back to _why's Estate

The Fully Upturned Bin

Managing memory and efficient disposal of waste in Ruby

You've been ignoring garbage collection again, haven't you? I'm sure of it! And why shouldn't you? GC just works. You can't be expected to interfere with the details of allocating memory. What can you do about it? As for emancipation of the heap--all you can do is a GC.start, right? Which happens regularly anyway, right?

We need to have a talk about managing memory in Ruby. Right now. The Pickaxe is a blooming incredible resource, but it says nearly nothing truly useful about memory management. Hop to page 369 in the Pickaxe II and you'll find the lone caution related to managing memory in Ruby.

Summary is: Dave had a CSV parser which was bogging down the processor. Basically, a huge string was was building, like a giant snowball rumbling down the mountainside and Ruby's GC had to keep stopping the snowball and weighing it and cleaning it every step of the way.

His lesson, from page 370:

The answer was simple and surprisingly effective. Rather than build the result string as it went along, the code was changed to store each CSV row as an element in an array... we were no longer building an ever-growing string that forced garbage collection.

Okay, whoa, whoa. Hold up. Adding to strings forces garbage collection? And adding to an array doesn't!? This isn't what Dave's saying, but there's some knowledge of Ruby internals that Dave's veiling here. And it makes it hard to draw conclusions about what, in fact, he is saying.

See, this is what I mean. Black magic everywhere. And the only place you can go to find out what's going on is in Ruby source code. Or the Ruby Hacker's Guide by Minero Aoki, which is encrypted to a large majority. But notice that Aoki-san has devoted an entire chapter to GC. Still, I can't seem to find practical guidelines for Ruby developers in any language.

I'm going to limit my instruction to three major mistakes many Rubyists make concerning GC. I consistently run into the first two. The third I just barely started thinking about and now I'm just realizing this very prevalent mistake, which I've personally been making for years.

When Ruby Sweeps

Let's wrap up a clear conclusion to Dave's problem above. His string was causing GC to run too much. This begs two questions: what triggers GC? And what is "too much"?

Garbage collection is triggered when:

  1. Ruby goes to allocate memory and its internal memory counter shows that not enough is available inside Ruby. This just means that Ruby will garbage collect first, before grabbing more of your system's total memory for its own use. (Incidentally, when Ruby is reallocating memory, it won't check this counter.)
  2. An attempt at allocation fails, Ruby will run GC to try to free up memory before resorting to a memory error.
  3. All objects have been free'd and the object heap needs to be reset.
  4. An extension uses rb_gc() or rb_gc_start() directly. (For example, the IO and Socket classes will start GC if no file or socket handles can be opened.)
  5. You manually trigger it with GC.start or ObjectSpace.garbage_collect.

Really only the first one is terribly important. The others will likely happen infrequently. The fourth and fifth are only important if any scripts or extensions you use are needlessly starting GC, but I don't know of any. As you'll see in a moment, most extensions commit errors of an opposite nature, are guilty of ignoring GC.

The Dotted Lines Mapping Ruby's Brain

I think it's important to deal with some concrete details that may seem tedious for Ruby authors to cover, but I think are important to know at some point. What I'm saying is: I'm going to quote some of Ruby's source code. Don't worry if you don't know C, it's just here as a citation to back me up.

From gc.c: #define GC_MALLOC_LIMIT 8000000

The Ruby object heap allocates a minimum of 8 megabytes. This is where the internal memory counter starts. This means that if you stay under eight megabytes worth of allocation, your script may never call GC. Once you breach this limit, Ruby will begin to watch your allocation needs and resize the limit accordingly. When it comes down to it, Ruby is just continually fighting to keep you under the 8 meg limit. If it can get you back under that limit, you're Red October, cruising quietly under the radar again.

If you're allocating lots of small portions that add up to a big breakfast for GC, then you'll see GC run more frequently. If you allocate a huge amount and wittle down from there, Ruby will probably go easier on you.

Ruby's Piles of Objects

From gc.c:

 static struct heaps_slot {
    RVALUE *slot;
    int limit;
 } *heaps;
 static int heaps_length = 0;
 static int heaps_used   = 0;

 #define HEAP_MIN_SLOTS 10000
 static int heap_slots = HEAP_MIN_SLOTS;

Ruby's object heap is broken up into manageable heaps which store objects as they are created. By default you get ten heaps, which have a minimum of ten-thousand objects in each heap. As Ruby adds heaps, they get bigger by a factor of 1.8.

The important thing to note here, again, are the heap boundaries. Once Ruby has allocated ten-thousand objects, a new heap is needed. After the next eighteen-thousand objects, another new heap is needed.

I'm not bringing this up so you'll start counting your objects one-by-one. Well, maybe a little. It's tough to know roughly how many objects are being loaded, given the wealthy of libraries we all depend on. But now you have an idea of where GC strikes on these boundary lines.

Mistake #1: Trash Everywhere, Temporary Objects Everywhere

Back to the original question: why was Dave's string incurring GC costs more than the arrays?

It's not because adding strings forces garbage collection and adding to arrays doesn't. The concat operator (<<) for both strings and arrays is a reallocation, a widening of the string or array to accomodate the addition, which doesn't force garbage collection.

The problem is the pile of data as a whole. In his first situation, he had two types of data stockpiling: (1) a temporary string for each row in his CSV file, with fixed quotations and such things, and (2) the giant string containing everything. If each string is 1k and there are 5,000 rows...

Scenario One: build a big string from little strings

temporary strings: 5 megs (5,000k)
massive string: 5 megs (5,000k)
TOTAL: 10 megs (10,000k)

Dave's improved script swapped the massive string for an array. He kept the temporary strings, but stored them in an array. The array will only end up costing 5000 * sizeof(VALUE) rather than the full size of each string. And generally, a VALUE is four bytes.

Scenario Two: storing strings in an array

strings: 5 megs (5,000k)
massive array: 20k
TOTAL: 5.02 megs

Then, when we need to make a big string, we call join. Now we're up to ten megs and suddenly all those strings become temporary strings and they can all be released at once. It's a huge cost at the end, but it's a lot more efficient than a gradual crescendo that eats resources the whole time.

The lessons here are:

  1. Temporary strings and temporary objects will stockpile. Just because you leave a method doesn't mean they are gone. You have to wait for one of the five conditions listed earlier to see GC happen.
  2. Arrays, hashes, objects are memory cheap. Worry about your strings. Do you have string corpses everywhere?
  3. Build your object from all the pieces, avoiding temporary objects as much as possible. Then, when you're done with it, toss the big object out of scope at once!

Mistake #2: Needy Scripts, Hanging On Too Tightly

Ruby's GC is called mark-and-sweep. The "mark" stage checks objects to see if they are still in use. If an object is in a variable that can still be used in the current scope, the object (and any object inside that object) is marked for keeping. If the variable is long gone, off in another method, the object isn't marked. The "sweep" stage then frees objects which haven't been marked.

If you stuff something in an array and you happen to keep that array around, it's all marked. If you stuff something in a constant or global variable, it's forever marked.

Despite what you think, you DO have the power to allocate and free! If your whole program is in one long script with no methods helping you out to create scopes, then chances are that everything you create will stick around until the program ends. And if that program runs indefinitely, then someday--some fateful day--you will have a problem.

I also see this most common abuse in situations where someone is looping through a database and hanging on to data.

 totals = {}
 db.query( "SELECT * FROM stores" ).each_hash do |store|
   # Transformations on the data, perhaps even selecting
   # other result sets...

   totals[store['id']] = store

That totals array is going to grow. And I just wonder: do you need all the data you're hanging on to? Maybe you can query again later if you really need certain data?

And, furthermore, are you stepping out of scope when you're done with the totals array? Or are you weighing down the rest of your program with it?

Mistake #3: Avoiding ALLOC_N

It probably sounds like I'm ripping on the Pickaxe, but I'm only pointing to some inadequacies in our documentation--and it is our documentation, you know? But here's one area the Pickaxe gets right.

From page 293:

To work correctly with the garbage collector, you should use the following memory allocation routines. These routines do a little bit more work than the standard malloc. For instance, if ALLOC_N determines that it cannot allocate the desired amount of memory, it will invoke the garbage collector to try to reclaim some space. It will raise a NoMemError if it can't or if the requested amount of memory is invalid.

Sometimes you may be linking to shared libs which have their own memory management. These will fight with Ruby for allocation, which generally isn't a problem if you are laying low.

If your extension has control over the allocation of memory, using these functions will ensure that the garabage collector gets called at reasonable times. Since my YAML parser doesn't use the Ruby routines, Ruby can't help free up space should the parser need to use memory for a large incoming stream. But this will be changing in the next release.

As a final advisory, I do think there are times when it makes sense to turn off GC. Generally, though, mentally walking through how GC is going to deal with your data is enough to help you refactor.

by why the lucky stiff

june 22, 2005