A few months ago I wrote an article on garbage collection, Java and C++. One of the comments asked why heap allocation is so slow.

Heap allocators are slow because they are complex. The real trick is, why are they complex?
A malloc has to balance several competing factors with each other. Different mallocs are tuned for different things, and a common trick to improve performance in C/C++ based apps is to use a different heap allocator to normal. The malloc used on Linux is an improved version of Doug Lea’s generic malloc which is tuned for typical workloads.
A malloc must:
- Be able to quickly find a chunk of memory of at least the requested size
- Be able to quickly free that chunk again
- Avoid wasting space
- Try to keep allocations together contiguously, as that improves locality and cache utilisation
The simplest implementation of malloc() just increments a pointer each time it’s called, so advancing the edge of the heap, and the simplest implementation of free() does nothing. In fact that’s what modern Java mallocs actually do (when you call new). Of course it’s not quite that simple – if you use new inside a tight loop expect to see a large speed hit. The last time I optimised a piece of Java code, removing new from loops was one of the biggest wins.
In a C++ app with no garbage collector, of course this is a fast but naive strategy. It won’t be long until you run out of memory. Instead free() has to mark that block as no longer in use, then malloc() needs to reuse it when it can.
This sounds simple but can quickly lead to poor performance. The next simplest implementation of malloc() would scan the heap looking for blocks marked as free of the right size, return that if found else increment the edge of the heap. On UNIX this is done by using the sbrk syscall to move the “program break”.
But what if there are no blocks of exactly the right size? The obvious improvement is to let malloc() return freed blocks that are equal to or larger than the requested size. The app won’t use all the returned space of course so we’re still wasting some, but we can chop the returned block in two before returning it.
Hmm. This works but for some reason our app is seeing large random drops in performance. What’s up with that?
mallocs must understand the hardware

Let’s take a quick detour into hardware design. When a CPU wants to read a value from memory into a register, the cache controller might eventually need to communicate with main RAM. When it does so, it writes the address required onto the bus, and the RAM controller circuitry at the other end will write the requested values back. For various reasons, CPUs don’t see memory as an array of bytes but rather machine words which on a 32 bit system will be 32 bits long, on 64bit systems 64 bits and so on. Despite this, as a programmer you see memory in terms of an array of bytes (or bits) and you can access any one of them as you wish.
The impedence mismatch between the two views can sometimes cause weird problems. If you attempt to read into the middle of a machine word, most CPUs will trigger an exception and stop running your program. We say the access was not aligned. x86 chips are a bit more friendly, they have dedicated circuitry that will “fix up” the access and trigger two reads, merge them together and return the unaligned value as the programmer might expect. Because of the extra work involved, unaligned access is slow.
For this reason, it’s important that a malloc implementation ensure that it always returns aligned memory. Compilers will also pad structures to ensure that the members fall on aligned addresses (meaning that a structure may actually be bigger than you’d think). Between the compiler and the malloc, the system co-operates to keep unaligned accesses to a minimum and so improve performance.
back to malloc
OK, so we fix our malloc to ensure it always returns blocks that are properly aligned and performance is back to what we’d expect. Great. Well, there is still work to do. Let’s say we allocate a block of 12 bytes, another one of 12 bytes, then free both of them. Now we allocate a block of 24 bytes. The obvious thing to do is re-use both of the freed blocks, but under our previous algorithm that won’t happen. We need to merge the two freed blocks together, so we can find them again inside malloc().
Where should this merging take place? The obvious place is inside free, but then code that expects free to be fast will suddenly start to suck. Worse, merging could take arbitrarily long …. we could merge two blocks, and find that we can suddenly merge many more freed blocks together. How long should we spend doing this? It depends on the behaviour of the app.
Most allocators don’t provide any time guarantees at all for malloc or free. They can take as long as they want, for this sort of reason.
OK, so we merge free blocks together, make sure to allocate aligned memory, re-use freed memory where possible …. hmm what else? Well, our malloc is still quite slow because it has to scan the entire heap looking for freed memory. This could actually be slower than just grabbing some space at the end, because you might swap in the entire heap.
A better solution would partition the heap into partitions of different sizes, allowing you to scan only the part that might contain free chunks of the required size. Of course if you also merge free blocks then these partitions may stop being accurate, and once allocated memory cannot be moved without special tricks.
Unfortunately, it’s still possible to lose with such an allocator. An obvious case is a process that allocates lots of small blocks while it processes some data, then allocates another block to hold the result, then frees the intermediate blocks. The result is a large span of empty space within the heap available to be reused, but at the top of the heap right next to the program break is a result block that is still in use. This high block now pins the break so the empty space cannot be returned to the operating system, forcing other processes into swap.
A smart allocator can use mmap rather than relying on sbrk and unmap the intermediate free pages, so returning the memory to the operating systems global free pool. However, this requires you to track free-ness not only at a block but also page level.
Phew! As you can see, making a semi-decent allocator is a lot of work. Doug Lea has been at it for literally decades, and todays modern allocators are a careful balance of compromises. Nowadays the Linux glibc can even do error detection and spot heap corruption.
GLib provides special purpose allocators you can use if you find malloc to be a bottleneck, but if you do first check to see if you can eliminate the allocation entirely.
Photo by Shannon Mary and atonal