Archive for October, 2006

why is heap allocation slow

October 26, 2006

A few months ago I wrote an article on garbage collection, Java and C++. One of the comments asked why heap allocation is so slow.

Heap allocators are slow because they are complex. The real trick is, why are they complex?

A malloc has to balance several competing factors with each other. Different mallocs are tuned for different things, and a common trick to improve performance in C/C++ based apps is to use a different heap allocator to normal. The malloc used on Linux is an improved version of Doug Lea’s generic malloc which is tuned for typical workloads.

A malloc must:

  • Be able to quickly find a chunk of memory of at least the requested size
  • Be able to quickly free that chunk again
  • Avoid wasting space
  • Try to keep allocations together contiguously, as that improves locality and cache utilisation

The simplest implementation of malloc() just increments a pointer each time it’s called, so advancing the edge of the heap, and the simplest implementation of free() does nothing. In fact that’s what modern Java mallocs actually do (when you call new). Of course it’s not quite that simple – if you use new inside a tight loop expect to see a large speed hit. The last time I optimised a piece of Java code, removing new from loops was one of the biggest wins.

In a C++ app with no garbage collector, of course this is a fast but naive strategy. It won’t be long until you run out of memory. Instead free() has to mark that block as no longer in use, then malloc() needs to reuse it when it can.

This sounds simple but can quickly lead to poor performance. The next simplest implementation of malloc() would scan the heap looking for blocks marked as free of the right size, return that if found else increment the edge of the heap. On UNIX this is done by using the sbrk syscall to move the “program break”.

But what if there are no blocks of exactly the right size? The obvious improvement is to let malloc() return freed blocks that are equal to or larger than the requested size. The app won’t use all the returned space of course so we’re still wasting some, but we can chop the returned block in two before returning it.

Hmm. This works but for some reason our app is seeing large random drops in performance. What’s up with that?

mallocs must understand the hardware

Let’s take a quick detour into hardware design. When a CPU wants to read a value from memory into a register, the cache controller might eventually need to communicate with main RAM. When it does so, it writes the address required onto the bus, and the RAM controller circuitry at the other end will write the requested values back. For various reasons, CPUs don’t see memory as an array of bytes but rather machine words which on a 32 bit system will be 32 bits long, on 64bit systems 64 bits and so on. Despite this, as a programmer you see memory in terms of an array of bytes (or bits) and you can access any one of them as you wish.

The impedence mismatch between the two views can sometimes cause weird problems. If you attempt to read into the middle of a machine word, most CPUs will trigger an exception and stop running your program. We say the access was not aligned. x86 chips are a bit more friendly, they have dedicated circuitry that will “fix up” the access and trigger two reads, merge them together and return the unaligned value as the programmer might expect. Because of the extra work involved, unaligned access is slow.

For this reason, it’s important that a malloc implementation ensure that it always returns aligned memory. Compilers will also pad structures to ensure that the members fall on aligned addresses (meaning that a structure may actually be bigger than you’d think). Between the compiler and the malloc, the system co-operates to keep unaligned accesses to a minimum and so improve performance.

back to malloc

OK, so we fix our malloc to ensure it always returns blocks that are properly aligned and performance is back to what we’d expect. Great. Well, there is still work to do. Let’s say we allocate a block of 12 bytes, another one of 12 bytes, then free both of them. Now we allocate a block of 24 bytes. The obvious thing to do is re-use both of the freed blocks, but under our previous algorithm that won’t happen. We need to merge the two freed blocks together, so we can find them again inside malloc().

Where should this merging take place? The obvious place is inside free, but then code that expects free to be fast will suddenly start to suck. Worse, merging could take arbitrarily long …. we could merge two blocks, and find that we can suddenly merge many more freed blocks together. How long should we spend doing this? It depends on the behaviour of the app.

Most allocators don’t provide any time guarantees at all for malloc or free. They can take as long as they want, for this sort of reason.

OK, so we merge free blocks together, make sure to allocate aligned memory, re-use freed memory where possible …. hmm what else? Well, our malloc is still quite slow because it has to scan the entire heap looking for freed memory. This could actually be slower than just grabbing some space at the end, because you might swap in the entire heap.

A better solution would partition the heap into partitions of different sizes, allowing you to scan only the part that might contain free chunks of the required size. Of course if you also merge free blocks then these partitions may stop being accurate, and once allocated memory cannot be moved without special tricks.

Unfortunately, it’s still possible to lose with such an allocator. An obvious case is a process that allocates lots of small blocks while it processes some data, then allocates another block to hold the result, then frees the intermediate blocks. The result is a large span of empty space within the heap available to be reused, but at the top of the heap right next to the program break is a result block that is still in use. This high block now pins the break so the empty space cannot be returned to the operating system, forcing other processes into swap.

A smart allocator can use mmap rather than relying on sbrk and unmap the intermediate free pages, so returning the memory to the operating systems global free pool. However, this requires you to track free-ness not only at a block but also page level.

Phew! As you can see, making a semi-decent allocator is a lot of work. Doug Lea has been at it for literally decades, and todays modern allocators are a careful balance of compromises. Nowadays the Linux glibc can even do error detection and spot heap corruption.

GLib provides special purpose allocators you can use if you find malloc to be a bottleneck, but if you do first check to see if you can eliminate the allocation entirely.

Photo by Shannon Mary and atonal

autopackage 1.2 finally released

October 20, 2006

Hear ye, hear ye. Autopackage 1.2 is finally out!

It’s been a long time in coming, largely due to a lack of time and motivation on my part but thanks to the hard work of Curtis, Isak and Taj the release has finally seen the light of day – a huge relief for all concerned I’m sure!

The list in the announcement makes it look like the changes are far smaller than they really are. A throwaway statement about statistical analysis hides a complete framework for optimisation of installer success rates. That was one of Curtis’ successes. In the spirit of open source collaboration you can now see which projects are frequently failing to install and could use work, along with hard numbers to show how many users are having problems. The statistics collection process itself is entirely anonymous (we don’t even record IP addresses) and can be easily disabled by the user if they so wish.

Other changes include improved user interfaces in many parts, improved localisation, the ability to uninstall older versions installed using RPM, better C++ support and more. Most of the new features are interesting for developers rather than end users, but I think this is correct and as it should be – the job of an installer is to be as boring and inconspicuous as possible. There’s also the usual assortment of many small bug fixes, new APIs and distro compatibility improvements.

Well done everyone! Well done.

on a new patent system

October 13, 2006

Today at work we had a lecture from Joseph Stiglitz, a well known Nobel-prize winning economist. Given his credentials I was hoping for insightful analysis and enlightening comment, but got neither. It was rather disappointing. Lots of time spent talking about stuff we already knew (global warming is bad kids!) and very little time spent on solutions. From this I conclude that the guys at the top are just as unsure of what to do as us guys at the bottom.

a modest proposal

There is probably something deeply wrong with this idea. If so be sure to point it out in the comments (on plan99). I don’t bite!

The current patent system has many problems, and I won’t enumerate them all because that’d be preaching to the choir. If you read this journal you probably work in software and have been burned by it before in some way. This proposal is designed to fix the following problems:

  • Patents last for 20 years, a number chosen pretty much arbitrarily. For software this is way too long, for other areas it might actually be too short. When the patent system was created by Henry VIII I’m sure this seemed like a reasonable length of time – after all, the whole point was to stop trade secrets being passed down the through a family. When you are dealing with generations, 20 years probably seems like an instant.
  • Patents grant a monopoly to the owner, and monopolies are bad.
  • There is no connection between how much it cost to develop an idea and how much you can get for it. So there is an epidemic of patent trolls that simply buy up patents on obvious things then use them to blackmail big companies.

a new system

Let us assume that company A develops a new wonder-drug, and it costs $100 million to do so. This figure is wrong by an order of magnitude of course but it’ll make the maths easier. The value is known because A must provide an audit trail showing in a provable manner what the costs of the project were. Let’s oversimplify for now and assume that every project must pay for itself.

We have a variable II which is a percentage, let’s set it at 20% to start with. II stands for Innovation Incentive and is some arbitrary figure that can be adjusted upwards to increase the incentive to invent things, and downwards to decrease the incentive. I’ve tried to avoid arbitrary figures in this system but we need at least one.

20% of 100 million is 20 million, so we add these two numbers together and put it into a “pool”, which now stands at $120 million.

Anybody who is in the pool has a license to use the technology, anybody out of the pool does not. Unlike the existing patent system though the originators of the patent cannot control who is allowed into the pool – anybody can join as long as they contribute to the value of the pool.

The contributions work like this. When company B wants to join it must pay half of the original sum, or $60 million. For this price B gets the ability to compete with A. Perhaps it cannot make as much money as A can because it lacks first mover advantage, or maybe it can make more due to learning from As mistakes. It doesn’t matter.

The $60 million that B pays goes straight into As bank account. Now both company A and B are down $60 million, but both have equal access to the technology.

Let us assume company C wishes to join the pool. It must pay $40 million ($120m/3). The $40 million is split both ways between A and B, so they both receive $20 million each. Now C is down $40 million, as is B, as is A. They have effectively all paid $40 million towards the cost of developing the technology.

Company D wishes to join, and must pay $30 million, split 3 ways, so now A B and C receive $10 million each. They are now all down $30 million.

And so it continues on and on, with more and more companies joining the fray as the barrier to entry gets lower and lower. The drug that at first may have been affordable only by rich Westerners eventually become available as generics thanks to the African drug companies only having to contribute a small amount to get access to the technology. Effectively the cost of developing the technology has been spread out across many companies.

As this process continues, the amount that each company has “lost” will reduce to zero. Eventually company A will get back $120 million or some figure quite close to it …. it’ll never reach exactly $120 million of course, which is why we need the Innovation Incentive percentage at the start – to ensure A makes a profit on developing the technology. Company B will get close to $60 million etc.

Increasing the II will make the products based on the technology more expensive for the end user. However, it makes the rewards for creating the technology proportionally greater. Decreasing it will decrease costs but also make creating new patents less attractive.

the advantages

This represents an improvement over the existing system in several ways:

  • There is no fixed length of time before an idea becomes “free”, because the idea is not really restricted. Anybody who can stump up the entry price can join in, and if the idea takes off they will eventually get their money back.
  • It isn’t possible to patent obvious things and then extract billions from your competitors, because you have to prove that developing the idea cost you something. It’s not enough to merely have a “brain wave”, your patent must represent real, hard work.
  • You can’t force competitors out of the market a la Eolas because you don’t have a monopoly.
  • The cost of licensing the technology will eventually drop so close to zero that the pool can be abandoned entirely, leading the way for commoditisation (eg, generic drugs).

the disadvantages

Firstly, I’m not entirely sure my thinking is straight. I have a vague feeling that this system doesn’t actually work how I think it works, but am not smart enough to see why not.

Apart from that:

  • How many inventors are pushed on by the promise of theoretically limitless rewards? Would a fixed II kill innovation?
  • The cost of entering the pool early might not be outweighed by first mover advantage
  • It’s not clear how you could incrementally evolve the current system towards it. I have a few ideas on this but it’ll have to wait for another post.
  • The example figure of 20% I gave for the II might be way too low. Maybe 200 percent is more appropriate.
  • Some companies research many ideas that cost money but never pan out, then rely on massive earnings from a golden goose to save them. In the fractional pool model, this becomes a lot harder because you cannot show that the costs incurred in one failed project contributed to the cost of the successful project.
  • The need to audit research costs might be unworkable or prove offputting to some companies.

Clearly there are many unanswered questions here and many potential pitfalls, but given that the existing system is so deeply flawed surely anything must be better than it.

If you like this idea, please let me know by writing a short message in the comments. Alternatively digg it, blog it, discuss it, slashdot it etc. Thanks.