Archive for July, 2008

Peak oil awareness …

July 9, 2008

… is growing slowly, but most people still aren’t making the connection between high oil prices and the fundamental shift in production rates that is peak oil. This image shows Google searches for “oil price” (red) against “peak oil” (blue):

Protocol buffers now open source!

July 7, 2008

Yay, it finally happened. Google released protocol buffers as an open source library.

Protocol buffers are one of those ideas that has a million different implementations but very few good implementations. As a result everybody invents their own and we lose interopability and time for no good reason. Protocol buffers more or less hit the sweet spot between features and simplicity – I certainly wouldn’t want to design a file format or network protocol without them these days, and now I won’t have to, even if I’m doing work outside of Google.

Here’s a quick intro to what they are.

Protocol buffers are binary XML

Well, sort of. XML is more complicated than protocol buffers are – protobufs have no concept of namespaces, attributes vs subelements, character escaping, DTDs or any of the other things that make XML complicated. But the essential idea is the same – it’s a way of representing trees of structured data in such a way that they can be extended in a backwards compatible manner.

The key features of protocol buffers are:

  • A very efficient yet simple binary encoding. A minimal protobuf takes only a few bytes due to clever use of variable-length integers.
  • A simple specification syntax that lets you define a schema far more easily than an XML Schema would be.
  • A compiler that produces objects representing your structures in either C++, Python or Java. These objects present a much cleaner and simpler interface than the XML DOM – it’s quite feasible to represent all your programs internal state this way, whereas it’d be painful to replace a native object heirarchy with XML.
  • A set of tools that let you quickly serialize and deserialize them.
  • A lightweight (incomplete) set of interfaces that let you hook protobufs up to an RPC system.

Usage in RPC

Protocol buffers are so named because they were developed as a wire protocol for server communication within Google. Over time this developed into a full high performance RPC system that supports many advanced features. In particular it’s very easy to debug and troubleshoot, a feature I find invaluable day in, day out. At heart though, this RPC system is still based on protocol buffers – the RPC protocols are defined using them.

The key feature here is that it’s very easy to extend protocols based on protobufs over time. Alternatives like CORBA, DCOM or Ice don’t have a particularly elegant approach to this – if you want to introduce a new parameter to an RPC for instance, you need to introduce a whole new interface, and then translate the call through. In a protobuf based system, you just mark the new field as optional and then use the has_foo() function before accessing it in your new server. When the client is ready to use the new parameter, one set_foo() call is all it takes.

The next key feature is that it’s very efficient on the wire. Protocols like SOAP or XML-RPC are really not designed for efficiency at all. In many cases, this won’t be a big deal, but for Google it is because we push everything so hard.

Protocol buffers as a file format

Because they efficiently serialize to binary, can be used from at least 3 key languages (and more in future) and can be extended over time, protocol buffers are a perfect fit for many file formats. Most open source programs these days have based their file formats on XML. OpenOffice, AbiWord, Inkscape and more all use XML markup languages to save their data. Because XML tends to be very large, they often zip it, resulting in a very slow and complicated piece of code to load or save these files. This matters – a big part of the complexity of the Word/Excel formats is due to “quicksave”, a feature that users love as it lets auto-save be less intrusive and more frequent, but which complicates the codebase considerably.

Protocol buffers have a more or less optimal binary representation and can be deserialized into in-memory objects extremely fast. Nothing stops you gzipping part or all of them if you want to eliminate some redundancy, but it’ll be redundancy on the application level that is eliminated, not on the format level.

Go check it out

Seriously, it’ll only take a few minutes to read the docs, and you’ll add a valuable tool to your arsenal. Whilst the problem that protocol buffers solve isn’t new, this is one of the best implementations I’ve seen so far. I hope the open source community embraces this system as a way to make easier to use, more efficient file formats and network protocols for its applications.