I want to write some stuff about Microsoft Singularity. It’s cool and everybody with an interest in computing should be talking about it. Here’s a summary for those who don’t want to read all the papers.
**what is it**
Singularity is an operating system research project. It’s a team of smart people who were told “what might an operating system look like, if it was designed from the ground up for dependability”.
People on pop forums like Slashdot and OSNews have been wishing for _years_ that Microsoft would throw away Windows and start from scratch, to address problems like reliability and malware. Usually their wish revolves around rebasing Windows onto some form of UNIX, but that’s a crap idea and wouldn’t actually achieve their wish at all. If you want to address problems that are caused by fundamental design decisions, you need to revisit them. This is what Singularity does.
Dependability is a pretty broad topic. At minimum it means not crashing, and it means being secure. But, although the Singularity researchers are exploring many topics, they don’t have a wide-open mandate … it’s not chartered to do GUI research for instance.
**how is it different?**
That’s what I want to talk about here.
Singularity is a high performance, single address space microkernel design, which uses static type verification to enforce reliability properties and flexible pattern based ACLs.
Oooh, there’s a lot of scary academic talk in that sentence. Let’s figure out what it means. This is going to be complicated, because Singularity is pretty different to textbook OS designs.
It’s _high performance_. Performance isn’t actually a goal of the project, but the researchers are smart enough to realise that if they don’t keep it in mind, their research will wander into the weeds and become completely uncommercialisable. There are hints that Microsoft are thinking of one day using this research in real products, so it’s important to be fast (more on that later).
It’s also a _microkernel_.
For now let’s just focus on the fact that it’s a microkernel. We can cover the other things in future blog posts. Skip to the bottom if you already know this stuff – I’ll assume here that you, my dear reader, aren’t entirely sure what a microkernel is or why they are supposed to be slow.
Historically, there are two ways to design an operating system, which are actually the same way but using different cost:benefit analyses for certain decisions. These are whether to use a microkernel, or a monolithic kernel. Note that we’re talking low level designs here … this stuff is all independent of whether you use a task bar or a dock in your UI.
Recall that in a microkernel design, subsystems like the filesystem and network drivers run as more-or-less regular programs outside the kernel itself (which is distinguished from other programs bascially by running in a special CPU mode). The kernel proper only handles starting processes/threads, sending messages between them and a small number of misc things like CPU scheduling. True microkernels are hard to find these days. You’ve probably used them without realising it – for instance, QNX is an operating system designed for embedded applications like Cisco routers, and QNX is a pure microkernel design.
Sidenote: Here’s a quick recap of virtual memory. When your code reads some value from memory, the CPU internally converts that address from a virtual address into a physical address it can give to the memory controllers. On a 32 bit CPU they’re both 32 bit pointers, and you’ll probably never see the raw physical address unless you’re actually a kernel developer. The conversion is done by a component of the CPU called the MMU (memory mapper unit), and is subject to an access control check. Memory is split into “pages”, which are 4kb each on Intel/AMD chips in the standard case, and each page can be mapped independently. Each page mapping has permission bits – read/write/execute – like a UNIX file would.
This memory mapping is the foundation of all security in existing operating systems. It prevents a buggy program splatting another program accidentally, and because only the kernel can update the page tables, and all hardware access has to go via the kernel, it means a program running in user-mode can’t really do anything interesting unless the kernel allows it. And because the MMU won’t let you read kernel memory, you can’t force it to give you that permission. It also means that we can use swap files to let the disk pretend its a RAM chip – just unmap the part of the processes address space that was swapped out, catch the error when the program tries to read from it and load it back in.
Virtual memory is jolly good and is one of the biggest improvements to computer reliability in the past 13 years. Windows 3.1 didn’t use it, Windows 95 did and that right there was why many people upgraded. The advantages of the microkernel then are obvious …. more use of virtual memory means buggy kernel components can’t blue-screen the computer like they can today. If your filesystem crashes, just restart it!
In a monolithic design, filesystems, drivers and even web servers are all loaded into the kernel itself and all run in privileged code. The kernel still provides message passing systems for user-mode processes to communicate, but they aren’t used anywhere near as much. Every mainstream server or desktop OS is monolithic – Windows, Linux and MacOS. Note that whilst Linux has always been monolithic, Windows NT started out as a microkernel, and MacOS X – being based on Mach – is theoretically one today. I don’t know anybody who believes that though.
It can be hard to say whether a particular system is truly a microkernel or a monolithic design, because it’s not a boolean yes/no thing – for instance, Linux runs its graphics subsystem in a separate process (the X server) whereas Windows _used_ to do that but doesn’t do it anymore. Nonetheless, everybody agrees that Linux is not a microkernel. A good smell test is whether the filesystems are running in kernel mode or not – graphics can be a grey area, but the filesystems are generally not.
Anyway, Singularity being a microkernel might seem strange, because historically the debate has in academia always been won by microkernels and in the market has always been won by monolithic kernels, largely for performance reasons. These arguments were going strong in the 80s and you can read the infamous Torvalds vs Tanenbaum debate on it here. So at first it might appear that Singularity is just another academic exploration of the theoretically clean thing to do, at the cost of real world usability. But it’s not so.
He won the debate
**why are microkernels slower than monolithic kernels?**
Microkernels are typically slower than monolithic kernels because there is a cost associated with transitioning between user mode, kernel mode, and back again. What’s more, there’s an additional cost for switching the CPU between two user mode processes: a context switch.
These costs are small but real, and when you do bazillions of them per second can come to completely dominate the CPU such that you’re not getting any actual work done. Measuring those costs is hard, although the Singularity team have managed it.
The reason they cost precious time is because the CPU has to do unusual work to make them happen, and because the majority of the CPUs time is spent _not_ doing that unusual work, it tends to not be well optimized (this has changed in recent generations of x86 chips, but the general point holds).
For example, when invoke a syscall to make the kernel do something, you use a special CPU instruction. That used to be “int $80″ on Linux but these days you can use the “sysenter” opcode on kernels and x86 CPUs that support it (nearly all do). Control then transitions to the kernel. This is pretty fast on modern computers, but it wasn’t always so – and in fact early versions of Windows actually abused an illegal instruction because they found triggering a CPU exception was a faster way to get into kernel mode than using an interrupt (the official way). Intel fixed that
Context switching is more expensive, firstly because it obviously involves a transition to kernel space, so you pay for the cost of doing that, but mostly because reconfiguring the page tables is slow.
Reconfiguring the page tables is slow partly because, again, it’s an unusual operation (it involves poking special registers on x86 chips), but mostly because it requires flushing the _translation lookaside buffers_. These buffers cache the result of the MMUs lookup. Even though MMUs are custom designed hardware and very fast, they’re still not free and yet the translations are needed every time code accesses memory, which is all the time, thus caching makes a lot of sense.
This also makes it hard to quantify exactly what a context switch costs you. We know it costs _something_ because of CPU design fundamentals, but the actual cost is spread out over the code in the new process to run. Immediately after a context switch then, your computer is running a little bit slower, and then picks up steam as the TLB fills up.
So we have two conflicting priorities here. On one hand, using virtual memory to separate address spaces can improve reliability by insulating programs from each other, which is good, but on the other hand, it costs us some hard-to-measure amount of performance, which is bad. Worse, although it’s true that CPUs have got faster over time, they got faster at running code and not at doing address space manipulations, so we can’t rely on Moores Law to bail us out this time.
Micro-kernels are based on the idea of sending messages between processes running in separate address spaces. Thus to read a file, first you have to send a message from your program to the filesystem server. This means formatting the message in your own memory space (fast), invoking the “send message” syscall (not quite as fast), the kernel then copies the message into its own address space (sorta slow), does a context switch to the filesystem server (slow), and then copies the message into the filesystem server memory space before leaving kernel mode.
Once the filesystem reads the data, you have to do the whole thing in reverse but this time copying the data back in a message …. because the cost of a message send goes up with its size, this is even slower than the initial request!
In contrast, in a monolithic design, you format your request (fast), do a read syscall (not quite as fast), wait whilst the filesystem gets your data, the kernel then copies your data into your memory space (or perhaps the hardware will do that if you’re using DMA) and returns to user mode …. wow, simpler and faster! The disadvantage is that if the filesystem code is buggy, you blue-screen and lose everything.
About 80% of Windows crashes are caused by crappy drivers. So, if you could stop drivers from exploding the system in the same way we can stop apps, we can eliminate 80% of the worlds blue screens! That’s pretty cool. It also means that drivers could have security enforced. Today if you install a 3rd party filesystem, who knows what you’re getting? Unless you review and compile the code yourself, you just have to trust whoever gave it to you. Even if they mean well, a bug in the new driver can open a local root exploit, compromising the entire security system
It’s probably not surprising then that academia preferred the slow-but-robust solution, and desktop OS vendors preferred the fast-but-unstable solution. It’s not that they didn’t try! Windows NT started out as a pure microkernel solution, but even with super-optimized IPC they eventually gave up and moved the whole shebang into kernel space, including the GUI, which they got a lot of stick for but made Windows feel snappier and more responsive thus making users happy.
the singularity approach
Singularity manages to have its cake, and eat it. It gets both the robustness benefits of a microkernel, and manages to get even better performance than a monolithic kernel. Neat!
But I wrote too much above. How it pulls off this trick will have to wait for next time.
cpu photo by nadya peek