Archive for January, 2008

singularity: part 2

January 26, 2008

Last time I talked about Singularity I rambled about micro-kernels for a while. You probably knew that stuff backwards anyway. Onwards!

**How does it work?**

Singularity pulls off the impressive feat of being a pure microkernel design that nonetheless is ~30% faster than the traditional approach, and about 10% faster than the monolithic approach for a file-io heavy benchmark. How does it do that?

The trick is very simple – they just throw away hardware memory protection entirely. In Singularity, everything runs in kernel mode, and everything runs in the same address space. The MMU, in other words, doesn’t do anything. There are no “processes” in the traditional sense.

Obviously, if that was the _only_ thing they do it wouldn’t be very interesting. 85% of Windows blue-screens are caused by drivers, not kernel bugs (I suspect the rest are caused by hardware failure). Innumerable privilege escalation vulnerabilities can be caused by bad drivers. Every privilege escalation is a gift to the bad guys. Protection against software bugs is what makes micro-kernels useful.

Programs in Singularity _are_ isolated from each other, but the isolation is done entirely in software, using type theory instead of silicon. They can do this, because programs in Singularity are all implemented in a C# derivative called Sing# (although in theory any .NET language could be used, Singularity uses a few features that they added to C#, so other languages would need the same minor extensions).

You already know that in most modern languages like Java and C# you can’t access memory directly. Most obviously, there’s no way to write the following C in Java:

*((char *)0×1234) = ‘X’; // overwrite the byte at location 1234

It’s not just that there’s no syntax for it. It’s that the Java/C# compilers produce a form of assembly language that can be checked by a bytecode verifier to mathematically prove that the program does not do this. Once proven, we can translate the JVM/.NET opcodes into actual code that the CPU can run, confident that if we don’t give the code a reference to an object, it can’t read or write to it. That’s not only good for reliability – we can build a security system on top of that!

**exceptions to the above**

OK, so that was the theory. In practice it wasn’t so easy. Just like theoretically “sandboxed” programs written in C can escape the sandbox by exploiting kernel bugs, the same is true of Java and .NET – by exploiting bugs in the VM or native class libraries (written in C++) the security system can be compromised.

Worse, in some cases, features like reflection can be used to access objects that people did not expect.

Both these problems have caused breaches in JVM applet security in the past. Given how huge the modern JREs are, that isn’t surprising.

**the solution**

Singularities solution is again straightforward – almost all of the system is itself written in C#, with only a small amount written in C++ and assembly language. This immediately makes the core “kernel” robust against the most common types of attacks. Even then, Microsoft are working on research that will let them prove the safety of the tiny remaining pieces of unsafe code.

The building block in Singularity is called a SIP, for “software isolated process”. Singularity retains the idea of a process being something that owns resources, has its own memory space and is independently scheduled, but it’s all enforced with software and mathematics.

A SIP has its own heap, its own garbage collector (you can choose one from several the OS offers, reflecting the fact that no one GC fits all situations), and its own memory pages. So it’s more isolated than some similar approaches (like KaffeOS), in which all code is loaded into a single runtime, and objects can be exchanged between different programs. This has the disadvantage that you can’t easily exchange objects between two SIPs. It has the pragmatic advantage that SIPs can be quite different from each other – not only using different garbage collectors, but also different language runtimes and in-memory layouts. And it means they can be deallocated quickly, without doing a full-heap GC, which could potentially be extremely slow.

Given that we’re no longer using the CPUs hardware modes, it’s no longer clear how we should define what the “kernel” really is. In Singularity, the “kernel” is the software component that connects SIPs together, does memory management, loads new SIPs and handles some other tasks. Logically, it also includes the trusted/unsafe parts of the system written in C++ or assembly. Some of these are actually attached to a SIP, like the garbage collectors, but because they are just trusted to be correct they can be thought of as a part of the kernel.

SIPs call into the kernel simply by doing a regular function call. You need a way to mark the stack so the kernels own garbage collector doesn’t interefere with the SIPs garbage collector, but that’s easy and fast. Thus all the syscall overhead is avoided.

SIPs communicate via “channels”. These are message based pipes, sort of like UNIX sockets, except strongly typed and faster. A channel is actually a mathematical abstraction – sending a message via such a pipe does not involve any copying or fancy hardware tricks. It’s simply updating a few pointers in memory. Because of this, sending even very large amounts of data between SIPs (say between the network driver and a web server) is fast.

**enforcing security**

It might seem odd that invoking the kernel is just a regular function call. Surely that’s not possible? What stops a SIP from simply invoking the instructions to control the hard disk itself?

The answer is that a SIP is shipped (say, on CD) as a set of MSIL bytecode files. These files are not compiled at runtime as in a regular Java or .NET system. Instead, when the software is installed, it is compiled ahead of time into native code after being statically checked using a variety of analyses. Software installation is a privileged operation in Singularity – it’s handled by the kernel itself. Only software installed by the kernel will be allowed to run.

Because you can’t represent CPU specific instructions like “write to this IO port” or “trigger this interrupt” in safe MSIL, the only way for a piece of code to do that is to be linked with some trusted native library that will do it for you. Because the kernel is in charge of software installation, it can verify that only certain software is linked with such libraries, and even then, only in certain ways. Thus, the kernel can control access to hardware resources without relying on hardware checks – by controlling who has access to the CPU (and how) at install time.

**manifests**

In a regular OS, a process more or less corresponds to a program. A few programs have multiple processes, for instance, iTunes installs a program that navel gazes until you plug in an iPod. But generally speaking, one program uses up one process, and actually it’s quite common for one process to contain more than one program – for instance, a web browser that hosts plugins.

Because SIPs are so cheap, it’s reasonable – encouraged, even – to split a single program into several co-operating SIPs. Thus we have a problem – how do we start a program? What even _is_ a program in such a setup?

The answer is that a program in Singularity is defined by a manifest. A manifest is (like in .NET) an XML file describing the SIPs that make up a program, and the connections between them. A manifest is mostly auto-generated from metadata annotations in the code. When you start a program, you actually invoke a manifest.

**drivers in singularity**

This has interesting implications for how drivers are managed. Unfortunately, I blew my word count again. It’ll have to wait for next time.

singularity: part 1

January 6, 2008

I want to write some stuff about Microsoft Singularity. It’s cool and everybody with an interest in computing should be talking about it. Here’s a summary for those who don’t want to read all the papers.

**what is it**

Singularity is an operating system research project. It’s a team of smart people who were told “what might an operating system look like, if it was designed from the ground up for dependability”.

People on pop forums like Slashdot and OSNews have been wishing for _years_ that Microsoft would throw away Windows and start from scratch, to address problems like reliability and malware. Usually their wish revolves around rebasing Windows onto some form of UNIX, but that’s a crap idea and wouldn’t actually achieve their wish at all. If you want to address problems that are caused by fundamental design decisions, you need to revisit them. This is what Singularity does.

Dependability is a pretty broad topic. At minimum it means not crashing, and it means being secure. But, although the Singularity researchers are exploring many topics, they don’t have a wide-open mandate … it’s not chartered to do GUI research for instance.

**how is it different?**

That’s what I want to talk about here.

Singularity is a high performance, single address space microkernel design, which uses static type verification to enforce reliability properties and flexible pattern based ACLs.

Oooh, there’s a lot of scary academic talk in that sentence. Let’s figure out what it means. This is going to be complicated, because Singularity is pretty different to textbook OS designs.

It’s _high performance_. Performance isn’t actually a goal of the project, but the researchers are smart enough to realise that if they don’t keep it in mind, their research will wander into the weeds and become completely uncommercialisable. There are hints that Microsoft are thinking of one day using this research in real products, so it’s important to be fast (more on that later).

It’s also a _microkernel_.

For now let’s just focus on the fact that it’s a microkernel. We can cover the other things in future blog posts. Skip to the bottom if you already know this stuff – I’ll assume here that you, my dear reader, aren’t entirely sure what a microkernel is or why they are supposed to be slow.

**a microwhat?**

Historically, there are two ways to design an operating system, which are actually the same way but using different cost:benefit analyses for certain decisions. These are whether to use a microkernel, or a monolithic kernel. Note that we’re talking low level designs here … this stuff is all independent of whether you use a task bar or a dock in your UI.

Recall that in a microkernel design, subsystems like the filesystem and network drivers run as more-or-less regular programs outside the kernel itself (which is distinguished from other programs bascially by running in a special CPU mode). The kernel proper only handles starting processes/threads, sending messages between them and a small number of misc things like CPU scheduling. True microkernels are hard to find these days. You’ve probably used them without realising it – for instance, QNX is an operating system designed for embedded applications like Cisco routers, and QNX is a pure microkernel design.

Sidenote: Here’s a quick recap of virtual memory. When your code reads some value from memory, the CPU internally converts that address from a virtual address into a physical address it can give to the memory controllers. On a 32 bit CPU they’re both 32 bit pointers, and you’ll probably never see the raw physical address unless you’re actually a kernel developer. The conversion is done by a component of the CPU called the MMU (memory mapper unit), and is subject to an access control check. Memory is split into “pages”, which are 4kb each on Intel/AMD chips in the standard case, and each page can be mapped independently. Each page mapping has permission bits – read/write/execute – like a UNIX file would.

This memory mapping is the foundation of all security in existing operating systems. It prevents a buggy program splatting another program accidentally, and because only the kernel can update the page tables, and all hardware access has to go via the kernel, it means a program running in user-mode can’t really do anything interesting unless the kernel allows it. And because the MMU won’t let you read kernel memory, you can’t force it to give you that permission. It also means that we can use swap files to let the disk pretend its a RAM chip – just unmap the part of the processes address space that was swapped out, catch the error when the program tries to read from it and load it back in.

Virtual memory is jolly good and is one of the biggest improvements to computer reliability in the past 13 years. Windows 3.1 didn’t use it, Windows 95 did and that right there was why many people upgraded. The advantages of the microkernel then are obvious …. more use of virtual memory means buggy kernel components can’t blue-screen the computer like they can today. If your filesystem crashes, just restart it!

In a monolithic design, filesystems, drivers and even web servers are all loaded into the kernel itself and all run in privileged code. The kernel still provides message passing systems for user-mode processes to communicate, but they aren’t used anywhere near as much. Every mainstream server or desktop OS is monolithic – Windows, Linux and MacOS. Note that whilst Linux has always been monolithic, Windows NT started out as a microkernel, and MacOS X – being based on Mach – is theoretically one today. I don’t know anybody who believes that though.

It can be hard to say whether a particular system is truly a microkernel or a monolithic design, because it’s not a boolean yes/no thing – for instance, Linux runs its graphics subsystem in a separate process (the X server) whereas Windows _used_ to do that but doesn’t do it anymore. Nonetheless, everybody agrees that Linux is not a microkernel. A good smell test is whether the filesystems are running in kernel mode or not – graphics can be a grey area, but the filesystems are generally not.

Anyway, Singularity being a microkernel might seem strange, because historically the debate has in academia always been won by microkernels and in the market has always been won by monolithic kernels, largely for performance reasons. These arguments were going strong in the 80s and you can read the infamous Torvalds vs Tanenbaum debate on it here. So at first it might appear that Singularity is just another academic exploration of the theoretically clean thing to do, at the cost of real world usability. But it’s not so.


He won the debate

**why are microkernels slower than monolithic kernels?**

Microkernels are typically slower than monolithic kernels because there is a cost associated with transitioning between user mode, kernel mode, and back again. What’s more, there’s an additional cost for switching the CPU between two user mode processes: a context switch.

These costs are small but real, and when you do bazillions of them per second can come to completely dominate the CPU such that you’re not getting any actual work done. Measuring those costs is hard, although the Singularity team have managed it.

The reason they cost precious time is because the CPU has to do unusual work to make them happen, and because the majority of the CPUs time is spent _not_ doing that unusual work, it tends to not be well optimized (this has changed in recent generations of x86 chips, but the general point holds).

For example, when invoke a syscall to make the kernel do something, you use a special CPU instruction. That used to be “int $80″ on Linux but these days you can use the “sysenter” opcode on kernels and x86 CPUs that support it (nearly all do). Control then transitions to the kernel. This is pretty fast on modern computers, but it wasn’t always so – and in fact early versions of Windows actually abused an illegal instruction because they found triggering a CPU exception was a faster way to get into kernel mode than using an interrupt (the official way). Intel fixed that :)

Context switching is more expensive, firstly because it obviously involves a transition to kernel space, so you pay for the cost of doing that, but mostly because reconfiguring the page tables is slow.

Reconfiguring the page tables is slow partly because, again, it’s an unusual operation (it involves poking special registers on x86 chips), but mostly because it requires flushing the _translation lookaside buffers_. These buffers cache the result of the MMUs lookup. Even though MMUs are custom designed hardware and very fast, they’re still not free and yet the translations are needed every time code accesses memory, which is all the time, thus caching makes a lot of sense.

This also makes it hard to quantify exactly what a context switch costs you. We know it costs _something_ because of CPU design fundamentals, but the actual cost is spread out over the code in the new process to run. Immediately after a context switch then, your computer is running a little bit slower, and then picks up steam as the TLB fills up.

So we have two conflicting priorities here. On one hand, using virtual memory to separate address spaces can improve reliability by insulating programs from each other, which is good, but on the other hand, it costs us some hard-to-measure amount of performance, which is bad. Worse, although it’s true that CPUs have got faster over time, they got faster at running code and not at doing address space manipulations, so we can’t rely on Moores Law to bail us out this time.

message passing

Micro-kernels are based on the idea of sending messages between processes running in separate address spaces. Thus to read a file, first you have to send a message from your program to the filesystem server. This means formatting the message in your own memory space (fast), invoking the “send message” syscall (not quite as fast), the kernel then copies the message into its own address space (sorta slow), does a context switch to the filesystem server (slow), and then copies the message into the filesystem server memory space before leaving kernel mode.

Once the filesystem reads the data, you have to do the whole thing in reverse but this time copying the data back in a message …. because the cost of a message send goes up with its size, this is even slower than the initial request!

In contrast, in a monolithic design, you format your request (fast), do a read syscall (not quite as fast), wait whilst the filesystem gets your data, the kernel then copies your data into your memory space (or perhaps the hardware will do that if you’re using DMA) and returns to user mode …. wow, simpler and faster! The disadvantage is that if the filesystem code is buggy, you blue-screen and lose everything.

About 80% of Windows crashes are caused by crappy drivers. So, if you could stop drivers from exploding the system in the same way we can stop apps, we can eliminate 80% of the worlds blue screens! That’s pretty cool. It also means that drivers could have security enforced. Today if you install a 3rd party filesystem, who knows what you’re getting? Unless you review and compile the code yourself, you just have to trust whoever gave it to you. Even if they mean well, a bug in the new driver can open a local root exploit, compromising the entire security system :(

It’s probably not surprising then that academia preferred the slow-but-robust solution, and desktop OS vendors preferred the fast-but-unstable solution. It’s not that they didn’t try! Windows NT started out as a pure microkernel solution, but even with super-optimized IPC they eventually gave up and moved the whole shebang into kernel space, including the GUI, which they got a lot of stick for but made Windows feel snappier and more responsive thus making users happy.

the singularity approach

Singularity manages to have its cake, and eat it. It gets both the robustness benefits of a microkernel, and manages to get even better performance than a monolithic kernel. Neat!

But I wrote too much above. How it pulls off this trick will have to wait for next time.

cpu photo by nadya peek

siemens gigaset se551 firmware

January 5, 2008

Here’s a note for the interwebs. The GigaSet SE 551 adsl/cable router firmware is really quite nice, and gets a lot of things right. Sadly, the same cannot be said for the people who run their website.

Unfortunately, like all software, you may need to upgrade it from time to time. The firmware is shipped on their site as a Windows EXE, despite consisting only of a readme file and the BIN that you upload into the router. What is a humble Linux or Mac user to do?

Fortunately, the Windows EXE is only a Winzip self extractor. Here’s the magic incantation to get your grubby paws on it using Linux:

unzip Gigaset_SE551_WLAN_DSL_Cable_V2706_int.exe

Probably, a similar trick will work on MacOS.

more attacks in the ksa

January 3, 2008

Looks like Saudi Arabia stopped another planned al-Qaeda attack. From the article:

The conservative Muslim kingdom also said it arrested 208 suspected Al-Qaeda militants over the past few months who were plotting assassinations and an attack on a logistical oil facility.

208 seems absurdly high, and given the KSAs rather dubious standards of justice, it’s unlikely there were actually 208 actual terrorists involved.

Nonetheless, the risk remains of a suicide attack on key pieces of infrastructure such as the Abqaiq stabilisation facility, or the refinery at Ras Tanura. I last wrote about this in April – to recap, approximately 6-7% of the worlds oil supply passes through Abqaiq, and any attack that managed to shut it down would cause an instant oil shock in the west.

That makes it a very juicy target for anybody who doesn’t like us. Back then, only a fairly ragtag bunch of terrorists were trying to damage it. Now we have another risk – if Iran is attacked it will fire missiles not at our ships (which they can easily destroy) but at Abqaiq or a part of the KSA pipeline network. A clever strategic strike for sure, because it doesn’t hurt soldiers …. it hurts the average US citizen, in the wallet.

The Saudi regime takes this threat seriously, and is building up a dedicated army to protect the petroleum infrastructure:

Saudi security officials say the force now has about 9,000 of its planned 35,000 troops. Still to come are a helicopter force crucial for pipeline defense and an air-defense system designed to thwart both suicide aircraft and missiles.

In other news, oil brings in the new year by passing $100/barrel. Cheers!