Friday, February 6, 2009

Old Article Redux: "... an argument for asymmetric multiprocessing"

My old article mentioned in the title made the argument that certain subtasks (to use a general term) in a workload don't require the full performance potential of a modern out-of-order core.  Of course, I attached the label AMP to it, and that was a bad idea.  Always use the newest, hippest, most up-to-date buzzwords when describing your ideas.  I'm not sure what that buzzword is in this case, but in 2003, Kumar, Farkas, Jouppi, Ranganathan, and Tullsen called the same idea Single-ISA heterogeneous multi-core architectures in their paper at MICRO-36.

I guess I'm in pretty good company, but at the time (December 2006) I was already 3 years late, and this update over 5 years past.  I still like the idea, and am looking for good avenues of inquiry in the area.

Friday, September 5, 2008

Compiling LuaJIT on FreeBSD/amd64

LuaJIT 1.x only has a 32-bit x86 JIT. This one-line change to the makefile will allow it to compile on 64-bit/amd64/EM64T/x64 FreeBSD.  To be clear, this just lets you compile the luajit executable as a 32-bit program, not magically enable a 64-bit JIT.  Tested on FreeBSD 7.0-RELEASE.

--- LuaJIT-1.1.4/src/Makefile 2008-02-05 10:00:00.000000000 -0600
+++ LuaJIT-1.1.4-32/src/Makefile 2008-09-05 16:17:06.425074394 -0500
@@ -127,7 +127,7 @@
@echo " $(PLATS)"

bsd:
- $(MAKE) all MYCFLAGS="-DLUA_USE_POSIX -DLUA_USE_DLOPEN" MYLIBS="-Wl,-E"
+ $(MAKE) all MYCFLAGS="-m32 -DLUA_USE_POSIX -DLUA_USE_DLOPEN" MYLIBS="-m32 -B/usr/lib32 -L/usr/lib32 -Wl,-E"

bsd_rl:
$(MAKE) all MYCFLAGS="-DLUA_USE_POSIX -DLUA_USE_DLOPEN -DLUA_USE_READLINE" MYLIBS="-Wl,-E -lreadline"

Saturday, October 6, 2007

Microarchitectural Trends Demand Asynchronous Software Design

It is likely that within the next few decades, semiconductor manufacturing will reach the fundamental physical limitations of the universe. That is, until we figure out how to do computation using subatomic particles.

Intel says in their 45 nm "Fun Facts" that a silicon atom is 0.24 nm in size (presumably diameter). We are within a factor of 200 of designing circuits at the precision of individual silicon atoms. While that's definitely cool and amazing (that's the whole reason they mention it at all), it doesn't exactly give me a warm fuzzy. At least they acknowledge that multi-core is the future and are heading toward it.

This suggests that at some point, there will be a limit to the speed of serial execution (moreso than there already is) and programs will have to be written concurrently to be maximally scalable. (This statement is probably completely obvious to all my readers, but I felt I had to say it anyway.)

As I have already argued in other posts and will reiterate here: asynchronous software design can only help you if you actually use it. These concepts have been around for decades but dealing with synchronous OS interfaces and single processor machines made it less than ideal from an efficiency standpoint. The processor trends are taking care of themselves but the OS interface remains. Once that is taken care of, there will be nothing left but human inertia preventing concurrent programming from taking over.

Syscall Innovation Synergizes With Many-Core Architectures

Suppose our little syscall network is just too rich, too sophisticated, and too pompous for an actual kernel to have to deal with. That's okay; we can still find some ways to squeeze some advantages out of the idea.

Just like syscalls are usually shielded by libraries, so would it be with async syscalls. Behind the scenes, this library can create a thread for the program just for the purpose of doing this syscall combinator magic. In this design, the thread that I previously suggested as a potential "kernel thread" is now just a user thread that the user doesn't explicitly control or necessarily know about. It's just a manifestation of the inner workings of a rich library component. The advantage here is that we still have another execution context that can be migrated to a different core from the rest of the running process. We've primarily lost the advantage of not dealing with the privilege-barrier data copying issue. For the most part that is probably not terrible as in the cases where we want to turn around syscall responses into further syscalls (directory entries get turned around into open >>= read's), the amount of data is small.

Again, the real amount of work happening in this thread is very small and so would have a very small resident memory footprint. It wouldn't be terribly surprising for the necessary logic to reside in one OS page, say 4K or 8K or whatever. It also would only deal with integer operations, most likely.

This means that it's an ideal candidate to be run on a less powerful core as advocated in my previous post on asymmetric multiprocessors. That particular post was not terribly well received or even understood for that matter (I'll take most of the blame for that), but ultimately the fact is that "it's just another core" that happens to be slower and less featureful but is still suitable for performing simple integer tasks. The syscall combinator idea is just one way of being able to take advantage of such a resource.

Finally, if we remain resigned to a user privilege level implementation, we can mostly ignore the safety issues of infinite iteration and recursion and could increase the richness of the language. The downside is that it may then be harder to distribute the logic to other nodes in distributed systems with complex trust relationships.

One reasonable compromise might be that the kernel can handle simple, linear combinator networks that can be represented by DAGs but that any networks that include a cycle (a feedback loop, if you will) must remain in user space. This allows the OS to go back to the ostrich approach and generally speaking eliminates the necessity of a dedicated kernel thread.

Implementing Sandboxed/Distributable Syscall Combinator Networks

It's getting late, so I might elide some of the technical details (a la Fermat, "I have a lovely proof but it would just take too long to type"). I hypothesize that the actual implementation of $SUBJECT could be relatively simple.

The queueing network is essentially just a message router; it does not perform modifications to the data. At least, I don't envision that we would, although we could. It just further complicates the sandboxed environment, so I say for now that no values are changed, they are only deconstructed/bound to variables.

The queueing network may send a message to an async syscall but it really does no computation other than limited things like case statements.

Anyway, I point out this simplicity because all the message queues can be handled by one event loop, probably not unlike how Erlang has worked for many years (pre-SMP support). The different queue behaviors are flattened into one big case statement, essentially. This allows the asynchronous syscall combinator language to be executed in a kernel thread that can be paired with the user process. Even if the user process has N threads for whatever reason, all receiving messages from different flows, only one kernel thread is necessary to handle that. Furthermore, this thread should be completely re-entrant so it could theoretically execute on every core in the system. This could be handy on big file servers: if different buses had their interrupts routed to different CPUs, if an I/O completes, neither execution nor data has to be sent to another CPU in order to find out what to do with it. Obviously I/O distribution for locality optimization forces the distribution of this thread to all participating hosts (although this will likely divide the operations into mutually exclusive sets so each host is only dealing with the types of messages it will get).

Another Way To Look At Syscall Message Combinators

Consider a more complex example along the lines of grep -R -l foo . - "List any files in this directory or any descendent directory that contain the word 'foo'". It gets exceedingly harder to come up with this, but it could be done.

scandir d = opendir(d) >>= stream_readdir >>= map (\f -> case (type f): when file -> open(f) >>= stream_read_unordered(); when dir -> scandir d)
scandir '.'


I'm not a Haskell expert so I'm taking some liberties with syntax. I read this as follows: open a directory and stream its contents into a message queue, but the message queue "terminates locally" (does not get sent to another process) by being sent to a function that either opens and streams the file, as before, or recursively starts another directory scan. At a slightly lower level, we might consider creating a local message queue that executes opendir >>= stream_readdir on every message (and all its outputs go to the same message queue), then we seed the queue with an initial '.'. That way there is no recursion, per se, but there is a cycle in our little message queue network. The messages that have no declared destination inside that network are delivered back to the client. In this particular case, we're receiving unordered blocks from files just as before.

By this point we've pretty much passed an entire monadic Haskell program to the kernel and are expecting it to be okay with spawning its own little queueing network on our behalf. Isn't this crazy?

Well, it definitely is, of course, but it also makes some sense.

To start with, it eliminates unnecessary message passing across privilege boundaries. The program, even though it performs IO, is written with pure functions, like Haskell IO is.

The combinators and function bindings easily reduce to a simple queueing network (or can, anyway, supposing we design it carefully), which plays well into our whole asynchronous system call universe that works by passing messages around. Interestingly, this example demonstrates the case where the kernel is executing a syscall for each message in a stream created by the syscall that you actually asked for. The user program doesn't actually know how many syscalls are being executed on his behalf, he just starts getting a barrage of messages and scans them for 'foo' as fast as he can.

Finally, again this carries the data locality advantage. If we have some language that we declare "safe" enough to run "in the kernel"* (even in a controlled eval, which is OK, because the slow stuff is the I/O, and doing a little message passing on the behalf of some interpreted code is relatively fast in comparison), then perhaps we could expect our neighbor, who doesn't necessarily trust us at all but does let us access his data, to run it too (sure, sandbox it, whatever).

This is very important because in a distributed system, the "operating system" very likely does not own the resource we are recursively grep'ing. The message clump can have individual message queues (internal ones like one that kicks off an opendir) and data generating nodes (like the stream_readdir or stream_read_unordered) distributed to different physical resources in the distributed system in accordance to the locality of the data being accessed by each.

So, yes, it is crazy, but if the language is suitably restrictive, we can get a limited form of process migration, and I/O migration is probably the most rewarding subset of them since I/O is so inherently latent, whether considering storage or network I/O.

* Compare what Sun had to do with the Dtrace language, D. They had to prevent infinite loops, which we potentially could have here. But ultimately the program would have the same behavior itself if not running within the kernel, but here, we might be able to actually examine the situation in this limited universe and do something about it. Further musing: data/codata analysis with structural recursion and so forth might be able to reason about cycles and declare them safe. Like transactional systems, we might optimistically let them run until we have some way of "suspecting" that they are having trouble, then then examine the situation (the traditional "hmm, this transaction is taking too long, oh what do you know it is actually deadlocked now that I look at it, zap" approach).

Some Commentary on Perceived Originality

The clumping concept is fairly obvious; I don't know how practical it is or how many message passing systems may already use it, etc., but it seems obvious.

The way that the exact ordering semantics of read streams could be specified was sort of inspired by the way the ZFS authors created a rich transactional interface somewhere in their rampant layering violation. Specifying exactly what you want as a client can be tedious, but it can definitely pay off. It pays off for ZFS and I think it will pay off here, too.

Finally, I think the real originality is the syscall combinator language idea. I don't know that anyone has really gone there before.

Einstein is alleged to have said "the secret to creativity is knowing how to hide your sources." Well, I think I already gave part of it away. Obviously combinators are out there in the functional language world and using that word is very suggestive as to my inspiration. I was also tangentially inspired by the way data flow is implemented in UT's TRIPS architecture. If you squint, more similarities between the approaches may become apparent...

More Ideas For The Previous grep Example

So I mentioned in the original async syscall post that grep -l foo * could do all the open()'s, all the read()'s, and all the close()'s in parallel. Whenever an open call returns success, we can immediately send a read.

What is particularly cool about this example is there are no data ordering dependences anywhere. Instead of read, we could use a new syscall that allows the kernel to push individual blocks from a file back to the user program in any order. Granted, it is relatively rare that a user program doesn't care about the order, but it can happen, and it can greatly loosen many of the constraints inside the OS when it does. So, to review, we can create a new syscall like read except it allows the kernel to send back data read from some file back to the user program, asynchronously. See the advantage here? If the whole idea is to read the whole file, why should we have to ask for each block? That's a big waste of effort overall. In the standard synchronous syscall world each read stops the process as execution enters the kernel. All of that wasted effort disappears.

At some point the kernel might swamp a user process, so the kernel should monitor unacknowledged messages and the amount of memory they consume in order to slow things down if need be. In that way it's very much like TCP again.

Ok, so that was pretty cool. What else can improve? How about we get rid of close? That's a pretty easy one. If all the file contents have been sent to the process, it can also send one final message indicating "that's all" or "EOF" and automatically close the file.

Here's another obvious idea with a great payoff. Messages should be clumped together if the user program desires, kind of like readv/writev or Linux's concept of a socket cork. Or perhaps we can just separate message construction from delivery. Anyway, this is all irrelevant to my point.

If we add message clumping, then we can pass many messages at one time. The act of sending a message or clump of messages is the slow part, even in a message passing design where the client isn't necessarily even giving up the CPU. It still has to deal with memory permissions and copying data from userspace to kernel space and so on. So if we can send lots of messages at one time, we can minimize that slow stuff.

One final way we can enhance clumping to get the most out of it is to allow a simple combinator language, if you will. It should be fairly straightforward to be able to encode "open() this file, I don't even want to know the fd (unless it fails, in which case send me a message), because I want you to go ahead and initiate a streamed_unordered_read() on it." Granted this level of detail is kind of reaching when compared to current interfaces, as it has a lot of potential generality that lends itself to its own domain specific language, but nevertheless, I'm sure we can come up with a simple set of primitives that allow these sorts of practical message flows to be set up with a single message clump.

So, what's the final tally? One message clump. We will pass a set of independent message pairs where each message pair is like this: "open("foo") >>= stream_unordered_read()". You like that? A little Haskell operator in our syscall message language. Pretty handy, no?

Asynchronous System Calls - Inspiration and Acknowledgements

First of all, I don't claim any sort of originality for the concept; I am merely advocating it.

I just watched a several-month-old video of a Google Tech Talk by Andrew Morton and apparently there is a movement afoot by some members of the Linux kernel community who are trying to do this in Linux. Good for them! Linux already does Asynchronous I/O ("AIO") but this new effort appears to be a generalization of that. However it probably won't help much at first as a great deal of programs would need to be greatly revised or rewritten to take advantage, but that will never happen without the foundational kernel work.

My actual inspirations are the languages CSP, Erlang, NewSqueak, Alef (Plan 9), and Limbo (Inferno). The message passing approach really does make things easier and better. However it can only do good where it is actually used! If the program itself uses CSP techniques while the OS interface requires per-thread synchronicity, then you strangle the program to the extent that it wants to engage in a concurrent conversation with the OS.

Friday, October 5, 2007

All System Calls Should Be Asynchronous

I think that all system calls (hereafter syscalls) in an operating system should be asynchronous. A syscall is a call that has to be serviced by the operating system and therefore has to leave userspace and cross the privilege barrier.

From a high level, what do I mean by this? In general I mean that when I call open(), I don't necessarily want the kernel to stop my process, do a bunch of I/O and bookkeeping stuff, and then resume execution of my program with a brand new file descriptor nestled in the ABI approved return value location. What I would rather do is for my process to send a message that will cross the privilege barrier into the kernel and be handled sometime. Note that this is one of the microkernel vs. monolithic kernel things (microkernels typically use message passing) but I am not trying to engage in that debate. This is about user program/kernel interface design and its effects on user program design.

Obviously this could be done from the user's standpoint by creating a thread for every call so that when the call blocks it stops nothing else but a dummy thread context whose only purpose is to wait (i.e. to do nothing). This is stupid because it adds complexity to get around a fundamentally less useful system call design. Thread creation is definitely not cheap so it would be good to forgo that unnecessary step and let the system get on with doing what you want.

This is a fairly uneducated opinion but I think that the synchronous system call design is an artifact of the uniprocessor era. (That is, it made obvious sense up until 2005*.) In that mode of operation, it is logical to stop the executing process to do what it asks as it will surely be needed soon anyway, and the process will definitely need to be stopped in that event since the system only has one execution resource.

In this brave new world of multi-core processors, such fundamental assumptions start to break down. In my mind it is perfectly reasonable to assume that a computer should do what it is capable of, and now it's capable of servicing system calls from user processes at the same time it is executing said processes. I will try to describe a number of reasons why this is desirable.

First of all, it's just a more powerful design. You can easily emulate the old syscall behavior by making the desired syscall and then calling one of the few remaining syscalls that blocks (say, a syscall whose fundamental (and not incidental) purpose is to block) - execution should not continue until something happens. In this case we want a message to arrive reporting on the status of our completed syscall. This shows that asynchronous syscalls are at least as powerful as synchronous ones.

However, here is an example of something that synchronous syscalls cannot do. Let's say a user runs grep foo -l *. This basically says "list all the files in this directory that contain the word 'foo'." With the asynchronous design, we can do something pretty cool - we can open all the files at once. We don't actually care which one gets opened first. Synchronous syscalls can't do this. They will enforce some arbitrary ordering, which may introduce inefficiency, but fundamentally they are already much more inefficient as they can only be opened one at a time. The application developer could use the fake-asynchronous-syscall mechanism here by explicitly creating threads for each open() call. What a hassle! It also follows that the asynchronous design can execute the read and close calls in parallel also. Further interesting tweaks can be made that I will discuss in another post.

The difference here is analogous to the difference between the flow control methodologies of stop and wait and sliding window. Synchronous syscalls are the stop and wait methodology and asynchronous syscalls are like sliding window, except that in theory the window is of unbounded size. This also presents obvious gains in the world of distributed systems. If the running instance of grep has poor locality to the directory it is reading from, the synchronous syscalls will grind performance to a halt as latency will be incurred on each request. The asynchronous syscalls can work more like TCP where the amount of in-flight data is allowed to grow as long as the flow rate can be efficiently maintained. Of course each individual operation is no faster, and with increased I/O load in the system, it may be slower, but the overall throughput will greatly increase. In the words of Tony Hoare "Concurrent Programs Wait Faster".

* AMD introduced the first mass market dual-core consumer chips in 2005. Obviously IBM had already done dual-core with the Power4 and SMP machines had been around for a long time before that. However, as goes the consumer market, inevitably so does the computer industry as a whole, so I think this data point is of the utmost relevance. At the time of this writing, you can't even buy a single-core computer from Apple.

Tuesday, October 2, 2007

Seemingly Obligatory Rant Re: chroot

Recent events have prompted this post. In my previous post I suggested that chroot is a security mechanism. Apparently some Linux 2.2 maintainer guy thinks I am mistaken.

Recently Alan Cox of Linux fame ranted about how "chroot is not a security tool." Much discussion ensued at reddit and Slashdot and probably many other places.

I tend to think Alan Cox is being a jackass about the whole thing, kind of like a lexicographer spouting off about how "y'all" isn't a word. Alan Cox says "hey, it's not intended as a security thing, and since you can only do it as root, and root can do anything, it doesn't help security." Umm, sure. Well, somewhere along the way, for better or worse, people started using chroot in various attempts to improve security for individual applications. Even OpenBSD, the most widely used security-focused Unix-like OS, uses the mechanism. They use it "correctly": a daemon is setuid and uses its escalated privileges very carefully for setting up its minimum threat footprint, chrooting, and then giving up root privileges. In this case, the fact that root can get out of a chroot is irrelevant. If a program is running as root, you would expect it to be able to do anything. But we aren't running the programs as root. Get a clue, Alan.

I liken this to the difference between descriptive and prescriptive approaches. Alan is trying to prescribe to Unix users and administrators: "Hey kids, that's not a security feature you have there. Stupid." The reality is that it is used that way and it can be done non-stupidly.

Now, it would be more useful if Unix were more like Plan 9 and exposed all resources in the file namespace...

Monday, July 16, 2007

A binding device for Plan9?

I'm not an expert on Plan9 by any means. Unfortunately this means that this idea may be completely useless, but I found it interesting nevertheless.

In Unix, a standard way to restrict a process's access to the outside world so that it won't be able to do "bad" things (either intentionally, by accident, or by being played by a malicious attacker) is to use chroot. Typically you would chroot into a directory providing no more access to the outside world than completely necessary, often an empty directory designed for that very purpose (e.g. OpenBSD prefers /var/empty).

FreeBSD one-upped the concept with jail(8)/jail(2), which is basically like chroot(8) /chroot(2) with its own IP address. This provides the handy ability of sandboxing virtual server instances for untrustworthy users (i.e. customers).

The FreeBSD kernel also has a very interesting concept called securelevel where certain features of the system can be locked down for increased security (e.g. disabling /dev/mem, disallowing write access to partitions for any program other than mount, and locking the firewall configuration, at successively increasing securelevels). Once the kernel's securelevel is increased it can never be lowered. Restarting the system is the only way to regain access to these mechanisms.

I consider chroot/jail and securelevel to be very similar concepts: one for processes and one for the kernel. The process/kernel cannot (at least, it is our goal, so hopefully) increase its threat footprint after the chroot or securelevel increase.

Now for my idea.

Plan9 can already do things analogous to chroot/securelevel at a per-process level by removing binds with unmount until some minimum level of privilege is reached. However the one final piece in my mind is locking the process into those bindings forever. This finality of chroot/jail/securelevel can be achieved with one small tweak to Plan9. Rather than bind being a syscall, why not expose it as a device, say, /dev/bind? Then a process could give up its binding privilege by unmounting /dev/bind.

Bind is already a very simple syscall; you pass it two strings (paths) and some flags. This could very easily be done instead using Plan9's standard "open() and write() to a device" paradigm. Maybe I'm not seeing another angle of complexity or perhaps this idea actually has no practical value for a variety of reasons. Whatever the case may be, I would be happy to be educated by a plan9 enthusiast.

Sunday, July 15, 2007

Directory-Sensitive History

I want a feature for Unix-like OSes called Directory-Sensitive History. It's fairly simple really. Whenever I use a history-related feature in the shell, like scrolling up in my history to choose a previous command, using ! to match some past command by common prefix, or typing 'history', it should take into account what directory I am in. It is ok if I actually have to tell the shell that I want a particular directory to have directory-sensitive history. If this means that different directories beneath my home directory also have history files (like .bash_history) then so be it. At this point, I just want this feature and don't feel like being terribly picky in how exactly it's engineered (that's for later :)).

While preparing this post, I researched the concept for the first time, though I first thought of it several months ago. I was sure that I wasn't the first to think of such a thing. If you query Google for "directory sensitive history" (quoted just so) you get 2 hits. Sure enough, a fellow named Eric Eide at the University of Utah talked about it briefly in his master's thesis in 1995 and cited a paper by Greenberg and Witten from 1988 where it was determined that the technique was able to increase the certainty of what a user would do next.

Even by 1988 Unix was fairly mature (Wikipedia tells us that UNIX System V Release 4, SunOS 3.2, and 4.3BSD Tahoe were the cutting edge flavors of that era) so it stands to reason that the idea had been thought of multiple times in the two decades prior. So it is definitely not new, which should come to no surprise.

However, correct me if I'm wrong (please), but this feature is not available in any software still around, and when I say software, I mean unix shells. The shells I consider to be "still around", in order of my perception from most to least "aroundness", are bash, sh, ksh, tcsh, zsh, and csh. Please do not flame me or argue over this list; it is not a value judgment on any shell, just how I gauge the popularity contest at this moment.

Finally, a short example of why this is useful. Say you're developing some piece of software and you have to issue a series of commands just-so in order to build it. You may sleep a few times between development sessions and you would like for those commands to be the most recent items in your history when you are in that directory. I know I could save tons of time if I could partition my histories based on tasks, approximated by using specific directories.

I want it. Anyone else who feels strongly about this care to comment?