Monday, July 26, 2010

A Retrospective on SEDA

I keep bumping into references online to my PhD thesis work on the Staged Event-Driven Architecture, or SEDA. I thought this had been long forgotten, but I guess not. It's been about 10 years since I did the bulk of that work (the major paper was published in SOSP 2001), so I thought it would be interesting to think back on what we got right and what we got wrong. Just to summarize, SEDA is a design for highly-concurrent servers based on a hybrid of event-driven and thread-driven concurrency. The idea is to break the server logic into a series of stages connected with queues; each stage has a (small and dynamically-sized) thread pool to process incoming events, and passes events to other stages. The advantages of the design include modularity, the ability to scale to large numbers of concurrent requests, and (most importantly, I think) explicit control of overload, through the queues.

Apparently quite a few systems have been influenced by SEDA, including some major components that drive Google and Amazon. I occasionally hear war stories from folks that tried the SEDA design and abandoned it when the performance did not meet up with expectations. The events-versus-threads debate continues to rage on. See, for example, this recent post comparing the performance of Node.js and Clojure. (Who knew that people would be talking about implementing high-performance servers in JavaScript and LISP? And I thought using Java for SEDA was crazy....)

Some historical context

It's important to keep in mind that I started work on SEDA around 1999. At the time, the server landscape looked pretty different than it does now. Linux threads were suffering a lot of scalability problems, so it was best to avoid using too many of them. Multicore machines were rare. Finally, at the time nearly all papers about Web server performance focused on bulk throughput for serving static Web pages, without regard for end-to-end request latency.

These days, things are pretty different. Linux threading implementations have vastly improved. Multicores are the norm. With the rise of AJAX and "Web 2.0," request latency matters a lot more.

Before we start splitting hairs, I want to emphasize that the SEDA work is about a server architecture, not an implementation. Yes, I implemented a prototype of SEDA (called Sandstorm) in Java, but I never considered Sandstorm to be the main contribution. Unfortunately, a lot of follow-on work has compared C or C++ implementations of alternate server designs to my original Java implementation. It is really hard to draw many conclusions from this, in part because Sandstorm was heavily tuned for the particular JVM+JIT+threading+GC combination I was using at the time. (I spent an incredible amount of time trying to get gcj to be robust enough to run my code, but eventually gave up after around six months of hacking on it.) Probably the best head-to-head comparison I have seen is David Pariag et al.'s paper in EuroSys 2007, where they do a nice job of factoring out these implementation effects.

What we got wrong

In retrospect, there definitely a few things about the SEDA design that I would rethink today.

The most critical is the idea of connecting stages through event queues, with each stage having its own separate thread pool. As a request passes through the stage graph, it experiences multiple context switches, and potentially long queueing at busy stages. This can lead to poor cache behavior and greatly increase response time. Note that under reasonably heavy load, the context switch overhead is amortized across a batch of requests processed at each stage, but on a lightly (or moderately) loaded server, the worst case context switching overhead can dominate.

If I were to design SEDA today, I would decouple stages (i.e., code modules) from queues and thread pools (i.e., concurrency boundaries). Stages are still useful as a structuring primitive, but it is probably best to group multiple stages within a single "thread pool domain" where latency is critical. Most stages should be connected via direct function call. I would only put a separate thread pool and queue in front of a group of stages that have long latency or nondeterministic runtime, such as performing disk I/O. (This approach harkens back to the original Flash event-driven server design that SEDA was inspired by.) This is essentially the design we used in the Pixie operating system.

I was never completely happy with the SEDA I/O interface. My original work on Java NBIO was used as the foundation for Sandstorm's event-driven socket library. (I was also one of the members of the Java Community Process group that defined the java.nio extensions, but I preferred to use my own library since I wrote the code and understood it.) However, layering the SEDA stage abstraction on top proved to be a real pain; there are multiple threads responsible for polling for request completion, incoming sockets, and so forth, and performance is highly sensitive to the timing of these threads. I probably spent more time tuning the sockets library than any other part of the design. (It did not surprise me to learn that people trying to run Sandstorm on different JVMs and threading libraries had trouble getting the same performance: I found those parameters through trial-and-error.) The fact that SEDA never included proper nonblocking disk I/O was disappointing, but this just wasn't available at the time (and I decided, wisely, I think, not to take it on as part of my PhD.)

Of course, while Java is a great implementation language for servers, I didn't implement Sandstorm with much regards for memory efficiency, so it kind of sucks in that regard compared to leaner server implementations.

What we got right

I chose to implement SEDA using Java, in order to tie into the larger Berkeley Ninja project which was all in Java. It turned out that my Java code was beating servers implemented in C, so I saw no reason to switch languages. I still believe that had I tried to do this work in C, I would still be writing my PhD thesis today. Case in point: Rob von Behren, who did a follow-on project to SEDA, called Capriccio, in C, never finished his PhD :-) Never mind -- we both work for Google now.

The most important contribution of SEDA, I think, was the fact that we made load and resource bottlenecks explicit in the application programming model. Regardless of how one feels about threads vs. events vs. stages, I think this is an extremely important design principle for robust, well-behaved systems. SEDA accomplishes this through the event queues between stages, which allow the application to inspect, reorder, drop, or refactor requests as they flow through the service logic. Requests are never "stalled" somewhere under the covers -- say, blocking on an I/O or waiting for a thread to be scheduled. You can always get at them and see where the bottlenecks are, just by looking at the queues. I haven't seen another high performance server design that tries to do this -- they mostly focus on peak performance, not performance under overload conditions, which was my main concern. I also think that SEDA makes it easier to design services that are load aware, though I leave it as an exercise to the reader to determine how you would do it in a conventional thread or event-driven framework.

Honestly, we never took full advantage of this, and I struggled somewhat to come up with a good benchmark to demonstrate the importance of this idea. (When you're using SpecWeb99, you can't really drop or refactor Web page requests.) Benchmarks are tricky, but I think that many real-world services have the opportunity to leverage SEDA's explicit load conditioning model.

Some general comments

I'm not really working on high performance server designs anymore (although my stint at Google may or may not take me back in that direction). I'm also not up on all of the latest literature on the topic, so maybe there is a killer design out there that solves all of these problems once and for all.

One thing I learned doing this work is that one should always be skeptical of simple, "clean" benchmarks that try to demonstrate the peak or best-case performance of a given server design. My original benchmarks of SEDA involved fetching the same static 8KB web page over and over. Not surprisingly, it yields about the same performance no matter what server design you use. This benchmark hardly stresses the I/O, memory, threading, or socket layers of any system, and is more likely to highlight performance differences in the corner cases. (Believe me, I've read plenty of papers that use much dumber benchmarks than this. SpecWeb99, which we used in the SOSP paper, is only marginally better.)

It's harder to do, but I think it's important to evaluate performance in the context of a "real" application, one that involves all of the load and complexity you'd see in a real service. So I am not convinced by microbenchmarks anymore; it is like showing off a new automobile design running on a flat, even, indoor track with no wind drag, no adverse weather, no other traffic, and no need for seatbelts or airbags. Usually as soon as you load it up with realistic conditions, things start to break. Achieving good, robust performance across a wide range of loads is the real challenge.

7 comments:

  1. Great retrospective, thanks Matt!

    What's your view on the Actors model (as available in Scala)? Would you leverage that if you had to implement Sandstorm again?

    The way I see it is basically every Actor a micro SEDA stage, with its own inbox or queue in SEDA terms..


    ps: if you get any response about current research work, could you post it on here as well?

    ReplyDelete
  2. Perhaps some things have come full circle:

    http://developers.slashdot.org/story/10/07/27/1925209/Java-IO-Faster-Than-NIO

    ReplyDelete
  3. Interesting !

    This is very similar to JGroups, which originally also had protocols ('stages') connected to each other, with queues and associated threads serving those.

    That was around 1998, so around the same time you started on SEDA.

    I changed the design soon after that, abandoned queues and threads in favor of direct method calls, and fronted the whole thing with a thread pool, whose threads would take a message and send it up through the protocol stack to the application.

    I did like SEDA, and also Ninja, which IMO was much better than RMI (IIRC using reflection rather than the stupid rmic compiler to create stubs and skeletons)...
    Cheers, good work !

    ReplyDelete
  4. Thanks for the update. It's great to read followup on such influential work. I don't think a thread pool is necessarily evil, we just have very primitive scheduling options at the OS level. And in a distributed system the event queue is likely to be something like SQS or RabbitMQ, which pushes the problem elsewhere.

    BTW, if you are interested in an architecture that deals explicitly with overload conditions, you might be interested in AppBackplane - http://possibility.com/wiki/index.php?title=AppBackplane. With more sophisticated approaches we can start to utilize all the CPU in a shared work environment.

    ReplyDelete
  5. Speaking of scheduling, what do you think about Bianca Schroeder's work on Web scheduling and QoS?

    ReplyDelete
  6. Hi Matt, remember the startup in the Mountain View condo you visited and talked about SEDA.

    Well, I ended up having to have transaction snapshot nodes, TxNodes, which where connected together with SEDA Stages rather than just one collection of Stages.
    Each TxNode would have the data associated with a unit of work stored transactionally (in a DB) and would send such work units through its the next set of Stages.
    When the work unit got to the next TxNode, it would store the current snapshot of the work in the DB and, in the same transaction, notify the previous TxNode that it could clear its snapshot for the unit of work.
    A TxNode could send more than one unit of work at a time to the next TxNode.
    This was required because for our particular clients, speed was less important than not losing a unit of work (a purchase order).
    Prior to doing this, an OOM could result in a lost order.
    Afterwards, using the Java 5 OOM notification capability, a set of Stages could be notified that memory was getting low, and the Stages could abort all work, releasing the memory and avoiding OOM.
    The TxNodes would then gradually restart the aborted units of work (oldest to youngest).


    It was also true that sometimes a set of Stages required that more than on Stage be involved in the same transaction (not a snapshot transaction).
    I thus abstracted the running of a job from a SEDA Stage so that one set of Stages might use a thread pool while another set might use a single
    thread.
    Using a single thread for a set of Stages allowed having multiple Stages participate in the same transaction. (Application Servers store transaction information in thread locals, hence a single thread had to be used.)
    Whether a set of Stages used a thread pool or a single thread was a configuration option.

    In our case, ultimately business decisions rather than software design determined the fate of the application.

    Richard

    ReplyDelete
  7. It would seem reasonable to have worst case context switching at light load while getting better amortization at higher loads. That does seem to be part of the point to it.

    One item that isn't generally discussed is the comprehensibility of the programming model. How hard is it for someone to grasp and apply the principle? SEDA seems straightforward to me and an easy model to comprehend.

    The stages also make one think about what sort of processing makes sense. How do you decompose any given problem.

    Sure, on might say the same about direct method calls but I don't think that's true. Passing events from stage to stage decouples the problem domain.

    Just one example that you mention is the ability to monitor the load going from stage to stage by examining the queues. That's tougher to do with a single thread running through method calls.

    ReplyDelete