Volatile and Decentralized: July 2010

Friday, July 30, 2010

Proposal: Abolish faculty offices

Posited: faculty offices are detrimental to the advancement of scientific knowledge.

At Google, everyone sits out in the open at clusters of desks (not cubicles, God no). It looks a little something like this:

(This appears to be a picture from Google's Kirkland, WA office, but we have a similar setup in Cambridge.)

Today I swung by Harvard to my big, empty office, which looks like this:

Of course, it's an awesome office, one of the most spacious that I've seen in an academic CS building. You could easily pack eight grad students in there, sitting on top of a large pile of undergrads.

I got to thinking. In most academic settings, faculty are isolated in their own separate offices -- isolated from one another, from the students, from the rest of the world. This can't possibly be good for cross-fertilization of ideas. Although I leave my office door open whenever I'm there, people hardly ever drop by -- I guess I am pretty intimidating. (Or maybe it's my ferocious guard dog that I bring with me to work.)

Of course, having my own office is great for meetings, but there are plenty of places I could hold meetings instead. And it's nice to have a place for all of my books and journals, but really, shouldn't those be in a communal library anyway? And I guess the office is nice for when I want to shut out the world and try to concentrate, but that's nothing a pair of noise-canceling headphones can't fix.

So here's the idea -- let's get rid of faculty offices. Get everyone sitting together in open-floorplan space, interacting, communicating, innovating. Just like startups. Why not? This is the model that the Berkeley RADLab uses. All of the faculty sit together in an open space. Here's a picture of Randy Katz at his desk in the lab, surrounded by British war paraphernalia:

Doesn't he look happy? (You can read more about the RADLab design philosophy here.)

To be honest, when I started at Google I was pretty concerned about the lack of an office. I was sure that I would be unable to concentrate sitting out in the open, and would get annoyed at all of the distractions and bodily odors of the people around me. On the contrary, I've found that it actually helps my productivity to be in an active space with other people hacking away around me. Also, the noise level is rarely an issue. People are generally respectful and it's a little like working in a coffee shop.

When I get back to Harvard, I think I'll move into the lab with my grad students. (I can hear the groaning now.)

Monday, July 26, 2010

A Retrospective on SEDA

I keep bumping into references online to my PhD thesis work on the Staged Event-Driven Architecture, or SEDA. I thought this had been long forgotten, but I guess not. It's been about 10 years since I did the bulk of that work (the major paper was published in SOSP 2001), so I thought it would be interesting to think back on what we got right and what we got wrong. Just to summarize, SEDA is a design for highly-concurrent servers based on a hybrid of event-driven and thread-driven concurrency. The idea is to break the server logic into a series of stages connected with queues; each stage has a (small and dynamically-sized) thread pool to process incoming events, and passes events to other stages. The advantages of the design include modularity, the ability to scale to large numbers of concurrent requests, and (most importantly, I think) explicit control of overload, through the queues.

Apparently quite a few systems have been influenced by SEDA, including some major components that drive Google and Amazon. I occasionally hear war stories from folks that tried the SEDA design and abandoned it when the performance did not meet up with expectations. The events-versus-threads debate continues to rage on. See, for example, this recent post comparing the performance of Node.js and Clojure. (Who knew that people would be talking about implementing high-performance servers in JavaScript and LISP? And I thought using Java for SEDA was crazy....)

Some historical context

It's important to keep in mind that I started work on SEDA around 1999. At the time, the server landscape looked pretty different than it does now. Linux threads were suffering a lot of scalability problems, so it was best to avoid using too many of them. Multicore machines were rare. Finally, at the time nearly all papers about Web server performance focused on bulk throughput for serving static Web pages, without regard for end-to-end request latency.

These days, things are pretty different. Linux threading implementations have vastly improved. Multicores are the norm. With the rise of AJAX and "Web 2.0," request latency matters a lot more.

Before we start splitting hairs, I want to emphasize that the SEDA work is about a server architecture, not an implementation. Yes, I implemented a prototype of SEDA (called Sandstorm) in Java, but I never considered Sandstorm to be the main contribution. Unfortunately, a lot of follow-on work has compared C or C++ implementations of alternate server designs to my original Java implementation. It is really hard to draw many conclusions from this, in part because Sandstorm was heavily tuned for the particular JVM+JIT+threading+GC combination I was using at the time. (I spent an incredible amount of time trying to get gcj to be robust enough to run my code, but eventually gave up after around six months of hacking on it.) Probably the best head-to-head comparison I have seen is David Pariag et al.'s paper in EuroSys 2007, where they do a nice job of factoring out these implementation effects.

What we got wrong

In retrospect, there definitely a few things about the SEDA design that I would rethink today.

The most critical is the idea of connecting stages through event queues, with each stage having its own separate thread pool. As a request passes through the stage graph, it experiences multiple context switches, and potentially long queueing at busy stages. This can lead to poor cache behavior and greatly increase response time. Note that under reasonably heavy load, the context switch overhead is amortized across a batch of requests processed at each stage, but on a lightly (or moderately) loaded server, the worst case context switching overhead can dominate.

If I were to design SEDA today, I would decouple stages (i.e., code modules) from queues and thread pools (i.e., concurrency boundaries). Stages are still useful as a structuring primitive, but it is probably best to group multiple stages within a single "thread pool domain" where latency is critical. Most stages should be connected via direct function call. I would only put a separate thread pool and queue in front of a group of stages that have long latency or nondeterministic runtime, such as performing disk I/O. (This approach harkens back to the original Flash event-driven server design that SEDA was inspired by.) This is essentially the design we used in the Pixie operating system.

I was never completely happy with the SEDA I/O interface. My original work on Java NBIO was used as the foundation for Sandstorm's event-driven socket library. (I was also one of the members of the Java Community Process group that defined the java.nio extensions, but I preferred to use my own library since I wrote the code and understood it.) However, layering the SEDA stage abstraction on top proved to be a real pain; there are multiple threads responsible for polling for request completion, incoming sockets, and so forth, and performance is highly sensitive to the timing of these threads. I probably spent more time tuning the sockets library than any other part of the design. (It did not surprise me to learn that people trying to run Sandstorm on different JVMs and threading libraries had trouble getting the same performance: I found those parameters through trial-and-error.) The fact that SEDA never included proper nonblocking disk I/O was disappointing, but this just wasn't available at the time (and I decided, wisely, I think, not to take it on as part of my PhD.)

Of course, while Java is a great implementation language for servers, I didn't implement Sandstorm with much regards for memory efficiency, so it kind of sucks in that regard compared to leaner server implementations.

What we got right

I chose to implement SEDA using Java, in order to tie into the larger Berkeley Ninja project which was all in Java. It turned out that my Java code was beating servers implemented in C, so I saw no reason to switch languages. I still believe that had I tried to do this work in C, I would still be writing my PhD thesis today. Case in point: Rob von Behren, who did a follow-on project to SEDA, called Capriccio, in C, never finished his PhD :-) Never mind -- we both work for Google now.

The most important contribution of SEDA, I think, was the fact that we made load and resource bottlenecks explicit in the application programming model. Regardless of how one feels about threads vs. events vs. stages, I think this is an extremely important design principle for robust, well-behaved systems. SEDA accomplishes this through the event queues between stages, which allow the application to inspect, reorder, drop, or refactor requests as they flow through the service logic. Requests are never "stalled" somewhere under the covers -- say, blocking on an I/O or waiting for a thread to be scheduled. You can always get at them and see where the bottlenecks are, just by looking at the queues. I haven't seen another high performance server design that tries to do this -- they mostly focus on peak performance, not performance under overload conditions, which was my main concern. I also think that SEDA makes it easier to design services that are load aware, though I leave it as an exercise to the reader to determine how you would do it in a conventional thread or event-driven framework.

Honestly, we never took full advantage of this, and I struggled somewhat to come up with a good benchmark to demonstrate the importance of this idea. (When you're using SpecWeb99, you can't really drop or refactor Web page requests.) Benchmarks are tricky, but I think that many real-world services have the opportunity to leverage SEDA's explicit load conditioning model.

Some general comments

I'm not really working on high performance server designs anymore (although my stint at Google may or may not take me back in that direction). I'm also not up on all of the latest literature on the topic, so maybe there is a killer design out there that solves all of these problems once and for all.

One thing I learned doing this work is that one should always be skeptical of simple, "clean" benchmarks that try to demonstrate the peak or best-case performance of a given server design. My original benchmarks of SEDA involved fetching the same static 8KB web page over and over. Not surprisingly, it yields about the same performance no matter what server design you use. This benchmark hardly stresses the I/O, memory, threading, or socket layers of any system, and is more likely to highlight performance differences in the corner cases. (Believe me, I've read plenty of papers that use much dumber benchmarks than this. SpecWeb99, which we used in the SOSP paper, is only marginally better.)

It's harder to do, but I think it's important to evaluate performance in the context of a "real" application, one that involves all of the load and complexity you'd see in a real service. So I am not convinced by microbenchmarks anymore; it is like showing off a new automobile design running on a flat, even, indoor track with no wind drag, no adverse weather, no other traffic, and no need for seatbelts or airbags. Usually as soon as you load it up with realistic conditions, things start to break. Achieving good, robust performance across a wide range of loads is the real challenge.

Thursday, July 22, 2010

Fatherhood and professorhood

My little boy, Sidney, turned a year old this past week. I've been reflecting a lot lately on how much my life has changed since having a baby. I've also met a bunch of junior faculty members who ask what it's like trying to juggle being a prof with being a parent. To be sure, I was pretty worried that it would be really hard to juggle my work responsibilities with having a kid. At first I screwed up royally, but now I've found a good balance and it really works. Best of all, I love being a dad -- it has been worth all the sleepless nights, cleaning up barf and poop, and learning how to steer a spoonful of beets into an unwilling mouth.

Of course, being a dad is totally different than being a mom, and I can't imagine how much harder it must be for women in academia who want to have children. My wife is also in an academic career. When Sidney was first born, she took 3 months off of work, but this was hard for both of us -- for her, because she never got a break from taking care of the baby during the day, and for me, since I wasn't doing a good job at balancing my job with being a new dad. Fortunately, Sidney was born about a week after I submitted my tenure case materials, so I could relax a little, but being a prof still involves a lot of day-to-day stress.

My biggest mistake was not taking teaching relief as soon as the baby was born. I was slated to teach one of our intro courses, which had around 80 students, so it would have been a real problem had the course not been covered that term. I figured since I had taught the class a couple of times before it would be easy -- I planned to lean heavily on the teaching assistants and mostly waltz in twice a week to give lectures I had already prepared. What I didn't account for is that with so many students there is always a fire to put out somewhere -- a student who needs special attention, allegations of cheating, TAs dropping the ball -- so you are still "on call" even if the lectures and assignments have been prepared well in advance. The biggest stressor was having to teach on days without having had any sleep the night before. In retrospect, trying to teach that term was a huge mistake, and I should have put my own sanity before the department teaching schedule.

Since then, things have improved greatly, and I am so happy and proud to be a dad. The thing that nobody tells you is that newborn babies aren't much fun. They can't yet smile, laugh, control their extremities, see more than 6 inches away, or do much of anything except eat, sleep, cry, and poop. Once they hit 10 or 12 weeks things really take a big turn, and now that Sidney is a year old he is a total hoot. He just started walking last week and it's the funniest thing in the world to watch.

The biggest change in my life is that I can no longer work in the evenings and on the weekends. When I'm home, I'm daddy, and finding time to sit down at the laptop to get anything done is pretty hard. After Sidney's 8pm bedtime I can get some things done, but by that time, my two priorities are having a nice cocktail and getting a good nights' sleep. (By the way, I am a big fan of the Ferber method for helping babies learn to sleep on their own. Greg Morrisett described the technique to me as "exponential backoff." We did this with Sidney when he was 4 months old and since then has consistently slept from 8pm - 6am almost every night. It works.)

On the flip side, when I'm in the office, I am very focused on getting work done, since I know I can't work as well in the evenings. So rather than put off things until after dinner, I try to knock them off during the day. As a result I'm a lot more productive and less scattered. I feel like a total slacker leaving the Google office at 5pm sharp every day, but I have to get home to meet the nanny. That's life. Now that Sidney is a little older we've been taking him out to restaurants and happy hour -- there's nothing like feeding the baby a bottle while nursing a nice cold beer of my own. So life is good. Professors can also be parents. I just can't wait to start teaching Sidney C++.

Friday, July 16, 2010

The Amazing Undergrads of Summer

I've often said that one of the best things about being at Harvard is the students. The undergrads in particular are really out-of-this-world, needle-goes-to-eleven, scary smart. (There's no way I would have ever managed to get into Harvard as an undergrad.) I also love getting undergrads involved in research, and have had some great experiences. Some of my former students have gone off to do PhDs at Berkeley, Stanford, and MIT, off to medical and business school, to work at Microsoft, Amazon, and Google. Others have started little companies, like Facebook. I'm really proud of the students that have passed through my research lab and my classes.

But the batch of undergrads I'm working with this summer are off the charts in so many ways. I'm so excited about what they're doing I feel like I have to brag a little.

Most of them are working on the RoboBees project. In case you haven't heard, this project is Sean Hannity's #1 waste of government stimulus money, and our goal is to build a colony of micro-sized robotic bees. We have a bunch of undergrads involved this summer on a wide range of projects. The last two weeks, I asked them to each give 5-10 minute updates to the group on their status, and expected most of them to say that they hadn't done very much yet. I was totally blown away that each and every one of them has already done more than we expected them to get done in the entire summer. They are kicking total ass. In no particular order:

Matt Chartier '12 is studying the use of RoboBees for exploring underground environments, like mines and collapsed buildings. He's put together a very nice subterranean environment generator for our RoboBees simulator -- it generates very realistic mine tunnels and shafts -- and is looking at different sensors for detecting warm bodies in an underground setting.

Diana Cai '13 has developed a 3D visualization environment for the simulator, by hooking up Java3D and the JBullet physics engine. This thing is sweet -- we can watch the simulated bees fly through the environment, pan the view angle, and change a bunch of other parameters. This is going to be one of the most important tools for this project and Diana has knocked it out of the park. Check out the below movie for an example of it in action.

Lucia Mocz '13 is developing a simulation of the optical flow sensors that we plan to use on the RoboBees platform. Optical flow will allow the RoboBees to navigate, avoid obstacles, and more, and now we can explore how effectively these sensors enable closed-loop control. The last time I talked with Lucia she was geeking out on tuning the gains for her PID control loop for hover control. Keep in mind she just finished her freshman year at Harvard -- I didn't even know what a PID control loop was until grad school!

Peter Bailis '11 is cranking on getting our micro-helicopter testbed up and running, writing TinyOS code to control the sensors, motors, and servos, and making it possible to control a swarm of helis via a Python API from a PC. He's also working on the new distributed OS that we're developing for RoboBees (all very top secret stuff!). Here's a little video of one of our helis taking off and landing (and not crashing) using Peter's code. Today -- one helicopter taking off. Tomorrow -- world domination:

Rose Cao '11 is exploring the use of harmonic radar for tracking RoboBees in the field. The idea is to outfit each bee with a lightweight transponder that reflects radar at a specific frequency which we can detect. Of course, we also need to worry about disambiguating multiple bees which could be done by controlling their flight patterns. Rose also gave the funniest and most creative PowerPoint presentation I've seen in a long time!

Neena Kamath '11 and Noah Olsman '12 (a student at USC, here on the RoboBees REU program) are working with Radhika Nagpal on algorithms for crop pollination, exploring a wide range of themes including random walks, Levy flight patterns, adaptive gradients, and energy awareness. This stuff is super fun and highlights the power of a large colony of RoboBees working together.

Finally, a shout out to my non-RoboBee student, Thomas Buckley '12, who is working on integrating our Mercury wearable sensor platform with LabView to make it possible for non-experts to program and tune the sensor network in different clinical settings. No more hacking NesC code just to change the sampling parameters!

All of these great students are supported by the National Science Foundation, National Instruments, and Harvard's PRISE program for undergraduate research. Thanks to all of them for their support!

Wednesday, July 7, 2010

The subtle art of managing a research group

One thing that you rarely learn before starting a faculty job is how much work goes into managing a research group. During my pre-tenure days, this meant squeezing the most productivity out of my students and making sure they were hitting on all cylinders. Now that I have tenure, my role is more like a bodhisattva -- simply to make sure that my students (and postdocs and research staff) are successful in whatever they do. Of course, productivity and success have a high degree of overlap, but they are not exactly the same thing.

There are many subtle things that one needs to know to make sure that a research group is functioning properly. A lot of it has to do with making sure that the personalities mesh well. For a while, I tried to get all of my students to work together on One Big Project. We would have these big group meetings and write design docs but over time it became clear to me that it just wasn't working. It finally dawned on me that a couple of the students in my group (including one who had developed most of the core code that I wanted everyone else to rely on) were not that interested in working with other people -- they were far better off doing their own thing. I've also had students who really work fantastically in a team setting, sometimes to a fault. Those students are really good at doing service work and helping others, when they really should be more selfish and pushing their own agenda first. In general it's good to have a mix of personalities with different strengths in the group. If everyone is gunning to be head honcho, it isn't going to work.

Of course, most junior faculty go into the job with zero management training or skills. One's experience in grad school no doubt has a big influence on their approach to running a research group. My advisor was David Culler, who is known to be extremely hands-off with his students (though he can do this amazing Jedi mind trick -- to get his students to do his bidding -- that I have never quite mastered). I took after David, though I find that I am much happier hacking alongside the students, rather than only discussing things at the whiteboard. I also see lots of junior faculty who live in the lab with their students and have their hands all over their code, so there are many different paths to enlightenment.

All along I wished I had more management experience or training. Early on, I was given a copy of Robert Boice's Advice for New Faculty Members, and frankly found it to be fairly useless. It is unnecessarily cryptic: the very first section is entitled "Rationale for a Nihil Nimus (Moderate) Approach to Teaching" -- I am still not sure what the hell that means, but it certainly wasn't any help for someone starting up a big research lab.

It turns out that MIT Professional Education runs a short course on Leadership Skills for Engineering and Science Faculty. (I signed up for this a few years ago but they canceled the course due to low enrollment! I certainly hope to take it one day.) Another useful resource is the Harvard Business Review Paperback Series, which is a collection of (very short and readable) books on management topics, some of which are germane to science faculty running a lab. For example, the book on motivating people gets into the various ways of getting your "employees" (a.k.a. students) to be productive, and talks all about the pros and cons of the carrot versus the stick. Synopsis: If you can get inside the head of an unmotivated student and figure out what they want, you can motivate them to do anything. This must be the key behind Culler's Jedi mind trick.

Sunday, July 4, 2010

First week at Google

I started work at Google this week, and did orientation at the mothership in Mountain View. It was an awesome experience, and I had more fun than I have had in years. I certainly learned a hell of a lot. A bunch of "Nooglers" -- more than 100! -- were starting the same week, including Amin Vahdat, who is taking a sabbatical there as well. I've been asked a lot what I will be working on a Google. I can't provide details, but my job is a software engineer doing networking-related projects out of Google's Boston office. I won't be doing "research"; I'll be building and deploying real systems. I'm very excited.

Clearly, I haven't been there long enough to have any informed opinions on the place, but first impressions are important -- so here goes.

First, it should be no surprise that I'm blown away by the scale of the problems that Google is working on and the resources they bring to bear on those problems. Before last week, the largest number of machines I'd ever used at once was a couple of hundred; on my fourth day at Google I was running jobs on two orders of magnitude more. It is a humbling experience.
Having worked on and thought about "big systems" for so many years, being able to work on a real big system is an amazing experience. Doing an internship at Google should be mandatory for students who want to do research in this area.

The place is very young and energetic. There are few people over 40 wandering the halls. I was also impressed with the fraction of women engineers -- much higher than I was expecting. Everyone that I have met so far is incredibly smart, and the overall culture is focused on getting shit done, with a minimum of bureaucracy.

Orientation was a little chaotic. The very first presentation was how to use the videoconference system -- this did not seem like the right place to start. Of course, there is so much to learn that they have no choice but to throw you in the deep end of the pool and point you at a bunch of resources for getting up to speed on Google's massive infrastructure.

Google is famous for having a "bottom up" approach to engineering. Development is driven by short projects, typically with a few engineers with a timeframe of 3-12 months. Rather than a manager or VP handing down requirements, anyone can start a new project and seed it in their 20% time. If the project gains momentum it can be officially allocated engineering resources and generally the tech lead needs to recruit other engineers to work on it. (GMail started this way.) Inevitably, there is some degree of overlap and competition between projects, but this seems like a good thing since is rewards execution and follow-through.

Figuring out what the heck is going on can be pretty challenging. Fortunately Google has an internal search engine of every project, every document, every line of code within the company which helps tremendously. Internally, the corporate culture is very open and with few exceptions, every engineer has access to everything going on within the company.

I hope that I will be able to continue blogging about my Google experience -- their blog policy is pretty reasonable, though I won't be able to share technical details. But from now on I need to include the following disclaimer:

This is my personal blog. The views expressed here are mine alone and not those of my employer.