Volatile and Decentralized: May 2013

I just got back from HotOS 2013 and, frankly, it was a little depressing. Mind you, the conference was really well-organized; there were lots of great people; an amazing venue; and fine work by the program committee and chair... but I could not help being left with the feeling that the operating systems community is somewhat stuck in a rut.

It did not help that the first session was about how to make network and disk I/O faster, a topic that has been a recurring theme for as long as "systems" has existed as a field. HotOS is supposed to represent the "hot topics" in the area, but when we're still arguing about problems that are 25 years old, it starts to feel not-so-hot.

Of the 27 papers presented at the workshop, only about 2 or 3 would qualify as bold, unconventional, or truly novel research directions. The rest were basically extended abstracts of conference submissions that are either already in preparation or will be submitted in the next year or so. This is a perennial problem for HotOS, and when I chaired it in 2011 we had the same problem. So I can't fault the program committee on this one -- they have to work with the submissions they get, and often the "best" and most polished submissions represent the most mature (and hence less speculative) work. (Still, this year there was no equivalent to Dave Ackley's paper in 2011 which challenged us to "pledge allegiance to the light cone.")

This got me thinking about what research areas I wish the systems research community would spend more time on. I wrote a similar blog post after attending HotMobile 2013, so it's only fair that I would subject the systems community to the same treatment. A few ideas...

Obligatory diisclaimer: Everything in this post is my personal opinion and does not represent the view of my employer.

An escape from configuration hell: A lot of research effort is focused on better techniques for finding and mitigating software bugs. In my experience at Google, the vast majority of production failures arise not due to bugs in the software, but bugs in the (often vast and incredibly complex) configuration settings that control the software. A canonical example is when someone bungles an edit to a config file which gets rolled out to the fleet, and causes jobs to start behaving in new and often not-desirable ways. The software is working exactly as intended, but the bad configuration is leading it to do the wrong thing.

This is a really hard problem. A typical Google-scale system involves many interacting jobs running very different software packages each with their own different mechanisms for runtime configuration: whether they be command-line flags, some kind of special-purpose configuration file (often in a totally custom ASCII format of some kind), or a fancy dynamically updated key-value store. The configurations are often operating at very different levels of abstraction --- everything from deciding where to route network packets, to Thai and Slovak translations of UI strings seen by users. "Bad configurations" are not just obvious things like syntax errors; they also include unexpected interactions between software components when a new (perfectly valid) configuration is used.

There are of course tools for testing configurations, catching problems and rapidly rolling back bad changes, etc. but a tremendous amount of developer and operational energy goes into fixing problems arising due to bad configurations. This seems like a ripe area for research.

Understanding interactions in a large, production system: The common definition of a "distributed system" assumes that the interactions between the individual components of the system are fairly well-defined, and dictated largely by whatever messaging protocol is used (cf., two phase commit, Paxos, etc.) In reality, the modes of interaction are vastly more complex and subtle than simply reasoning about state transitions and messages, in the abstract way that distributed systems researchers tend to cast things.

Let me give a concrete example. Recently we encountered a problem where a bunch of jobs in one datacenter started crashing due to running out of file descriptors. Since this roughly coincided with a push of a new software version, we assumed that there must have been some leak in the new code, so we rolled back to the old version -- but the crash kept happening. We couldn't just take down the crashing jobs and let the traffic flow to another datacenter, since we were worried that the increased load would trigger the same bug elsewhere, leading to a cascading failure. The engineer on call spent many, many hours trying different things and trying to isolate the problem, without success. Eventually we learned that another team had changed the configuration of their system which was leading to many more socket connections being made to our system, which put the jobs over the default file descriptor limit (which had never been triggered before). The "bug" here was not a software bug, or even a bad configuration: it was the unexpected interaction between two very different (and independently-maintained) software systems leading to a new mode of resource exhaustion.

Somehow there needs to be a way to perform offline analysis and testing of large, complex systems so that we can catch these kinds of problems before they crop up in production. Of course we have extensive testing infrastructure, but the "hard" problems always come up when running in a real production environment, with real traffic and real resource constraints. Even integration tests and canarying are a joke compared to how complex production-scale systems are. I wish I had a way to take a complete snapshot of a production system and run it in an isolated environment -- at scale! -- to determine the impact of a proposed change. Doing so on real hardware would be cost-prohibitive (even at Google), so how do you do this in a virtual or simulated setting?

I'll admit that these are not easy problems for academics to work on. Unless you have access to a real production system, it's unlikely you'll encounter this problem in an academic setting. Doing internships at companies is a great way to get exposure to this kind of thing. Replicating this problem in an academic environment may be difficult.

Pushing the envelope on new computing platforms: I also wish the systems community would come back to working on novel and unconventional computing platforms. The work on sensor networks in the 2000's really challenged our assumptions about the capabilities and constraints of a computer system, and forced us down some interesting paths in terms of OS, language, and network protocol design. In doing these kinds of explorations, we learn a lot about how "conventional" OS concepts map (or don't map) onto the new platform, and the new techniques can often find a home in a more traditional setting: witness how the ideas from Click have influenced all kinds of systems unrelated to its original goals.

I think it is inevitable that in our lifetimes we will have a wearable computing platform that is "truly embedded": either with a neural interface, or with something almost as good (e.g. seamless speech input and visual output in a light and almost-invisible form factor). I wore my Google Glass to HotOS, which stirred up a lot of discussions around privacy issues, what the "killer apps" are, what abstractions the OS should support, and so forth. I would call Google Glass an early example of the kind of wearable platform that may well replace smartphones, tablets, and laptops as the personal computing interface of choice in the future. If that is true, then now is the time for the academic systems community to start working out how we're going to support such a platform. There are vast issues around privacy, energy management, data storage, application design, algorithms for vision and speech recognition, and much more that come up in this setting.

These are all juicy and perfectly valid research problems for the systems community -- if only it is bold enough to work on them.

Wednesday, May 15, 2013

What I wish systems researchers would work on

Startup Life: Three Months In

Search This Blog