Tuesday, May 10, 2011

How can academics do research on cloud computing?

This week I'm in Napa for HotOS 2011 -- the premier workshop on operating systems. HotOS is in its 24th year -- it started as the Workshop on Workstation Operating Systems in 1987. More on HotOS in a forthcoming blog post, but for now I wanted to comment on a very lively argument discussion that took place during the panel session yesterday.

The panel consisted of Mendel Rosenblum from Stanford (and VMWare, of course); Rebecca Isaacs from Microsoft Research; John Wilkes from Google; and Ion Stoica from Berkeley. The charge to the panel was to discuss the gap between academic research in cloud computing and the realities faced by industry. This came about in part because a bunch of cloud papers were submitted to HotOS from academic research groups. In some cases, the PC felt that the papers were trying to solve the wrong problems, or making incorrect assumptions about the state of cloud computing in the real world. We thought it would be interesting to hear from both academic and industry representatives about whether and how academic researchers can hope to do work on the cloud, given that there's no way for a university to build something at the scale and complexity of a real-world cloud platform. The concern is that academics will be relegated to working on little problems at the periphery, or come up with toy solutions.

The big challenge, as I see it, is how to enable academics to do interesting and relevant work on the cloud when it's nearly impossible to build up the infrastructure in a university setting. John Wilkes made the point that that he never wanted to see another paper submission showing a 10% performance improvement in Hadoop, and he's right -- this is not the right problem for academics to be working on. Not because 10% improvement is not useful, or that Hadoop is a bad platform, but because those kinds of problems are already being solved by industry. In my opinion, the best role for academia is to open up new areas and look well beyond where industry is working. But this is often at odds with the desire for academics to work on "industry relevant" problems, as well as to get funding from industry. Too often I think academics fall into the trap of working on things that might as well be done at a company.

Much of the debate at HotOS centered around the industry vs. academic divide and a fair bit of it was targeted at my previous blog posts on this topic. Timothy Roscoe argued that academia's role was to shed light on complex problems and gain understanding, not just to engineer solutions. I agree with this. Sometimes at Google, I feel that we are in such a rush to implement that we don't take the time to understand the problems deeply enough: build something that works and move onto the next problem. Of course, you have to move fast in industry. The pace is very different than academia, where a PhD student needs to spend multiple years focused on a single problem to get a dissertation written about it.

We're not there yet, but there are some efforts to open up cloud infrastructure to academic research. OpenCirrus is a testbed supported by HP, Intel, and Yahoo! with more than 10,000 cores that academics can use for systems research. Microsoft has opened up its Azure cloud platform for academic research. Only one person at HotOS raised their hand when asked if anyone was using this -- this is really unfortunate. (My theory is that academics have an allergic reaction to programming in C# and Visual Studio, which is too bad, since this is a really great platform if you can get over the toolchain.) Google is offering a billion core hours through its Exacycle program, and Amazon has a research grant program as well.

Providing infrastructure is only one part of the solution. Knowing what problems to work on is the other. Many people at HotOS bemoaned the fact that companies like Google are so secretive about what they're doing, and it's hard to learn what the "real" challenges are from the outside. My answer to this is to spend time at Google as a visiting scientist, and send your students to do internships. Even though it might not result in a publication, I can guarantee you will learn a tremendous amount about what the hard problems are in cloud computing and where the great opportunities are for academic work. (Hell, my mind was blown after my first couple of days at Google. It's like taking the red pill.)

A few things that jump to mind as ripe areas for academic research on the cloud:
  • Understanding and predicting performance at scale, with uncertain workloads and frequent node failures.
  • Managing workloads across multiple datacenters with widely varying capacity, occasional outages, and constrained inter-datacenter network links.
  • Building failure recovery mechanisms that are robust to massive correlated outages. (This is what brought down Amazon's EC2 a few weeks ago.)
  • Debugging large-scale cloud applications: tools to collect, visualize, and inspect the state of jobs running across many thousands of cores.
  • Managing dependencies in a large codebase that relies upon a wide range of distributed services like Chubby and GFS.
  • Handling both large-scale upgrades to computing capacity as well as large-scale outages seamlessly, without having to completely shut down your service and everything it depends on.

11 comments:

  1. Surely someone on the panel pointed out the difference between using a cloud infrastructure that someone has opened up, versus being able to change the architecture/implementation of that infrastructure. Certainly using Azure or EC2 lets you do some interesting things. But it's working on the underlying system that is the proper domain of "systems" research as we've understood it in the past.

    This seems to me to be something best solved by having a nationally funded research infrastructure -- a cloud that PhD students can hack. Given the pervasive and growing importance of this computing model across many companies, significant investment seems warranted. And making sure that companies don't own all the possible testbeds for innovation seems pretty fundamental from a policy point of view.

    ReplyDelete
  2. David - I agree completely. This is what OpenCirrus seems to provide.

    ReplyDelete
  3. Awesome post. Very timely!

    ReplyDelete
  4. Sounds like a really interesting panel.

    Perhaps there is an analogy with computer architecture. Certainly you don't need a billion-dollar fab to do research on processor designs, and a lot of the reason for that is having good simulators. However, the vast majority of datacenter operators won't release even aggregate statistics of traffic patterns, traces, or workloads.

    We'll never have a "cycle accurate" simulator for the datacenter--there is way more complexity in a mega-datacenter than in a processor design. But having some traces, statistics, or workloads could help academic researchers choose problems better. Hadoop isn't a good example, since you probably do need large scale to get believable results. But something like 802.1Qau-QCN doesn't necessarily a gazillion nodes to be interesting.

    ReplyDelete
  5. I am thinking of something more to post but I just want to say at this time:

    Eric Brewer is going on leave at Google for 2 years, as VP Engineering, working on Cloud Computing.

    !!!!!!

    ReplyDelete
  6. Sounds like a great panel, and hopefully one that will encourage more academic infrastructure. I wanted to add that if you look beyond the Googles of the world, there are a lot of interesting cloud users who are much more approachable. These are the many startups, academic teams, etc starting to use cloud computing (see for example http://aws.amazon.com/solutions/case-studies). They need programming models, storage systems, debugging tools, etc just like the Googles, and furthermore, they need solutions that will work *without* an army of SREs to support them. These users are often glad to try collaboration with academics and open source software. Of course, you will not get to work on all aspects of cloud computing this way, but in a sense, these are the more exciting cloud users if you care about democratizing the ability to run large computations.

    ReplyDelete
  7. George - I tend to agree that having good simulation or emulation tools would help a lot. The systems community does not tend to believe in such tools, no matter how good they are... but given the scale at which people want to work it may be necessary to get back to that mode of doing research.

    Matei - it would be great for academics to build relationships with startups and others that are willing to be more open. Most of them are using EC2 which does not make it easy to do "systems" research on how to build a cloud though.

    ReplyDelete
  8. The NSF/LANL PRoBE project is another example of a testbed for cloud computing infrastructure. I'm trying to get them to join OpenCirrus (for which my slogan was "PlanetLab for data centers".)

    ReplyDelete
  9. Hi Matt,

    For what it's worth, I was actually arguing against simulations (I am very skeptical of simulations in general).

    Rather, by providing information about workloads and more information about what actually goes on in datacenters, outside researchers will be able to select problems better. There are certainly problems that have solutions that don't require Google-sized datacenters to evaluate. As one example, QCN could prove to be quite useful in datacenters, and can be prototyped and evaluated on a few racks.

    ReplyDelete
  10. Hi Matt,

    Good post, interesting stuff that I've thought about myself for several years. As a PhD student at the University of Virginia, I've looked at OpenCirrus and talked to some Intel people involved and while I like the idea the actual execution seems to leave a lot to be desired. My understanding (and this may have changed in the year or so since I really looked at it) is that each company/participant basically gets to decide what resources and what level of access they are willing to give researchers. If you look at the projects page most of them are *uhem* hadoop optimization/application projects.

    A researcher seems very unlikely to get access below the VM layer where a lot of the really interesting cloud research is. There is certainly work to be done in how to use the cloud and how to manage failure, performance variation, and scaling resources at the application level, but to get academic research on the actual infrastructure layers academics need access to large scale hardware and competent tools and admins to run them. This, I think, is a tricky problem from a funding perspective. There is another project of note, OGF's FutureGrid (https://portal.futuregrid.org/) that aims to be a systems research platform (they use Nimbus, Eucalyptus, etc), but OS-level access is highly questionable there as well. The fundamental problem is that no one wants to buy a bunch (like 10,000) of machines and then hand-over root-level access to them to a bunch of academics.

    ReplyDelete
  11. George - it would be great if Google (or others) could release public workload datasets. I have no idea if Google would go for this, but it's probably worth asking around... I'll see what I can dig up.

    Zach - this is indeed a problem, but the question is how much systems research *can* be done on a platform that provides higher-level abstractions, like Hadoop. PlanetLab has been tremendously successful even though you don't get "raw" access to the machine, although it also limits certain kinds of questions. Rather than focus on the negative I wonder how much people have tried to leverage these platforms rather than assume that they won't work.

    ReplyDelete

Startup Life: Three Months In

I've posted a story to Medium on what it's been like to work at a startup, after years at Google. Check it out here.