Skip to main content

How can academics do research on cloud computing?

This week I'm in Napa for HotOS 2011 -- the premier workshop on operating systems. HotOS is in its 24th year -- it started as the Workshop on Workstation Operating Systems in 1987. More on HotOS in a forthcoming blog post, but for now I wanted to comment on a very lively argument discussion that took place during the panel session yesterday.

The panel consisted of Mendel Rosenblum from Stanford (and VMWare, of course); Rebecca Isaacs from Microsoft Research; John Wilkes from Google; and Ion Stoica from Berkeley. The charge to the panel was to discuss the gap between academic research in cloud computing and the realities faced by industry. This came about in part because a bunch of cloud papers were submitted to HotOS from academic research groups. In some cases, the PC felt that the papers were trying to solve the wrong problems, or making incorrect assumptions about the state of cloud computing in the real world. We thought it would be interesting to hear from both academic and industry representatives about whether and how academic researchers can hope to do work on the cloud, given that there's no way for a university to build something at the scale and complexity of a real-world cloud platform. The concern is that academics will be relegated to working on little problems at the periphery, or come up with toy solutions.

The big challenge, as I see it, is how to enable academics to do interesting and relevant work on the cloud when it's nearly impossible to build up the infrastructure in a university setting. John Wilkes made the point that that he never wanted to see another paper submission showing a 10% performance improvement in Hadoop, and he's right -- this is not the right problem for academics to be working on. Not because 10% improvement is not useful, or that Hadoop is a bad platform, but because those kinds of problems are already being solved by industry. In my opinion, the best role for academia is to open up new areas and look well beyond where industry is working. But this is often at odds with the desire for academics to work on "industry relevant" problems, as well as to get funding from industry. Too often I think academics fall into the trap of working on things that might as well be done at a company.

Much of the debate at HotOS centered around the industry vs. academic divide and a fair bit of it was targeted at my previous blog posts on this topic. Timothy Roscoe argued that academia's role was to shed light on complex problems and gain understanding, not just to engineer solutions. I agree with this. Sometimes at Google, I feel that we are in such a rush to implement that we don't take the time to understand the problems deeply enough: build something that works and move onto the next problem. Of course, you have to move fast in industry. The pace is very different than academia, where a PhD student needs to spend multiple years focused on a single problem to get a dissertation written about it.

We're not there yet, but there are some efforts to open up cloud infrastructure to academic research. OpenCirrus is a testbed supported by HP, Intel, and Yahoo! with more than 10,000 cores that academics can use for systems research. Microsoft has opened up its Azure cloud platform for academic research. Only one person at HotOS raised their hand when asked if anyone was using this -- this is really unfortunate. (My theory is that academics have an allergic reaction to programming in C# and Visual Studio, which is too bad, since this is a really great platform if you can get over the toolchain.) Google is offering a billion core hours through its Exacycle program, and Amazon has a research grant program as well.

Providing infrastructure is only one part of the solution. Knowing what problems to work on is the other. Many people at HotOS bemoaned the fact that companies like Google are so secretive about what they're doing, and it's hard to learn what the "real" challenges are from the outside. My answer to this is to spend time at Google as a visiting scientist, and send your students to do internships. Even though it might not result in a publication, I can guarantee you will learn a tremendous amount about what the hard problems are in cloud computing and where the great opportunities are for academic work. (Hell, my mind was blown after my first couple of days at Google. It's like taking the red pill.)

A few things that jump to mind as ripe areas for academic research on the cloud:
  • Understanding and predicting performance at scale, with uncertain workloads and frequent node failures.
  • Managing workloads across multiple datacenters with widely varying capacity, occasional outages, and constrained inter-datacenter network links.
  • Building failure recovery mechanisms that are robust to massive correlated outages. (This is what brought down Amazon's EC2 a few weeks ago.)
  • Debugging large-scale cloud applications: tools to collect, visualize, and inspect the state of jobs running across many thousands of cores.
  • Managing dependencies in a large codebase that relies upon a wide range of distributed services like Chubby and GFS.
  • Handling both large-scale upgrades to computing capacity as well as large-scale outages seamlessly, without having to completely shut down your service and everything it depends on.


  1. Surely someone on the panel pointed out the difference between using a cloud infrastructure that someone has opened up, versus being able to change the architecture/implementation of that infrastructure. Certainly using Azure or EC2 lets you do some interesting things. But it's working on the underlying system that is the proper domain of "systems" research as we've understood it in the past.

    This seems to me to be something best solved by having a nationally funded research infrastructure -- a cloud that PhD students can hack. Given the pervasive and growing importance of this computing model across many companies, significant investment seems warranted. And making sure that companies don't own all the possible testbeds for innovation seems pretty fundamental from a policy point of view.

  2. David - I agree completely. This is what OpenCirrus seems to provide.

  3. Awesome post. Very timely!

  4. Sounds like a really interesting panel.

    Perhaps there is an analogy with computer architecture. Certainly you don't need a billion-dollar fab to do research on processor designs, and a lot of the reason for that is having good simulators. However, the vast majority of datacenter operators won't release even aggregate statistics of traffic patterns, traces, or workloads.

    We'll never have a "cycle accurate" simulator for the datacenter--there is way more complexity in a mega-datacenter than in a processor design. But having some traces, statistics, or workloads could help academic researchers choose problems better. Hadoop isn't a good example, since you probably do need large scale to get believable results. But something like 802.1Qau-QCN doesn't necessarily a gazillion nodes to be interesting.

  5. I am thinking of something more to post but I just want to say at this time:

    Eric Brewer is going on leave at Google for 2 years, as VP Engineering, working on Cloud Computing.


  6. Sounds like a great panel, and hopefully one that will encourage more academic infrastructure. I wanted to add that if you look beyond the Googles of the world, there are a lot of interesting cloud users who are much more approachable. These are the many startups, academic teams, etc starting to use cloud computing (see for example They need programming models, storage systems, debugging tools, etc just like the Googles, and furthermore, they need solutions that will work *without* an army of SREs to support them. These users are often glad to try collaboration with academics and open source software. Of course, you will not get to work on all aspects of cloud computing this way, but in a sense, these are the more exciting cloud users if you care about democratizing the ability to run large computations.

  7. George - I tend to agree that having good simulation or emulation tools would help a lot. The systems community does not tend to believe in such tools, no matter how good they are... but given the scale at which people want to work it may be necessary to get back to that mode of doing research.

    Matei - it would be great for academics to build relationships with startups and others that are willing to be more open. Most of them are using EC2 which does not make it easy to do "systems" research on how to build a cloud though.

  8. The NSF/LANL PRoBE project is another example of a testbed for cloud computing infrastructure. I'm trying to get them to join OpenCirrus (for which my slogan was "PlanetLab for data centers".)

  9. Hi Matt,

    For what it's worth, I was actually arguing against simulations (I am very skeptical of simulations in general).

    Rather, by providing information about workloads and more information about what actually goes on in datacenters, outside researchers will be able to select problems better. There are certainly problems that have solutions that don't require Google-sized datacenters to evaluate. As one example, QCN could prove to be quite useful in datacenters, and can be prototyped and evaluated on a few racks.

  10. Hi Matt,

    Good post, interesting stuff that I've thought about myself for several years. As a PhD student at the University of Virginia, I've looked at OpenCirrus and talked to some Intel people involved and while I like the idea the actual execution seems to leave a lot to be desired. My understanding (and this may have changed in the year or so since I really looked at it) is that each company/participant basically gets to decide what resources and what level of access they are willing to give researchers. If you look at the projects page most of them are *uhem* hadoop optimization/application projects.

    A researcher seems very unlikely to get access below the VM layer where a lot of the really interesting cloud research is. There is certainly work to be done in how to use the cloud and how to manage failure, performance variation, and scaling resources at the application level, but to get academic research on the actual infrastructure layers academics need access to large scale hardware and competent tools and admins to run them. This, I think, is a tricky problem from a funding perspective. There is another project of note, OGF's FutureGrid ( that aims to be a systems research platform (they use Nimbus, Eucalyptus, etc), but OS-level access is highly questionable there as well. The fundamental problem is that no one wants to buy a bunch (like 10,000) of machines and then hand-over root-level access to them to a bunch of academics.

  11. George - it would be great if Google (or others) could release public workload datasets. I have no idea if Google would go for this, but it's probably worth asking around... I'll see what I can dig up.

    Zach - this is indeed a problem, but the question is how much systems research *can* be done on a platform that provides higher-level abstractions, like Hadoop. PlanetLab has been tremendously successful even though you don't get "raw" access to the machine, although it also limits certain kinds of questions. Rather than focus on the negative I wonder how much people have tried to leverage these platforms rather than assume that they won't work.


Post a Comment

Popular posts from this blog

Why I'm leaving Harvard

The word is out that I have decided to resign my tenured faculty job at Harvard to remain at Google. Obviously this will be a big change in my career, and one that I have spent a tremendous amount of time mulling over the last few months.

Rather than let rumors spread about the reasons for my move, I think I should be pretty direct in explaining my thinking here.

I should say first of all that I'm not leaving because of any problems with Harvard. On the contrary, I love Harvard, and will miss it a lot. The computer science faculty are absolutely top-notch, and the students are the best a professor could ever hope to work with. It is a fantastic environment, very supportive, and full of great people. They were crazy enough to give me tenure, and I feel no small pang of guilt for leaving now. I joined Harvard because it offered the opportunity to make a big impact on a great department at an important school, and I have no regrets about my decision to go there eight years ago. But m…

Rewriting a large production system in Go

My team at Google is wrapping up an effort to rewrite a large production system (almost) entirely in Go. I say "almost" because one component of the system -- a library for transcoding between image formats -- works perfectly well in C++, so we decided to leave it as-is. But the rest of the system is 100% Go, not just wrappers to existing modules in C++ or another language. It's been a fun experience and I thought I'd share some lessons learned.

Why rewrite?

The first question we must answer is why we considered a rewrite in the first place. When we started this project, we adopted an existing C++ based system, which had been developed over the course of a couple of years by two of our sister teams at Google. It's a good system and does its job remarkably well. However, it has been used in several different projects with vastly different goals, leading to a nontrivial accretion of cruft. Over time, it became apparent that for us to continue to innovate rapidly wo…

Running a software team at Google

I'm often asked what my job is like at Google since I left academia. I guess going from tenured professor to software engineer sounds like a big step down. Job titles aside, I'm much happier and more productive in my new role than I was in the 8 years at Harvard, though there are actually a lot of similarities between being a professor and running a software team.

I lead a team at Google's Seattle office which is responsible for a range of projects in the mobile web performance area (for more background on my team's work see my earlier blog post on the topic). One of our projects is the recently-announced data compression proxy support in Chrome Mobile. We also work on the PageSpeed suite of technologies, specifically focusing on mobile web optimization, as well as a bunch of other cool stuff that I can't talk about just yet.

My official job title is just "software engineer," which is the most common (and coveted) role at Google. (I say "coveted&quo…