Tuesday, October 19, 2010
Computing at scale, or, how Google has warped my brain
Most of the code I'm writing is in Python, but makes heavy use of Google technologies such as MapReduce, BigTable, GFS, Sawzall, and a bunch of other things that I'm not at liberty to discuss in public. Within about a week of starting at Google, I had code running on thousands of machines all over the planet, with surprisingly little overhead.
As an academic, I have spent a lot of time thinking about and designing "large scale systems", though before coming to Google I rarely had a chance to actually work on them. At Berkeley, I worked on the 200-odd node NOW and Millennium clusters, which were great projects, but pale in comparison to the scale of the systems I use at Google every day.
A few lessons and takeaways from my experience so far...
The cloud is real. The idea that you need a physical machine close by to get any work done is completely out the window at this point. My only machine at Google is a Mac laptop (with a big honking monitor and wireless keyboard and trackpad when I am at my desk). I do all of my development work on a virtual Linux machine running in a datacenter somewhere -- I am not sure exactly where, not that it matters. I ssh into the virtual machine to do pretty much everything: edit code, fire off builds, run tests, etc. The systems I build are running in various datacenters and I rarely notice or care where they are physically located. Wide-area network latencies are low enough that this works fine for interactive use, even when I'm at home on my cable modem.
In contrast, back at Harvard, there are discussions going on about building up new resources for scientific computing, and talk of converting precious office and lab space on campus (where space is extremely scarce) into machine rooms. I find this idea fairly misdirected, given that we should be able to either leverage a third-party cloud infrastructure for most of this, or at least host the machines somewhere off-campus (where it would be cheaper to get space anyway). There is rarely a need for the users of the machines to be anywhere physically close to them anymore. Unless you really don't believe in remote management tools, the idea that we're going to displace students or faculty lab space to host machines that don't need to be on campus makes no sense to me.
The tools are surprisingly good. It is amazing how easy it is to run large parallel jobs on massive datasets when you have a simple interface like MapReduce at your disposal. Forget about complex shared-memory or message passing architectures: that stuff doesn't scale, and is so incredibly brittle anyway (think about what happens to an MPI program if one core goes offline). The other Google technologies, like GFS and BigTable, make large-scale storage essentially a non-issue for the developer. Yes, there are tradeoffs: you don't get the same guarantees as a traditional database, but on the other hand you can get something up and running in a matter of hours, rather than weeks.
Log first, ask questions later. It should come as no surprise that debugging a large parallel job running on thousands of remote processors is not easy. So, printf() is your friend. Log everything your program does, and if something seems to go wrong, scour the logs to figure it out. Disk is cheap, so better to just log everything and sort it out later if something seems to be broken. There's little hope of doing real interactive debugging in this kind of environment, and most developers don't get shell access to the machines they are running on anyway. For the same reason I am now a huge believer in unit tests -- before launching that job all over the planet, it's really nice to see all of the test lights go green.