Tuesday, October 19, 2010

Computing at scale, or, how Google has warped my brain

A number of people at Google have stickers on their laptops that read "my other computer is a data center." Having been at Google for almost four months, I realize now that my whole concept of computing has radically changed since I started working here. I now take it for granted that I'll be able to run jobs on thousands of machines, with reliable job control and sophisticated distributed storage readily available.

Most of the code I'm writing is in Python, but makes heavy use of Google technologies such as MapReduce, BigTable, GFS, Sawzall, and a bunch of other things that I'm not at liberty to discuss in public. Within about a week of starting at Google, I had code running on thousands of machines all over the planet, with surprisingly little overhead.

As an academic, I have spent a lot of time thinking about and designing "large scale systems", though before coming to Google I rarely had a chance to actually work on them. At Berkeley, I worked on the 200-odd node NOW and Millennium clusters, which were great projects, but pale in comparison to the scale of the systems I use at Google every day.

A few lessons and takeaways from my experience so far...

The cloud is real. The idea that you need a physical machine close by to get any work done is completely out the window at this point. My only machine at Google is a Mac laptop (with a big honking monitor and wireless keyboard and trackpad when I am at my desk). I do all of my development work on a virtual Linux machine running in a datacenter somewhere -- I am not sure exactly where, not that it matters. I ssh into the virtual machine to do pretty much everything: edit code, fire off builds, run tests, etc. The systems I build are running in various datacenters and I rarely notice or care where they are physically located. Wide-area network latencies are low enough that this works fine for interactive use, even when I'm at home on my cable modem.

In contrast, back at Harvard, there are discussions going on about building up new resources for scientific computing, and talk of converting precious office and lab space on campus (where space is extremely scarce) into machine rooms. I find this idea fairly misdirected, given that we should be able to either leverage a third-party cloud infrastructure for most of this, or at least host the machines somewhere off-campus (where it would be cheaper to get space anyway). There is rarely a need for the users of the machines to be anywhere physically close to them anymore. Unless you really don't believe in remote management tools, the idea that we're going to displace students or faculty lab space to host machines that don't need to be on campus makes no sense to me.

The tools are surprisingly good. It is amazing how easy it is to run large parallel jobs on massive datasets when you have a simple interface like MapReduce at your disposal. Forget about complex shared-memory or message passing architectures: that stuff doesn't scale, and is so incredibly brittle anyway (think about what happens to an MPI program if one core goes offline). The other Google technologies, like GFS and BigTable, make large-scale storage essentially a non-issue for the developer. Yes, there are tradeoffs: you don't get the same guarantees as a traditional database, but on the other hand you can get something up and running in a matter of hours, rather than weeks.

Log first, ask questions later. It should come as no surprise that debugging a large parallel job running on thousands of remote processors is not easy. So, printf() is your friend. Log everything your program does, and if something seems to go wrong, scour the logs to figure it out. Disk is cheap, so better to just log everything and sort it out later if something seems to be broken. There's little hope of doing real interactive debugging in this kind of environment, and most developers don't get shell access to the machines they are running on anyway. For the same reason I am now a huge believer in unit tests -- before launching that job all over the planet, it's really nice to see all of the test lights go green.


  1. Sun Micro Systems coined the term "Network is the Computer" in late 1990s, May be Google will make it happen with "Data Center is computer".

  2. PCs with greater RAM or CPU speed may not be required at all in future. All complex computations will be done from data centres. Way to go!

  3. Chintan - don't forget that the machines in the datacenters need more RAM and CPU and disk and all that over time.

  4. Can you explain how you "do development on a virtual linux machine".

    Is the linux machine running your IDE which you are viewing using VNC on the mac? I have found that annoying.

    Or, is the IDE running native on the mac and you are somehow moving files over to the linux machine?

    Or, is there some other tool that I don't know about?

  5. Unfortunately commercially available tools are not at the same level as the tools at Google. Our servers were moved from our lab to a state where electricity is cheap. The off-the-shelf remote control tools we have (both commercial stand-alone KVMs and blade management tools from IBM) don't play well with anything but Windows. I spent a couple of days trying to do a stock Fedora install using a CIFS-mounted virtual CD to a blade using a virtual keyboard, display, and mouse before giving up and having an on-site warm body burn a disc, pop it in the drive, and install locally. (And I work for a very large systems company--one everybody here has heard of!)

  6. Anon re "doing development on a virtual machine": I use this wonderful development environment called "vi" which works beautifully over an ssh connection :-) Some people here do run a remote desktop client which works fine too.

    Anon re "commercially available tools": I don't know the state of third-party cloud solutions. I've used blade management tools from IBM without much trouble at Harvard.

  7. Anon re "commercially available tools":
    For that stuff, there are good third-party solutions. I have no experience with blades, but if it can do PXE-boot and you can get a virtual kb+display, it should be doable.
    Also, there's plenty of solutions for the non-blade case: DRAC cards for windows machines, serial (yes, serial) kits to control linux machines, then just configure every machine to do output redirection to the serial port, set up a PXE server and you are set.

    We have several datacenters over the world with a setup based on PXE (both for windows and linux) and DRAC cards and serial redirection and rarely I need to get someone to go physically to the machines (usually only when something is broken in the machine and it needs repairs).

    And another possibility: on top of that, add virtualization. Nothing like being able to migrate a virtual server to another node when the original node has hardware problems.

  8. I am now waiting on stickers for USB drives that say: "My other drive is a GFS cell".

  9. Regarding datacenters: I cannot agree more. Recently I needed a linux host where to host some software repositories, and the like. The traditional solution in my university would be to "buy a machine". I logged in into rackspacecloud, clicked on a button, and for $10/month, the server is there. From now on, the only hardware I buy is my laptop and screen. For the rest, I will use a datacenter.

  10. Anon re "doing development on a virtual machine"

    VNC is not designed for long-haul wide area connections. An option is NX, which compresses VNC traffic, significantly reduces latency, and make long-haul remote desktop usable.

  11. Yes, NX works amazingly well over WAN in my experience, using Eclipse, etc. I have even used it while tethered to my phone's 3G connection.

  12. Well, aside from tools, there are also a "few" SREs who make things go.

  13. Anon re: SREs. Indeed! SREs (site reliability engineers, for non-Googlers) have one of the most important jobs here. There is a six-month rotation for software engineers to learn the SRE trade that I'd love to do, but unfortunately Cambridge doesn't have a large enough SRE team to host it.

  14. Matt, if you're really serious about the 6-month SRE rotation, there's a GWS presence in Cambridge, and I think we could work something out. Email me if you're interested.

  15. I am curious -- what has changed in your views on the client-side role of operating systems?

  16. @Maulik

    “When the network becomes as fast as the processor, the computer hollows out and spreads across the network.” -Eric Schmidt, 1993;

    This is all about overcoming the Von Neumann bottleneck. You guys may be interested in reading this John Backus Turing Award ACM paper from 1977. Backus was way ahead of his time with this.


  17. While the concept of daily large-scale computing is certainly an important message, as a developer I was more intrigued by how you set up your development environment in a remote machine with a text-based interface. Very old school but, I suspect, very effective as well. My whole career as a programmer was bound to microcomputers and I've always done my development work locally and, in the last 15 years, using IDEs in a graphic environment. I suspect the old way gives you some freedom, though, and of course some serious computing, storage and bandwidth power at your command. No offline work, though, unless you replicate your environment in a local VM.

  18. A very practical question: if we move to a world in which it more effective for us to use low-powered laptops as windows into remotely managed server farms, do we have to throw away our graphical IDEs in favor of text-mode editors (emacs and vi)?

  19. Great post! The cool part is the data center is all Linux and the desktop / laptop computer is running Linux or the Mac OS.



  20. Here is a great introduction to this from Google: http://research.google.com/pubs/pub35290.html

  21. @Edward Benson, or perhaps our IDEs become web-based editors. Our source code, the compiler, and all the other compute-intensive parts live in the cloud. Look up Bespin and Ecco.

  22. The cloud and its costs don't scale well for all data sets.


    Admittedly, things have changed since this report came out in 2008 (such as being able to ship physical disks to Amazon for import into their storage systems), but much of it still holds true.


  23. Google has a possibilty to do than others :)

  24. Interesting post. I think that we've passed from "The network is the computer" to "The Internet is the database"...

  25. Cool post! I interned at Google over the summer, and felt the same effect. I am a fan of unit tests now as well.

  26. I liked the fact that you mentioned that unit testing should be done a priori. Can't stressed that enough. In an environment like the one you've experienced, i can appreciate that it needs to be done.

  27. Do you use unit tests after you re in google?or you have use it before?

    Anyway,unit test rocks!

  28. Hi Matt,

    Interesting comments about debugging - unit tests are great, success of printf depends on or requires the independence the machines/processes from each other - which may be your case.

    But for the line "There's little hope of doing real interactive debugging in this kind of environment, [..]" -- take a look at this 200,000 cores debugging at the same time -- it's not the same environment but shows what can be done at scale.

  29. Automated Unit Testing : http://www.parasoft.com/jsp/technologies/unit_testing.jsp?itemId=326

  30. Great post. Great comments alike. Matt, I just wonder if you could share your .vimrc file (a trimmed down version I guess) to learn new tricks from a Vim power user. Thanks.

  31. very good information,thank you

  32. Hi Matt, you mention MapReduce and brittle MPI infrastructure. Aren't these two different layers? MapReduce is one parallel algorithm. Aren't there others? And MPI is one parallel implementation method underlying an algorithm such as MapReduce or matrix multiply. What is MapReduce based on, and can it be used for matrix multiply?
    Can you provide a systematic picture here?
    Thanks for the article!!

  33. The comparison between MPI and MapReduce was simply to define the two extremes in the parallel programming model space. MPI is designed for tightly-coupled systems where processors and communication rarely fail; MapReduce works on much larger, loosely-coupled systems where failures are commonplace. It's safe to say that MR scales much better than MPI, but is better suited for computations that don't require explicit coordination between processors. In MR, the data is primary; in MPI, the computation is primary. There's a vast difference between the two models.

  34. Since you asked, here are the relevant parts of my .vimrc, which I have been using for years. I am by no means a vim wizard; I tend to stick to the pretty basic commands, and don't do fancy stuff like split windows, multiple buffers, or macros very well. If I got timewarped back to 1976 and had to edit a file on a VAX I would be very comfortable :-)

    set noshowmode
    set ts=8
    set tw=70
    set noruler
    set ai
    set shiftwidth=2
    set background=dark
    set wildmode=longest:full
    set wildmenu
    set hlsearch

    " allow backspacing over everything in insert mode
    set backspace=indent,eol,start
    set copyindent " copy the previous indentation on autoindenting
    "set number " always show line numbers
    set shiftround " use multiple of shiftwidth when indenting with '<' and '>'
    set showmatch " set show matching parenthesis
    set ignorecase " ignore case when searching
    set smartcase " ignore case if search pattern is all lowercase, case-sensitive otherwise
    set smarttab " insert tabs on the start of a line according to shiftwidth, not tabstop
    set incsearch " show search matches as you type

    set history=1000 " remember more commands and search history
    set undolevels=1000 " use many muchos levels of undo
    set wildignore=*.swp,*.bak,*.pyc,*.class
    set title " change the terminal's title
    set visualbell " don't beep
    set noerrorbells " don't beep
    syntax enable
    hi SpellBad cterm=bold ctermfg=227 ctermbg=NONE
    map :syntax enable:set spell
    map :nohlsearch:redraw:
    autocmd FileType c set cindent | set tw=0 | map ij
    autocmd FileType cs set cindent | set tw=0 | map ij
    autocmd FileType nesc set cindent | set tw=0 | map ij
    autocmd FileType cpp set cindent | set tw=0 | map ij
    autocmd FileType java set cindent | set tw=0 | map ij
    autocmd FileType tex set spell
    autocmd FileType txt set spell
    autocmd FileType perl set cindent | set tw=0 | map ij
    autocmd FileType python set cindent | set tw=0 | set shiftwidth=2 | set expandtab | set ts=2 | map ij

  35. Hi Matt, comparing MPI and MapReduce, computations would definitely benefit from a loosely coupled fault resilient infrastructure. Can you isolate the infrastructure away from MapReduce algorithm per-se, and provide the infrastructure in terms of passing out blocks of data for computation, returning results (if not crashed), and repeat until all the results show up?
    We need a resilient computation infrastructure. MapReduce is only one algoritm!! Support all of the "Berkeley 7 Dwarfs" !

  36. it should really feel awesome if you have that much computing capacity to write code on ..


Startup Life: Three Months In

I've posted a story to Medium on what it's been like to work at a startup, after years at Google. Check it out here.