Volatile and Decentralized: Computing at scale, or, how Google has warped my brain

Tuesday, October 19, 2010

Computing at scale, or, how Google has warped my brain

A number of people at Google have stickers on their laptops that read "my other computer is a data center." Having been at Google for almost four months, I realize now that my whole concept of computing has radically changed since I started working here. I now take it for granted that I'll be able to run jobs on thousands of machines, with reliable job control and sophisticated distributed storage readily available.

Most of the code I'm writing is in Python, but makes heavy use of Google technologies such as MapReduce, BigTable, GFS, Sawzall, and a bunch of other things that I'm not at liberty to discuss in public. Within about a week of starting at Google, I had code running on thousands of machines all over the planet, with surprisingly little overhead.

As an academic, I have spent a lot of time thinking about and designing "large scale systems", though before coming to Google I rarely had a chance to actually work on them. At Berkeley, I worked on the 200-odd node NOW and Millennium clusters, which were great projects, but pale in comparison to the scale of the systems I use at Google every day.

A few lessons and takeaways from my experience so far...

The cloud is real. The idea that you need a physical machine close by to get any work done is completely out the window at this point. My only machine at Google is a Mac laptop (with a big honking monitor and wireless keyboard and trackpad when I am at my desk). I do all of my development work on a virtual Linux machine running in a datacenter somewhere -- I am not sure exactly where, not that it matters. I ssh into the virtual machine to do pretty much everything: edit code, fire off builds, run tests, etc. The systems I build are running in various datacenters and I rarely notice or care where they are physically located. Wide-area network latencies are low enough that this works fine for interactive use, even when I'm at home on my cable modem.

In contrast, back at Harvard, there are discussions going on about building up new resources for scientific computing, and talk of converting precious office and lab space on campus (where space is extremely scarce) into machine rooms. I find this idea fairly misdirected, given that we should be able to either leverage a third-party cloud infrastructure for most of this, or at least host the machines somewhere off-campus (where it would be cheaper to get space anyway). There is rarely a need for the users of the machines to be anywhere physically close to them anymore. Unless you really don't believe in remote management tools, the idea that we're going to displace students or faculty lab space to host machines that don't need to be on campus makes no sense to me.

The tools are surprisingly good. It is amazing how easy it is to run large parallel jobs on massive datasets when you have a simple interface like MapReduce at your disposal. Forget about complex shared-memory or message passing architectures: that stuff doesn't scale, and is so incredibly brittle anyway (think about what happens to an MPI program if one core goes offline). The other Google technologies, like GFS and BigTable, make large-scale storage essentially a non-issue for the developer. Yes, there are tradeoffs: you don't get the same guarantees as a traditional database, but on the other hand you can get something up and running in a matter of hours, rather than weeks.

Log first, ask questions later. It should come as no surprise that debugging a large parallel job running on thousands of remote processors is not easy. So, printf() is your friend. Log everything your program does, and if something seems to go wrong, scour the logs to figure it out. Disk is cheap, so better to just log everything and sort it out later if something seems to be broken. There's little hope of doing real interactive debugging in this kind of environment, and most developers don't get shell access to the machines they are running on anyway. For the same reason I am now a huge believer in unit tests -- before launching that job all over the planet, it's really nice to see all of the test lights go green.

37 comments:

MaulikOctober 19, 2010 at 8:38 PM
Sun Micro Systems coined the term "Network is the Computer" in late 1990s, May be Google will make it happen with "Data Center is computer".
ReplyDelete
Replies
UnknownOctober 19, 2010 at 10:06 PM
Very cool.
ReplyDelete
Replies
ChintanOctober 20, 2010 at 1:27 AM
PCs with greater RAM or CPU speed may not be required at all in future. All complex computations will be done from data centres. Way to go!
ReplyDelete
Replies
AnonymousOctober 20, 2010 at 3:41 AM
Chintan - don't forget that the machines in the datacenters need more RAM and CPU and disk and all that over time.
ReplyDelete
Replies
AnonymousOctober 20, 2010 at 6:08 AM
Can you explain how you "do development on a virtual linux machine".

Is the linux machine running your IDE which you are viewing using VNC on the mac? I have found that annoying.

Or, is the IDE running native on the mac and you are somehow moving files over to the linux machine?

Or, is there some other tool that I don't know about?
ReplyDelete
Replies
AnonymousOctober 20, 2010 at 6:42 AM
Unfortunately commercially available tools are not at the same level as the tools at Google. Our servers were moved from our lab to a state where electricity is cheap. The off-the-shelf remote control tools we have (both commercial stand-alone KVMs and blade management tools from IBM) don't play well with anything but Windows. I spent a couple of days trying to do a stock Fedora install using a CIFS-mounted virtual CD to a blade using a virtual keyboard, display, and mouse before giving up and having an on-site warm body burn a disc, pop it in the drive, and install locally. (And I work for a very large systems company--one everybody here has heard of!)
ReplyDelete
Replies
AnonymousOctober 20, 2010 at 7:24 AM
Anon re "doing development on a virtual machine": I use this wonderful development environment called "vi" which works beautifully over an ssh connection :-) Some people here do run a remote desktop client which works fine too.

Anon re "commercially available tools": I don't know the state of third-party cloud solutions. I've used blade management tools from IBM without much trouble at Harvard.
ReplyDelete
Replies
AnonymousOctober 20, 2010 at 10:00 AM
Anon re "commercially available tools":
For that stuff, there are good third-party solutions. I have no experience with blades, but if it can do PXE-boot and you can get a virtual kb+display, it should be doable.
Also, there's plenty of solutions for the non-blade case: DRAC cards for windows machines, serial (yes, serial) kits to control linux machines, then just configure every machine to do output redirection to the serial port, set up a PXE server and you are set.

We have several datacenters over the world with a setup based on PXE (both for windows and linux) and DRAC cards and serial redirection and rarely I need to get someone to go physically to the machines (usually only when something is broken in the machine and it needs repairs).

And another possibility: on top of that, add virtualization. Nothing like being able to migrate a virtual server to another node when the original node has hardware problems.
ReplyDelete
Replies
AnonymousOctober 20, 2010 at 10:57 AM
I am now waiting on stickers for USB drives that say: "My other drive is a GFS cell".
ReplyDelete
Replies
AnonymousOctober 20, 2010 at 12:23 PM
Regarding datacenters: I cannot agree more. Recently I needed a linux host where to host some software repositories, and the like. The traditional solution in my university would be to "buy a machine". I logged in into rackspacecloud, clicked on a button, and for $10/month, the server is there. From now on, the only hardware I buy is my laptop and screen. For the rest, I will use a datacenter.
ReplyDelete
Replies
rxinOctober 20, 2010 at 1:05 PM
Anon re "doing development on a virtual machine"

VNC is not designed for long-haul wide area connections. An option is NX, which compresses VNC traffic, significantly reduces latency, and make long-haul remote desktop usable.
ReplyDelete
Replies
TomOctober 20, 2010 at 2:21 PM
Yes, NX works amazingly well over WAN in my experience, using Eclipse, etc. I have even used it while tethered to my phone's 3G connection.
ReplyDelete
Replies
AnonymousOctober 20, 2010 at 5:04 PM
Well, aside from tools, there are also a "few" SREs who make things go.
ReplyDelete
Replies
AnonymousOctober 20, 2010 at 6:37 PM
Anon re: SREs. Indeed! SREs (site reliability engineers, for non-Googlers) have one of the most important jobs here. There is a six-month rotation for software engineers to learn the SRE trade that I'd love to do, but unfortunately Cambridge doesn't have a large enough SRE team to host it.
ReplyDelete
Replies
AnonymousOctober 20, 2010 at 8:20 PM
Matt, if you're really serious about the 6-month SRE rotation, there's a GWS presence in Cambridge, and I think we could work something out. Email me if you're interested.
ReplyDelete
Replies
Ballom, Master of DeathOctober 21, 2010 at 1:48 AM
I am curious -- what has changed in your views on the client-side role of operating systems?
ReplyDelete
Replies
Andrew de AndradeOctober 21, 2010 at 3:49 AM
@Maulik

“When the network becomes as fast as the processor, the computer hollows out and spreads across the network.” -Eric Schmidt, 1993;

This is all about overcoming the Von Neumann bottleneck. You guys may be interested in reading this John Backus Turing Award ACM paper from 1977. Backus was way ahead of his time with this.

http://www.thocp.net/biographies/papers/backus_turingaward_lecture.pdf
ReplyDelete
Replies
AnonymousOctober 21, 2010 at 4:16 AM
While the concept of daily large-scale computing is certainly an important message, as a developer I was more intrigued by how you set up your development environment in a remote machine with a text-based interface. Very old school but, I suspect, very effective as well. My whole career as a programmer was bound to microcomputers and I've always done my development work locally and, in the last 15 years, using IDEs in a graphic environment. I suspect the old way gives you some freedom, though, and of course some serious computing, storage and bandwidth power at your command. No offline work, though, unless you replicate your environment in a local VM.
ReplyDelete
Replies
Edward BensonOctober 21, 2010 at 6:02 AM
A very practical question: if we move to a world in which it more effective for us to use low-powered laptops as windows into remotely managed server farms, do we have to throw away our graphical IDEs in favor of text-mode editors (emacs and vi)?
ReplyDelete
Replies
EricOctober 21, 2010 at 7:18 AM
Great post! The cool part is the data center is all Linux and the desktop / laptop computer is running Linux or the Mac OS.

http://blogs.computerworld.com/16232/good_bye_windows_hello_linux_mac_says_google

http://content.dell.com/us/en/gen/d/large-business/google-data-center.aspx
ReplyDelete
Replies
AnonymousOctober 21, 2010 at 9:27 AM
Here is a great introduction to this from Google: http://research.google.com/pubs/pub35290.html
ReplyDelete
Replies
AnonymousOctober 21, 2010 at 12:24 PM
@Edward Benson, or perhaps our IDEs become web-based editors. Our source code, the compiler, and all the other compute-intensive parts live in the cloud. Look up Bespin and Ecco.
ReplyDelete
Replies
TravisOctober 21, 2010 at 12:25 PM
The cloud and its costs don't scale well for all data sets.

http://technews.acm.org/archives.cfm?fo=2008-05-may/may-28-2008.html&hdr=1#363435

Admittedly, things have changed since this report came out in 2008 (such as being able to ship physical disks to Amazon for import into their storage systems), but much of it still holds true.

Travis
ReplyDelete
Replies
AnonymousOctober 21, 2010 at 1:01 PM
Google has a possibilty to do than others :)
ReplyDelete
Replies
AnonymousOctober 21, 2010 at 4:50 PM
Interesting post. I think that we've passed from "The network is the computer" to "The Internet is the database"...
ReplyDelete
Replies
Vijay ChidambaramOctober 21, 2010 at 6:51 PM
Cool post! I interned at Google over the summer, and felt the same effect. I am a fan of unit tests now as well.
ReplyDelete
Replies
Raymond TayOctober 21, 2010 at 6:51 PM
I liked the fact that you mentioned that unit testing should be done a priori. Can't stressed that enough. In an environment like the one you've experienced, i can appreciate that it needs to be done.
ReplyDelete
Replies
fajrOctober 21, 2010 at 8:01 PM
Do you use unit tests after you re in google?or you have use it before?

Anyway,unit test rocks!
ReplyDelete
Replies
ParalloxonOctober 22, 2010 at 4:54 AM
Hi Matt,

Interesting comments about debugging - unit tests are great, success of printf depends on or requires the independence the machines/processes from each other - which may be your case.

But for the line "There's little hope of doing real interactive debugging in this kind of environment, [..]" -- take a look at this 200,000 cores debugging at the same time -- it's not the same environment but shows what can be done at scale.
ReplyDelete
Replies
AnonymousOctober 22, 2010 at 7:27 AM
Automated Unit Testing : http://www.parasoft.com/jsp/technologies/unit_testing.jsp?itemId=326
ReplyDelete
Replies
AnonymousOctober 25, 2010 at 10:49 PM
Great post. Great comments alike. Matt, I just wonder if you could share your .vimrc file (a trimmed down version I guess) to learn new tricks from a Vim power user. Thanks.
ReplyDelete
Replies
JosOctober 26, 2010 at 6:30 AM
very good information,thank you
ReplyDelete
Replies
AnonymousOctober 27, 2010 at 10:59 AM
Hi Matt, you mention MapReduce and brittle MPI infrastructure. Aren't these two different layers? MapReduce is one parallel algorithm. Aren't there others? And MPI is one parallel implementation method underlying an algorithm such as MapReduce or matrix multiply. What is MapReduce based on, and can it be used for matrix multiply?
Can you provide a systematic picture here?
Thanks for the article!!
ReplyDelete
Replies
AnonymousOctober 28, 2010 at 6:10 AM
The comparison between MPI and MapReduce was simply to define the two extremes in the parallel programming model space. MPI is designed for tightly-coupled systems where processors and communication rarely fail; MapReduce works on much larger, loosely-coupled systems where failures are commonplace. It's safe to say that MR scales much better than MPI, but is better suited for computations that don't require explicit coordination between processors. In MR, the data is primary; in MPI, the computation is primary. There's a vast difference between the two models.
ReplyDelete
Replies
AnonymousOctober 28, 2010 at 6:13 AM
Since you asked, here are the relevant parts of my .vimrc, which I have been using for years. I am by no means a vim wizard; I tend to stick to the pretty basic commands, and don't do fancy stuff like split windows, multiple buffers, or macros very well. If I got timewarped back to 1976 and had to edit a file on a VAX I would be very comfortable :-)

set noshowmode
set ts=8
set tw=70
set noruler
set ai
set shiftwidth=2
set background=dark
set wildmode=longest:full
set wildmenu
set hlsearch

" allow backspacing over everything in insert mode
set backspace=indent,eol,start
set copyindent " copy the previous indentation on autoindenting
"set number " always show line numbers
set shiftround " use multiple of shiftwidth when indenting with '<' and '>'
set showmatch " set show matching parenthesis
set ignorecase " ignore case when searching
set smartcase " ignore case if search pattern is all lowercase, case-sensitive otherwise
set smarttab " insert tabs on the start of a line according to shiftwidth, not tabstop
set incsearch " show search matches as you type

set history=1000 " remember more commands and search history
set undolevels=1000 " use many muchos levels of undo
set wildignore=*.swp,*.bak,*.pyc,*.class
set title " change the terminal's title
set visualbell " don't beep
set noerrorbells " don't beep
syntax enable
hi SpellBad cterm=bold ctermfg=227 ctermbg=NONE
map :syntax enable:set spell
map :nohlsearch:redraw:
autocmd FileType c set cindent | set tw=0 | map ij
autocmd FileType cs set cindent | set tw=0 | map ij
autocmd FileType nesc set cindent | set tw=0 | map ij
autocmd FileType cpp set cindent | set tw=0 | map ij
autocmd FileType java set cindent | set tw=0 | map ij
autocmd FileType tex set spell
autocmd FileType txt set spell
autocmd FileType perl set cindent | set tw=0 | map ij
autocmd FileType python set cindent | set tw=0 | set shiftwidth=2 | set expandtab | set ts=2 | map ij
ReplyDelete
Replies
Rich AltmaierOctober 28, 2010 at 10:14 AM
Hi Matt, comparing MPI and MapReduce, computations would definitely benefit from a loosely coupled fault resilient infrastructure. Can you isolate the infrastructure away from MapReduce algorithm per-se, and provide the infrastructure in terms of passing out blocks of data for computation, returning results (if not crashed), and repeat until all the results show up?
We need a resilient computation infrastructure. MapReduce is only one algoritm!! Support all of the "Berkeley 7 Dwarfs" !
ReplyDelete
Replies
lakshmananNovember 3, 2010 at 5:17 AM
it should really feel awesome if you have that much computing capacity to write code on ..
ReplyDelete
Replies

Tuesday, October 19, 2010

Computing at scale, or, how Google has warped my brain

37 comments:

Startup Life: Three Months In

Search This Blog