Friday, February 28, 2014

Taking the "Hot" out of "Hot Topics" workshops

I just got back from HotMobile 2014 (for which I was the general chair). HotMobile is the mobile systems community's "hot topics" workshop, held annually as a forum for (according to the Call for Papers) "position papers containing highly original ideas" and which "propose new directions of research" or "advocate non-traditional approaches". It's a small workshop (we had about 95 people this year) and the paper submissions are short -- 6 pages, rather than the regular 14.

The HotMobile'14 poster and demo session.
Look how happy those mobile systems researchers are!
Overall, the workshop was great -- lots of good discussions, good talks, interesting ideas. And yet, every time I attend one of these "hot topics" workshops, I end up feeling that the papers fall well short of this lofty goal. This is not limited to the mobile community -- the HotOS community has a similar problem as well.

This has bugged me for a long time, since it often feels as though there is no venue for doing "out of the box" work that is intended to look out five or ten years -- rather than just things that are incremental but not yet ready for publication in a major conference like SOSP or MobiSys. I also have fond memories of HotOS in the late 1990s in which it felt as though many of the papers were there to shake up the status quo and put forward a strong position.

What I've now come to realize is that there is a tremendous value in having a small workshop for preliminary (and often incremental) results. The community obviously feels that such a venue is useful, despite its lack of "hotness" -- we had a record number of attendees this year, and (I believe) a near-record number of submissions.

And after all, the main reasons to attend any workshop are the discussions and networking -- not the papers.

The problem is that we insist on calling this a "hot topics" workshop and pretend that it's about far-out ideas that could not be published elsewhere. Instead, I think we should be honest that HotMobile (and HotOS, HotNets, etc.) are really for three kinds of papers:
  1. Preliminary work on a new project which is not yet ready for a major conference. Getting early feedback on a new project is often very useful to researchers, so they know if they are barking up the right trees.

    An example of this from this year is the CMU paper on QuiltView, which proposes allowing users to pose real-time queries ("How is the weather down at the beach in Santa Barbara?") and get back real-time video snippets (from users wearing Google Glass!) in reply. This work is no where near mature enough for a full conference, and I hope the authors gained something from the paper reviews and discussion at the workshop to shape their future direction.
  2. An incremental, and possibly vestigial, step, towards the next major conference paper on a topic. Many such papers are simply not big enough ideas for a full conference paper, but make a nice "short paper" for the sake of getting some idea out there.

    One example from this year is this paper on the dangers of public IPs for LTE devices. This isn't something that's going to turn into a longer, more pithy paper later on, but is probably worth reporting.
  3. The odd wacky paper that falls under the "hot topics" rubric. These are increasingly rare. About the only example from this year is this Duke paper on adding smart capabilities to childrens' toys with smartphones -- but the idea is not that radical.
Last year at SOSP, there was a one-day workshop called TRIOS ("Timely Results in Operating Systems") which was an informal venue for preliminary work -- exactly to provide an outlet for papers in the first two categories above. At least TRIOS was honest about its intent, so nobody attending could be disappointed that the papers weren't "hot" enough.

So, my humble proposal is to rename the workshop "ColdMobile" and, just to be cheeky, hold it at a ski resort in the winter.

Thursday, January 30, 2014

Getting a job at Google for PhD Students

I happen to sit on one of the hiring committees at Google, which looks at interview packets and makes a recommendation about whether we should extend an offer or not. So I've read a lot of packets, and have seen some of the ways in which applicants succeed or fail to get an offer. Ph.D. students, in particular, tend to get tripped up by the Google interview process, so I thought I'd offer some advice.

While I can't be certain, I imagine this same advice would apply to other companies which have a similar interview process that focuses on coding and algorithms.

(Disclaimer: This is all my personal opinion, and nothing I'm saying here is sanctioned or recommended by Google in any way. In fact, it might be totally wrong. Take it with a grain of salt.)

Google's interview process

Google uses a fairly typical industry interview process: Candidates go through one or two phone screens (or possibly an on-campus interview), and if they do well they are brought on campus for a full interview loop. Each interview is an hour and consists largely of problem solving and coding on the whiteboard. Sometimes a laptop is used.

This same process is used for all software engineering positions, regardless of level: undergrads, PhD students, and seasoned industry candidates all get the same style of interview. I had to go through this interview process upon joining Google as a professor. PhD-level candidates will generally spend one interview slot discussing their thesis work, and the questions may be more "researchy", but by and large it's the same for everyone.

The problem

Ph.D. students often tend to do worse on coding interviews than, say, bachelors' or masters' level candidates. Why? Doing a Ph.D. simply does not train you in professional software development skills, and that is (primarily) what a Google interview tests for. Undergrads, paradoxically, often do better because (a) they may have done internships at companies writing code, and (b) have practiced for this style of interview in the past.

There is a widespread belief that doing a Ph.D. somehow elevates you above the need to demonstrate fundamental algorithms and coding skills. Having a Ph.D. from Berkeley is awesome, but you still gotta be able to write good, clean code.

Also, part of the long process of doing a Ph.D. means you get hyper-specialized, so you get farther away from the "basics". Many of the Google interview questions touch on topics you probably first encountered (and mastered) as a sophomore or junior in college. I don't know about you, but I never dealt with binary search trees or graph connectivity problems directly during my Ph.D. and subsequent years as a faculty member. (Then again I'm just a systems guy, so the most sophisticated data structure I ever deal with is a hash table.)

Why the basics matter

Being at Google means writing production-quality software. We don't have "research labs" where people primarily build prototypes or write papers. I have written about Google's hybrid research model elsewhere -- also see this CACM article for more. While there are exceptions, by and large being at Google means being on a product team building and launching real products. That is even true of the more far-flung projects like self-driving cars and high-altitude Internet balloons. The quality and professionalism of the code you develop matters a great deal.

Doing a Ph.D. generally trains you for building research prototypes. There is a vast difference between this and writing production-quality code. First of all, it's not good enough for the code to make sense -- or be maintainable -- only by you or a small number of collaborators. Adherence to good design, avoiding overcomplicated code, conforming to style guidelines, etc. are all super important. In addition, you have to really concern yourself with robustness, scalability, testability, and performance. Corner cases that aren't interesting for publishing your next paper can't be overlooked.

Most of these skills can only be developed by working with a professional software development team. Research and class projects don't give you a chance to develop these skills. Undergrads gain these skills largely through internships. Unfortunately, most PhD students do internships at research labs, which may or may not provide much opportunity to build production-quality software.

Advice for grad students

If you're interviewing at Google, bone up on your basic algorithms and data structures. Go dust off that sophomore-level textbook and try to page it back in. I also highly recommend the book Cracking the Coding Interview, which gives the best description I have seen of Google-style interviews - it was written by a former Googler.

Don't go in with the attitude that you're above all this. Roll up your sleeves and show them what you've got. I know it may feel silly being asked what seem like basic CS questions, but if you're really as good as you think you are, you should knock them out of the park. (Keep in mind that the questions get harder the better you are doing, so no matter what, you will probably feel like crap at the end of the day.)

Every line of code you write on the whiteboard will get written up as part of the interview packet. Make it squeaky clean. Initialize variables. Use semicolons. Don't forget your constructors. Although writing sloppy pseudocode to get your meaning across might seem adequate (after all, we're all professionals here, aren't we?), attention to detail matters. Code in C++ or Java, which shows maturity. If you can only code in Python or bash shell, you're going to have trouble. If you make the slightest suggestion of wanting to code in Haskell or Lisp, the interviewer will push a hidden button which opens a trap door, dropping you into a bottomless pit. (Just kidding.)

Never, ever suggest you are a "C++ expert", either on your resume or in person. You are not.

Unfortunately, Google interviews tend to be a bit one-sided and you will not have as much opportunity to learn about Google (and what projects you might be working on) as you would like. If you do get an offer, you'll have more opportunities to come back and ask those questions. Google is notoriously secretive, so you have to trust me that there are plenty of cool things to work on.

Finally: Remember that the content of the interview has nothing to do with the kind of projects you would work on here. You're not going to get hired by Google and be asked to implement depth-first-search or reverse a linked list -- trust me on that. I'm pretty sure we have library routines for those already.

Tuesday, January 21, 2014

Your Field Guide to Industrial Research Labs

There are a lot of different kinds of industrial research organizations out there. Identifying them can be tricky, so I've compiled this field guide to help you out.

The Patent Factory Research Lab

This is the classic model of research lab, and the main model that existed when I was a grad student in the late 1990s. Many of these labs no longer exist, or have transformed into one of the models below. Generally attached to a big company, this style of research lab primarily exists to bolster the parent company's patent portfolio. A secondary mission is to somehow inform the long-term product roadmap for the parent company, which may or may not be successful, depending on whether the research lab is located 50 miles or a mere 15 miles away from any buildings in which actual product teams work.

How you know you're visiting this style of lab: The main decoration in researcher's offices are the little paperweights they get for every 20 patents they file.

The Academic Department inside of a Company Research Lab

This model is somewhat rare but it does exist, and a couple of companies have done a superb job building up a lab full of people who would really like to have been professors but who really don't like teaching or getting too close to undergraduates. This style of research lab focuses on cranking out paper after paper after paper and padding the ranks of every program committee in sight with its own members. Product impact is usually limited to demos, or the occasional lucky project which gets taken in by a product team and then ripped to shreds until it no longer resembles the original research in any way.

How you know you're visiting this style of lab: It feels just like grad school, except everyone gets their own office, and there are a lot more Windows desktops than you would normally expect to see.

The Why Are We Still Here, Let's Hope The CEO Doesn't Notice Research Lab

This type of research lab exists only because the C-level executives have either misplaced it or forgotten it exists. Researchers here are experts in flying under the radar, steering clear of anything that might generate the slightest amount of media coverage lest they blow their cover. When asked what they are working on, they generally mumble something about "the cloud" which grants them another two-year reprieve until another VP-level review comes around, at which time everyone scrambles to put together demos and PowerPoint decks to look like they've been busy.

How you know you're visiting this style of lab: Nobody has the slightest idea what's happening in the actual research community, and the project titles sound auto-generated.

The It's We-Could-Tell-You-But-We'd-Have-To-Kill-You Research Lab

This type of lab deals exclusively in classified defense contracts. These labs all have innocuous-sounding names which evoke the Cold War and bygone days when it was acceptable, and even encouraged, to smoke a pipe while working in the lab. Projects are done under contract from some branch of the military and generally involve satellites, nuclear warheads, lasers, or some combination of the above. On the plus side, this is the type of lab where you are most likely to encounter alien technology or invent time travel.

How you know you're visiting this style of lab: All project names are comprised of inscrutable acronyms such as "JBFM MAXCOMM"; nobody seems to have a sense of humor.

The "We Have a Research Lab Too" Research Lab

This is the model exemplified by startup companies who are feeling jealous that they don't have enough Ph.D.'s working for them and feel the need to start " Research" to make their mark on the world.  This generally happens the first time such a company hires an ex-academic and makes the mistake of putting them in any kind of leadership role. Projects in this kind of lab aren't that different from regular work on the product teams, apart from the expectation that launching anything will take three times longer than a non-research team would be able to do.

How you know you're visiting this style of lab: Hoodies with the word "Research" on them; free lunch.

Sunday, January 19, 2014

Google did not steal the smart contact lens from Microsoft

Wired is carrying an article rather provocatively entitled, "Google Stole Its Smart Contact Lens From Microsoft. And That’s a Good Thing." While the article makes a few good points, the gist of the headline is dead wrong. I now work at Google, but I was previously an academic myself and received a significant amount of funding from Microsoft while I was at Harvard. (Standard disclaimer applies: This post represents my own opinion and not that of my employer.)

The Wired article gets it wrong when it claims that Google "stole" the smart contact lens project from Microsoft. It's true that Microsoft funded the original project being done by Babak Parviz when he was on the faculty at the University of Washington. Google then subsequently hired Babak (and Brian Otis, another UW faculty) to develop the project further, which was recently announced on the Google Blog. However, I don't think anyone would consider this "stealing". Suggesting that it does is a real problem, since it undercuts the open model used by many companies for funding university research.

It would not surprise me if Microsoft hired former faculty to work on projects that were originally funded by Google's university research programs (which, like Microsoft, provides millions of dollars a year to university projects to undertake research). These kinds of industry research gifts generally have no strings attached. As the recipient of several Microsoft research awards, I could have used the money for anything -- pizza parties for my grad students, extravagant trips to the tropics -- without any repercussions, apart from gaining a poor reputation and probably excluding myself from consideration for future Microsoft awards. Likewise, the research output that these gifts funded had no intellectual property restrictions: the research was wholly owned by the university, and Microsoft received no IP rights whatsoever.

This is a great model for industry research funding. It provides researchers with the maximal amount of flexibility, and does not preclude a researcher from funding one project from multiple sources (even multiple awards from competing companies).

The Wired article does make a good point that Google seems to be doing a good job at taking these kinds of moonshot research ideas (like self-driving cars, Google Glass, and the smart contact lens project) to the next level, beyond the lab. But the implication that Google "stole" the research "from" Microsoft is disingenuous. I am sure most academics, and even Microsoft folks, would agree.

Sunday, August 18, 2013

Rewriting a large production system in Go

My team at Google is wrapping up an effort to rewrite a large production system (almost) entirely in Go. I say "almost" because one component of the system -- a library for transcoding between image formats -- works perfectly well in C++, so we decided to leave it as-is. But the rest of the system is 100% Go, not just wrappers to existing modules in C++ or another language. It's been a fun experience and I thought I'd share some lessons learned.

Plus, the Go language has a cute mascot ... awwww!
Why rewrite?

The first question we must answer is why we considered a rewrite in the first place. When we started this project, we adopted an existing C++ based system, which had been developed over the course of a couple of years by two of our sister teams at Google. It's a good system and does its job remarkably well. However, it has been used in several different projects with vastly different goals, leading to a nontrivial accretion of cruft. Over time, it became apparent that for us to continue to innovate rapidly would be extremely challenging on this large, shared codebase. This is not a ding to the original developers -- it is just a fact that when certain design decisions become ossified, it becomes more difficult to rethink them, especially when multiple teams are sharing the code.

Before doing the rewrite, we realized we needed only a small subset of the functionality of the original system -- perhaps 20% (or less) of what the other projects were doing with it. We were also looking at making some radical changes to its core logic, and wanted to experiment with new features in a way that would not impact the velocity of our team or the others using the code. Finally, the cognitive burden associated with making changes to any large, shared codebase is unbearable -- almost any change required touching lots of code that the developer did not fully understand, and updating test cases with unclear consequences for the other users of the code.

So, we decided to fork off and do a from-scratch rewrite. The bet we made was that taking an initial productivity hit during the initial rewrite would pay off in droves when we were able to add more features over time. It has also given us an opportunity to rethink some of the core design decisions of our system, which has been extremely valuable for improving our own understanding of its workings.

Why Go?

I'll admit that at first I was highly skeptical of using Go. This production system sits directly on the serving path between users and their content, so it has to be fast. It also has to handle a large query volume, so CPU and memory efficiency are key. Go's reliance on garbage collection gave me pause (pun intended ... har har har), given how much pain Java developers go through to manage their memory footprint. Also, I was not sure how well Go would be supported for the kind of development we wanted to do inside of Google. Our system has lots of dependencies, and the last thing I wanted was to have to reinvent lots of libraries in Go that we already had in C++. Finally, there was also simply the fear of the unknown.

My whole attitude changed when Michael Piatek (one of the star engineers in the group) sent me an initial cut at the core system rewrite in Go, the result of less than a week's work. Unlike the original C++ based system, I could actually read the code, even though I didn't know Go (yet). The #1 benefit we get from Go is the lightweight concurrency provided by goroutines. Instead of a messy chain of dozens of asynchronous callbacks spread over tens of source files, the core logic of the system fits in a couple hundred lines of code, all in the same file. You just read it from top to bottom, and it makes sense.

Michael also made the observation that Go is a language designed for writing Web-based services. Its standard libraries provide all of the machinery you need for serving HTTP, processing URLs, dealing with sockets, doing crypto, processing dates and timestamps, doing compression. Unlike, say, Python, Go is a compiled language and therefore very fast. Go's modular design makes for beautiful decomposition of code across modules, with clear explicit dependencies between them. Its incremental compilation approach makes builds lightning fast. Automatic memory management means you never have to worry about freeing memory (although the usual caveats with a GC-based language apply).

Being terse

Syntactically, Go is very succinct. Indeed, the Go style guidelines encourage you to write code as tersely as possible. At first this drove me up the wall, since I was used to using long descriptive variable names and spreading expressions over as many lines as possible. But now I appreciate the terse coding approach, as it makes reading and understanding the code later much, much easier.

Personally, I really like coding in Go. I can get to the point without having to write a bunch of boilerplate just to make the compiler happy. Unlike C++, I don't have to split the logic of my code across header files and .cc files. Unlike Java, you don't have to write anything that the compiler can infer, including the types of variables. Go feels a lot like coding in a lean scripting language, like Python, but you get type safety for free.

Our Go-based rewrite is 121 Go source files totaling about 21K lines of code (including comments). Compare that to the original system, which was 1400 C++ source files with 460K lines of code. (Remember what I said about the new system implementing a small subset of the new system's functionality, though I do feel that the code size reduction is disproportionate to the functionality reduction.)

What about ramp-up time?

Learning Go is easy coming from a C-like language background. There are no real surprises in the language; it pretty much makes sense. The standard libraries are very well documented, and there are plenty of online tutorials. None of the engineers on the team have taken very long at all to come up to speed in the language; heck, even one of our interns picked it up in a couple of days.

Overall, the rewrite has taken about 5 months and is already running in production. We have also implemented 3 or 4 major new features that would have taken much longer to implement in the original C++ based system, for the reasons described above. I estimate that our team's productivity has been improved by at least a factor of ten by moving to the new codebase, and by using Go.

Why not Go?

There are a few things about Go that I'm not super happy about, and that tend to bite me from time to time.

First, you need to "know" whether the variable you are dealing with is an interface or a struct. Structs can implement interfaces, of course, so in general you tend to treat these as the same thing. But when you're dealing with a struct, you might be passing by reference, in which the type is *myStruct, or you might be passing by value, in which the type is just myStruct. If, on the other hand, the thing you're dealing with is "just" an interface, you never have a pointer to it -- an interface is a pointer in some sense. It can get confusing when you're looking at code that is passing things around without the * to remember that it might actually "be a pointer" if it's an interface rather than a struct.

Go's type inference makes for lean code, but requires you to dig a little to figure out what the type of a given variable is if it's not explicit. So given code like:
foo, bar := someFunc(baz) 
You'd really like to know what foo and bar actually are, in case you want to add some new code to operate on them. If I could get out of the 1970s and use an editor other than vi, maybe I would get some help from an IDE in this regard, but I staunchly refuse to edit code with any tool that requires using a mouse.

Finally, Go's liberal use of interfaces allows a struct to implement an interface "by accident". You never have to explicitly declare that a given struct implements a particular interface, although it's good coding style to mention this in the comments. The problem with this is that it can be difficult to tell when you are reading a given segment of code whether the developer intended for their struct to implement the interface that they appear to be projecting onto it. Also, if you want to refactor an interface, you have to go find all of its (undeclared) implementations more or less by hand.

Most of all I find coding in Go really, really fun. This is a bad thing, since we all know that "real" programming is supposed to be a grueling, painful exercise of fighting with the compiler and tools. So programming in Go is making me soft. One day I'll find myself in the octagon ring with a bunch of sweaty, muscular C++ programmers bare-knuckling it out to the death, and I just know they're going to mop the floor with me. That's OK, until then I'll just keep on cuddling my stuffed gopher and running gofmt to auto-intent my code.

ObDisclaimer: Everything in this post is my personal opinion and does not represent the view of my employer.

Thursday, July 11, 2013

Does the academic process slow innovation?

I've been wondering recently whether the extended, baroque process of doing research in an academic setting (by which I mean either a university or an "academic style" research lab in industry) is doing more harm than good when it comes to the pace of innovation.

Prior to moving to industry, I spent my whole career as an academic. It took me a while to get used to how fast things happen in industry. My team, which is part of Chrome, does a new major release every six weeks. This is head-spinningly fast compared to academic projects. Important decisions are made on the order of days, not months. Projects are started up and executed an order of magnitude faster than it would take a similarly-sized academic research group to get up to speed.

This is not just about having plenty of funding (although that is part of it). It is also about what happens when you abandon the trappings of the academic process, for which the timelines are glacial:
  • A three month wait (typically) to get a decision on a conference submission, during which time you are not allowed to submit similar work elsewhere.
  • A six month wait on hearing back on a grant proposal submission.
  • A year or more wait for a journal publication, with a similar restriction on parallel submissions.
  • Five plus years to get a PhD.
  • Possibly one or two years as a postdoc.
  • Six to eight years to get tenure.
  • A lifetime of scarring as the result of the above. (Okay, I'm kidding. Sort of.)
This is not a problem unique to computer science of course. In the medical field, the average age at which a PI receives their first NIH R01 grant is 44 years. Think about that for a minute. That's 23-some-odd years after graduation before an investigator is considered an "independent" contributor to the research field. Is this good for innovation?


Part of the problem is that the academic process is full of overheads. Take a typical conference program committee for example. Let's say the committee has 15 members, each of whom has 30 papers to review (this is pretty average, for good conferences at least). Each paper takes at least an hour to review (often more) - that's the equivalent of at least 4 work days (that is, assuming academics work only 8 hours a day ... ha ha!). Add on two more full days (minimum) for the program committee meeting and travel, and you're averaging about a full week of work for each PC member. Multiply by 15 -- double it for the two program co-chairs -- and you're talking about around 870 person-hours combined effort to decide on the 25 or so papers that will appear in the conference. That's 34 person-hours of overhead per paper. This doesn't count any of the overheads associated with actually organizing the conference -- making the budget, choosing the hotel, raising funds, setting up the website, publishing the proceedings, organizing the meals and poster sessions, renting the projectors ... you get my point.

The question is, does all of this time and effort produce (a) better science or (b) lead to greater understanding or impact? I want to posit that the answer is no. This process was developed decades ago in a pre-digital era where we had no other way to disseminate research results. (Hell, it's gotten much easier to run a program committee now that submissions are done via the web -- it used to be you had to print out 20 copies of your paper and mail them to the program chair who would mail out large packets to each of the committee members.)

But still, we cling to this process because it's the only way we know how to get PhD students hired as professors and get junior faculty tenured -- any attempt to buck the trend would no doubt jeopardize the career of some young academic. It's sad.

How did we get here?

Why do we have these processes in the first place? The main reason is competition for scarce resources. Put simply, there are too many academics, and not enough funding and not enough paper-slots in good conference venues. Much has been said about the sad state of public funding for science research. Too many academics competing for the same pool of money means longer processes for proposal reviews and more time re-submitting proposals when they get rejected.

As far as the limitation on conferences goes, you can't create more conferences out of thin air, because people wouldn't have time to sit on the program committees and travel to all of them (ironic, isn't it?). Whenever someone proposes a new conference venue there are groans of "but how will we schedule it around SOSP and OSDI and NSDI and SIGCOMM?!?" - so forget about that. Actually, I think the best model would be to adopt the practice of some research communities and have one big mongo conference every year that everybody goes to (ideally in Mexico) and have USENIX run it so the scientists can focus on doing science and leave the conference organization to the experts. But I digress.

The industrial research labs don't have the same kind of funding problem, but they still compete for paper-slots. And I believe this inherently slows everything down because you can't do new research when you have to keep backtracking to get that paper you spent so many precious hours on finally published after the third round of rejections with "a strong accept, two weak accepts, and a weak reject" reviews. It sucks.

Innovative != Publishable

My inspiration for writing this post came from the amazing pace at which innovation is happening in industry these days. The most high-profile of these are crazy "moon shot" projects like SpaceX23andme, and Google's high-altitude balloons to deliver Internet access to entire cities. But there are countless other, not-as-sexy innovations happening every day at companies big and small, just focused on changing the world, rather than writing papers about it.

I want to claim that even with all of their resources, had these projects gone down the conventional academic route -- writing papers and the like -- they would have never happened. No doubt if a university had done the equivalent of, say, Google Glass and submitted a MobiSys paper on it, it would have been rejected as "not novel enough" since Thad Starner has been wearing a computer on his head for 20 years. And high-altitude Internet balloons? What's new about that? It's just a different form of WiFi, essentially. Nothing new there.

We still need to publish research, though, which is important for driving innovation. But we should shift to an open, online publication model -- like arXiv -- where everything is "accepted" and papers are reviewed and scored informally after the fact. Work can get published much more rapidly and good work won't be stuck in the endless resubmission cycle. Scientists can stop wasting so much time and energy on program committees and conference organization. (We should still have one big conference every year so people still get to meet and drink and bounce ideas around.)  This model is also much more amenable to publications from industry, who currently have little incentive to run the conference submission gauntlet, unless publishing papers is part of their job description. And academics can still use citation counts or "paper ratings" as the measure by which hiring and promotion decisions are made.

Wednesday, May 15, 2013

What I wish systems researchers would work on

I just got back from HotOS 2013 and, frankly, it was a little depressing. Mind you, the conference was really well-organized; there were lots of great people; an amazing venue; and fine work by the program committee and chair... but I could not help being left with the feeling that the operating systems community is somewhat stuck in a rut.

It did not help that the first session was about how to make network and disk I/O faster, a topic that has been a recurring theme for as long as "systems" has existed as a field. HotOS is supposed to represent the "hot topics" in the area, but when we're still arguing about problems that are 25 years old, it starts to feel not-so-hot.

Of the 27 papers presented at the workshop, only about 2 or 3 would qualify as bold, unconventional, or truly novel research directions. The rest were basically extended abstracts of conference submissions that are either already in preparation or will be submitted in the next year or so. This is a perennial problem for HotOS, and when I chaired it in 2011 we had the same problem. So I can't fault the program committee on this one -- they have to work with the submissions they get, and often the "best" and most polished submissions represent the most mature (and hence less speculative) work. (Still, this year there was no equivalent to Dave Ackley's paper in 2011 which challenged us to "pledge allegiance to the light cone.")

This got me thinking about what research areas I wish the systems research community would spend more time on. I wrote a similar blog post after attending HotMobile 2013, so it's only fair that I would subject the systems community to the same treatment. A few ideas...

Obligatory diisclaimer: Everything in this post is my personal opinion and does not represent the view of my employer.

An escape from configuration hell: A lot of research effort is focused on better techniques for finding and mitigating software bugs. In my experience at Google, the vast majority of production failures arise not due to bugs in the software, but bugs in the (often vast and incredibly complex) configuration settings that control the software. A canonical example is when someone bungles an edit to a config file which gets rolled out to the fleet, and causes jobs to start behaving in new and often not-desirable ways. The software is working exactly as intended, but the bad configuration is leading it to do the wrong thing.

This is a really hard problem. A typical Google-scale system involves many interacting jobs running very different software packages each with their own different mechanisms for runtime configuration: whether they be command-line flags, some kind of special-purpose configuration file (often in a totally custom ASCII format of some kind), or a fancy dynamically updated key-value store. The configurations are often operating at very different levels of abstraction --- everything from deciding where to route network packets, to Thai and Slovak translations of UI strings seen by users. "Bad configurations" are not just obvious things like syntax errors; they also include unexpected interactions between software components when a new (perfectly valid) configuration is used.

There are of course tools for testing configurations, catching problems and rapidly rolling back bad changes, etc. but a tremendous amount of developer and operational energy goes into fixing problems arising due to bad configurations. This seems like a ripe area for research.

Understanding interactions in a large, production system: The common definition of a "distributed system" assumes that the interactions between the individual components of the system are fairly well-defined, and dictated largely by whatever messaging protocol is used (cf., two phase commit, Paxos, etc.)  In reality, the modes of interaction are vastly more complex and subtle than simply reasoning about state transitions and messages, in the abstract way that distributed systems researchers tend to cast things.

Let me give a concrete example. Recently we encountered a problem where a bunch of jobs in one datacenter started crashing due to running out of file descriptors. Since this roughly coincided with a push of a new software version, we assumed that there must have been some leak in the new code, so we rolled back to the old version -- but the crash kept happening. We couldn't just take down the crashing jobs and let the traffic flow to another datacenter, since we were worried that the increased load would trigger the same bug elsewhere, leading to a cascading failure. The engineer on call spent many, many hours trying different things and trying to isolate the problem, without success. Eventually we learned that another team had changed the configuration of their system which was leading to many more socket connections being made to our system, which put the jobs over the default file descriptor limit (which had never been triggered before). The "bug" here was not a software bug, or even a bad configuration: it was the unexpected interaction between two very different (and independently-maintained) software systems leading to a new mode of resource exhaustion.

Somehow there needs to be a way to perform offline analysis and testing of large, complex systems so that we can catch these kinds of problems before they crop up in production. Of course we have extensive testing infrastructure, but the "hard" problems always come up when running in a real production environment, with real traffic and real resource constraints. Even integration tests and canarying are a joke compared to how complex production-scale systems are. I wish I had a way to take a complete snapshot of a production system and run it in an isolated environment -- at scale! -- to determine the impact of a proposed change. Doing so on real hardware would be cost-prohibitive (even at Google), so how do you do this in a virtual or simulated setting?

I'll admit that these are not easy problems for academics to work on. Unless you have access to a real production system, it's unlikely you'll encounter this problem in an academic setting. Doing internships at companies is a great way to get exposure to this kind of thing. Replicating this problem in an academic environment may be difficult.

Pushing the envelope on new computing platforms: I also wish the systems community would come back to working on novel and unconventional computing platforms. The work on sensor networks in the 2000's really challenged our assumptions about the capabilities and constraints of a computer system, and forced us down some interesting paths in terms of OS, language, and network protocol design. In doing these kinds of explorations, we learn a lot about how "conventional" OS concepts map (or don't map) onto the new platform, and the new techniques can often find a home in a more traditional setting: witness how the ideas from Click have influenced all kinds of systems unrelated to its original goals.

I think it is inevitable that in our lifetimes we will have a wearable computing platform that is "truly embedded": either with a neural interface, or with something almost as good (e.g. seamless speech input and visual output in a light and almost-invisible form factor). I wore my Google Glass to HotOS, which stirred up a lot of discussions around privacy issues, what the "killer apps" are, what abstractions the OS should support, and so forth. I would call Google Glass an early example of the kind of wearable platform that may well replace smartphones, tablets, and laptops as the personal computing interface of choice in the future. If that is true, then now is the time for the academic systems community to start working out how we're going to support such a platform. There are vast issues around privacy, energy management, data storage, application design, algorithms for vision and speech recognition, and much more that come up in this setting.

These are all juicy and perfectly valid research problems for the systems community -- if only it is bold enough to work on them.