Monday, March 12, 2012

Do you need a PhD?

Since I decamped from the academic world to industry, I am often asked (usually by first or second year graduate students) whether it's "worth it" to get a PhD in Computer Science if you're not planning a research career. After all, you certainly don't need a PhD to get a job at a place like Google (though it helps). Hell, many successful companies (Microsoft and Facebook among them) have been founded by people who never got their undergraduate degrees, let alone a PhD. So why go through the 5-to-10 year, grueling and painful process of getting a PhD when you can just get a job straight out of college (degree or not) and get on with your life, making the big bucks and working on stuff that matters?

Doing a PhD is certainly not for everybody, and I do not recommend it for most people. However, I am really glad I got my PhD rather than just getting a job after finishing my Bachelor's. The number one reason is that I learned a hell of a lot doing the PhD, and most of the things I learned I would never get exposed to in a typical software engineering job. The process of doing a PhD trains you to do research: to read research papers, to run experiments, to write papers, to give talks. It also teaches you how to figure out what problem needs to be solved. You gain a very sophisticated technical background doing the PhD, and having your work subject to the intense scrutiny of the academic peer-review process -- not to mention your thesis committee.

I think of the PhD a little like the Grand Tour, a tradition in the 16th and 17th centuries where youths would travel around Europe, getting a rich exposure to high society in France, Italy, and Germany, learning about art, architecture, language, literature, fencing, riding -- all of the essential liberal arts that a gentleman was expected to have experience with to be an influential member of society. Doing a PhD is similar: You get an intense exposure to every subfield of Computer Science, and have to become the leading world's expert in the area of your dissertation work. The top PhD programs set an incredibly high bar: a lot of coursework, teaching experience, qualifying exams, a thesis defense, and of course making a groundbreaking research contribution in your area. Having to go through this process gives you a tremendous amount of technical breadth and depth.

I do think that doing a PhD is useful for software engineers, especially those that are inclined to be technical leaders. There are many things you can only learn "on the job," but doing a PhD, and having to build your own compiler, or design a new operating system, or prove a complex distributed algorithm from scratch is going to give you a much deeper understanding of complex Computer Science topics than following coding examples on StackOverflow.

Some important stuff I learned doing a PhD:

How to read and critique research papers. As a grad student (and a prof) you have to read thousands of research papers, extract their main ideas, critique the methods and presentation, and synthesize their contributions with your own research. As a result you are exposed to a wide range of CS topics, approaches for solving problems, sophisticated algorithms, and system designs. This is not just about gaining the knowledge in those papers (which is pretty important), but also about becoming conversant in the scientific literature.

How to write papers and give talks. Being fluent in technical communications is a really important skill for engineers. I've noticed a big gap between the software engineers I've worked with who have PhDs and those who don't in this regard. PhD-trained folks tend to give clear, well-organized talks and know how to write up their work and visualize the result of experiments. As a result they can be much more influential.

How to run experiments and interpret the results: I can't overstate how important this is. A systems-oriented PhD requires that you run a zillion measurements and present the results in a way that is both bullet-proof to peer-review criticism (in order to publish) and visually compelling. Every aspect of your methodology will be critiqued (by your advisor, your co-authors, your paper reviewers) and you will quickly learn how to run the right experiments, and do it right.

How to figure out what problem to work on: This is probably the most important aspect of PhD training. Doing a PhD will force you to cast away from shore and explore the boundary of human knowledge. (Matt Might's cartoon on this is a great visualization of this.) I think that at least 80% of making a scientific contribution is figuring out what problem to tackle: a problem that is at once interesting, open, and going to have impact if you solve it. There are lots of open problems that the research community is not interested in (c.f., writing an operating system kernel in Haskell). There are many interesting problems that have been solved over and over and over (c.f., filesystem block layout optimization; wireless multihop routing). There's a real trick to picking good problems, and developing a taste for it is a key skill if you want to become a technical leader.

So I think it's worth having a PhD, especially if you want to work on the hardest and most interesting problems. This is true whether you want a career in academia, a research lab, or a more traditional engineering role. But as my PhD advisor was fond of saying, "doing a PhD costs you a house." (In terms of the lost salary during the PhD years - these days it's probably more like several houses.)


Tuesday, February 7, 2012

My love affair with code reviews

One of the most life-altering events in my move from academia to industry was the discovery of code reviews. This is pretty standard fare for developers in the "real world", but I have never heard of an academic research group using them, and had never done code reviews myself before joining Google.

In short: Code reviews are awesome. Everyone should use them. Heck, my dog should use them. You should too.

For those of you not in the academic research community, you have to understand that academics are terrible programmers. (I count myself among this group.) Academics write sloppy code, with no unit tests, no style guidelines, and no documentation. Code is slapped together by grad students, generally under pressure of a paper deadline, mainly to get some graphs to look pretty without regard for whether anyone is ever going to run the code ever again. Before I came to Google, that was what "programming" meant to me: kind of a necessary side effect of doing research, but the result was hardly anything I would be proud to show my mother. (Or my dog, for that matter.) Oh, sure, I released some open source code as an academic, but now I shudder to think of anyone at a place like Google or Microsoft or Facebook actually reading that code (please don't, I'm begging you).

Then I came to Google. Lesson #1: You don't check anything in until it has been reviewed by someone else. This took some getting used to. Even an innocent four-line change to some "throw away" Python script is subject to scrutiny. And of course, most of the people reviewing my code were young enough to be my students -- having considered myself to be an "expert programmer" (ha!), it is a humbling experience for a 23-year-old one year out of college to show you how to take your 40 lines of crap and turn them into one beautiful, tight function -- and how to generalize it and make it testable and document the damn thing for chrissakes.

So there's a bunch of reasons to love code reviews:

Maintain standards. This is pretty obvious but matters tremendously. The way I think of it, imagine you get hit by a truck one day, and 100 years from now somebody who has never heard of your code gets paged at 3 a.m. because something you wrote was suddenly raising exceptions. Not only does your code have to work, but it also needs to make sense. Code reviews force you to write code that fits together, that adheres to the style guide, that is testable.

Catch bugs before you check in. God, I can't count the number of times someone has pointed out an obvious (or extremely subtle) bug in my code during the code review process. Having another pair of eyes (or often several pairs of eyes) looking at your code is the best way to catch flaws early.

Learn from your peers. I have learned more programming techniques and tricks from doing code reviews than I ever did reading O'Reilly books or even other people's code. A couple of guys on my team are friggin' coding ninjas and suggest all kinds of ways of improving my clunky excuse for software. You learn better design patterns, better approaches for testing, better algorithms by getting direct feedback on your code from other developers.

Stay on top of what's going on. Doing code reviews for other people is the best way to understand what's happening in complex codebase. You get exposed to a lot of different code, different approaches for solving problems, and can chart the evolution of the software over time -- a very different experience than just reading the final product.

I think academic research groups would gain a lot by using code reviews, and of course the things that go with them: good coding practices, a consistent style guide, insistence on unit tests. I'll admit that code quality matters less in a research setting, but it is probably worth the investment to use some kind of process.


The thing to keep in mind is that there is a social aspect to code reviews as well. At Google, you need an LGTM from another developer before you're allowed to submit a patch. It also takes a lot of time to do a good code review, so it's standard practice to break large changes into smaller, more review-friendly pieces. And of course the expectation is you've done your due diligence by testing your code thoroughly before sending it for review.

Don't code reviews slow you down? Somewhat. But if you think of code development as a pipeline, with multiple code reviews in the flight at a time you can still sustain a high issue rate, even if each individual patch has higher latency. Generally developers all understand that being a hardass on you during the review process will come back to bite them some day -- and they understand the tradeoff between the need to move quickly and the need to do things right. I think code reviews can also serve to build stronger teams, since everyone is responsible for doing reviews and ensuring the quality of the shared codebase. So if done right, it's worth it.


Okay, Matt. I'm convinced. How can I too join the code review bandwagon? Glad you asked. The tool we use internally at Google was developed by none other than Guido van Rossum, who has graciously released a similar system called Rietveld as open source. Basically, you install Rietveld on AppEngine, and each developer uses a little Python script to upload patches for review. Reviews are done on the website, and when the review is complete, the developer can submit the patch. Rietveld doesn't care which source control system you use, or where the repository is located -- it just deals with patches. It's pretty slick and I've used it for a couple of projects with success.

Another popular approach is to use GitHub's "pull request" and commenting platform as a code review mechanism. Individual developers clone a master repository, and submit pull requests to the owner of that repository for inclusion. GitHub has a nice commenting system allowing for code reviews to be used with pull requests.

I was floored the other day when I met an engineer from a fairly well-known Internet site who said they didn't use code reviews internally -- and complained about how messy the code was and how poorly designed some pieces were. No kidding! Code reviews aren't the ultimate solution to a broken design process, but they are an incredibly useful tool.

Monday, January 23, 2012

Making universities obsolete

Sebastian Thrun recently announced that he was leaving Stanford to found a free, online university called Udacity. This is based on his experiences teaching the famous intro to AI class, for free, to 160,000 students online.

Is this just Education for the Twitter Generation? Or truly a revolution in how we deliver higher education? Will this ultimately render universities obsolete?

I want to ponder the failings of the conventional higher education model for a minute and see where this leads us, and consider whether something like Udacity is really the solution.

Failure #1: Exclusivity.

In Sebatian's brilliant talk at DLD, he talks about being embarrassed that he was only able to teach a few tens of students at a time, and only to those who can afford $30,000 to attend Stanford. I estimate that I taught fewer than 500 students in total during my eight years on the faculty at Harvard. That's a pretty poor track record by any stretch.

It gets worse. I know plenty of faculty who love to give tough courses, in which they would teach really hard material at the beginning of the semester to "weed out" the weaker students, sometimes being left with only 2 or 3 really committed and really good students in the class. This is so much more satisfying as a professor, since you don't need to worry about tutoring the weaker students, and the fewer students you have, the less work you have to do grading and so on. There is no penalty for doing this - and rarely any incentive given for teaching a larger, more popular course.

Exclusivity is necessary when you only have so much classroom space, or so many dorms, or so many dining halls, so you have to be selective about who enters the hallowed gates of the university. It's also a way of maintaining a brand: even schools, like Harvard, with a "distance education" component go to great lengths to differentiate the "true" Harvard education from a "distance learning certificate," lest they raise the ire of the Old Boys' Network by watering down what it means to get a Harvard degree (not unlike the reaction they got when they started admitting women, way way back in 1977).

Failure #2: Grades.


Can someone remind me why we still have grades? I like what Sebastian says (quoting Salman Khan) about learning to ride a bicycle: It's not as if you get a D learning to ride a bike, then you stop and move onto learning the unicycle. Shouldn't the goal of every course be to get every student to the point of making an A+?

Apparently not. The common argument is that we need grades in order to differentiate the "good" from the "bad" students. Presumably the idea is that if you can't get through a course in the 12-to-13 week semester then you deserve to fail, regardless of whatever is going on in your life and whether you could have learned everything over a longer time span, or with more help, or whatever. And the really smart students, the ones who nail it the first time, and make A's in every class, need to float to the top so they get the first dibs on good jobs or law school or medical school or whatever rewards they have been working all of their young lives to achieve. It would not be fair if everyone made an A+ -- how would the privileged and smart kids gain any advantage over the less privileged, less intelligent kids?

It seems to me that this is completely at odds with the idea of education.

Failure #3: Lectures.


As Sebastian says, universities have been using the lecture format for more than a thousand years. I used to tell students that they were required to come to my lectures, and never provided my lectures by video, lest the students skipped class and watched it on YouTube from their dorms instead. Mostly this was to ensure that everybody in the class got the benefit of my dynamic and entertaining lecture style, which I worked so hard to perfect over the years (complete with a choreographed interpretive dance demonstrating the movement of the disk heads during a Log-Structured Filesystem cleaning operation.) But mostly it was to boost my ego and get some gratification for working so hard on the lectures, by having the students physically there in class as an audience.

Implications


I'm not sure whether Udacity and Khan Academy and iTunes University are really the solution to these problems. Clearly they are not a replacement for the conventional university experience -- you can't go to a frat party, or join a Finals Club, or make out in the library stacks while getting your degree from Online U. (At least not yet.)

But I think there are two important things that online universities bring to the table: (1) Broadening access to higher education, and (2) Leveraging technology to explore new approaches to learning.

The real question is whether broadening access ends up reinforcing the educational caste system: if you're not smart or rich enough to go to a "real university," you become one of those poor, second-class students with a certificate Online U. Would employers or graduate schools ever consider such a certificate, where everyone makes an A+, equivalent to an artium baccalaureus from the Ivy League school of your choice?

If not, is that because we truly believe that students are getting a better education sitting in a dusty classroom and having paid the proverbial $30,000 a year rather than doing the work online? This reminds me of my friends who have been through medical school, where the conventional wisdom is that doctors need to be trained using the classical methods (unbelievable amounts of rote memorization, soul-destroying clinical rotations and countless overnight shifts) because that's how it's been done for hundreds of years -- not because anybody thinks it yields better-trained doctors.

And I think universities have a long way to go towards embracing new technologies and new ways of teaching students. Sebastian makes a great point about the online AI class feeling more "intimate" to some students, in part because it really is a feeling of a one-on-one experience watching a video: you're not sitting in a big lecture hall surrounded by a bunch of other students, you're at home, in your PJs, drinking a beer and watching the video on your own laptop. A lot of this also has to do with Sebastian's teaching style, using a series of short quizzes that are auto-graded by the system. It is not just a lecture. For this reason I think that replacing live courses with videotaped lectures is not going far enough (and may in fact be detrimental).

Another benefit of the video delivery model is that you can replay it infinitely many times. Missed a point? Confused? Rewind and watch it again. What about questions? In large courses almost nobody asks questions, apart from the really smart students who should shut the hell up and not ask questions anyway. There are plenty of ways to deal with questions in an online course format, just not live, during a (limited time) lecture in which your question is likely going to annoy the rest of the class who almost certainly gets it already.

Risks


I'm going to close this little rant with a few caveats. It's fashionable to talk about "University 2.0" and How the Internet Changes Everything and disruptive technologies and all that. But a shallow, 18-minute video on the first 200 years of American History can't replace conventional coursework, deep reading, and essays. You can't tweet your way through college. Learning and teaching are hard work, and need to be taken seriously by both the student and educator.

Although expanding access to education is a great thing, it's simply not the case that everyone is smart enough to do well in any subject. For example, I'm terrible at math (which is why I'm a systems person, natch), and damn near failed to complete my CS theory course requirement at Berkeley as a result. Education should give everyone the opportunity to succeed, but the ultimate responsibility (and raw ability) comes down to the student.

Finally, it goes without saying that the most important experiences I ever had in college were outside of the classroom. I'm not just talking about staying up late and watching "Wayne's World" for the millionth time while drinking Zima, I'm talking about doing research, building things, learning from and being inspired by my fellow students. Making lectures obsolete is one thing; but I'm not sure there can ever be an online replacement for The College Experience writ large. Though 4Chan seems to be a pretty close approximation.

Monday, November 7, 2011

Research without walls

I recently signed the Research Without Walls pledge, which says that I will not do any peer review work for conferences, journals, or other scientific venues that do not make the results available for free via the Web. Like many scientists, I commit hundreds of hours a year to serving on program committees and reviewing journal papers, but the result of that (volunteer) work is essentially that the research results get locked behind a copyright license that is inconsistent with the way in which scientists actually disseminate their results -- for free, via the Web.

I believe that there is absolutely no reason for research results, especially those supported by public funding, not to be made open to the entire world. It's time for the computer science research community to move in this direction. Of course, this is going to mean a big change in the role of the professional societies, such as ACM and IEEE. It's time we made that change, as painful as it might be.


What is open access?

The issue of "open access research" often gets confused with questions such as where the papers are hosted, who owns the copyright, and whether authors are allowed to post their own papers on their website. In most cases, copyright in research publications is not held by the authors, but rather the professional societies that organize a conference or run a journal. For example, ACM and IEEE typically require authors to assign copyright to them, although they might grant the author a license to post their own research papers on their website. However, allowing authors to post papers on the Web is not the same as open access. It is an extremely limited license: posting papers on the Web does not give other scientists or students the right to share or archive those papers, or for anyone to use them for any other purpose other than downloading them for personal use. It is not unlike going to the library and borrowing a book; you still have to return it later, and you can't make copies for others.

With rare exception, every paper I have published is available for download on my website. In most cases, I have a license to do this; in others, I am probably in violation of copyright for doing so. The idea that I might get a cease-and-desist letter one day asking me to take down my own scientific papers bothers me to no end. I worked hard on those papers, and in most cases, spent hundreds of thousands of dollars of public funding to undertake the research that went into each of them.

For most of these publications, I even paid hundreds of dollars to the professional societies -- for membership fees and conference registrations for myself and my students -- to present the work at the associated conference. But yet, I don't own copyright in most of those works, and the main beneficiaries of all of this work are organizations like the ACM. It seems to me that these results should be open for everyone to benefit from, since, well, "we" (meaning, the taxpayers) paid for them.

ACM's Author-izer Service

Recently, the ACM announced a new service called the "Author-izer" (whoever came up with this name will be first against the wall when the revolution comes), that allows authors to generate free links to their publications hosted on the ACM Digital Library. This is not open access, either: this is actually a way for ACM to discourage the spread of "rogue posting" of PDF files and monetize access to the content down the road. For example, those free links will stop working when the website hosting them moves (e.g., when a student graduates). Essentially, ACM wants to control all access to "its" research library, and for good reason: it brings in a lot of revenue.

USENIX's open access policy


USENIX has a much more sane policy. Back in 2008, USENIX  announced that all of their conference proceedings would be open access, and indeed you can download PDFs of all USENIX papers from the corresponding conference website (see, for example, http://www.usenix.org/events/hotcloud11/tech/ for the proceedings from HotCloud'11).

USENIX does not ask authors to assign copyright to them. Instead, for one year from the publication date, USENIX gets an exclusive license to publish the work (both in print and electronic form), with the usual license granted back to the author to post copies on their website. After the one-year exclusivity period, USENIX retains a non-exclusive license to distribute the work forever. This is a good policy, though in my opinion it does not go far enough: USENIX does not require authors to release their work under an open access license. USENIX is kind enough to post PDFs for free on the Web, but tomorrow, USENIX could reverse this decision and put all of those papers behind a paywall, or take them down entirely. (No, I don't think this is going to happen, but you never know.)


University open access initiatives


Another way to fight back is for your home institution to require all of your work be made open. Harvard was one of the first major universities to do this. This ambitious effort, spearheaded by my colleague Stuart Shieber, required all Harvard affiliates to submit copies of their published work to the open-access Harvard DASH archive. While in theory this sounds great, there are several problems with this in practice. First, it requires individual scientists to do the legwork of securing the rights and submitting the work to the archive. This is a huge pain and most folks don't bother. Second, it requires that scientists attach a Harvard-supplied "rider" to the copyright license (e.g., from the ACM or IEEE) allowing Harvard to maintain an open-access copy in the DASH repository. Many, many publishers have pushed back on this. Harvard's response was to allow its affiliates to get an (automatic) waiver of the open-access requirement. Well, as soon as word got out that Harvard was granting these waivers, the publishers started refusing to accept the riders wholesale, claiming that the scientist could just request a waiver. So the publishers tend to win.

Creative Commons for research publications

The only way to ensure that research is de jure open access, rather than merely de facto, is by baking the open access requirement into the copyright license for the work. This is very much in the same spirit as the GPL is for software licensing. What I really want is for all research to be published under something like a Creative Commons Attribution 3.0 Unported license, allowing others to share, remix, and make commercial use of the work as long as attribution is given. This kind of license would prevent professional organizations from locking down research results, and give maximum flexibility for others to make use of the research, while retaining the conventional expectations of attribution. The "remix" clause might seem a little problematic, given that peer review expects original results, but the attribution requirement would not allow someone to submit work that is not their own and claim authorship. And there are many ways in which research can be legitimately remixed: incorporated into a talk, class notes, or collection, for example.

What happens to the publishers?


Traditional scientific publishers, like Elsevier, go out of business. I don't have a problem with that. One can make a strong argument that traditional scientific publishers have fairly limited value in today's world. It used to be that scientists needed publishers to disseminate their work; this has not been true for more than a decade.

Professional organizations, like ACM and IEEE, will need to radically change what they do if they want to stay alive. These organizations do many other things other than run conferences and journals. Unfortunately, a substantial amount of their operating budget comes from controlling access to scientific literature. Open access will drastically change that. Personally, I'd rather be a member of a leaner, more focused professional society that can focus its resources on education and policymaking than supporting a gazillion "Special Interest Groups" and journals that nobody reads.

Seems to me that USENIX strikes the right balance: They focus on running conferences. Yes, you pay through the nose to attend these events, though it's not any more expensive than a typical ACM or IEEE conference. I really do not buy the argument that an ACM-sponsored conference, even one like SOSP, is any better than one run by USENIX. Arguably USENIX does a far better job at running conferences, since they specialize in it. ACM shunts most of the load of conference organization onto inexperienced academics, with predictable results.


A final word

I can probably get away with signing the Research Without Walls pledge because I no longer rely on service on program committees to further my career. (Indeed, the pledge makes it easier for me to say no when asked to do these things.) Not surprisingly, most of the signatories of the pledge have been from industry. To tell an untenured professor that they should sign the pledge and, say, turn down a chance to serve on the program committee for SOSP, would be a mistake.  But this is not to say that academics can't promote open access in other ways: for example, by always putting PDFs on their website, or preferentially sending work to open access venues.

ObDisclaimer: This is my personal blog. The views expressed here are mine alone and not those of my employer.

Friday, November 4, 2011

Highlights from SenSys 2011

ACM SenSys 2011 just wrapped up this week in Seattle. This is the premier conference in the area of wireless sensor networks, although lately the conference has embraced a bunch of other technologies, including sensing on smartphones and micro-air vehicles. It's an exciting conference and brings together a bunch of different areas.

Rather than a full trip report, I wanted to quickly write up two highlights of the conference: The keynote by Michel Maharbiz on cybernetic beetles (!), and an awesome talk by James Biagioni on using smartphone data to automatically determine bus routes and schedules.

Keynote by Mich Maharbiz - Cyborg beetles: building interfaces between the synthetic and the multicellular

Mich is a professor at Berkeley and works in the interface between biology and engineering. His latest project is to adding a "remote control" circuit to a live insect -- a large beetle -- allowing one to control the flight of the insect. Basically, they stick electrodes into the beetle's brain and muscles, and a little microcontroller mounted on the back of the insect sends pulses to cause the insect to take off, land, and turn. A low-power radio on the microcontroller lets you control the flight using, literally, a Wii Mote.

Oh yes ... this is real.
There has been a lot of interest in the research community in building insect-scale flying robots -- the Harvard RoboBees project is just one example. Mich's work takes a different approach: let nature do the work of building the flyer, but augment it with remote control capabilities. These beetles are large enough that they can carry a 3 gram payload, can fly for kilometers at a time, and live up to 180 days.

Mich's group found that by sending simple electrical pulses to the brain and muscles that they could activate and deactivate the insect's flying mechanism, causing it to take off and land. Controlling turns is a bit more complicated, but by stimulating certain muscles behind the wings they can cause the beetle to turn left or right on command.

They have also started looking at how to tap into the beetle's sensory organs -- essentially implanting electrodes behind the eye and antennae -- so it is possible to take electrical recordings of the neural activity. And they are also looking at implanting a micro fuel cell that generates electricity from the insect's hemolymph -- essentially turning its own internal fuel source into a battery.

Mich and I were actually good friends while undergrads at Cornell together. Back then he was trying to build a six-legged insect inspired walking robot. I am not sure if it ever worked, but it's kind of amazing to run into him some 15 years later and see he's still working on these totally out-there ideas.

EasyTracker: Automatic Transit Tracking, Mapping, and Arrival Time Prediction Using Smartphones
James Biagioni, Tomas Gerlich, Timothy Merrifield, and Jakob Eriksson (University of Illinois at Chicago)


James, a PhD student at UIC, gave a great talk on this project. (One of the best conference talks I have seen in a long time. I found out later that he won the best talk award - well deserved!) The idea is amazing: To use GPS data collected from buses to automatically determine both the route and the schedule of the bus system, and give users real-time indications of expected arrival times for each route. All the transit agency has to do is install a GPS-enabled cellphone in each bus (and not even label which bus it is, or which route it would be taking - routes change all the time anyway). The data is collected and processed centrally to automatically build the tracking system for that agency.

The system starts with unlabeled GPS traces to extract routes and locations / times of stops. They use kernel density estimation with a Gaussian kernel function to “clean up” the raw traces and come up with clean route information. Some clever statistical analysis to throw out bogus route data.

To do stop extraction, they use a point density estimate with thresholding for each GPS location, which results in clusters at points where buses tend to stop. This will produce a bunch of "fake" stops at traffic lights and stop signs - the authors decided to err on the side of too many stops than too few, so they consider this to be an acceptable tradeoff.

To extract the bus schedule, they look at the arrival times of buses on individual days and use k-means clustering to determine the “centroid time” of each stop. This works fine for first stop on route (which should be close to true schedule). For downstream stops this data ends up being to be too noisy, so instead they compute the mean travel time to each downstream stop.

Another challenge is labeling buses: Need to know which bus is coming down the road towards you. For this, they use a history of GPS traces from each bus, and build an HMM to determine which route the bus is currently serving. Since buses change routes all the time, even during the same day, this has to be tracked over time. Finally, for arrival time prediction, they use the previously-computed arrival time between stops to estimate when the bus is likely to arrive.

I really liked this work and the nice combination of techniques used to take some noisy and complex sensor data and distill it into something useful.

Wednesday, November 2, 2011

Software is not science

Very often I see conference paper submissions and PhD thesis proposals that center entirely on a piece of software that someone has built. The abstract often starts out something like this:

We have designed METAFOO, a sensor network simulator that accurately captures hardware level power consumption. METAFOO has a modular design that achieves high flexibility by allowing new component models to be plugged into the simulation. METAFOO also incorporates a Java-based GUI environment for visualizing simulation results, as well as plugins to MATLAB, R, and Gnuplot for analyzing simulation runs....


You get the idea.  More often than not, the paper reads like a technical description of the software, with a hairy block diagram with a bunch of boxes and arrows and a detailed narrative on each piece of the system, what language it's implemented in, how many lines of code, etc. The authors of such papers quite earnestly believe that this is going to make a good conference submission.

While this all might be very interesting to someone who plans to use the software or build on it, this is not the point of a scientific publication or a PhD dissertation. All too often, researchers -- especially those in systems -- seem to confuse the scientific question with the software artifact that they build to explore that question. They get hung up on the idea of building a beautiful piece of software, forgetting that the point was to do science.

When I see a paper submission like this, I will start reading it in the hopes that there is some deeper insight or spark of inspiration in the system design. Usually it's not there. The paper gets so wrapped up in describing the artifact that it forgets to establish the scientific contributions that were made in developing the software. These papers do not tend to get into major conferences, and they do not make a good foundation for a PhD dissertation.

In computer systems research, there are two kinds of software that people build. The first class comprises tools used to support other research. This includes things like testbeds, simulators, and so forth. This is often great, and invaluable, software, but not -- in and of itself -- the point of research itself. Countless researchers have used ns2, Emulab, Planetlab, etc. to do their work and without this investment the community can't move forward. But all too often, students seem to think that building a useful tool equates to doing research. It doesn't.

The second, and more important, kind of software is a working prototype to demonstrate an idea. However, the point of the work is the idea that it embodies, not the software itself. Great examples of this include things like Exokernel and Barrelfish. Those systems demonstrated a beautiful set of concepts (operating system extensibility and message-passing in multicore processors respectively), but nobody actually used those pieces of software for anything more than getting graphs for a paper, or maybe a cute demo at a conference.

There are rare exceptions of "research" software that took on a life beyond the prototype phase. TinyOS and Click are two good examples. But this is the exception, not the rule. Generally I would not advise grad students to spend a lot of energy on "marketing" their research prototype. Chances are nobody will use your code anyway, and time you spend turning a prototype into a real system is time better spent pushing the envelope and writing great papers. If your software doesn't happen to embody any radical new ideas, and instead you are spending your time adding a GUI or writing documentation, you're probably spending your time on the wrong thing.

So, how do you write a paper about a piece of software? Three recommendations:

  1. Put the scientific contributions first. Make the paper about the key contributions you are making to the field. Spell them out clearly, on the first page of the paper. Make sure they are really core scientific contributions, not something like "our first contribution is that we built METAFOO." A better example would be, "We demonstrate that by a careful decomposition of cycle-accurate simulation logic from power modeling, we can achieve far greater accuracy while scaling to large numbers of nodes." Your software will be the vehicle you use to prove this point.
  2. Decouple the new ideas from the software itself. Someone should be able to come along and take your great ideas and apply them in another software system or to a completely different problem entirely. The key idea you are promoting should not be linked to whatever hairy code you had to write to show that the idea works in practice. Taking Click as an example, its modular design has been recycled in many, many other software systems (including my own PhD thesis).
  3. Think about who will care about this paper 20 years from now. If your paper is all about some minor feature that you're adding to some codebase, chances are nobody will. Try to bring out what is enduring about your work, and focus the paper on that.




Monday, September 26, 2011

Do we need to reboot the CS publications process?

My friend and colleague Dan Wallach has an interesting piece in this month's Communications of the ACM on Rebooting the CS Publication Process. This is a topic I've spent a lot of time thinking about (and ranting about) the last few years and thought I should weigh in. The TL;DR for Dan's proposal is something like arXiv for CS -- all papers (published or not) are sent to a centralized CSPub repository, where they can be commented on, cited, and reviewed. Submissions to conferences would simply be tagged as such in the CSPub archive, and "journals" would simply consist of tagged collections of papers.

I really like the idea of leveraging Web 2.0 technology to fix the (broken) publication process for CS papers. It seems insane to me that the CS community relies on 18th-century mechanisms for peer review, that clearly do not scale, prevent good work from being seen by larger audiences, and create more work for program chairs having to deal with deadlines, running a reviewing system, and screening for plagiarized content.

Still, I'm concerned that Dan's proposal does not go far enough. Mostly his proposal addresses the distribution issue -- how papers are submitted and archived. It does not fix the problem of authors submitting incremental work. If anything, it could make the problem worse, since I could just spam CSPub with whatever random crap I was working on and hope that (by dint of my fame and amazing good looks) it would get voted up by the plebeian CSPub readership irrespective of its technical merit. (I call this the Digg syndrome.) In the CSPub model, there is nothing to distinguish, say, a first year PhD student's vote from that of a Turing Award winner, so making wild claims and writing goofy position papers is just as likely to get you attention as doing the hard and less glamorous work of real science.

Nor does Dan's proposal appear to reduce reviewing load for conference program committees. Being a cynic, it would seem that if submitting a paper to SOSP simply consisted of setting a flag on my (existing) CSPub paper entry, then you would see an immediate deluge of submissions to major conferences. Authors would no longer have to jump through hoops to submit their papers through an arcane reviewing system and run the gauntlet of cranky program chairs who love nothing more than rejecting papers due to trivial formatting violations. Imagine having your work judged on technical content, rather than font size! I am not sure our community is ready for this.

Then there is the matter of attaining critical mass. arXiV already hosts the Computing Research Repository, which has many of the features that Dan is calling for in his proposal. The missing piece is actual users. I have never visited the site, and don't know anyone -- at least in the systems community -- who uses it. (Proof: There are a grand total of six papers in the "operating systems" category on CORR.) For better or worse, we poor systems researchers are programmed to get our publications from a small set of conferences. The best way to get CSPub to have wider adoption would be to encourage conferences to use it as their main reviewing and distribution mechanism, but I am dubious that ACM or USENIX would allow such a thing, as it takes a lot of control away from them.

The final question is that of anonymity. This is itself a hotly debated topic, but CSPub would seem to require authors to divulge authorship on submission, making it impossible to do double-blind reviewing. I tend to believe that blind reviewing is a good thing, especially for researchers at less-well-known institutions who can't lean on a big name like MIT or Stanford on the byline.

The fact is that we cling to our publication model because we perceive -- rightly or wrongly -- that there is value in the exclusivity of having a paper accepted by a conference. There is value for authors (being one of 20 papers or so in SOSP in a given year is a big deal, especially for grad students on the job market); value for readers (the papers in such a competitive conference have been hand-picked by the greatest minds in the field for your reading pleasure, saving you the trouble of slogging through all of the other crap that got submitted that year); and value for program committee members (you get to be one of the aforementioned greatest minds on the PC in a given year, and wear a fancy ribbon on your name badge when you are at the conference so everybody knows it).

Yes, it's more work for PC members, but not many people turn down an opportunity to be on the OSDI or SOSP program committee because of the workload, and there are certainly enough good people in the community who are willing to do the job. And nothing is stopping you from posting your preprint to arXiv today. But act fast -- yours could be the seventh systems paper up there!