Monday, September 26, 2011

Do we need to reboot the CS publications process?

My friend and colleague Dan Wallach has an interesting piece in this month's Communications of the ACM on Rebooting the CS Publication Process. This is a topic I've spent a lot of time thinking about (and ranting about) the last few years and thought I should weigh in. The TL;DR for Dan's proposal is something like arXiv for CS -- all papers (published or not) are sent to a centralized CSPub repository, where they can be commented on, cited, and reviewed. Submissions to conferences would simply be tagged as such in the CSPub archive, and "journals" would simply consist of tagged collections of papers.

I really like the idea of leveraging Web 2.0 technology to fix the (broken) publication process for CS papers. It seems insane to me that the CS community relies on 18th-century mechanisms for peer review, that clearly do not scale, prevent good work from being seen by larger audiences, and create more work for program chairs having to deal with deadlines, running a reviewing system, and screening for plagiarized content.

Still, I'm concerned that Dan's proposal does not go far enough. Mostly his proposal addresses the distribution issue -- how papers are submitted and archived. It does not fix the problem of authors submitting incremental work. If anything, it could make the problem worse, since I could just spam CSPub with whatever random crap I was working on and hope that (by dint of my fame and amazing good looks) it would get voted up by the plebeian CSPub readership irrespective of its technical merit. (I call this the Digg syndrome.) In the CSPub model, there is nothing to distinguish, say, a first year PhD student's vote from that of a Turing Award winner, so making wild claims and writing goofy position papers is just as likely to get you attention as doing the hard and less glamorous work of real science.

Nor does Dan's proposal appear to reduce reviewing load for conference program committees. Being a cynic, it would seem that if submitting a paper to SOSP simply consisted of setting a flag on my (existing) CSPub paper entry, then you would see an immediate deluge of submissions to major conferences. Authors would no longer have to jump through hoops to submit their papers through an arcane reviewing system and run the gauntlet of cranky program chairs who love nothing more than rejecting papers due to trivial formatting violations. Imagine having your work judged on technical content, rather than font size! I am not sure our community is ready for this.

Then there is the matter of attaining critical mass. arXiV already hosts the Computing Research Repository, which has many of the features that Dan is calling for in his proposal. The missing piece is actual users. I have never visited the site, and don't know anyone -- at least in the systems community -- who uses it. (Proof: There are a grand total of six papers in the "operating systems" category on CORR.) For better or worse, we poor systems researchers are programmed to get our publications from a small set of conferences. The best way to get CSPub to have wider adoption would be to encourage conferences to use it as their main reviewing and distribution mechanism, but I am dubious that ACM or USENIX would allow such a thing, as it takes a lot of control away from them.

The final question is that of anonymity. This is itself a hotly debated topic, but CSPub would seem to require authors to divulge authorship on submission, making it impossible to do double-blind reviewing. I tend to believe that blind reviewing is a good thing, especially for researchers at less-well-known institutions who can't lean on a big name like MIT or Stanford on the byline.

The fact is that we cling to our publication model because we perceive -- rightly or wrongly -- that there is value in the exclusivity of having a paper accepted by a conference. There is value for authors (being one of 20 papers or so in SOSP in a given year is a big deal, especially for grad students on the job market); value for readers (the papers in such a competitive conference have been hand-picked by the greatest minds in the field for your reading pleasure, saving you the trouble of slogging through all of the other crap that got submitted that year); and value for program committee members (you get to be one of the aforementioned greatest minds on the PC in a given year, and wear a fancy ribbon on your name badge when you are at the conference so everybody knows it).

Yes, it's more work for PC members, but not many people turn down an opportunity to be on the OSDI or SOSP program committee because of the workload, and there are certainly enough good people in the community who are willing to do the job. And nothing is stopping you from posting your preprint to arXiv today. But act fast -- yours could be the seventh systems paper up there!

Saturday, September 10, 2011

Programming != Computer Science

I recently read this very interesting article on ways to "level up" as a software developer. Reading this article brought home something that has been nagging me for a while since joining Google: that there is a huge skill and cultural gap between "developers" and "Computer Scientists." Jason's advice to leveling-up in the aforementioned article is very practical: write code in assembly, write a mobile app, complete the exercises in SICP, that sort of thing. This is good advice, but certainly not all that I would want people on my team spending their time doing in order to be true technical leaders. Whether you can sling JavaScript all day or know the ins and outs of C++ templates often has little bearing on whether you're able to grasp the bigger, more abstract, less well-defined problems and be able to make headway on them.

For that you need a very different set of skills, which is where I start to draw the line between a Computer Scientist and a developer. Personally, I consider myself a Computer Scientist first and a software engineer second. I am probably not the right guy to crank out thousands of lines of Java on a tight deadline, and I'll be damned if I fully grok C++'s inheritance rules. But this isn't what Google hired me to do (I hope!) and I lean heavily on some amazing programmers who do understand these things better than I do.

Note that I am not defining a Computer Scientist as someone with a PhD -- although it helps. Doing a PhD trains you to think critically, to study the literature, make effective use of experimental design, and to identify unsolved problems. By no means do you need a PhD to do these things (and not everyone with a PhD can do them, either).

A few observations on the difference between Computer Scientists and Programmers...

Think Big vs. Get 'er Done 

One thing that drove me a little nuts when I first started at Google was how quickly things move, and how often solutions are put into place that are necessary to move ahead, even if they aren't fully general or completely thought through. Coming from an academic background I am used to spending years pounding away at a single problem until you have a single, beautiful, general solution that can stand up to a tremendous amount of scrutiny (mostly in the peer review process). Not so in industry -- we gotta move fast, so often it's necessary to solve a problem well enough to get onto the next thing. Some of my colleagues at Google have no doubt been driven batty by my insistence on getting something "right" when they would rather just (and in fact need to) plow ahead.

Another aspect of this is that programmers are often satisfied with something that solves a concrete, well-defined problem and passes the unit tests. What they sometimes don't ask is "what can my approach not do?" They don't always do a thorough job at measurement and analysis: they test something, it seems to work on a few cases, they're terribly busy, so they go ahead and check it in and get onto the next thing. In academia we can spend months doing performance evaluation just to get some pretty graphs that show that a given technical approach works well in a broad range of cases.

Throwaway prototype vs. robust solution

On the other hand, one thing that Computer Scientists are not often good at is developing production-quality code. I know I am still working at it. The joke is that most academics write code so flimsy that it collapses into a pile of bits as soon as the paper deadline passes. Developing code that is truly robust, scales well, is easy to maintain, well-documented, well-tested, and uses all of the accepted best practices is not something academics are trained to do. I enjoy working with hardcore software engineers at Google who have no problem pointing out the totally obvious mistakes in my own code, or suggesting a cleaner, more elegant approach to some ass-backwards page of code I submitted for review. So there is a lot that Computer Scientists can learn about writing "real" software rather than prototypes.

My team at Google has a good mix of folks from both development and research backgrounds, and I think that's essential to striking the right balance between rapid, efficient software development and pushing the envelope of what is possible.