Monday, November 7, 2011

Research without walls

I recently signed the Research Without Walls pledge, which says that I will not do any peer review work for conferences, journals, or other scientific venues that do not make the results available for free via the Web. Like many scientists, I commit hundreds of hours a year to serving on program committees and reviewing journal papers, but the result of that (volunteer) work is essentially that the research results get locked behind a copyright license that is inconsistent with the way in which scientists actually disseminate their results -- for free, via the Web.

I believe that there is absolutely no reason for research results, especially those supported by public funding, not to be made open to the entire world. It's time for the computer science research community to move in this direction. Of course, this is going to mean a big change in the role of the professional societies, such as ACM and IEEE. It's time we made that change, as painful as it might be.

What is open access?

The issue of "open access research" often gets confused with questions such as where the papers are hosted, who owns the copyright, and whether authors are allowed to post their own papers on their website. In most cases, copyright in research publications is not held by the authors, but rather the professional societies that organize a conference or run a journal. For example, ACM and IEEE typically require authors to assign copyright to them, although they might grant the author a license to post their own research papers on their website. However, allowing authors to post papers on the Web is not the same as open access. It is an extremely limited license: posting papers on the Web does not give other scientists or students the right to share or archive those papers, or for anyone to use them for any other purpose other than downloading them for personal use. It is not unlike going to the library and borrowing a book; you still have to return it later, and you can't make copies for others.

With rare exception, every paper I have published is available for download on my website. In most cases, I have a license to do this; in others, I am probably in violation of copyright for doing so. The idea that I might get a cease-and-desist letter one day asking me to take down my own scientific papers bothers me to no end. I worked hard on those papers, and in most cases, spent hundreds of thousands of dollars of public funding to undertake the research that went into each of them.

For most of these publications, I even paid hundreds of dollars to the professional societies -- for membership fees and conference registrations for myself and my students -- to present the work at the associated conference. But yet, I don't own copyright in most of those works, and the main beneficiaries of all of this work are organizations like the ACM. It seems to me that these results should be open for everyone to benefit from, since, well, "we" (meaning, the taxpayers) paid for them.

ACM's Author-izer Service

Recently, the ACM announced a new service called the "Author-izer" (whoever came up with this name will be first against the wall when the revolution comes), that allows authors to generate free links to their publications hosted on the ACM Digital Library. This is not open access, either: this is actually a way for ACM to discourage the spread of "rogue posting" of PDF files and monetize access to the content down the road. For example, those free links will stop working when the website hosting them moves (e.g., when a student graduates). Essentially, ACM wants to control all access to "its" research library, and for good reason: it brings in a lot of revenue.

USENIX's open access policy

USENIX has a much more sane policy. Back in 2008, USENIX  announced that all of their conference proceedings would be open access, and indeed you can download PDFs of all USENIX papers from the corresponding conference website (see, for example, for the proceedings from HotCloud'11).

USENIX does not ask authors to assign copyright to them. Instead, for one year from the publication date, USENIX gets an exclusive license to publish the work (both in print and electronic form), with the usual license granted back to the author to post copies on their website. After the one-year exclusivity period, USENIX retains a non-exclusive license to distribute the work forever. This is a good policy, though in my opinion it does not go far enough: USENIX does not require authors to release their work under an open access license. USENIX is kind enough to post PDFs for free on the Web, but tomorrow, USENIX could reverse this decision and put all of those papers behind a paywall, or take them down entirely. (No, I don't think this is going to happen, but you never know.)

University open access initiatives

Another way to fight back is for your home institution to require all of your work be made open. Harvard was one of the first major universities to do this. This ambitious effort, spearheaded by my colleague Stuart Shieber, required all Harvard affiliates to submit copies of their published work to the open-access Harvard DASH archive. While in theory this sounds great, there are several problems with this in practice. First, it requires individual scientists to do the legwork of securing the rights and submitting the work to the archive. This is a huge pain and most folks don't bother. Second, it requires that scientists attach a Harvard-supplied "rider" to the copyright license (e.g., from the ACM or IEEE) allowing Harvard to maintain an open-access copy in the DASH repository. Many, many publishers have pushed back on this. Harvard's response was to allow its affiliates to get an (automatic) waiver of the open-access requirement. Well, as soon as word got out that Harvard was granting these waivers, the publishers started refusing to accept the riders wholesale, claiming that the scientist could just request a waiver. So the publishers tend to win.

Creative Commons for research publications

The only way to ensure that research is de jure open access, rather than merely de facto, is by baking the open access requirement into the copyright license for the work. This is very much in the same spirit as the GPL is for software licensing. What I really want is for all research to be published under something like a Creative Commons Attribution 3.0 Unported license, allowing others to share, remix, and make commercial use of the work as long as attribution is given. This kind of license would prevent professional organizations from locking down research results, and give maximum flexibility for others to make use of the research, while retaining the conventional expectations of attribution. The "remix" clause might seem a little problematic, given that peer review expects original results, but the attribution requirement would not allow someone to submit work that is not their own and claim authorship. And there are many ways in which research can be legitimately remixed: incorporated into a talk, class notes, or collection, for example.

What happens to the publishers?

Traditional scientific publishers, like Elsevier, go out of business. I don't have a problem with that. One can make a strong argument that traditional scientific publishers have fairly limited value in today's world. It used to be that scientists needed publishers to disseminate their work; this has not been true for more than a decade.

Professional organizations, like ACM and IEEE, will need to radically change what they do if they want to stay alive. These organizations do many other things other than run conferences and journals. Unfortunately, a substantial amount of their operating budget comes from controlling access to scientific literature. Open access will drastically change that. Personally, I'd rather be a member of a leaner, more focused professional society that can focus its resources on education and policymaking than supporting a gazillion "Special Interest Groups" and journals that nobody reads.

Seems to me that USENIX strikes the right balance: They focus on running conferences. Yes, you pay through the nose to attend these events, though it's not any more expensive than a typical ACM or IEEE conference. I really do not buy the argument that an ACM-sponsored conference, even one like SOSP, is any better than one run by USENIX. Arguably USENIX does a far better job at running conferences, since they specialize in it. ACM shunts most of the load of conference organization onto inexperienced academics, with predictable results.

A final word

I can probably get away with signing the Research Without Walls pledge because I no longer rely on service on program committees to further my career. (Indeed, the pledge makes it easier for me to say no when asked to do these things.) Not surprisingly, most of the signatories of the pledge have been from industry. To tell an untenured professor that they should sign the pledge and, say, turn down a chance to serve on the program committee for SOSP, would be a mistake.  But this is not to say that academics can't promote open access in other ways: for example, by always putting PDFs on their website, or preferentially sending work to open access venues.

ObDisclaimer: This is my personal blog. The views expressed here are mine alone and not those of my employer.

Friday, November 4, 2011

Highlights from SenSys 2011

ACM SenSys 2011 just wrapped up this week in Seattle. This is the premier conference in the area of wireless sensor networks, although lately the conference has embraced a bunch of other technologies, including sensing on smartphones and micro-air vehicles. It's an exciting conference and brings together a bunch of different areas.

Rather than a full trip report, I wanted to quickly write up two highlights of the conference: The keynote by Michel Maharbiz on cybernetic beetles (!), and an awesome talk by James Biagioni on using smartphone data to automatically determine bus routes and schedules.

Keynote by Mich Maharbiz - Cyborg beetles: building interfaces between the synthetic and the multicellular

Mich is a professor at Berkeley and works in the interface between biology and engineering. His latest project is to adding a "remote control" circuit to a live insect -- a large beetle -- allowing one to control the flight of the insect. Basically, they stick electrodes into the beetle's brain and muscles, and a little microcontroller mounted on the back of the insect sends pulses to cause the insect to take off, land, and turn. A low-power radio on the microcontroller lets you control the flight using, literally, a Wii Mote.

Oh yes ... this is real.
There has been a lot of interest in the research community in building insect-scale flying robots -- the Harvard RoboBees project is just one example. Mich's work takes a different approach: let nature do the work of building the flyer, but augment it with remote control capabilities. These beetles are large enough that they can carry a 3 gram payload, can fly for kilometers at a time, and live up to 180 days.

Mich's group found that by sending simple electrical pulses to the brain and muscles that they could activate and deactivate the insect's flying mechanism, causing it to take off and land. Controlling turns is a bit more complicated, but by stimulating certain muscles behind the wings they can cause the beetle to turn left or right on command.

They have also started looking at how to tap into the beetle's sensory organs -- essentially implanting electrodes behind the eye and antennae -- so it is possible to take electrical recordings of the neural activity. And they are also looking at implanting a micro fuel cell that generates electricity from the insect's hemolymph -- essentially turning its own internal fuel source into a battery.

Mich and I were actually good friends while undergrads at Cornell together. Back then he was trying to build a six-legged insect inspired walking robot. I am not sure if it ever worked, but it's kind of amazing to run into him some 15 years later and see he's still working on these totally out-there ideas.

EasyTracker: Automatic Transit Tracking, Mapping, and Arrival Time Prediction Using Smartphones
James Biagioni, Tomas Gerlich, Timothy Merrifield, and Jakob Eriksson (University of Illinois at Chicago)

James, a PhD student at UIC, gave a great talk on this project. (One of the best conference talks I have seen in a long time. I found out later that he won the best talk award - well deserved!) The idea is amazing: To use GPS data collected from buses to automatically determine both the route and the schedule of the bus system, and give users real-time indications of expected arrival times for each route. All the transit agency has to do is install a GPS-enabled cellphone in each bus (and not even label which bus it is, or which route it would be taking - routes change all the time anyway). The data is collected and processed centrally to automatically build the tracking system for that agency.

The system starts with unlabeled GPS traces to extract routes and locations / times of stops. They use kernel density estimation with a Gaussian kernel function to “clean up” the raw traces and come up with clean route information. Some clever statistical analysis to throw out bogus route data.

To do stop extraction, they use a point density estimate with thresholding for each GPS location, which results in clusters at points where buses tend to stop. This will produce a bunch of "fake" stops at traffic lights and stop signs - the authors decided to err on the side of too many stops than too few, so they consider this to be an acceptable tradeoff.

To extract the bus schedule, they look at the arrival times of buses on individual days and use k-means clustering to determine the “centroid time” of each stop. This works fine for first stop on route (which should be close to true schedule). For downstream stops this data ends up being to be too noisy, so instead they compute the mean travel time to each downstream stop.

Another challenge is labeling buses: Need to know which bus is coming down the road towards you. For this, they use a history of GPS traces from each bus, and build an HMM to determine which route the bus is currently serving. Since buses change routes all the time, even during the same day, this has to be tracked over time. Finally, for arrival time prediction, they use the previously-computed arrival time between stops to estimate when the bus is likely to arrive.

I really liked this work and the nice combination of techniques used to take some noisy and complex sensor data and distill it into something useful.

Wednesday, November 2, 2011

Software is not science

Very often I see conference paper submissions and PhD thesis proposals that center entirely on a piece of software that someone has built. The abstract often starts out something like this:

We have designed METAFOO, a sensor network simulator that accurately captures hardware level power consumption. METAFOO has a modular design that achieves high flexibility by allowing new component models to be plugged into the simulation. METAFOO also incorporates a Java-based GUI environment for visualizing simulation results, as well as plugins to MATLAB, R, and Gnuplot for analyzing simulation runs....

You get the idea.  More often than not, the paper reads like a technical description of the software, with a hairy block diagram with a bunch of boxes and arrows and a detailed narrative on each piece of the system, what language it's implemented in, how many lines of code, etc. The authors of such papers quite earnestly believe that this is going to make a good conference submission.

While this all might be very interesting to someone who plans to use the software or build on it, this is not the point of a scientific publication or a PhD dissertation. All too often, researchers -- especially those in systems -- seem to confuse the scientific question with the software artifact that they build to explore that question. They get hung up on the idea of building a beautiful piece of software, forgetting that the point was to do science.

When I see a paper submission like this, I will start reading it in the hopes that there is some deeper insight or spark of inspiration in the system design. Usually it's not there. The paper gets so wrapped up in describing the artifact that it forgets to establish the scientific contributions that were made in developing the software. These papers do not tend to get into major conferences, and they do not make a good foundation for a PhD dissertation.

In computer systems research, there are two kinds of software that people build. The first class comprises tools used to support other research. This includes things like testbeds, simulators, and so forth. This is often great, and invaluable, software, but not -- in and of itself -- the point of research itself. Countless researchers have used ns2, Emulab, Planetlab, etc. to do their work and without this investment the community can't move forward. But all too often, students seem to think that building a useful tool equates to doing research. It doesn't.

The second, and more important, kind of software is a working prototype to demonstrate an idea. However, the point of the work is the idea that it embodies, not the software itself. Great examples of this include things like Exokernel and Barrelfish. Those systems demonstrated a beautiful set of concepts (operating system extensibility and message-passing in multicore processors respectively), but nobody actually used those pieces of software for anything more than getting graphs for a paper, or maybe a cute demo at a conference.

There are rare exceptions of "research" software that took on a life beyond the prototype phase. TinyOS and Click are two good examples. But this is the exception, not the rule. Generally I would not advise grad students to spend a lot of energy on "marketing" their research prototype. Chances are nobody will use your code anyway, and time you spend turning a prototype into a real system is time better spent pushing the envelope and writing great papers. If your software doesn't happen to embody any radical new ideas, and instead you are spending your time adding a GUI or writing documentation, you're probably spending your time on the wrong thing.

So, how do you write a paper about a piece of software? Three recommendations:

  1. Put the scientific contributions first. Make the paper about the key contributions you are making to the field. Spell them out clearly, on the first page of the paper. Make sure they are really core scientific contributions, not something like "our first contribution is that we built METAFOO." A better example would be, "We demonstrate that by a careful decomposition of cycle-accurate simulation logic from power modeling, we can achieve far greater accuracy while scaling to large numbers of nodes." Your software will be the vehicle you use to prove this point.
  2. Decouple the new ideas from the software itself. Someone should be able to come along and take your great ideas and apply them in another software system or to a completely different problem entirely. The key idea you are promoting should not be linked to whatever hairy code you had to write to show that the idea works in practice. Taking Click as an example, its modular design has been recycled in many, many other software systems (including my own PhD thesis).
  3. Think about who will care about this paper 20 years from now. If your paper is all about some minor feature that you're adding to some codebase, chances are nobody will. Try to bring out what is enduring about your work, and focus the paper on that.

Startup Life: Three Months In

I've posted a story to Medium on what it's been like to work at a startup, after years at Google. Check it out here.