Tuesday, January 12, 2010

The CS Grad Student Lab Manual

Should CS grad students be required to receive formal training in lab technique?

In most scientific disciplines, a great deal of attention is paid to proper experimental design, data collection, and analysis. A grad student in chemistry or biology learns to adhere to a fairly rigid set of procedures for running an experiment (and documenting the procedure). In CS, we largely assume that grad students (not to mention professors and undergrads) somehow magically know how to do these things properly.

When I was a grad student, I more or less figured out how to run benchmarks, collect and record data, document experimental setup, analyze the raw data, and produce meaningful figures on my own. Sure, I had some mentorship from the more senior grad students in my group (and no small amount of pushback from my advisor when a graph would not make sense to him). But in reality, there was very little oversight in terms of how I ran my experiments and collected results. I logged benchmark output to various homebrew ASCII file formats and cobbled together Perl scripts to churn the output. This evolved considerably over time, adding support for gzipped log files (when they got too big), automatic generation of gnuplot scripts to graph the results, and elaborate use of Makefiles to automate the benchmark runs. Needless to say, I am absolutely certain that all of my scripts were free of bugs and that the results published in my papers are 100% accurate.

In my experience, grad students tend to come up with their own procedures, and few of them are directly verifiable. Sometimes I find myself digging into scripts written by one of my students to understand how the statistics were generated. As an extreme example, at one point Sean Rhea (whom I went to grad school with) logged all of his benchmark results directly to a MySQL database and used a set of complex SQL queries to crunch the numbers. For our volcano sensor network deployments, we opted to log everything using XML and wrote some fairly hairy Python code to parse the logs and generate statistics. The advantage of XML is that the data is self-describing and can be manipulated programmatically (your code walks the document tree). It also decouples the logic of reading and writing the logs from the code that manipulates the data. More recently, students in my group have made heavy use of Python pickle files for data logging, which have the advantage of being absolutely trivial to use, but the disadvantage that changes to the Python data structures can make old log files unusable.

Of course, all of these data management approaches assume sound experimental technique. Obvious things include running benchmarks on an "unloaded" machine, doing multiple runs to eliminate measurement error, and using high-resolution timers (such as CPU cycle counters) when possible. However, some of these things are more subtle. I'll never forget my first benchmarking experience as an undergrad at Cornell -- measuring round-trip latency of the U-Net interface that I implemented on top of Fast Ethernet. My initial set of runs said that the RTT was around 6 microseconds -- below the fabled Culler Constant! -- beating the pants off of the previous implementation over ATM. I was ecstatic. Turns out my benchmark code had a small bug and was not doing round-trip ping-pongs but rather having both ends transmit simultaneously, thereby measuring the packet transmission overhead only. Duh. Fortunately, the results were too good to be true, and we caught the bug well before going to press, but what if we hadn't noticed?

Should the CS systems community come up with a set of established procedures for running benchmarks and analyzing results? Maybe we need a "lab manual" for new CS grad students, laying out the best practices. What do people think?

7 comments:

  1. I suspect that more thorough and rigorous experimental methodology would arise naturally if computer science had a strong tradition of repeating experiments. But, for a variety of reasons (proprietary code and/or data, complexity, rapid change, etc.), we don't.

    I often wonder what (if any) impact a lack of controlled, repeatable experimentation has had on progress in the field. It's tough to say quantitatively, but qualitatively it has resulted in many derisive comments from my friends in the natural sciences 8).

    ReplyDelete
  2. I really like the "lab manual" idea. Best practices would seem to be a lot easier for the larger community to get behind than any one set of procedures.

    It seems like it would be especially valuable to emphasize to beginning grad students that, whether for the resubmission of a rejected paper or the camera-ready for an accepted one, you're probably going to have to run at least one of your experiments again, so you might as well make your experiments repeatable and document how you did them. I learned that particular lesson the hard way and I wish someone would have drilled that into my head earlier.

    ReplyDelete
  3. I also really like the idea of providing grad students with thorough guide lines on how to run experiments!

    What bothers me a lot is that little authors make the data their papers are based on available. Of course, most will provide you with the data if requested, but I think it should be common procedure that if you publish some findings that you derive from experimentally gathered data, you should also publish that data (e.g. by providing a link to a tar ball).

    Next, I think any experiments whose results are published need to be reproducable by other researchers. That is, I would like to see every paper give a reference to some tar ball including all the scripts for experiment control and data analysis. This doesn't have to be polished, but should give an external researcher who was not involved in the experiments a good idea of how experiments were carried out in detail.

    The practice of publishing your methods and tools also facilitates starting up similar research.

    As somebody new to the field, I am wondering why this is not common practice. Are people ashamed of dirty hacks? Afraid of flaws being exposed? Or others gaining advantage of the time invested into setting up experiments?

    ReplyDelete
  4. As a start, read Vern Paxson's IMC paper from 2004: Strategies for sound internet measurement.

    ReplyDelete
  5. Another starting poing might be How to do Research At the MIT AI Lab. It is a bit more general than what you propose but similar in spirit.

    ReplyDelete
  6. i would strongly support a course on rigorous experimentation, analysis and validation techniques for every systems ph.d. student. there are papers from paxson, floyd, and willinger which advocate rigorous analysis - something that every systems/networking student should read. unfortunately there are many many papers out there (even in tier-1 conferences) which have sloppy and/or buggy analysis. to make things worse:

    1. it is not uncommon to see people refusing to share their analysis code or simply ignoring requests for code (leave alone making it public by themselves).

    2. it is not uncommon to see papers with propreitary datasets which are massive (ex. from ISPs and datacenters) in papers from tier-1 conferences. obviously, these results cannot be validated, and are taken on face-value (they may get in on the basis of the data, since a reviewer may have no clue of accuracy the results).

    3. it is not uncommon to see a paper which does not provide enough details to reproduce its results; this is not uncommon in tier-1 conferences either. problem (1) above further complicates things with such work.

    the pressure on a graduate student (to be able to get a job) and tenure track faculty (to get tenure) to publish in tier-1 venues ends up churning out last minute analysis, and thus many of these problems. i'm not claiming to have a solution, and i don't think there is one. in non-CS fields (esp. medicine, where accuracy of results really matters) this may work out better, given that they do not have "yearly deadlines" and can submit to their journals when they are confident that they have finished.

    ReplyDelete
  7. I agree on the idea that the computer sciences has more experimental cycles, say repeated experments..hence a manual is perhaps not advisable to some extent.

    ReplyDelete

Startup Life: Three Months In

I've posted a story to Medium on what it's been like to work at a startup, after years at Google. Check it out here.