Skip to main content

The CS Grad Student Lab Manual

Should CS grad students be required to receive formal training in lab technique?

In most scientific disciplines, a great deal of attention is paid to proper experimental design, data collection, and analysis. A grad student in chemistry or biology learns to adhere to a fairly rigid set of procedures for running an experiment (and documenting the procedure). In CS, we largely assume that grad students (not to mention professors and undergrads) somehow magically know how to do these things properly.

When I was a grad student, I more or less figured out how to run benchmarks, collect and record data, document experimental setup, analyze the raw data, and produce meaningful figures on my own. Sure, I had some mentorship from the more senior grad students in my group (and no small amount of pushback from my advisor when a graph would not make sense to him). But in reality, there was very little oversight in terms of how I ran my experiments and collected results. I logged benchmark output to various homebrew ASCII file formats and cobbled together Perl scripts to churn the output. This evolved considerably over time, adding support for gzipped log files (when they got too big), automatic generation of gnuplot scripts to graph the results, and elaborate use of Makefiles to automate the benchmark runs. Needless to say, I am absolutely certain that all of my scripts were free of bugs and that the results published in my papers are 100% accurate.

In my experience, grad students tend to come up with their own procedures, and few of them are directly verifiable. Sometimes I find myself digging into scripts written by one of my students to understand how the statistics were generated. As an extreme example, at one point Sean Rhea (whom I went to grad school with) logged all of his benchmark results directly to a MySQL database and used a set of complex SQL queries to crunch the numbers. For our volcano sensor network deployments, we opted to log everything using XML and wrote some fairly hairy Python code to parse the logs and generate statistics. The advantage of XML is that the data is self-describing and can be manipulated programmatically (your code walks the document tree). It also decouples the logic of reading and writing the logs from the code that manipulates the data. More recently, students in my group have made heavy use of Python pickle files for data logging, which have the advantage of being absolutely trivial to use, but the disadvantage that changes to the Python data structures can make old log files unusable.

Of course, all of these data management approaches assume sound experimental technique. Obvious things include running benchmarks on an "unloaded" machine, doing multiple runs to eliminate measurement error, and using high-resolution timers (such as CPU cycle counters) when possible. However, some of these things are more subtle. I'll never forget my first benchmarking experience as an undergrad at Cornell -- measuring round-trip latency of the U-Net interface that I implemented on top of Fast Ethernet. My initial set of runs said that the RTT was around 6 microseconds -- below the fabled Culler Constant! -- beating the pants off of the previous implementation over ATM. I was ecstatic. Turns out my benchmark code had a small bug and was not doing round-trip ping-pongs but rather having both ends transmit simultaneously, thereby measuring the packet transmission overhead only. Duh. Fortunately, the results were too good to be true, and we caught the bug well before going to press, but what if we hadn't noticed?

Should the CS systems community come up with a set of established procedures for running benchmarks and analyzing results? Maybe we need a "lab manual" for new CS grad students, laying out the best practices. What do people think?


  1. I suspect that more thorough and rigorous experimental methodology would arise naturally if computer science had a strong tradition of repeating experiments. But, for a variety of reasons (proprietary code and/or data, complexity, rapid change, etc.), we don't.

    I often wonder what (if any) impact a lack of controlled, repeatable experimentation has had on progress in the field. It's tough to say quantitatively, but qualitatively it has resulted in many derisive comments from my friends in the natural sciences 8).

  2. I really like the "lab manual" idea. Best practices would seem to be a lot easier for the larger community to get behind than any one set of procedures.

    It seems like it would be especially valuable to emphasize to beginning grad students that, whether for the resubmission of a rejected paper or the camera-ready for an accepted one, you're probably going to have to run at least one of your experiments again, so you might as well make your experiments repeatable and document how you did them. I learned that particular lesson the hard way and I wish someone would have drilled that into my head earlier.

  3. I also really like the idea of providing grad students with thorough guide lines on how to run experiments!

    What bothers me a lot is that little authors make the data their papers are based on available. Of course, most will provide you with the data if requested, but I think it should be common procedure that if you publish some findings that you derive from experimentally gathered data, you should also publish that data (e.g. by providing a link to a tar ball).

    Next, I think any experiments whose results are published need to be reproducable by other researchers. That is, I would like to see every paper give a reference to some tar ball including all the scripts for experiment control and data analysis. This doesn't have to be polished, but should give an external researcher who was not involved in the experiments a good idea of how experiments were carried out in detail.

    The practice of publishing your methods and tools also facilitates starting up similar research.

    As somebody new to the field, I am wondering why this is not common practice. Are people ashamed of dirty hacks? Afraid of flaws being exposed? Or others gaining advantage of the time invested into setting up experiments?

  4. As a start, read Vern Paxson's IMC paper from 2004: Strategies for sound internet measurement.

  5. Another starting poing might be How to do Research At the MIT AI Lab. It is a bit more general than what you propose but similar in spirit.

  6. i would strongly support a course on rigorous experimentation, analysis and validation techniques for every systems ph.d. student. there are papers from paxson, floyd, and willinger which advocate rigorous analysis - something that every systems/networking student should read. unfortunately there are many many papers out there (even in tier-1 conferences) which have sloppy and/or buggy analysis. to make things worse:

    1. it is not uncommon to see people refusing to share their analysis code or simply ignoring requests for code (leave alone making it public by themselves).

    2. it is not uncommon to see papers with propreitary datasets which are massive (ex. from ISPs and datacenters) in papers from tier-1 conferences. obviously, these results cannot be validated, and are taken on face-value (they may get in on the basis of the data, since a reviewer may have no clue of accuracy the results).

    3. it is not uncommon to see a paper which does not provide enough details to reproduce its results; this is not uncommon in tier-1 conferences either. problem (1) above further complicates things with such work.

    the pressure on a graduate student (to be able to get a job) and tenure track faculty (to get tenure) to publish in tier-1 venues ends up churning out last minute analysis, and thus many of these problems. i'm not claiming to have a solution, and i don't think there is one. in non-CS fields (esp. medicine, where accuracy of results really matters) this may work out better, given that they do not have "yearly deadlines" and can submit to their journals when they are confident that they have finished.

  7. I agree on the idea that the computer sciences has more experimental cycles, say repeated experments..hence a manual is perhaps not advisable to some extent.


Post a Comment

Popular posts from this blog

Why I'm leaving Harvard

The word is out that I have decided to resign my tenured faculty job at Harvard to remain at Google. Obviously this will be a big change in my career, and one that I have spent a tremendous amount of time mulling over the last few months.

Rather than let rumors spread about the reasons for my move, I think I should be pretty direct in explaining my thinking here.

I should say first of all that I'm not leaving because of any problems with Harvard. On the contrary, I love Harvard, and will miss it a lot. The computer science faculty are absolutely top-notch, and the students are the best a professor could ever hope to work with. It is a fantastic environment, very supportive, and full of great people. They were crazy enough to give me tenure, and I feel no small pang of guilt for leaving now. I joined Harvard because it offered the opportunity to make a big impact on a great department at an important school, and I have no regrets about my decision to go there eight years ago. But m…

Rewriting a large production system in Go

My team at Google is wrapping up an effort to rewrite a large production system (almost) entirely in Go. I say "almost" because one component of the system -- a library for transcoding between image formats -- works perfectly well in C++, so we decided to leave it as-is. But the rest of the system is 100% Go, not just wrappers to existing modules in C++ or another language. It's been a fun experience and I thought I'd share some lessons learned.

Why rewrite?

The first question we must answer is why we considered a rewrite in the first place. When we started this project, we adopted an existing C++ based system, which had been developed over the course of a couple of years by two of our sister teams at Google. It's a good system and does its job remarkably well. However, it has been used in several different projects with vastly different goals, leading to a nontrivial accretion of cruft. Over time, it became apparent that for us to continue to innovate rapidly wo…

Running a software team at Google

I'm often asked what my job is like at Google since I left academia. I guess going from tenured professor to software engineer sounds like a big step down. Job titles aside, I'm much happier and more productive in my new role than I was in the 8 years at Harvard, though there are actually a lot of similarities between being a professor and running a software team.

I lead a team at Google's Seattle office which is responsible for a range of projects in the mobile web performance area (for more background on my team's work see my earlier blog post on the topic). One of our projects is the recently-announced data compression proxy support in Chrome Mobile. We also work on the PageSpeed suite of technologies, specifically focusing on mobile web optimization, as well as a bunch of other cool stuff that I can't talk about just yet.

My official job title is just "software engineer," which is the most common (and coveted) role at Google. (I say "coveted&quo…