Should CS grad students be required to receive formal training in lab technique?
In most scientific disciplines, a great deal of attention is paid to proper experimental design, data collection, and analysis. A grad student in chemistry or biology learns to adhere to a fairly rigid set of procedures for running an experiment (and documenting the procedure). In CS, we largely assume that grad students (not to mention professors and undergrads) somehow magically know how to do these things properly.
When I was a grad student, I more or less figured out how to run benchmarks, collect and record data, document experimental setup, analyze the raw data, and produce meaningful figures on my own. Sure, I had some mentorship from the more senior grad students in my group (and no small amount of pushback from my advisor when a graph would not make sense to him). But in reality, there was very little oversight in terms of how I ran my experiments and collected results. I logged benchmark output to various homebrew ASCII file formats and cobbled together Perl scripts to churn the output. This evolved considerably over time, adding support for gzipped log files (when they got too big), automatic generation of gnuplot scripts to graph the results, and elaborate use of Makefiles to automate the benchmark runs. Needless to say, I am absolutely certain that all of my scripts were free of bugs and that the results published in my papers are 100% accurate.
In my experience, grad students tend to come up with their own procedures, and few of them are directly verifiable. Sometimes I find myself digging into scripts written by one of my students to understand how the statistics were generated. As an extreme example, at one point Sean Rhea (whom I went to grad school with) logged all of his benchmark results directly to a MySQL database and used a set of complex SQL queries to crunch the numbers. For our volcano sensor network deployments, we opted to log everything using XML and wrote some fairly hairy Python code to parse the logs and generate statistics. The advantage of XML is that the data is self-describing and can be manipulated programmatically (your code walks the document tree). It also decouples the logic of reading and writing the logs from the code that manipulates the data. More recently, students in my group have made heavy use of Python pickle files for data logging, which have the advantage of being absolutely trivial to use, but the disadvantage that changes to the Python data structures can make old log files unusable.
Of course, all of these data management approaches assume sound experimental technique. Obvious things include running benchmarks on an "unloaded" machine, doing multiple runs to eliminate measurement error, and using high-resolution timers (such as CPU cycle counters) when possible. However, some of these things are more subtle. I'll never forget my first benchmarking experience as an undergrad at Cornell -- measuring round-trip latency of the U-Net interface that I implemented on top of Fast Ethernet. My initial set of runs said that the RTT was around 6 microseconds -- below the fabled Culler Constant! -- beating the pants off of the previous implementation over ATM. I was ecstatic. Turns out my benchmark code had a small bug and was not doing round-trip ping-pongs but rather having both ends transmit simultaneously, thereby measuring the packet transmission overhead only. Duh. Fortunately, the results were too good to be true, and we caught the bug well before going to press, but what if we hadn't noticed?
Should the CS systems community come up with a set of established procedures for running benchmarks and analyzing results? Maybe we need a "lab manual" for new CS grad students, laying out the best practices. What do people think?