Skip to main content

On the serendipity of failure

My post on the relative difficulty of sensor nets research versus "traditional" systems research made an allusion to a comment that David Culler once made about the handicap associated with field work. David notes that in a field deployment, unlike a simulation or a lab experiment, if you get data you don't like, you can't just throw it away and run the experiment again, hoping for a better result. All systems researchers tend to tweak and tune their systems until they get the graphs they expect to get; it's one of the luxuries of running experiments in a controlled setting. With sensor network field deployments, however, there are no "do overs". You get whatever the network gives you, and if the network is broken or producing bogus data, that's what you have to go with when it's time to write it up. One of the most remarkable examples I have seen of this is the Berkeley paper on "A Macroscope in the Redwoods." In that case, the data was actually retrieved from the sensor nodes manually, in part because the multihop routing did not function as well as expected in the field. They got lucky by logging all of the sensor data to flash on each mote, making it possible to write that paper despite the failure.

We learned this lesson the hard way with our deployment at Reventador Volcano in 2005: the FTSP time sync protocol broke (badly), causing much of the data to have bogus timestamps, rendering the signals useless from a geophysical monitoring point of view. Note that we had tested our system considerably in the lab, on much larger networks than the one we deployed in the field, but we never noticed this problem until we got to the volcano. What went wrong? Many possibilities: The sparse nature of the field network; the fact that the nodes were powered by real alkaline batteries and not USB hubs; the fact that the time sync bug only seemed to turn up after the network had been operational for several hours. (We had done many hours of testing in the lab, but never continuously for that period of time.)

In our case, we managed to turn lemons into lemonade by designing a scheme to fix the broken timestamps, and then did a fairly rigorous study of its accuracy. That got us into OSDI! It's possible that if the FTSP protocol had worked perfectly we would have had a harder time getting that paper accepted.

I often find the parts of those application papers that talk about what didn't work as expected are more enlightening than the rest of the paper. Lots of things sound like good ideas on paper; it's often not until you try them in the field that you gain some real understanding of the real-world forces at work.

Later on I'll blog about why sensor net application papers face such an uphill battle at most conferences.


Popular posts from this blog

Why I'm leaving Harvard

The word is out that I have decided to resign my tenured faculty job at Harvard to remain at Google. Obviously this will be a big change in my career, and one that I have spent a tremendous amount of time mulling over the last few months.

Rather than let rumors spread about the reasons for my move, I think I should be pretty direct in explaining my thinking here.

I should say first of all that I'm not leaving because of any problems with Harvard. On the contrary, I love Harvard, and will miss it a lot. The computer science faculty are absolutely top-notch, and the students are the best a professor could ever hope to work with. It is a fantastic environment, very supportive, and full of great people. They were crazy enough to give me tenure, and I feel no small pang of guilt for leaving now. I joined Harvard because it offered the opportunity to make a big impact on a great department at an important school, and I have no regrets about my decision to go there eight years ago. But m…

Rewriting a large production system in Go

My team at Google is wrapping up an effort to rewrite a large production system (almost) entirely in Go. I say "almost" because one component of the system -- a library for transcoding between image formats -- works perfectly well in C++, so we decided to leave it as-is. But the rest of the system is 100% Go, not just wrappers to existing modules in C++ or another language. It's been a fun experience and I thought I'd share some lessons learned.

Why rewrite?

The first question we must answer is why we considered a rewrite in the first place. When we started this project, we adopted an existing C++ based system, which had been developed over the course of a couple of years by two of our sister teams at Google. It's a good system and does its job remarkably well. However, it has been used in several different projects with vastly different goals, leading to a nontrivial accretion of cruft. Over time, it became apparent that for us to continue to innovate rapidly wo…

Running a software team at Google

I'm often asked what my job is like at Google since I left academia. I guess going from tenured professor to software engineer sounds like a big step down. Job titles aside, I'm much happier and more productive in my new role than I was in the 8 years at Harvard, though there are actually a lot of similarities between being a professor and running a software team.

I lead a team at Google's Seattle office which is responsible for a range of projects in the mobile web performance area (for more background on my team's work see my earlier blog post on the topic). One of our projects is the recently-announced data compression proxy support in Chrome Mobile. We also work on the PageSpeed suite of technologies, specifically focusing on mobile web optimization, as well as a bunch of other cool stuff that I can't talk about just yet.

My official job title is just "software engineer," which is the most common (and coveted) role at Google. (I say "coveted&quo…