Skip to main content

Reflections on Fast, User-Level Networking

A couple of weeks ago at HotOS, one of the most controversial papers (from Stanford) was entitled "It's Time for Low Latency." The basic premise of the paper is that clusters are stuck using expensive, high-latency network interfaces (generally TCP/IP over some flavor of Ethernet), but it should now be possible to achieve sub-10-microsecond round-trip-times for RPCs. Of course, a tremendous amount of research looked at low-latency, high-bandwidth cluster networking in the mid-1990's, including Active Messages, the Virtual Interface Architecture, and U-Net (which I was involved with as an undergrad at Cornell). A bunch of commercial products were available in this space, including Myrinet (still the best, IMHO) and InfiniBand.

Not much of this work has really taken off in commercial datacenters. John Ousterhout and Steve Rumble argue that this is because the commercial need for low latency networking hasn't been there until now. Indeed, when we were working on this in the 90's, the applications we envisioned were primarily numerical and scientific computing: big matrix multiplies, that kind of thing.

When Inktomi and Google started demonstrating Web search as the "killer app" for clusters, they managed to get away with relatively high-latency, but cheap, Ethernet-based solutions. For these applications, the cluster interconnect was not the bottleneck. Rumble's paper argues that emerging cloud applications are motivating the need for fast intermachine RPC. I'm not entirely convinced of this, but John and Steve and I had a few good conversations about this at HotOS and I've been reflecting on the lessons learned from the "fast interconnect" heyday of the 90's...

Microbenchmarks are evil: There is a risk in focusing on microbenchmarks when working on cluster networking. The standard "ping-pong" latency measurement and bulk transfer throughput measurements rarely reflect the kind of traffic patterns seen in real workloads. Getting something to work on two unloaded machines connected back-to-back says little about whether it will work at a large scale with a complex traffic mix and unexpected load. You often find that real world performance comes nowhere near the ideal two-machine case. For that matter, even "macrobenchmarks" like the infamous NOW-Sort work be misleading, especially when measurements are taken under ideal conditions. Obtaining robust performance under uncertain conditions seems a lot more important than optimizing for the best case that you will never see in real life.

Usability matters:  I'm convinced that one of the reasons that U-Net, Active Messages, and VIA failed to take off is that they were notoriously hard to program to. Some systems, like Fast Sockets, layer a conventional sockets API on top, but often suffered large performance losses as a result, in part because the interface couldn't be tailored for specific traffic patterns. And even "sockets-like" layers often did not work exactly like sockets, being different enough that you couldn't just recompile your application to use them. A common example is not being entirely threadsafe, or not working with mechanisms such as select() and poll(). When you are running a large software stack that depends on sockets, it is not easy to rip out the guts with something that is not fully backwards compatible.

Commodity beats fast: If history has proven anything, there's only so much that systems designers are willing to pay -- in terms of complexity or cost -- for performance. The vast majority of real-world systems are based on some flavor of the UNIX process model, BSD filesystem, relational database, and TCP/IP over Ethernet. These technologies are all commodity and can be found in many (mostly compatible) variants, both commercial and open source; few companies are willing to invest time and money to tailor their design for some funky single-vendor user-level networking solution that might disappear one day.


  1. FWIW Inktomi started with Myrinet and eventually migrated to Ethernet.

  2. Ethernet, as a wire protocol, has benefited from wide adoption and been essentially commoditized. It also provides a nimble deployment model for upgrading to faster (i.e. 1G to 10G) signaling technologies. The *programming model* is what ultimately determines "usability" and this is where cluster networks are not willing to spend the effort unless there is substantial payoff. Sockets Direct Protocol (SDP) offered some initial promise, but in practice has not been used. Most high-performance clusters use MPI (even over ethernet which is still the most common network protocol on the list of the Top500 civilian computers ( I am not arguing for MPI as a programming model, but I am suggesting there is a big gap between TCP/IP (i.e. sockets) and MPI or user-level fast messaging.

    FWIW, the financial community is *very* latency sensitive in their applications for trading and microseconds in the network make a difference of order execution, and potentially millions of dollars.


Post a Comment

Popular posts from this blog

Why I'm leaving Harvard

The word is out that I have decided to resign my tenured faculty job at Harvard to remain at Google. Obviously this will be a big change in my career, and one that I have spent a tremendous amount of time mulling over the last few months.

Rather than let rumors spread about the reasons for my move, I think I should be pretty direct in explaining my thinking here.

I should say first of all that I'm not leaving because of any problems with Harvard. On the contrary, I love Harvard, and will miss it a lot. The computer science faculty are absolutely top-notch, and the students are the best a professor could ever hope to work with. It is a fantastic environment, very supportive, and full of great people. They were crazy enough to give me tenure, and I feel no small pang of guilt for leaving now. I joined Harvard because it offered the opportunity to make a big impact on a great department at an important school, and I have no regrets about my decision to go there eight years ago. But m…

Rewriting a large production system in Go

My team at Google is wrapping up an effort to rewrite a large production system (almost) entirely in Go. I say "almost" because one component of the system -- a library for transcoding between image formats -- works perfectly well in C++, so we decided to leave it as-is. But the rest of the system is 100% Go, not just wrappers to existing modules in C++ or another language. It's been a fun experience and I thought I'd share some lessons learned.

Why rewrite?

The first question we must answer is why we considered a rewrite in the first place. When we started this project, we adopted an existing C++ based system, which had been developed over the course of a couple of years by two of our sister teams at Google. It's a good system and does its job remarkably well. However, it has been used in several different projects with vastly different goals, leading to a nontrivial accretion of cruft. Over time, it became apparent that for us to continue to innovate rapidly wo…

Running a software team at Google

I'm often asked what my job is like at Google since I left academia. I guess going from tenured professor to software engineer sounds like a big step down. Job titles aside, I'm much happier and more productive in my new role than I was in the 8 years at Harvard, though there are actually a lot of similarities between being a professor and running a software team.

I lead a team at Google's Seattle office which is responsible for a range of projects in the mobile web performance area (for more background on my team's work see my earlier blog post on the topic). One of our projects is the recently-announced data compression proxy support in Chrome Mobile. We also work on the PageSpeed suite of technologies, specifically focusing on mobile web optimization, as well as a bunch of other cool stuff that I can't talk about just yet.

My official job title is just "software engineer," which is the most common (and coveted) role at Google. (I say "coveted&quo…