I just got back from HotOS 2013 and, frankly, it was a little depressing. Mind you, the conference was really well-organized; there were lots of great people; an amazing venue; and fine work by the program committee and chair... but I could not help being left with the feeling that the operating systems community is somewhat stuck in a rut.
It did not help that the first session was about how to make network and disk I/O faster, a topic that has been a recurring theme for as long as "systems" has existed as a field. HotOS is supposed to represent the "hot topics" in the area, but when we're still arguing about problems that are 25 years old, it starts to feel not-so-hot.
Of the 27 papers presented at the workshop, only about 2 or 3 would qualify as bold, unconventional, or truly novel research directions. The rest were basically extended abstracts of conference submissions that are either already in preparation or will be submitted in the next year or so. This is a perennial problem for HotOS, and when I chaired it in 2011 we had the same problem. So I can't fault the program committee on this one -- they have to work with the submissions they get, and often the "best" and most polished submissions represent the most mature (and hence less speculative) work. (Still, this year there was no equivalent to Dave Ackley's paper in 2011 which challenged us to "pledge allegiance to the light cone.")
This got me thinking about what research areas I wish the systems research community would spend more time on. I wrote a similar blog post after attending HotMobile 2013, so it's only fair that I would subject the systems community to the same treatment. A few ideas...
Obligatory diisclaimer: Everything in this post is my personal opinion and does not represent the view of my employer.
An escape from configuration hell: A lot of research effort is focused on better techniques for finding and mitigating software bugs. In my experience at Google, the vast majority of production failures arise not due to bugs in the software, but bugs in the (often vast and incredibly complex) configuration settings that control the software. A canonical example is when someone bungles an edit to a config file which gets rolled out to the fleet, and causes jobs to start behaving in new and often not-desirable ways. The software is working exactly as intended, but the bad configuration is leading it to do the wrong thing.
This is a really hard problem. A typical Google-scale system involves many interacting jobs running very different software packages each with their own different mechanisms for runtime configuration: whether they be command-line flags, some kind of special-purpose configuration file (often in a totally custom ASCII format of some kind), or a fancy dynamically updated key-value store. The configurations are often operating at very different levels of abstraction --- everything from deciding where to route network packets, to Thai and Slovak translations of UI strings seen by users. "Bad configurations" are not just obvious things like syntax errors; they also include unexpected interactions between software components when a new (perfectly valid) configuration is used.
There are of course tools for testing configurations, catching problems and rapidly rolling back bad changes, etc. but a tremendous amount of developer and operational energy goes into fixing problems arising due to bad configurations. This seems like a ripe area for research.
Understanding interactions in a large, production system: The common definition of a "distributed system" assumes that the interactions between the individual components of the system are fairly well-defined, and dictated largely by whatever messaging protocol is used (cf., two phase commit, Paxos, etc.) In reality, the modes of interaction are vastly more complex and subtle than simply reasoning about state transitions and messages, in the abstract way that distributed systems researchers tend to cast things.
Let me give a concrete example. Recently we encountered a problem where a bunch of jobs in one datacenter started crashing due to running out of file descriptors. Since this roughly coincided with a push of a new software version, we assumed that there must have been some leak in the new code, so we rolled back to the old version -- but the crash kept happening. We couldn't just take down the crashing jobs and let the traffic flow to another datacenter, since we were worried that the increased load would trigger the same bug elsewhere, leading to a cascading failure. The engineer on call spent many, many hours trying different things and trying to isolate the problem, without success. Eventually we learned that another team had changed the configuration of their system which was leading to many more socket connections being made to our system, which put the jobs over the default file descriptor limit (which had never been triggered before). The "bug" here was not a software bug, or even a bad configuration: it was the unexpected interaction between two very different (and independently-maintained) software systems leading to a new mode of resource exhaustion.
Somehow there needs to be a way to perform offline analysis and testing of large, complex systems so that we can catch these kinds of problems before they crop up in production. Of course we have extensive testing infrastructure, but the "hard" problems always come up when running in a real production environment, with real traffic and real resource constraints. Even integration tests and canarying are a joke compared to how complex production-scale systems are. I wish I had a way to take a complete snapshot of a production system and run it in an isolated environment -- at scale! -- to determine the impact of a proposed change. Doing so on real hardware would be cost-prohibitive (even at Google), so how do you do this in a virtual or simulated setting?
I'll admit that these are not easy problems for academics to work on. Unless you have access to a real production system, it's unlikely you'll encounter this problem in an academic setting. Doing internships at companies is a great way to get exposure to this kind of thing. Replicating this problem in an academic environment may be difficult.
Pushing the envelope on new computing platforms: I also wish the systems community would come back to working on novel and unconventional computing platforms. The work on sensor networks in the 2000's really challenged our assumptions about the capabilities and constraints of a computer system, and forced us down some interesting paths in terms of OS, language, and network protocol design. In doing these kinds of explorations, we learn a lot about how "conventional" OS concepts map (or don't map) onto the new platform, and the new techniques can often find a home in a more traditional setting: witness how the ideas from Click have influenced all kinds of systems unrelated to its original goals.
I think it is inevitable that in our lifetimes we will have a wearable computing platform that is "truly embedded": either with a neural interface, or with something almost as good (e.g. seamless speech input and visual output in a light and almost-invisible form factor). I wore my Google Glass to HotOS, which stirred up a lot of discussions around privacy issues, what the "killer apps" are, what abstractions the OS should support, and so forth. I would call Google Glass an early example of the kind of wearable platform that may well replace smartphones, tablets, and laptops as the personal computing interface of choice in the future. If that is true, then now is the time for the academic systems community to start working out how we're going to support such a platform. There are vast issues around privacy, energy management, data storage, application design, algorithms for vision and speech recognition, and much more that come up in this setting.
These are all juicy and perfectly valid research problems for the systems community -- if only it is bold enough to work on them.
It did not help that the first session was about how to make network and disk I/O faster, a topic that has been a recurring theme for as long as "systems" has existed as a field. HotOS is supposed to represent the "hot topics" in the area, but when we're still arguing about problems that are 25 years old, it starts to feel not-so-hot.
Of the 27 papers presented at the workshop, only about 2 or 3 would qualify as bold, unconventional, or truly novel research directions. The rest were basically extended abstracts of conference submissions that are either already in preparation or will be submitted in the next year or so. This is a perennial problem for HotOS, and when I chaired it in 2011 we had the same problem. So I can't fault the program committee on this one -- they have to work with the submissions they get, and often the "best" and most polished submissions represent the most mature (and hence less speculative) work. (Still, this year there was no equivalent to Dave Ackley's paper in 2011 which challenged us to "pledge allegiance to the light cone.")
This got me thinking about what research areas I wish the systems research community would spend more time on. I wrote a similar blog post after attending HotMobile 2013, so it's only fair that I would subject the systems community to the same treatment. A few ideas...
Obligatory diisclaimer: Everything in this post is my personal opinion and does not represent the view of my employer.
An escape from configuration hell: A lot of research effort is focused on better techniques for finding and mitigating software bugs. In my experience at Google, the vast majority of production failures arise not due to bugs in the software, but bugs in the (often vast and incredibly complex) configuration settings that control the software. A canonical example is when someone bungles an edit to a config file which gets rolled out to the fleet, and causes jobs to start behaving in new and often not-desirable ways. The software is working exactly as intended, but the bad configuration is leading it to do the wrong thing.
This is a really hard problem. A typical Google-scale system involves many interacting jobs running very different software packages each with their own different mechanisms for runtime configuration: whether they be command-line flags, some kind of special-purpose configuration file (often in a totally custom ASCII format of some kind), or a fancy dynamically updated key-value store. The configurations are often operating at very different levels of abstraction --- everything from deciding where to route network packets, to Thai and Slovak translations of UI strings seen by users. "Bad configurations" are not just obvious things like syntax errors; they also include unexpected interactions between software components when a new (perfectly valid) configuration is used.
There are of course tools for testing configurations, catching problems and rapidly rolling back bad changes, etc. but a tremendous amount of developer and operational energy goes into fixing problems arising due to bad configurations. This seems like a ripe area for research.
Understanding interactions in a large, production system: The common definition of a "distributed system" assumes that the interactions between the individual components of the system are fairly well-defined, and dictated largely by whatever messaging protocol is used (cf., two phase commit, Paxos, etc.) In reality, the modes of interaction are vastly more complex and subtle than simply reasoning about state transitions and messages, in the abstract way that distributed systems researchers tend to cast things.
Let me give a concrete example. Recently we encountered a problem where a bunch of jobs in one datacenter started crashing due to running out of file descriptors. Since this roughly coincided with a push of a new software version, we assumed that there must have been some leak in the new code, so we rolled back to the old version -- but the crash kept happening. We couldn't just take down the crashing jobs and let the traffic flow to another datacenter, since we were worried that the increased load would trigger the same bug elsewhere, leading to a cascading failure. The engineer on call spent many, many hours trying different things and trying to isolate the problem, without success. Eventually we learned that another team had changed the configuration of their system which was leading to many more socket connections being made to our system, which put the jobs over the default file descriptor limit (which had never been triggered before). The "bug" here was not a software bug, or even a bad configuration: it was the unexpected interaction between two very different (and independently-maintained) software systems leading to a new mode of resource exhaustion.
Somehow there needs to be a way to perform offline analysis and testing of large, complex systems so that we can catch these kinds of problems before they crop up in production. Of course we have extensive testing infrastructure, but the "hard" problems always come up when running in a real production environment, with real traffic and real resource constraints. Even integration tests and canarying are a joke compared to how complex production-scale systems are. I wish I had a way to take a complete snapshot of a production system and run it in an isolated environment -- at scale! -- to determine the impact of a proposed change. Doing so on real hardware would be cost-prohibitive (even at Google), so how do you do this in a virtual or simulated setting?
I'll admit that these are not easy problems for academics to work on. Unless you have access to a real production system, it's unlikely you'll encounter this problem in an academic setting. Doing internships at companies is a great way to get exposure to this kind of thing. Replicating this problem in an academic environment may be difficult.
Pushing the envelope on new computing platforms: I also wish the systems community would come back to working on novel and unconventional computing platforms. The work on sensor networks in the 2000's really challenged our assumptions about the capabilities and constraints of a computer system, and forced us down some interesting paths in terms of OS, language, and network protocol design. In doing these kinds of explorations, we learn a lot about how "conventional" OS concepts map (or don't map) onto the new platform, and the new techniques can often find a home in a more traditional setting: witness how the ideas from Click have influenced all kinds of systems unrelated to its original goals.
I think it is inevitable that in our lifetimes we will have a wearable computing platform that is "truly embedded": either with a neural interface, or with something almost as good (e.g. seamless speech input and visual output in a light and almost-invisible form factor). I wore my Google Glass to HotOS, which stirred up a lot of discussions around privacy issues, what the "killer apps" are, what abstractions the OS should support, and so forth. I would call Google Glass an early example of the kind of wearable platform that may well replace smartphones, tablets, and laptops as the personal computing interface of choice in the future. If that is true, then now is the time for the academic systems community to start working out how we're going to support such a platform. There are vast issues around privacy, energy management, data storage, application design, algorithms for vision and speech recognition, and much more that come up in this setting.
These are all juicy and perfectly valid research problems for the systems community -- if only it is bold enough to work on them.
Temple OS
ReplyDeletehttp://www.templeos.org
There is something similar: http://nixos.org/
ReplyDeleteI'm almost certain you mean https://code.google.com/p/nix-os/
DeleteHear, hear.
ReplyDeleteWell said Matt! Another excellent post clarifying thoughts that many of us have had at one time or another. I too have been relatively disappointed over recent years by some of the directions in the core systems community, and as a result have personally drifted away from it into more data-driven areas.
I do think, however, that the TPC etc still hold considerable sway over whether the "bold" papers get in. There is still a lot of built in skepticism to anything outside of the currently "hot" topics, where hot is defined as the 2-3 areas covered by papers at the top conference. Polished, bold, submissions are few and far in between, but they still face significant headwinds when they do show up. I recall a lot of discussion about "scope" on one such paper on on-chip networks at HotNets a couple years back. In that case, the paper made it, and brought some much needed fresh air in a program filled with more data centers and SDN work. But the discussion could have easily turned negative.
Also love that you gave serious thought to the question of what the community should work on. Unfortunately, at least some of the problems you mention (all of which I agree with, btw), are very much problems that typical academics have limited access to. I'll echo Remzi and some others in saying that it would be very helpful to the community if Google (and msft, yhoo, fb etc) were to share more data and traces, so that academics without their own 1000-node cluster can get a handle on these problems and begin to think about them in a concrete way.
You're right that the PC has to work with what they are given. That said, it is their choice to accept the more polished papers, ones with experimental results, etc. My last Hot* submission was rejected for not having an extensive evaluation and related work. While I would have gladly accepted criticism on the ideas, design or boldness, I am bitter about the "extensive eval" part.
ReplyDeleteHave you checked a mirror? Google is one of (the?) largest systems deployed but I saw no papers from Google. You, yourself, left Harvard to work at Google and haven't published since.
ReplyDeleteUniversities rarely have the resources to even approach "internet scale". At work we give an individual developer 40 servers to work on code development. At a university that is a large cluster. Internet scale involves data centers at least nationwide if not worldwide. Universities are lucky to have two distributed locations (probably AWS).
I actually have published since leaving Harvard for Google.
DeleteI will accept the criticism that Google doesn't publish enough papers in this area, although we do publish several hundred research papers a year overall. I don't think it's Google's job to "fix" the academic OS community, though publishing more papers might put some dent in the problem. Hence my recommendation that systems researchers try to do an internship or sabbatical at places like Google, Microsoft, Amazon, Facebook, etc. to learn what the real problems are. It is much better than us just publishing more papers.
I'm not sure that the answer is for faculty to spend time at Google, Facebook etc. Firstly, there are very few mechanisms to actually do this. Case in point, Google offers visiting Faculty positions / grants but these schemes are often an entirely closed loop: in order to apply *and* succeed you need to know people at Google to champion your proposal, however it is very often difficult to make these contacts in the first place, back to square one… Secondly, large corporations do not usually want to work with faculty in the first place. There are usually a range of mechanisms to employ graduate students through internships; however the primary motivation here is for Google etc. to employ these interns if they are any good anyway, reducing overall recruiting fees. Thirdly, even if you are given a sabbatical type place you are a very small cog in a large machine, often without an actual research focus.
DeleteI think that industry - academia collaboration is great in principle but I would suggest taking a sabbatical at a smaller start-up company with a research focus. Ideas could be streamlined from concept to implementation far more rapidly than at companies like Google.
If anyone has better / different ideas here I'd be interested to hear them? I am an academic at a leading University and am interested in taking my sabbatical in industry but have found very few companies even willing to discuss this idea in the first place.
This is patently untrue. Almost all of the best systems faculty that I know have spent a sabbatical -- or at least an internship during grad school -- at a company like Microsoft or Google. Maybe that's part of the distinction between what makes for "good" versus "not so good" faculty in this area.
DeleteI think that's a little harsh and certainly not true of the people I work with in Europe at top Universities, Cambridge, Bristol etc. Perhaps the US has a different, more open model than Europe.
DeleteStill, if anyone has any suggestions about how to take a sabbatical in industry I'd be interested to hear them.
You mentioned you were at an OS conference and were disappointed that people were doing research on things like system I/O. You then propose that OS researchers spend time researching things like software configuration and data center testing tools... Neither of those two things are OS related. Then you went on a random tangent about hardware which is independent of systems. In order for OS researchers to do research on wearable hardware, the hardware must be there first. It was a well written article who's introduction didn't match it's body, and who's body didn't match it's conclusion.
ReplyDeleteI disagree. HotOS has "traditionally" been the place to talk about far-reaching research that is outside of the bounds of the conventional OS community. But recently it has become basically an SOSP preview. I think the problem is that the community actually feels that it is more worthwhile to have a venue to float SOSP preprints for comments than a place to talk about far-out research that may not get into SOSP for 5 or 10 years. That's too bad.
DeleteHotOS happens twice a year! how "hot" can it be if it is almost two years old?
ReplyDeletedoh, i meant once every two years! it would actually get some hot research if it happened twice a year.
Delete@Anon @may16:7:45pm -- it's actually the opposite. It stays fresh by NOT exhausting the pool of randomness. Compare HotOS (which is typically a pretty fun, dynamic workshop) with HotNets (which is even more of a "mini-conference"). I think one of the key differences is the every-other-yearness. And maybe that HotOS typically has better wine.
ReplyDeleteHot does not necessarily mean current. It means _direction changing_ and thought provoking. Good and truly different ideas don't actually happen that often.
I don't know anything about current academic research in operating systems, but I also see an interesting direction, which doesn't seem to be often taken.
ReplyDeleteCurrently, all the processes are usually selfish, they ask for resources, and the OS is like benevolent dictator, determines if they get them or not.
However, it seems to me that in some cases the whole system would benefit from some altruism as well. For example, a situation could happen where OS would suddenly find itself in the need to run a critical task, and would lack the resources to do so. It could ask some running processes to perhaps release some non-critical resources they hold (empty caches, run garbage collection, interrupt some work on locked data etc.) and that way scrounge for the resources needed to get the job done. That way non-critical task could continue to run, albeit in a more limited, and the critical work would get done faster as well.
This would mean more negotiation on resources between OS and processes, and in more dynamic way. Maybe there could even be a credit system, and then the result would be like a market system for resources. (The usual failings of markets wouldn't apply here, unlike on human markets.)
This is a neat idea. It's related to virtual memory ballooning which is a technique used by a virtual machine monitor to induce a VM to free memory:
Deletehttp://pubs.vmware.com/vsphere-4-esx-vcenter/index.jsp?topic=/com.vmware.vsphere.resourcemanagement.doc_41/managing_memory_resources/c_memory_balloon_driver.html
Isn't what you describe an extension of what SIGDANGER in AIX is meant to be?
DeleteIn terms of the general short-term focus on academic research, academia is continually getting pushed towards a shorter and shorter term research horizon. Because of the squeezing of basic acadmic funding, there's huge pressures for faculty of all levels to bring in large amounts of external research dollars, and the fact is that essentially all sizeable research grants, be they from NSF, NIH, DOD, DOE, or Industry, strongly favor relatively short term, incremental work. Really innovative things, like the things my colleague Dave Ackley does, can unfortunately be very hard to get funded. What funding agency or corporation is stepping up to fund things like that?
ReplyDeleteThe closed nature of most research communities also contribute to this, but changing this is hard. Program committees and grant panels don't have a lot of incentive to bring in people from outside their communities, so research communities work on what they've always worked on with occasional course corrections. As an example, I was at OSDI a number of years ago, and a well-known systems researcher commented to me and a friend (paraphrasing from memory): "To write a paper that can into SOSP, you really have to have been through the (SOSP) shepherding process."
I also think it's an artifact of the peer review process. Proposing incremental improvements to a well-defined problem is a lot easier to get published than a new big idea. Our group recently submitted two papers to a top-tier international software engineering conference. The first addressed one of the problems proposed by Matt Welsh. It was based on 5 years work, it was green fields research, a big new idea - which no one else (in the academic community) has done. The second paper was a trivial extension to a well defined problem, it represented about 4 weeks work. The second got accepted, the first rejected. Big new ideas are hard to position against previous work. The problem is about half of the reviewers will "get it" and the other half won't. You may get raving reviews from the ones who get it, but one negative review will drag down your review score average to below the acceptance threshold. Researchers are measured by number of papers. So the existing system encourages "small target"/incremental research. I'm not sure what the solution is.
DeleteAnyone who's ever been on an NSF panel has seen this dynamic in action. I was on one once where at the start the program manager exhorted the panel to look for high-risk/high-reward things to fund and not just fund incremental work. At one point, a proposal came up with a modest budget request that a number of us thought was really intriguing. One panelist then said "It's interesting, but I'm not sure it'll work. If you're going to to fund it, maybe fund it for 6 months tops. I gave it a fair." Needless to say, the proposal didn't get funded.
DeleteLike democracy, peer review is the worst possible system, except for all of the others. It's important that members of the academic community are aware of its problems so that they can actively work against them.
My PhD was related to 'understanding interactions in a large, production system'. More specifically, replicating application-layer interactions of large scale systems to test behaviour at scale of individual software components. I had real difficulties getting my work published. In my opinion, having researchers tackle such problems is only half the battle. Finding appreciation for that work in the wider academic community is the other half.
ReplyDeleteUnpublished work for those interested: https://github.com/camhine/ICSE-2013
What do you think are the chances that a professor (or graduate student) who tried to do work in the areas you suggested would (a) be able to get funding for that research or (b) be able to get a paper on that subject accepted to a conference or journal that tenure committees think highly of?
ReplyDeleteIt's one thing if the a potential research area is hard for an academic to do work in; there will always be smart people who love challenges. But if they don't get rewarded (or are actively penalized) for doing work in a particular area, that's quite another thing....
Ted, this is a good question -- I don't honestly know. See my earlier post on "academic freedom" which talks a bit about some of the constraints on what academics can practically work on:
Deletehttp://matt-welsh.blogspot.com/2013/04/the-other-side-of-academic-freedom.html
The answers to these questions are more affected by how the work is done and by whom than by the details of the problems being solved.
DeleteIf the right person wades into one of these areas, they're going to have excellent industry contacts and a bunch of great students; they'll implement a ton of stuff, throw it away, and then do it right; they'll evaluate it on big, real problems; probably most importantly they'll be able to find the right spin which teases out the research ideas so it doesn't look like a bunch of hacks. They'll know where to submit and when, and when they submit a paper the right people will read it and give it a fair shake. When the paper gets accepted it'll be great and it will get some good buzz at SOSP or whatever and then on some blogs and finally all the grad student seminars will read it forever afterwards and try to emulate the success.
On the other hand, the wrong research group is not going to nail as many of these aspects and the papers and grants will suffer accordingly.
Does there exist a research problem so bad that a top person couldn't tease a couple of good papers out of it if they had to? Probably so, but it would have to be really bad...
I had a version of this comment using people's names; it was more fun to write but probably not that helpful.
Suck a dick Welsh. Everyone knows you're so full of hot air you'd make a hot air balloon look solid.
ReplyDeleteWow, such insightful commentary - thanks for the feedback!
DeleteNormally I would delete a comment like this, but since you've shown how immature and idiotic you are, I'm going to let it stand. Of course, you're not willing to sign your name, not that it really matters who you are anyway. You're a nobody.
Matt -- Just wondering, do you serve on NSF panels now that you are in industry? Based on my experience (and I've managed to serve on at least one outside of the CISE directorate), this type of perspective is exactly what is needed. (In my experience, panels dominated by researchers from industry and national labs are far more reasonable than panels dominated by academics -- and I am an academic).
ReplyDeleteI've been asked a few times to serve on panels since leaving academia and am happy to do it, but it's hard to justify the time and travel commitment. If I could review proposals without having to travel to DC it would be much easier to say yes. I do a lot of proposal reviewing for Google's research award program and chair the mobile funding committee, so I try to give back since I think it's incredibly important to support academic research.
DeleteI wanna connect your essay to Ellen Ullman’s from over the weekend:
ReplyDeletehttp://www.nytimes.com/2013/05/19/opinion/sunday/how-to-be-a-woman-programmer.html
which is beautifully written.
“What will save you is tacking into the love of the work, into the desire that brought you there in the first place.”
Some of us are motivated by the desire to have “impact,” however that’s defined. (Many Googlers seem to define “impact” as “anything that will improve my work life at Google.”) But others have internal, almost aesthetic motivations. Academia, wonderfully, has room for both. But the impact people can seem crass to the aesthetes, and the aesthetes can seem irrelevant to the impacted. Oh well!
Admittedly “Hot” conferences are supposed to be about impact. (I don’t often attend.) And many papers fall into the pessimal bin of neither pleasing nor impactful. But that’s collateral damage from our system. We don’t know how to support only great work; we do know how to support a lot of mediocre work and some good work.
A great sequence from Nicholson Baker’s The Anthologist about “waste” in poetry applies way more broadly:
http://books.google.com/books?id=GNZl5dEQ7PEC&pg=PT103&lpg=PT103&ots=octAfBxQK3
“What does it mean to be a great poet? It means that you wrote one or two great poems. Or great parts of poems. That’s all it means. Don’t try to picture the waste or it will alarm you.”
Academics are not entitled to receive any public funding at all, whether or not they work on problems you consider high impact, and many don’t, except for their educational work; the entire setup of your post is that “the operating systems community is somewhat stuck in a rut,” which certainly implies that the whole “community” is producing research of no or little value; high impact and aesthetic work are both super important to me, and to repeat, it’s wonderful there’s room for both.
DeleteIt is the nature of public research funding, and particularly in science, engineering and medicine, that we do not know how to pick "winners" a priori. Hell, we can't even do that with the published papers (e.g., compare contemporary "best paper" awards with any long term measure of impact and you'll find a poor correlation at best). That does not mean that impact doesn't happen, it means that its hard to predict where it comes from. It is not merely appetite to expand our knowledge that drives broad-based STEM funding (in fact I don't know if we have sensitive enough apparatus to measure how low this is on the legislative priority rankings), but experience that out of this kind of portfolio we end up with some real winners or combinations of winners that are transformative.
DeleteRight now we have a fight going on in Congress precisely around this issue, where Lamar Smith's bill would require the NSF director to certify that all funded proposals are clearly in the national interest or groundbreaking. The underlying motivation here is the flawed notion that we can make research arbitrarily more efficient through funding selectivity. This works as well as you can prognosticate. This works poorly over the short term and disastrously over the long term.
Stefan - I agree 100%. My point (which has since been deleted) was simply that academics aren't entitled to public funding to work on anything they want. This only works because our society values open-ended research. This is a good thing, but I think it's important to make sure that the politicians who make these decisions understand that.
DeleteThe only really radical approach I've seen to O/S in 20 years or so is Singularity (http://research.microsoft.com/en-us/projects/singularity/). You have to commend Microsoft for actually funding this kind of project.
ReplyDeleteI agree that Singularity is cool, but if you think that's the only radical idea in the last 20 years you aren't reading enough.
DeleteI'm with you on the general idea that there seems to be little "innovation" happening these days. But I feel it's a much more broad phenomenon than only system design.
ReplyDeleteSince the 60s/70s (perhaps 80s) it seems nothing truly "new" has even been "discovered", never mind considered. Not systems, nor hardware, nor even programming in general. Most (if not all) research seems to focus on optimizing and/or enhancing old ideas.
Perhaps it's a situation of the possibilities reaching a plateau and requiring some new breakthrough to start a total new direction before new innovation can occur.
For the last 20 years (at least) the most significant developments seem to be JIT compilers and GC (in programming that is). The 80's fad on RISC processors is where the current batch of mobile chips (i.e. ARM and the like) originated. As for OS, not much "new" have been developed - perhaps more "robust" FS's like ZFS. But all these are simply implementations using one or more ideas from before. No radically new principle.
Thanks Unknown for the link to Singularity. It does "sound" good, but once you get into it - it's still just an amalgamation of different features. Hopefully it's not similar to its namesake and just a "black hole" sucking everything in but not giving anything back.
Maybe I'm off base and "Hot" doesn't need to be something "out-of-this-world". Perhaps something like Singularity is a good idea: i.e. combine all the developments thus far to try and realise all the best ideals. Or at least all those which are possible to interact without too much detriment in each other's operations.
I'm still a bit sceptical about what "should" form part of the system, and what should be left to the application layer. This write-up seems to indicate one place where it pays to move a normally kernel-level function into the application level:
http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html
Perhaps that could be another research idea: Which portions of the OS needs to be in a monolithic core and which might benefit by placing them as "plug-in" ancillaries.
Or maybe these aren't very interesting research questions anymore after all. It feels like there has been an endless stream of academic discussions about layering, abstraction, extensibility, etc. in operating systems but maybe we should just declare victory and move on to other, more pressing problems?
DeleteI would like to see virtualization as a fundamental component, maybe the only essential component, of a new OS.
ReplyDeleteJono
Arguably this is already very much the case with pretty much every cloud infrastructure system out there: a virtual machine monitor that hosts potentially many "guest OSs" on top.
DeleteMaybe OS systems have become too complicated for us mere mortals, perhaps research should be towards OS systems being more self aware and how they are relating to other systems etc. Intelligence or a least assisted intelligence?
ReplyDeleteProbably a step (or many thousand steps) beyond better configuration, but your example with a self aware OS could have resolved your example itself or least indicated the problem.
Perhaps systems should be less cloaked in black magic and 'configured in more natural language. Although might result in some redundancies of IT staff when the mail room manage the IT operation during their lunch break!
Well, I guess that given that a sockets allocation was reported as a "file descriptor" constraint, I can guess which OS you were working with.
ReplyDeleteMany of the problems you describe have to do with metrics in OS internals. Better diagnostics and reporting would be helpful. It does add overhead, but that's a good topic for academic research.
Otherwise it's these are application design and integration problems.
Matt, have you heard of the word "disruptive" in the context of research? Academia should be working on problems that are 5-10 years forward-looking -- so much so that it *might* (not should) transform or create a new industry. Suggesting that academics solve current problems that you face in the industry, whatever the scale is, is below the belt. Such behavior may be pathological to some Google employees, but "scale" does not always necessarily bring interesting and/or futuristic research problems.
ReplyDeleteI would *LOVE* it if more academics would focus on problems that are 5-10 years out. Hence my suggestion to work on radically new computing platforms (assuming you made it to the end of the blog post, maybe you didn't).
DeleteRegardless, many academics are focused on problems that are 6-12 months out, and as long as they are going to do this, then I think it would be much more interesting for them to work on problems that could have huge impact (rather than quibbling over the best API for disk I/O, a problem which has been around for 20 years and frankly is not something that is holding up human progress).
I do not think that academics should exclusively focus on 5-to-10-year problems. The risk of going too far in that direction is that academia becomes completely irrelevant. I advocate a portfolio of near-term and far-term research; my argument is that the current trend towards near-term focus is often *too* near term and often uninteresting.
Nice stuff, keep posting...
ReplyDelete