This post is reprinted from Communications of the ACM. The original post by Michael Stonebraker (MIT CSAIL) and Jeremy Kepner (MIT CSAIL, MIT Lincoln Lab, MIT Math Department) can be found here.

Michael StonebrakerHadoop has spread rapidly in the last few years as a platform for parallel computation in Java. As such, it has achieved the goal of bringing parallel processing to the millions of programmers in the Java community.  Prior attempts to do this (Java Grande, JavaHPC) have not succeeded, and we applaud Hadoop for its success in this area, which we believe is due largely to the simplicity and accessibility of its environment.

However, we see lots of improvement that will be required for serious production use, at least in the science domain of a large National Laboratory, such as Lincoln Labs, where one of us works. Briefly, the dominant use cases for Hadoop in our environment are:

  • Parallel computation, mostly scientific analytical codes
  • Information aggregation and rollups of data sets

We discuss these two use cases in turn.

Hadoop Computation

Many of our scientific codes organize a collection of nodes into a 2-D (or 3-D or N-D) rectangular partitioned grid.  Then, each node executes the following template;

Until end-condition {
Local computation on local data partition
Output state
Send/receive data to/from a subset of nodes holding other data partitions

This template describes most computational fluid dynamics (CFD) codes, essentially all atmospheric and ocean simulation models, linear algebra operations, sparse graph operations, image processing, and signal processing.  When considering this class of problems in Hadoop, the following issues arise:

The local computation is very stateful from one iteration step to the next.  Preserving state across successive Map-Reduce steps requires writing it to the file system, an expensive alternative.  Also, the above codes require direct node-to-node communication, which is not supported in the Map-Reduce framework.  Third, these codes bind a particular computation to the same node across iterations.  Again, this is not supported in the Map-Reduce model.

We estimate that the Map-Reduce model works well for about 5% of Lincoln Labs users.  The other 95% must shoehorn their computation into this model, and pay 1-2 orders of magnitude in performance as a result.  Few scientists are willing to take this hit, except on very small data sets.

Many argue that performance does not matter.  That may be true at the low end; however for the loads we see at Lincoln Labs as well as the data centers we are familiar with, performance matters a lot, and there are never enough computing resources.  Our institution is leading a collaborative investment of $100M to place our next generation supercomputing center near a hydroelectric dam to reduce its carbon footprint by an order of magnitude. The performance loss inherent in Hadoop is an impossible-to-absorb penalty.

Even at lower scale, it is extremely eco-unfriendly to waste power using an inefficient system like Hadoop.

In summary, we see the following steps in Hadoop adoption in the computation arena.

Step 1: Adopt Hadoop for pilot projects

Step 2: Scale Hadoop to production use

Step 3: Hit the wall, as the above problems become big issues

Step 4: Morph to something that deals with our issues.

At Lincoln Labs we have projects at all four steps.  Survival of Hadoop in our environment will require major surgery to the parallel computation model, complementing the current Hadoop work on the task scheduler.  Our expectation is that solving these issues will make current Hadoop unrecognizable in future systems.

It is possible that other shops have a job mix that is more aligned with the current Map-Reduce framework.  However, our expectation is that we are more the norm than the exception.  The evolution of Google away from Map-Reduce to other models lends credence to this supposition.  Hence, we fully expect a dramatic future evolution of the Hadoop computation framework.

Hadoop Data Management

Forty years of DBMS research and enterprise best practices has confirmed the position advocated by Ted Codd in 1970:  programmer and system efficiency are maximized when data management operations are coded in a high level language and not a record-at-a-time language.  Although the Map-Reduce model is higher level than a record-at-a-time system, it is obviously way easier to code queries in Hive than by using Map-Reduce directly.  Hence, we see essentially all Hadoop data management moving to higher level languages, i.e. to SQL and SQL-like languages.

For example, according to David Dewitt (1), the Facebook Hadoop cluster is programmed almost exclusively in Hive, a high-level data access language that looks very much like SQL.  At Lincoln Labs a similar trajectory is occurring, although the higher-level language of choice is not Hive, but is instead a high level sparse linear algebraic interface to the data (2,3).

As such, the current Map-Reduce model becomes an interface that is internal to the DBMS.  I.e. Hive users don’t care what is underneath the Hive query language, and the current Map-Reduce interface disappears into the innards of a DBMS.  After all, how many of you actually care what wire protocol is used by a parallel DBMS to communicate with worker code at individual nodes.

One of us has written five parallel DBMSs, and is obviously quite familiar with the protocol required to communicate between a query coordinator and multiple workers at various local nodes.  Moreover, worker nodes must communicate with each other to pass intermediate data around.  The following characteristics are required for a high performance system:

  • Stateful worker nodes, which can retain state across successive steps in a distributed query plan
  • Point-to-point communication
  • Binding query processing to data that is local to the node.

Roughly speaking, a DBMS wants the same kinds of features as the scientific codes noted above.  In summary, Map-Reduce is an internal interface in a parallel DBMS, and one that is not well suited to the needs of a DBMS.

Some of us wrote a paper in 2009 comparing parallel DBMS technology with Hadoop (4). In round numbers DBMSs are faster by 1-2 orders of magnitude.  This performance advantage comes from indexing the data, making sure that queries are always sent to the nodes where data resides and not the other way around, superior compression, and superior protocols between worker nodes.  As near as we can tell, the situation in 2012 is about the same as 2009; Hadoop is still 1-2 orders of magnitude off the mark.  Anecdotal evidence abounds.  For example one large web property has a 5 Pbyte Hadoop cluster deployed on 2700 nodes; a second has a 5 Pbyte instance supported by a commercial DBMS.  It uses 200 nodes, a factor of 13 less.

Therefore, we see the following trajectory in Hadoop data management

Step 1: Adopt Hadoop for pilot projects

Step 2: Scale Hadoop to production use

Step 3: Observe an unacceptable performance penalty

Step 4: Morph to a real parallel DBMS

Most Hadoop sites are somewhere between Steps 2 and 3, and “hitting the wall” is still in their future.  Either Hadoop will grow into a real parallel DBMS in the necessary time frame or users will move to other solutions, built with replacement parts inserted into Hadoop or with interfaces to Hadoop to ingest data, or in some other way.  Given the slow progress we have observed in the last three years, our money would be on the second outcome.

In summary, the Gartner Group has formulated the well-known “hype cycle” (5) to describe the evolution of a new technology from inception onward.   Current Hadoop is promised as the “best thing since sliced bread” by its advocates.  We hope that its shortcomings can be fixed in time for it to live up to its promise.



(2) Jeremy Kepner, et. al., “Dynamic Distributed Dimensional Data Model (D4M) Database and Computation System,” ICASSP, March 25-30, 2012


(4) Andy Pavlo, et. al., “A Comparison of Approaches to Large Scale Data Analysis,”   Proc. SIGMOD 2009,  Providence, RI, June 2009.