Data-parallel pipelines using high-level languages

Microsoft, “DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language,” OSDI, 2008. [PDF]

Google, “FlumeJava: Easy, Efficient Data-Parallel Pipelines,” PLDI, 2010. [LINK]

Background

Data-parallel computing systems expose high-level abstractions to the users to reason about distributed computations, while handling low-level tasks of scheduling and automated fault-tolerance without any user input. At one end of the spectrum, MapReduce exposes a limited interface consisting of only a map, a shuffle, and a reduce step, which makes it harder to write workflows or data-parallel pipelines. Workarounds include scripting languages to stich together a sequence of MapReduce jobs. On the other end of the spectrum lies Dryad; it allows creating arbitrary DAGs to represent long and complex workflows, without proper language support.

The key missing part from all these frameworks is an even higher-level abstraction that will

  1. be expressive enough to create arbitrary workflows/pipelines,
  2. remove worry about data representations and underlying implementation details,
  3. permit dynamic optimization across multiple MapReduce jobs, and
  4. be accessible to developers by providing integration with familiar sequential languages.

DryadLINQ and FlumeJava

To address these challenges in Dryad and MapReduce, Microsoft and Google introduced DryadLINQ and FlumeJava, respectively. Both are very similar in design and operation with differences in their implementations. DryadLINQ is language integrated into the .NET family of languages and LINQ, whereas FlumeJava is a library written in Java. In both cases, developers write pipelines that look either like normal C# or Java program, but they internally work on distributed datasets using a set of core primitives (including map, shuffle, combine, and reduce). Given a workflow, the compilers create a corresponding DAG representation, optimizes that DAG to compact it to a set of kernels, and run those kernels efficiently on the underlying MapReduce or Dryad frameworks.

Comments

Higher-level abstractions are going to become more and more common as we move toward the future of data-parallel computing. They increase productivity, efficiency, and usability, while decreasing code length and possibilities of bugs. Success of FlumeJava has given rise to Crunch, which is its parallel in the open-source world.

Also, these frameworks presented a rare surprise in that Google apparently came after Microsoft into the business!

Leave a Reply

Your email address will not be published. Required fields are marked *