High-level platforms on top of Hadoop

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins, “Pig Latin: A Not-So-Foreign Language for Data Processing,” SIGMOD, 2008. [PDF]

Facebook Data Team, “Hive: Data Warehousing and Analytics on Hadoop,” . [LINK]

Summary

Pig and Hive are higher level programming interfaces to Hadoop with corresponding data management tools and related optimizations developed by Yahoo! and Facebook, respectively. Pig looks more like a scripting language to create workflows composed of multiple MapReduce jobs, and its authors claim it to be in a sweet spot between SQL and MapReduce. Hive is closer to SQL in look-and-feel, and it logically arranges data in tables similar to RDBMS using schemas.

Pig vs Hive

In both cases, the end goal is to enable complex workflows that require multiple MapReduce jobs without having to change the underlying execution engine, Hadoop, and the common storage system, HDFS. However, Pig seems to have a few more tricks than Hive in terms of query optimization and debugging. A nice side-by-side comparison can be found here (probably a little outdated).

Critique

Pig and Hive are open-source answers to Microsoft’s Dryad, DryadLINQ, and SCOPE, that allow creating extended workflows. They both want to keep the underlying systems, Hadoop and HDFS, untouched, but each makes different tradeoffs/assumptions in designing the system. Pig assumes its jobs to be ad hoc and does not stress on building indices and play other DB tricks to make repetitive jobs faster. Hive, on the other hand, assumes a large number of small queries on the same data, and hence, it builds indices and enforces schema.

Both the systems are and continue to be influential due to their large usage at Yahoo! and Facebook. There are still a lot of opportunities for optimizations though, which can and are inspiring academic research. The question is whether we run the risk of reinventing everything that DB people had invented a while ago; my guess is (unfortunately) yes.

Mosharaf Chowdhury

High-level platforms on top of Hadoop

Summary

Pig vs Hive

Critique

Leave a Reply Cancel reply