Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience

Gates, Alan F.; Natkovich, Olga; Chopra, Shubham; Kamath, Pradeep; Narayanamurthy, Shravan M.; Olston, Christopher; Reed, Benjamin; Srinivasan, Santhosh; Srivastava, Utkarsh
Reed, B
Srinivasan, S
Srivastava, U
Olston, C
Narayanamurthy, S
Chopra, S
Kamath, P
Gates, A
Natkovich, O

Increasingly, organizations capture, transform and analyze
enormous data sets. Prominent examples include internet
companies and e-science. The Map-Reduce scalable dataflow
paradigm has become popular for these applications. Its
simple, explicit dataflow programming model is favored by
some over the traditional high-level declarative approach:
SQL. On the other hand, the extreme simplicity of Map-
Reduce leads to much low-level hacking to deal with the
many-step, branching dataflows that arise in practice. More-
over, users must repeatedly code standard operations such
as join by hand. These practices waste time, introduce bugs,
harm readability, and impede optimizations.
Pig is a high-level dataflow system that aims at a sweet
spot between SQL and Map-Reduce. Pig offers SQL-style
high-level data manipulation constructs, which can be as-
sembled in an explicit dataflow and interleaved with custom
Map- and Reduce-style functions or executables. Pig pro-
grams are compiled into sequences of Map-Reduce jobs, and
executed in the Hadoop Map-Reduce environment. Both Pig
and Hadoop are open-source projects administered by the
Apache Software Foundation.
This paper describes the challenges we faced in develop-
ing Pig, and reports performance comparisons between Pig
execution and raw Map-Reduce execution.

VLDB 2009
Citations range: 
Reed2009BuildingaHighLevelDataowSystemontopofMapReduce.pdf529.63 KB