Increasingly, organizations capture, transform and analyze
enormous data sets. Prominent examples include internet
companies and e-science. The Map-Reduce scalable dataﬂow
paradigm has become popular for these applications. Its
simple, explicit dataﬂow programming model is favored by
some over the traditional high-level declarative approach:
SQL. On the other hand, the extreme simplicity of Map-
Reduce leads to much low-level hacking to deal with the
many-step, branching dataﬂows that arise in practice. More-
over, users must repeatedly code standard operations such
as join by hand. These practices waste time, introduce bugs,
harm readability, and impede optimizations.
Pig is a high-level dataﬂow system that aims at a sweet
spot between SQL and Map-Reduce. Pig oﬀers SQL-style
high-level data manipulation constructs, which can be as-
sembled in an explicit dataﬂow and interleaved with custom
Map- and Reduce-style functions or executables. Pig pro-
grams are compiled into sequences of Map-Reduce jobs, and
executed in the Hadoop Map-Reduce environment. Both Pig
and Hadoop are open-source projects administered by the
Apache Software Foundation.
This paper describes the challenges we faced in develop-
ing Pig, and reports performance comparisons between Pig
execution and raw Map-Reduce execution.