There is a growing need for ad-hoc analysis of extremely
large data sets, especially at internet companies where inno-
vation critically depends on being able to analyze terabytes
of data collected every day. Parallel database products, e.g.,
Teradata, oﬀer a solution, but are usually prohibitively ex-
pensive at this scale. Besides, many of the people who ana-
lyze this data are entrenched procedural programmers, who
ﬁnd the declarative, SQL style to be unnatural. The success
of the more procedural map-reduce programming model, and
its associated scalable implementations on commodity hard-
ware, is evidence of the above. However, the map-reduce
paradigm is too low-level and rigid, and leads to a great deal
of custom user code that is hard to maintain, and reuse.
We describe a new language called Pig Latin that we have
designed to ﬁt in a sweet spot between the declarative style
of SQL, and the low-level, procedural style of map-reduce.
The accompanying system, Pig, is fully implemented, and
compiles Pig Latin into physical plans that are executed
over Hadoop, an open-source, map-reduce implementation.
We give a few examples of how engineers at Yahoo! are using
Pig to dramatically reduce the time required for the develop-
ment and execution of their data analysis tasks, compared to
using Hadoop directly. We also report on a novel debugging
environment that comes integrated with Pig, that can lead
to even higher productivity gains. Pig is an open-source,
Apache-incubator project, and available for general use.