Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience

Gates, Alan F.; Natkovich, Olga; Chopra, Shubham; Kamath, Pradeep; Narayanamurthy, Shravan M.; Olston, Christopher; Reed, Benjamin; Srinivasan, Santhosh; Srivastava, Utkarsh

Increasingly, organizations capture, transform and analyze
enormous data sets. Prominent examples include internet
companies and e-science. The Map-Reduce scalable dataflow
paradigm has become popular for these applications. Its
simple, explicit dataflow programming model is favored by
some over the traditional high-level declarative approach:
SQL. On the other hand, the extreme simplicity of Map-
Reduce leads to much low-level hacking to deal with the
many-step, branching dataflows that arise in practice. More-
over, users must repeatedly code standard operations such


Some sample programs written in DryadLINQ

Yu, Yuan; Isard, Michael; Fetterly, Dennis; Budiu, Mihai; Erlingsson, Ulfar; Gunda, Pradeep Kumar; Currey, Jon; McSherry, Frank; Achan, Kannan; Poulain, Christophe

The goal of this document is to illustrate the use of DryadLINQ parallel computation framework through
a set of examples. For each program we present the essential source code and a brief description. This
document does not describe the installation or configuration of DryadLINQ or the configuration
parameters which can be used to influence the compilation and execution. A non-commercial release of
the DryadLINQ research software is available for download at


DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

Yu, Y; Isard, M; Fetterly, D; Budiu, M; Erlingon, Ú; Gunda, PK; Currey, J

DryadLINQ is a system and a set of language extensions
that enable a new programming model for large scale dis-
tributed computing. It generalizes previous execution en-
vironments such as SQL, MapReduce, and Dryad in two
ways: by adopting an expressive data model of strongly
typed .NET objects; and by supporting general-purpose
imperative and declarative operations on datasets within
a traditional high-level programming language.
A DryadLINQ program is a sequential program com-
posed of LINQ expressions performing arbitrary side-

Syndicate content