Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience

Gates, Alan F.; Natkovich, Olga; Chopra, Shubham; Kamath, Pradeep; Narayanamurthy, Shravan M.; Olston, Christopher; Reed, Benjamin; Srinivasan, Santhosh; Srivastava, Utkarsh

Increasingly, organizations capture, transform and analyze
enormous data sets. Prominent examples include internet
companies and e-science. The Map-Reduce scalable dataflow
paradigm has become popular for these applications. Its
simple, explicit dataflow programming model is favored by
some over the traditional high-level declarative approach:
SQL. On the other hand, the extreme simplicity of Map-
Reduce leads to much low-level hacking to deal with the
many-step, branching dataflows that arise in practice. More-
over, users must repeatedly code standard operations such


Improving mapreduce performance in heterogeneous environments

Zaharia, Matei; Konwinski, Andy; Joseph, Anthony D.; Katz, Randy; Stoica, Ion

MapReduce is emerging as an important programming
model for large-scale data-parallel applications such as
web indexing, data mining, and scientific simulation.
Hadoop is an open-source implementation of MapRe-
duce enjoying wide adoption and is often used for short
jobs where low response time is critical. Hadoop’s per-
formance is closely tied to its task scheduler, which im-
plicitly assumes that cluster nodes are homogeneous and
tasks make progress linearly, and uses these assumptions
to decide when to speculatively re-execute tasks that ap-


Efficient Parallel Set-Similarity Joins Using MapReduce

Vernica, Rares; Carey, Michael J.; Li, Chen

In this paper we study how to efficiently perform set-simi-
larity joins in parallel using the popular MapReduce frame-
work. We propose a 3-stage approach for end-to-end set-
similarity joins. We take as input a set of records and output
a set of joined records based on a set-similarity condition.
We efficiently partition the data across nodes in order to
balance the workload and minimize the need for replication.
We study both self-join and R-S join cases, and show how to
carefully control the amount of data kept in main memory


Hadoop: The Definitive Guide MapReduce for the Cloud - MapReduce for the Cloud

White, Tom; Gray, Jonathan; Stack, Michael

Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.

Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you:


Introduction to cloud computing

Lu, Jiaheng

MapReduce and parallel DBMSs: friends or foes?

Stonebraker, Michael; Abadi, Daniel; DeWitt, David J.; Madden, Sam; Paulson, Erik; Pavlo, Andrew; Rasin, Alexander

MapReduce complements DBMSs since
databases are not designed for extract-
transform-load tasks, a MapReduce specialty.


Wie passen Dokumente und Datenbanken zusammen? CouchDB als komfortable REST-basierte Datenbankalterative

Pientka, Frank

Als dokumentenorientierte Datenbank für das Internet unterscheidet sich CouchDB bereits grundlegend von klassischen relationalen Datenbanken. Dabei setzt es konsequent auf den populären MapReduce-Algorithmus und Internetstandards, wie das JSON-Austauschformat und das REST-Protokoll. In diesem Beitrag werden wir die Hintergründe diskutieren, wie eine hochskalierbare Datenarchitektur für das Web heute aussehen könnte und wie wir diese am Beispiel der CouchDB realisieren können.


Map-reduce-merge: simplified relational data processing on large clusters

Yang, Hung-chih; Dasdan, Ali; Hsiao, Ruey-Lung; Parker, D. Stott

Map-Reduce is a programming model that enables easy de-
velopment of scalable parallel applications to process vast
amounts of data on large clusters of commodity machines.
Through a simple interface with two functions, map and re-
duce, this model facilitates parallel implementation of many
real-world tasks such as data processing for search engines
and machine learning.
However, this model does not directly support processing
multiple related heterogeneous datasets. While processing
relational data is a common need, this limitation causes dif-


Ad-hoc data processing in the cloud

Logothetis, Dionysios; Yocum, Kenneth

Ad-hoc data processing has proven to be a critical paradigm
for Internet companies processing large volumes of unstruc-
tured data. However, the emergence of cloud-based com-
puting, where storage and CPU are outsourced to multi-
ple third-parties across the globe, implies large collections
of highly distributed and continuously evolving data. Our
demonstration combines the power and simplicity of the
MapReduce abstraction with a wide-scale distributed stream
processor, Mortar. While our incremental MapReduce op-
erators avoid data re-processing, the stream processor man-

Syndicate content