MapReduce

CoHadoop: flexible data placement and its exploitation in Hadoop

Authors: 
Eltabakh, MY; Tian, Y; Özcan, F; Gemulla, R; Krettek, A; McPherson, J

Hadoop has become an attractive platform for large-scale data ana-
lytics. In this paper, we identify a major performance bottleneck of
Hadoop: its lack of ability to colocate related data on the same set
of nodes. To overcome this bottleneck, we introduce CoHadoop,
a lightweight extension of Hadoop that allows applications to con-
trol where data are stored. In contrast to previous approaches, Co-
Hadoop retains the flexibility of Hadoop in that it does not require
users to convert their data to a certain format (e.g., a relational

Year: 
2011

MapDupReducer: Detecting Near Duplicates over Massive Datasets

Authors: 
Wang, Chaokun; Wang, Jianmin; Lin, Xuemin; Wang, Wei, Wang, Haixun; Li, Hongsong; Tian, Wanpeng; Xu, Jun; Li, Rui

Near duplicate detection benefits many applications, e.g.,
on-line news selection over the Web by keyword search. The
purpose of this demo is to show the design and implemen-
tation of MapDupReducer, a MapReduce based system ca-
pable of detecting near duplicates over massive datasets ef-
ficiently.

Year: 
2010

Automatic Optimization for MapReduce Programs

Authors: 
Jahani, Eaman; Cafarella, Michael J.; Ré, Christopher

The MapReduce distributed programming framework has
become popular, despite evidence that current implemen-
tations are inefficient, requiring far more hardware than a
traditional relational databases to complete similar tasks.
MapReduce jobs are amenable to many traditional database
query optimizations (B+Trees for selections, column-store-
style techniques for projections, etc), but existing systems
do not apply them, substantially because free-form user code
obscures the true data operation being performed. For ex-
ample, a selection in SQL is easily detected, but a selection

Year: 
2011

Parallel Sorted Neighborhood Blocking with MapReduce

Authors: 
Kolb, L; Thor, A; Rahm, E

Cloud infrastructures enable the efficient parallel execution of data-intensive
tasks such as entity resolution on large datasets. We investigate challenges and possi-
ble solutions of using the MapReduce programming model for parallel entity resolu-
tion. In particular, we propose and evaluate two MapReduce-based implementations
for Sorted Neighborhood blocking that either use multiple MapReduce jobs or apply
a tailored data replication.

Year: 
2011

MRShare: Sharing Across Multiple Queries in MapReduce

Authors: 
Nykiel, T; Potamias, M; Mishra, C; Kollios, G; N, Koudas

Large-scale data analysis lies in the core of modern enter-
prises and scientific research. With the emergence of cloud
computing, the use of an analytical query processing in-
frastructure (e.g., Amazon EC2) can be directly mapped
to monetary value. MapReduce has been a popular frame-
work in the context of cloud computing, designed to serve
long running queries (jobs) which can be processed in batch
mode. Taking into account that different jobs often perform
similar work, there are many opportunities for sharing. In
principle, sharing similar work reduces the overall amount of

Year: 
2010

The Performance of MapReduce: An in-depth Study

Authors: 
Jiang, D; Ooi, BC; Shi, L; Wu, S

Large-scale data analysis has become increasingly impor-
tant for many enterprises. Recently, a new distributed com-
puting paradigm, called MapReduce, and its open source
implementation Hadoop, has been widely adopted due to
its impressive scalability and flexibility to handle structured
as well as unstructured data. In this paper, we describe
our data warehouse system, called Cheetah, built on top of
MapReduce. Cheetah is designed specifically for our online
advertising application to allow various simplifications and
custom optimizations. First, we take a fresh look at the data

Year: 
2010

Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce

Authors: 
Chen, Songting

Large-scale data analysis has become increasingly impor-
tant for many enterprises. Recently, a new distributed com-
puting paradigm, called MapReduce, and its open source
implementation Hadoop, has been widely adopted due to
its impressive scalability and flexibility to handle structured
as well as unstructured data. In this paper, we describe
our data warehouse system, called Cheetah, built on top of
MapReduce. Cheetah is designed specifically for our online
advertising application to allow various simplifications and
custom optimizations. First, we take a fresh look at the data

Year: 
2010

Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)

Authors: 
Dittrich, J; Quiane-Ruiz, J; Jindal, A; Kargin, Y; Setty, V; Schad, J

MapReduce is a computing paradigm that has gained a lot of at-
tention in recent years from industry and research. Unlike paral-
lel DBMSs, MapReduce allows non-expert users to run complex
analytical tasks over very large data sets on very large clusters
and clouds. However, this comes at a price: MapReduce pro-
cesses tasks in a scan-oriented fashion. Hence, the performance of
Hadoop — an open-source implementation of MapReduce — often
does not match the one of a well-configured parallel DBMS. In this
paper we propose a new type of system named Hadoop++: it boosts

Year: 
2010

Runtime Measurements in the Cloud: Observing, Analyzing, and Reducing Variance

Authors: 
Schad, J; Dittrich, J; Quiané-Ruiz, JA

One of the main reasons why cloud computing has gained
so much popularity is due to its ease of use and its ability
to scale computing resources on demand. As a result, users
can now rent computing nodes on large commercial clusters
through several vendors, such as Amazon and rackspace.
However, despite the attention paid by Cloud providers,
performance unpredictability is a major issue in Cloud com-
puting for (1) database researchers performing wall clock ex-
periments, and (2) database applications providing service-
level agreements. In this paper, we carry out a study of the

Year: 
2010

Interpreting the data: Parallel analysis with Sawzall

Authors: 
Pike, R; Dorward, S; Griesemer, R; Quinlan, S

Very large data sets often have a flat but regular structure and span multiple disks and
machines. Examples include telephone call records, network logs, and web document reposi-
tories. These large data sets are not amenable to study using traditional database techniques, if
only because they can be too large to fit in a single relational database. On the other hand, many
of the analyses done on them can be expressed using simple, easily distributed computations:
filtering, aggregation, extraction of statistics, and so on.

Year: 
2005
Syndicate content