MRShare: Sharing Across Multiple Queries in MapReduce

Nykiel, T; Potamias, M; Mishra, C; Kollios, G; N, Koudas
Nykiel, T
N, K
Kollios, G
Potamias, M
Mishra, C

Large-scale data analysis lies in the core of modern enter-
prises and scientific research. With the emergence of cloud
computing, the use of an analytical query processing in-
frastructure (e.g., Amazon EC2) can be directly mapped
to monetary value. MapReduce has been a popular frame-
work in the context of cloud computing, designed to serve
long running queries (jobs) which can be processed in batch
mode. Taking into account that different jobs often perform
similar work, there are many opportunities for sharing. In
principle, sharing similar work reduces the overall amount of
work, which can lead to reducing monetary charges incurred
while utilizing the processing infrastructure. In this paper
we propose a sharing framework tailored to MapReduce.
Our framework, MRShare, transforms a batch of queries
into a new batch that will be executed more efficiently, by
merging jobs into groups and evaluating each group as a
single query. Based on our cost model for MapReduce, we
define an optimization problem and we provide a solution
that derives the optimal grouping of queries. Experiments
in our prototype, built on top of Hadoop, demonstrate the
overall effectiveness of our approach and substantial savings.

VLDB 2010
Citations range: 
Nykiel2010MRShareSharingAcrossMultipleQueriesinMapReduce.pdf484.78 KB