The Performance of MapReduce: An in-depth Study

Jiang, D; Ooi, BC; Shi, L; Wu, S

Large-scale data analysis has become increasingly impor-
tant for many enterprises. Recently, a new distributed com-
puting paradigm, called MapReduce, and its open source
implementation Hadoop, has been widely adopted due to
its impressive scalability and flexibility to handle structured
as well as unstructured data. In this paper, we describe
our data warehouse system, called Cheetah, built on top of
MapReduce. Cheetah is designed specifically for our online
advertising application to allow various simplifications and
custom optimizations. First, we take a fresh look at the data


Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)

Dittrich, J; Quiane-Ruiz, J; Jindal, A; Kargin, Y; Setty, V; Schad, J

MapReduce is a computing paradigm that has gained a lot of at-
tention in recent years from industry and research. Unlike paral-
lel DBMSs, MapReduce allows non-expert users to run complex
analytical tasks over very large data sets on very large clusters
and clouds. However, this comes at a price: MapReduce pro-
cesses tasks in a scan-oriented fashion. Hence, the performance of
Hadoop — an open-source implementation of MapReduce — often
does not match the one of a well-configured parallel DBMS. In this
paper we propose a new type of system named Hadoop++: it boosts

Syndicate content