Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce

Chen, Songting

Large-scale data analysis has become increasingly impor-
tant for many enterprises. Recently, a new distributed com-
puting paradigm, called MapReduce, and its open source
implementation Hadoop, has been widely adopted due to
its impressive scalability and flexibility to handle structured
as well as unstructured data. In this paper, we describe
our data warehouse system, called Cheetah, built on top of
MapReduce. Cheetah is designed specifically for our online
advertising application to allow various simplifications and
custom optimizations. First, we take a fresh look at the data


Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing)

Dittrich, J; Quiane-Ruiz, J; Jindal, A; Kargin, Y; Setty, V; Schad, J

MapReduce is a computing paradigm that has gained a lot of at-
tention in recent years from industry and research. Unlike paral-
lel DBMSs, MapReduce allows non-expert users to run complex
analytical tasks over very large data sets on very large clusters
and clouds. However, this comes at a price: MapReduce pro-
cesses tasks in a scan-oriented fashion. Hence, the performance of
Hadoop — an open-source implementation of MapReduce — often
does not match the one of a well-configured parallel DBMS. In this
paper we propose a new type of system named Hadoop++: it boosts


Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience

Gates, Alan F.; Natkovich, Olga; Chopra, Shubham; Kamath, Pradeep; Narayanamurthy, Shravan M.; Olston, Christopher; Reed, Benjamin; Srinivasan, Santhosh; Srivastava, Utkarsh

Increasingly, organizations capture, transform and analyze
enormous data sets. Prominent examples include internet
companies and e-science. The Map-Reduce scalable dataflow
paradigm has become popular for these applications. Its
simple, explicit dataflow programming model is favored by
some over the traditional high-level declarative approach:
SQL. On the other hand, the extreme simplicity of Map-
Reduce leads to much low-level hacking to deal with the
many-step, branching dataflows that arise in practice. More-
over, users must repeatedly code standard operations such


Improving mapreduce performance in heterogeneous environments

Zaharia, Matei; Konwinski, Andy; Joseph, Anthony D.; Katz, Randy; Stoica, Ion

MapReduce is emerging as an important programming
model for large-scale data-parallel applications such as
web indexing, data mining, and scientific simulation.
Hadoop is an open-source implementation of MapRe-
duce enjoying wide adoption and is often used for short
jobs where low response time is critical. Hadoop’s per-
formance is closely tied to its task scheduler, which im-
plicitly assumes that cluster nodes are homogeneous and
tasks make progress linearly, and uses these assumptions
to decide when to speculatively re-execute tasks that ap-


Efficient Parallel Set-Similarity Joins Using MapReduce

Vernica, Rares; Carey, Michael J.; Li, Chen

In this paper we study how to efficiently perform set-simi-
larity joins in parallel using the popular MapReduce frame-
work. We propose a 3-stage approach for end-to-end set-
similarity joins. We take as input a set of records and output
a set of joined records based on a set-similarity condition.
We efficiently partition the data across nodes in order to
balance the workload and minimize the need for replication.
We study both self-join and R-S join cases, and show how to
carefully control the amount of data kept in main memory


Hadoop: The Definitive Guide MapReduce for the Cloud - MapReduce for the Cloud

White, Tom; Gray, Jonathan; Stack, Michael

Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.

Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you:


Introduction to cloud computing

Lu, Jiaheng

HDFS Architecture

Borthakur, Dhruba

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant. HDFS is highly
fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high
throughput access to application data and is suitable for applications that have large data sets.
HDFS relaxes a few POSIX requirements to enable streaming access to file system data.


Hive - A Warehousing Solution Over a Map-Reduce Framework

Thusoo, Ashish; Sarma, Joydeep Sen; Jain, Namit; Shao, Zheng; Chakka, Prasad; Anthony, Suresh; Liu, Hao; Wyckoff, Pete; Murthy, Raghotham

The size of data sets being collected and analyzed in the
industry for business intelligence is growing rapidly, mak-
ing traditional warehousing solutions prohibitively expen-
sive. Hadoop [3] is a popular open-source map-reduce im-
plementation which is being used as an alternative to store
and process extremely large data sets on commodity hard-
ware. However, the map-reduce programming model is very
low level and requires developers to write custom programs
which are hard to maintain and reuse.
In this paper, we present Hive, an open-source data ware-

Syndicate content