Improving mapreduce performance in heterogeneous environments

Authors: 
Zaharia, Matei; Konwinski, Andy; Joseph, Anthony D.; Katz, Randy; Stoica, Ion
Author: 
Zaharia, M
Konwinski, A
Joseph, A
Katz, R
Stoica, I

Abstract
MapReduce is emerging as an important programming
model for large-scale data-parallel applications such as
web indexing, data mining, and scientific simulation.
Hadoop is an open-source implementation of MapRe-
duce enjoying wide adoption and is often used for short
jobs where low response time is critical. Hadoop’s per-
formance is closely tied to its task scheduler, which im-
plicitly assumes that cluster nodes are homogeneous and
tasks make progress linearly, and uses these assumptions
to decide when to speculatively re-execute tasks that ap-
pear to be stragglers. In practice, the homogeneity as-
sumptions do not always hold. An especially compelling
setting where this occurs is a virtualized data center, such
as Amazon’s Elastic Compute Cloud (EC2). We show
that Hadoop’s scheduler can cause severe performance
degradation in heterogeneous environments. We design
a new scheduling algorithm, Longest Approximate Time
to End (LATE), that is highly robust to heterogeneity.
LATE can improve Hadoop response times by a factor
of 2 in clusters of 200 virtual machines on EC2.

Year: 
2008
Venue: 
OSDI 2008
URL: 
http://portal.acm.org/citation.cfm?id=1855744
Citations: 
0
Citations range: 
n/a
AttachmentSize
Zaharia2008Improvingmapreduceperformanceinheterogeneousenvironments.pdf712.33 KB