DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

Yu, Y; Isard, M; Fetterly, D; Budiu, M; Erlingon, Ú; Gunda, PK; Currey, J
Yu, Y
Isard, M
Fetterly, D
Budiu, M
Gunda, P
Currey, J

DryadLINQ is a system and a set of language extensions
that enable a new programming model for large scale dis-
tributed computing. It generalizes previous execution en-
vironments such as SQL, MapReduce, and Dryad in two
ways: by adopting an expressive data model of strongly
typed .NET objects; and by supporting general-purpose
imperative and declarative operations on datasets within
a traditional high-level programming language.
A DryadLINQ program is a sequential program com-
posed of LINQ expressions performing arbitrary side-
effect-free transformations on datasets, and can be writ-
ten and debugged using standard .NET development
tools. The DryadLINQ system automatically and trans-
parently translates the data-parallel portions of the pro-
gram into a distributed execution plan which is passed
to the Dryad execution platform. Dryad, which has been
in continuous operation for several years on production
clusters made up of thousands of computers, ensures ef-
ficient, reliable execution of this plan.
We describe the implementation of the DryadLINQ
compiler and runtime. We evaluate DryadLINQ on a
varied set of programs drawn from domains such as
web-graph analysis, large-scale log mining, and machine
learning. We show that excellent absolute performance
can be attained—a general-purpose sort of 1012 Bytes of
data executes in 319 seconds on a 240-computer, 960-
disk cluster—as well as demonstrating near-linear scal-
ing of execution time on representative applications as
we vary the number of computers used for a job.

Citations range: 
Yu2008DryadLINQASystemforGeneralPurposeDistributed.pdf888.29 KB