SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets

Chaiken, R; Jenkins, B; Larson, PŅ; Ramsey, B; Shakib, D; Weaver, S; Zhou, J
Chaiken, R
Jenkins, B
Larson, P
Ramsey, B
Shakib, D
Weaver, S
Zhou, J

Companies providing cloud-scale services have an increasing
need to store and analyze massive data sets such as search logs
and click streams. For cost and performance reasons, processing is
typically done on large clusters of shared-nothing commodity
machines. It is imperative to develop a programming model that
hides the complexity of the underlying system but provides flex-
ibility by allowing users to extend functionality to meet a variety
of requirements.
In this paper, we present a new declarative and extensible script-
ing language, SCOPE (Structured Computations Optimized for
Parallel Execution), targeted for this type of massive data analy-
sis. The language is designed for ease of use with no explicit par-
allelism, while being amenable to efficient parallel execution on
large clusters. SCOPE borrows several features from SQL. Data is
modeled as sets of rows composed of typed columns. The select
statement is retained with inner joins, outer joins, and aggregation
allowed. Users can easily define their own functions and imple-
ment their own versions of operators: extractors (parsing and con-
structing rows from a file), processors (row-wise processing),
reducers (group-wise processing), and combiners (combining
rows from two inputs). SCOPE supports nesting of expressions
but also allows a computation to be specified as a series of steps,
in a manner often preferred by programmers. We also describe
how scripts are compiled into efficient, parallel execution plans
and executed on large clusters.

VLDB 2008
Citations range: 
Chaiken2008SCOPEEasyandEfficientParallelProcessingofMassiveData.pdf434.2 KB