Adaptively parallelizing distributed range queries

Vigfusson, Ymir; Silberstein, Adam; Cooper, Brian F.; Fonseca, Rodrigo

We consider the problem of how to best parallelize range
queries in a massive scale distributed database. In tradi-
tional systems the focus has been on maximizing paral-
lelism, for example by laying out data to achieve the highest
throughput. However, in a massive scale database such as
our PNUTS system [11] or BigTable [10], maximizing par-
allelism is not necessarily the best strategy: the system has
more than enough servers to saturate a single client by re-
turning results faster than the client can consume them, and
when there are multiple concurrent queries, maximizing par-


Pig Latin: A not-so-foreign language for data processing

C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins

There is a growing need for ad-hoc analysis of extremely
large data sets, especially at internet companies where inno-
vation critically depends on being able to analyze terabytes
of data collected every day. Parallel database products, e.g.,
Teradata, offer a solution, but are usually prohibitively ex-
pensive at this scale. Besides, many of the people who ana-
lyze this data are entrenched procedural programmers, who
find the declarative, SQL style to be unnatural. The success
of the more procedural map-reduce programming model, and

Syndicate content