Hive - A Warehousing Solution Over a Map-Reduce Framework

Thusoo, Ashish; Sarma, Joydeep Sen; Jain, Namit; Shao, Zheng; Chakka, Prasad; Anthony, Suresh; Liu, Hao; Wyckoff, Pete; Murthy, Raghotham

The size of data sets being collected and analyzed in the
industry for business intelligence is growing rapidly, mak-
ing traditional warehousing solutions prohibitively expen-
sive. Hadoop [3] is a popular open-source map-reduce im-
plementation which is being used as an alternative to store
and process extremely large data sets on commodity hard-
ware. However, the map-reduce programming model is very
low level and requires developers to write custom programs
which are hard to maintain and reuse.
In this paper, we present Hive, an open-source data ware-


Mapreduce: A major step backwards

DeWitt, D; Stonebraker, M

On January 8, a Database Column reader asked for our views on new distributed database research efforts, and we'll begin here with our views on MapReduce. This is a good time to discuss it, since the recent trade press has been filled with news of the revolution of so-called "cloud computing." This paradigm entails harnessing large numbers of (low-end) processors working in parallel to solve a computing problem. In effect, this suggests constructing a data center by lining up a large number of "jelly beans" rather than utilizing a much smaller number of high-end servers.


Implementation Issues of A Cloud Computing Platform

Peng, B; Cui, B; Li, X

Cloud computing is Internet based system development in which large scalable computing resources
are provided “as a service” over the Internet to users. The concept of cloud computing incorporates
web infrastructure, software as a service (SaaS), Web 2.0 and other emerging technologies, and has
attracted more and more attention from industry and research community. In this paper, we describe our
experience and lessons learnt in construction of a cloud computing platform. Specifically, we design a


Pig Latin: A not-so-foreign language for data processing

C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins

There is a growing need for ad-hoc analysis of extremely
large data sets, especially at internet companies where inno-
vation critically depends on being able to analyze terabytes
of data collected every day. Parallel database products, e.g.,
Teradata, offer a solution, but are usually prohibitively ex-
pensive at this scale. Besides, many of the people who ana-
lyze this data are entrenched procedural programmers, who
find the declarative, SQL style to be unnatural. The success
of the more procedural map-reduce programming model, and

Syndicate content