Cloud Database

Apache Hadoop Goes Realtime at Facebook

Borthakur, Dhruba; Sarma, Joydeep Sen; Gray, Jonathan; Muthukkaruppan, Kannan; Spiegelberg, Nicolas; Kuang, Hairong; Ranganathan, Karthik; Molkov, Dmytro; Menon, Aravind; Rash, Samuel; Schmidt, Rodrigo; Aiyer, Amitanand

Facebook recently deployed Facebook Messages, its first ever
user-facing application built on the Apache Hadoop platform.
Apache HBase is a database-like layer built on Hadoop designed
to support billions of messages per day. This paper describes the
reasons why Facebook chose Hadoop and HBase over other
systems such as Apache Cassandra and Voldemort and discusses
the application’’s requirements for consistency, availability,
partition tolerance, data model and scalability. We explore the
enhancements made to Hadoop to make it a more effective


Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce

Chen, Songting

Large-scale data analysis has become increasingly impor-
tant for many enterprises. Recently, a new distributed com-
puting paradigm, called MapReduce, and its open source
implementation Hadoop, has been widely adopted due to
its impressive scalability and flexibility to handle structured
as well as unstructured data. In this paper, we describe
our data warehouse system, called Cheetah, built on top of
MapReduce. Cheetah is designed specifically for our online
advertising application to allow various simplifications and
custom optimizations. First, we take a fresh look at the data


The Case for Determinism in Database Systems

Thomson, Alexander; Abadi, Daniel J.

Replication is a widely used method for achieving high availability in database systems. Due to the nondeterminism inherent in traditional concurrency control schemes, however, special care must be taken to ensure that replicas don’t
diverge. Log shipping, eager commit protocols, and lazy synchronization protocols are well-understood methods for
safely replicating databases, but each comes with its own cost in availability, performance, or consistency.
In this paper, we propose a distributed database system which combines a simple deadlock avoidance technique with


Hadoop: The Definitive Guide MapReduce for the Cloud - MapReduce for the Cloud

White, Tom; Gray, Jonathan; Stack, Michael

Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.

Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you:


Introduction to cloud computing

Lu, Jiaheng

HBase-0.20.0 Performance Evaluation

Rao, Anty; Zhang, Schubert

We  have  been  using  HBase  for  around  a  year  in  our  development  and  projects,  from  0.17.x  to 
0.19.x. We and all in the community know the critical performance and reliability issues of these 
Now,  the  great  news  is  that  HBase‐0.20.0  will  be  released  soon.  Jonathan  Gray  from  Streamy, 
Ryan  Rawson  from  StumbleUpon,  Michael  Stack  from  Powerset/Microsoft,  Jean‐Daniel  Cryans 
from  OpenPlaces,  and  other  contributors  had  done  a  great  job  to  redesign  and  rewrite  many 


Wie passen Dokumente und Datenbanken zusammen? CouchDB als komfortable REST-basierte Datenbankalterative

Pientka, Frank

Als dokumentenorientierte Datenbank für das Internet unterscheidet sich CouchDB bereits grundlegend von klassischen relationalen Datenbanken. Dabei setzt es konsequent auf den populären MapReduce-Algorithmus und Internetstandards, wie das JSON-Austauschformat und das REST-Protokoll. In diesem Beitrag werden wir die Hintergründe diskutieren, wie eine hochskalierbare Datenarchitektur für das Web heute aussehen könnte und wie wir diese am Beispiel der CouchDB realisieren können.


Hive - A Warehousing Solution Over a Map-Reduce Framework

Thusoo, Ashish; Sarma, Joydeep Sen; Jain, Namit; Shao, Zheng; Chakka, Prasad; Anthony, Suresh; Liu, Hao; Wyckoff, Pete; Murthy, Raghotham

The size of data sets being collected and analyzed in the
industry for business intelligence is growing rapidly, mak-
ing traditional warehousing solutions prohibitively expen-
sive. Hadoop [3] is a popular open-source map-reduce im-
plementation which is being used as an alternative to store
and process extremely large data sets on commodity hard-
ware. However, the map-reduce programming model is very
low level and requires developers to write custom programs
which are hard to maintain and reuse.
In this paper, we present Hive, an open-source data ware-


Building a database on S3

Brantner, Matthias; Florescu†, Daniela; Graf, David; Kossmann, Donald; Kraska, Tim

There has been a great deal of hype about Amazon’s simple storage
service (S3). S3 provides infinite scalability and high availability at
low cost. Currently, S3 is used mostly to store multi-media docu-
ments (videos, photos, audio) which are shared by a community of
people and rarely updated. The purpose of this paper is to demon-
strate the opportunities and limitations of using S3 as a storage sys-
tem for general-purpose database applications which involve small
objects and frequent updates. Read, write, and commit protocols

Syndicate content