Nova: Continuous Pig/Hadoop Workflows

Olston, Christopher; Chiou, Greg; Chitnis, Laukik; Liu, Francis; Han, Yiping; Larsson, Mattias; Neumann, Andreas; Rao, Vellanki B. N.; Sankarasubramanian, Vijayanand; Rao, Vellanki B. N.; Siddharth, Seth; Tian, Chao; ZiCornell, Topher; Wang, Xiaodan

This paper describes a workflow manager developed and
deployed at Yahoo called Nova, which pushes continually-
arriving data through graphs of Pig programs executing on
Hadoop clusters. (Pig is a structured dataflow language and
runtime for the Hadoop map-reduce system.)
Nova is like data stream managers in its support for
stateful incremental processing, but unlike them in that it
deals with data in large batches using disk-based processing.
Batched incremental processing is a good fit for a large frac-
tion of Yahoo’s data processing use-cases, which deal with


Building a High-Level Dataflow System on top of Map-Reduce: The Pig Experience

Gates, Alan F.; Natkovich, Olga; Chopra, Shubham; Kamath, Pradeep; Narayanamurthy, Shravan M.; Olston, Christopher; Reed, Benjamin; Srinivasan, Santhosh; Srivastava, Utkarsh

Increasingly, organizations capture, transform and analyze
enormous data sets. Prominent examples include internet
companies and e-science. The Map-Reduce scalable dataflow
paradigm has become popular for these applications. Its
simple, explicit dataflow programming model is favored by
some over the traditional high-level declarative approach:
SQL. On the other hand, the extreme simplicity of Map-
Reduce leads to much low-level hacking to deal with the
many-step, branching dataflows that arise in practice. More-
over, users must repeatedly code standard operations such


Hadoop: The Definitive Guide MapReduce for the Cloud - MapReduce for the Cloud

White, Tom; Gray, Jonathan; Stack, Michael

Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.

Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you:

Syndicate content