This paper describes a workﬂow manager developed and
deployed at Yahoo called Nova, which pushes continually-
arriving data through graphs of Pig programs executing on
Hadoop clusters. (Pig is a structured dataﬂow language and
runtime for the Hadoop map-reduce system.)
Nova is like data stream managers in its support for
stateful incremental processing, but unlike them in that it
deals with data in large batches using disk-based processing.
Batched incremental processing is a good ﬁt for a large frac-
tion of Yahoo’s data processing use-cases, which deal with
continually-arriving data and beneﬁt from incremental algo-
rithms, but do not require ultra-low-latency processing.