The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant. HDFS is highly
fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high
throughput access to application data and is suitable for applications that have large data sets.
HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
Complex on-demand data retrieval and processing is a characteristic of several applications and com-
bines the notions of querying & search, information ﬁltering & retrieval, data transformation & analysis,
and other data manipulations. Such rich tasks are typically represented by data processing graphs, hav-
ing arbitrary data operators as nodes and their producer-consumer interactions as edges. Optimizing
and executing such graphs on top of distributed architectures is critical for the success of the corre-
We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.