In this post you will learn about the Hadoop HDFS architecture introduction and its design. The Hadoop Distributed File System (HDFS) is a Java based distributed file system, designed to run on commodity hardwares. It has many similarities with existing available distributed file systems.
Hadoop HDFS Architecture Introduction
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. Hadoop HDFS provides high throughput access to application data and is suitable for applications that have large volume of data sets. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting around a billion files and blocks.
Hadoop HDFS NameNode and DataNodes
Hadoop HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. There are number of DataNodes in the cluster, usually one per node in the cluster, which manage storage or disks attached to the nodes that they run on.
When you put file to Hadoop HDFS, internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode or master executes file system operations like opening, closing, and renaming files and HDFS directories. NameNode also determines the mapping of file blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the clients. The DataNodes also perform block creation, deletion, and replication on other DataNodes.
Read:
- Apache Hadoop Security – Hadoop HDFS File Permissions
- Hadoop HDFS Schema Design for ETL Process
- Hadoop Data Warehouse and Design Considerations
- Hadoop Single Node Cluster Setup on Ubuntu
- 7 Best Hadoop Books to Learn Bigdata Hadoop
- Impala Architecture
- Import using Apache Sqoop
- Export using Apache Sqoop
- Sqoop Command with Secure Password
Hadoop HDFS File System Namespace
The NameNode or master node in the cluster maintains the file system namespace. Any change to the file system namespace is recorded by the NameNode. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
Hadoop HDFS Data Replication
Hadoop HDFS is designed to store a very large volume of the files (interms of multiple GB’s). It stores each file as a sequence of multiple blocks; all blocks in a file except the last block are the same size. The blocks of a HDFS file are replicated to other DataNodes for fault tolerance. The number of file blocks that NameNode keeps is called replication factor. The block size and replication factor are configurable. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. Multiple data sources cannot write to same HDFS directory.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes available in the Hadoop cluster. Heartbeat indicates that DataNode is working properly. A Blockreport contains a list of all blocks on a DataNode.
NameNode (Filename, numReplicas, Block-IDs,...) /usr/warehouse/data/part-0,r:2,{1,3}, ... --> Replication is 2 and blocks will be copied to node 1 & 3 /usr/warehouse/data/part-1,r:3,{2,3,5}, ... --> Replication is 3 and blocks will be copied to node 2,3 & 5
Space Reclaim
When you delete file, it will not be removed from the HDFS, instead,it will be rename to file in /trash directory. You canrestore the deleted file as long as it is available in /trash. You can configure the retain time, after the expiry date, NameNode deletes the files from /trash and its is lost forever.