Wednesday, April 27, 2011

HDFS Overview

This is for my friends who got interested in Hadoop reading my previous post "A Hadoop Use Case". In this blog I will try to briefly touch upon the two major sub projects in Apache Hadoop - HDFS and MapReduce and have thus split it into two posts. This one would talk about HDFS mainly. Most of the information here has been picked up from the official site of hadoop.

Apache Hadoop project develops open-source software for distributed computing that are scalable and reliable. It includes sub-projects

a. Hadoop Distributed File System (HDFS): A distributed file system that provides high throughput access to application data

b. MapReduce: A software framework for distributed processing of large data sets on compute clusters.

HDFS is the primary storage system used by Hadoop applications. HDFS is distributed file system designed to run on commodity servers. It is designed for applications that have large data sets (hunders of gigabytes or even peta bytes) and require fast access to application data. HDFS supports write-once-read-many semantics on files.

HDFS works on the master/slave architecture. A HDFS cluster consists of NameNode which is the coordinator node and multiple DataNodes.

NameNode manages the file system namespace and regulates access to the files by clients. The DataNodes manage the storage attached to the nodes that they run on. The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software.

HDFS provides a commandline interface called FS shell that lets a user interact with the data in HDFS. To copy a local file to HDFS from command line you can use the command below:
hdfs dfs -copyFromLocal URI
For more command line options refer this link.
HDFS also provides java apis for applications to use. Sample java code to copy a file from local file system to HDFS is shown below:
Configuration config = new Configuration();
FileSystem hdfs = FileSystem.get(config);
Path srcPath = new Path(srcFile);
Path dstPath = new Path(dstFile);
hdfs.copyFromLocalFile(srcPath, dstPath);
When you copy a file to HDFS it is split into one or more blocks (based on the configured block size) and these blocks are stored on multiple DataNodes.The NameNode would manage the mapping of blocks to DataNodes. These blocks are also replicated for fault tolerance. The replication factor again is configurable.

Below is the HDFS Architecture diagram from the HDFS Architecture Guide


HDFS Limitations
a. HDFS has one NameNode for each cluster, so the total memory available on NameNode is the primary scalability limitation. On very large clusters, increasing average size of files stored in HDFS helps with increasing cluster size without increasing memory requirements on NameNode.
b. Currently NameNode is the single point of failure, so if the NameNode goes down the cluster is unavailable, but yes we will not loose any data as NameNode persists all the information on disk and can start up again with the latest information. To prevent data loss due to disk failures we can run secondary nodes or save the copies of the fsimage and audit log on multiple disks.

To start the hadoop cluster refer this link.

No comments:

Post a Comment