This is second blog to our series of blog for more information about Hadoop. Here, we need to consider two main pain point with Big Data as
- Secure storage of the data
- Accurate analysis of the data
Hadoop is designed for parallel processing into a distributed environment, so Hadoop requires such a mechanism which helps users to answer these questions. In 2003 Google has published two white papers Google File System (GFS) and MapReduce framework. Dug Cutting had read these papers and designed file system for hadoop which is known as Hadoop Distributed File System (HDFS) and implemented a MapReduce framework on this file system to process data. This has become the core components of Hadoop.
Hadoop Distributed File System :
HDFS is a virtual file system which is scalable, runs on commodity hardware and provides high throughput access to application data. It is a data storage component of Hadoop. It stores its data blocks on top of the native file system.It presents a single view of multiple physical disks or file systems. Data is distributed across the nodes, node is an individual machine in a cluster and cluster is a group of nodes. It is designed for applications which need a write-once-read-many access. It does not allow modification of data once it is written. Hadoop has a master/slave architecture. The Master of HDFS is known as Namenode and Slave is known as Datanode.
It is a deamon which runs on master node of hadoop cluster. There is only one namenode in a cluster. It contains metadata of all the files stored on HDFS which is known as namespace of HDFS. It maintain two files EditLog, record every change that occurs to file system metadata (transaction history) and FsImage, which stores entire namespace, mapping of blocks to files and file system properties. The FsImage and the EditLog are central data structures of HDFS.
It is a deamon which runs on slave machines of Hadoop cluster. There are number of datanodes in a cluster. It is responsible for serving read/write request from the clients. It also performs block creation, deletion, and replication upon instruction from the Namenode. It also sends a Heartbeat message to the namenode periodically about the blocks it hold. Namenode and Datanode machines typically run a GNU/Linux operating system (OS).
Following are some of the characteristics of HDFS ,
1) Data Integrity :
When a file is created in HDFS, it computes a checksum of each block of the file and stores this checksums in a separate hidden file. When a client retrieves file contents, it verifies that the data it received matches the checksum stored in the associated checksum file.
2) Robustness :
The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions.
3) Cluster Rebalancing :
The HDFS is compatible with data re balancing that means it will automatically move the data from one datanode to another, if free space on datanode falls below a certain threshold.
4) Accessibility :
It can be accessed from applications in many different ways. Hadoop provides a Java API for applications to use. An HTTP browser can also be used to browse the files of an HDFS instance using default web interface of hadoop.
5) Re-replication :
When a datanode send heartbeats to namenode and if any block is missing then namenode mark that block as dead. This dead block is re-replicated from the other datanode. Re-replication arise when a datanode become unavailable, a replica is corrupted, a hard disk may fail, or the replication factor value is increased.
In general MapReduce is a programming model which allows to process large data sets with a parallel, distributed algorithm on a cluster. Hadoop uses this model to process data which is stored on HDFS. It splits a task across the processes. Generally, we send data to the process but in MapReduce we send process to the data which decreases network overhead.
MapReduce job is an analysis work that we want to run on data, which is broken down into multiple task because the data is stored on different nodes which can run paralleled. A MapReduce program processes data by manipulating (key/value) pairs in the general form
map: (K1,V1) ? list(K2,V2)
reduce: (K2,list(V2)) ? list(K3,V3)
Following are the phases of MapReduce job ,
1) Map :
In this phase we simultaneously ask our machines to run a computation on their local block of data. As this phase completes, each node stores the result of its computation in temporary local storage, this is called the “intermediate data”. Please note that the output of this phase is written to the local disk, not to the HDFS.
2) Combine :
Sometime we want to perform a local reduce before we transfer result to reduce task. In such scenarios we add combiner to perform local reduce task. It is a reduce task which runs on local data. For example, if the job processes a document containing the word “the” 574 times, it is much more efficient to store and shuffle the pair (“the”, 574) once instead of the pair (“the”, 1) multiple times. This processing step is known as combining.
3) Partition :
In this phase partitioner will redirect the result of mappers to different reducers. When there are multiple reducers, we need some ways to determine the appropriate one to send a (key/value) pair outputted by a mapper.
4) Reduce :
The Map task on the machines have completed and generated their intermediate data. Now we need to gather all of this intermediate data to combine it for further processing such that we have one final result. Reduce task run on any of the slave nodes. When the reduce task receives the output from the various mappers, it sorts the incoming data on the key of the (key/value) pair and groups together all values of the same key.
The Master of MapReduce engine is known as Jobtracker and Slave is known as Tasktracker.
Jobtracker is a coordinator of the MapReduce job which runs on master node. When the client machine submits the job then it first consults Namenode to know about which datanode have blocks of file which is input for the submitted job. The Job Tracker then provides the Task Tracker running on those nodes with the Java code required to execute job.
Tasktracker runs actual code of job on the data blocks of input file. It also sends heartbeats and task status back to the jobtracker.
If the node running the map task fails before the map output has been consumed by the reduce task, then Jobtracker will automatically rerun the map task on another node to re-create the map output that is why it is known as self-hexaling system.
Read More : Apache Hadoop: An Introduction