In the previous article, we had a brief introduction to Big Data. To be able to take advantage of the emerging opportunities for organisations that it creates, some frameworks or tools have appeared in the last few years. One of these frameworks is Hadoop and all the ecosystem of tools created around it.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Apache Hadoop website
The project includes these modules:
- Hadoop Common: The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Hadoop Distributed File System (HDFS™)
HDFS creates an abstraction that allows users to see HDFS logically as a single storage unit, but the data is stored across multiple nodes in a distributed fashion. HDFS follows master-slave architecture where the master is the NameNode and the slaves – there can be multiple slaves – are known as DataNodes.
NameNodes have the next characteristics or functions:
- It is the master daemon that maintains and manages the DataNodes.
- It records the metadata of all the blocks stored in the cluster, e.g. location of blocks stored, size of the files, permissions, hierarchy, etc.
- It records every change that takes place to the file system metadata.
- If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
- It regularly receives a heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are alive.
- It keeps a record of all the blocks in the HDFS and DataNode in which they are stored.
- It has high availability and federation features.
DataNodes have the next characteristics or functions:
- It is the slave daemon which runs on each slave machine.
- The actual data is stored on DataNodes.
- It is responsible for serving read and write requests from the clients.
- It is also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.
- It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds.
Some of the problems that it fixes are:
- Storage space problems: HDFS provides a distributed way of storing Big Data. The data is stored in blocks across the available DataNodes. In addition, it will also replicate the data blocks on different DataNodes. Taking advantage of this architecture allows for horizontal scaling allowing the addition of some extra data nodes to the HDFS cluster when required increasing the available storage space.
- Variety of data: HDFS can store all kinds of data whether it is structured, semi-structured or unstructured.
- Access and process velocity: To increase the access and process velocity HDFS moves processing to data and not data to processing. Instead of moving data to the master nodes and then processing the data, the processing logic is sent to the various data nodes where the data is processed in parallel across different nodes. Once they are done, the processed results are sent to the master node where the results are merged and the response is sent back to the client.
Hadoop YARN
YARN performs all the processing activities by allocating resources and scheduling tasks. YARN has a ResourceManager and one or more NodeManagers.
The ResourceManager has the next characteristics or functions:
- It is a cluster-level (one for each cluster) component and runs on the master machine.
- It manages resources and schedules applications running on top of YARN.
- It has two components: Scheduler & ApplicationManager.
- The Scheduler is responsible for allocating resources to the various running applications.
- The ApplicationManager is responsible for accepting job submissions and negotiating the first container for executing the application.
- It keeps a track of the heartbeats from the Node Manager.
The NodeManagers have the next characteristics or functions:
- It is a node-level component (one on each node) and runs on each slave machine.
- It is responsible for managing containers and monitoring resource utilisation in each container.
- It also keeps track of node health and log management.
- It continuously communicates with ResourceManager to remain up-to-date.
Despite all the benefits Hadoop offers, it is not a silver bullet and its use needs to be carefully considered. Some of the use cases where it is not recommended are:
- Low Latency data access: Queries that have a maximum and short response time.
- Constant data modifications: Hadoop is a better fit when the primarily concerned is reading data.
- Lots of small files: While Hadoop can store multiple small datasets, it is much more suitable for scenarios where there are few but large files.
Hadoop MapReduce
Hadoop MapReduce is the core Hadoop ecosystem component which provides data processing. MapReduce is a software framework for easily writing applications that process the vast amount of structured and unstructured data stored in the Hadoop Distributed File system.
MapReduce programs are parallel in nature, and thus are very useful for performing large-scale data analysis using multiple machines in the cluster. Thus, it improves the speed and reliability of cluster this parallel processing.
Hadoop ecosystem
Hadoop was designed as both a computing (MapReduce) and storage (HDFS) platform from the very beginning. With the increasing need for big data analysis, Hadoop attracts lots of other software to resolve big data questions and merges into a Hadoop-centric big data ecosystem. The following diagram gives a brief overview of the Hadoop big data ecosystem:

We will go into more detail on some of the elements in future articles.