Batch Processing — Hadoop Ecosystem

Ty Shaikh

Published in

K2 Data Science & Engineering

5 min readJan 22, 2019

Let’s review the major components of the Hadoop Ecosystem.

Apache Hadoop Project

The Apache Hadoop project consists of the following key parts:

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFSTM): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Latest Hadoop documentation: http://hadoop.apache.org/docs/current/

Distributed FS

The official documentation states: “The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.”

Key concepts:

Deployed across multiple servers, called a cluster
HDFS provides a write-once-read-many access model for files
Hardware failure is the norm rather than the exception
Files are spread across multiple nodes, for fault tolerance, HA (high availability), and to enable parallel processing

Architecture

Hadoop follows a master-slave architecture design for data storage and distributed data processing using HDFS and MapReduce respectively:

The master node for data storage in hadoop HDFS is the NameNode
The master node for parallel processing of data using Hadoop MapReduce is the Job Tracker
The slave nodes in the hadoop architecture are the other machines in the Hadoop cluster which store data and perform complex computations
Every slave node has a Task Tracker daemon and a DataNode that synchronizes the processes with the Job Tracker and NameNode respectively

NameNode & DataNodes

Critical components of HDFS!

A HDFS cluster consists of a single NameNode (with one or more slave replicas), a master server that manages the file system namespace and regulates access to files by clients
In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.
Basically, application data and metadata are stored separately in HDFS: Master nodes store metadata, and DataNodes store application data
HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the file system’s clients.
The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
The NameNode and DataNode communicate with each other using TCP based protocols.

Client Interaction

Complexity

You may be wondering why such complexity for something that can be done easily in python on your local computer?

The reasons are:

Splitting files into blocks enables parallel processing
Horizontal scalability
HA (High Availability)

HDFS Setup

VM
Managed Service
Install manually on your own linux server

MapReduce Framework

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the HDFS are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of:

A single master ResourceManager
One slave NodeManager per cluster-node
MRAppMaster per application.

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes.

These, and other job parameters, comprise the job configuration.

Job Submission

The Hadoop job client then submits the job (jar/executable etc.) and configuration to the ResourceManager which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Hadoop Streaming

Although the Hadoop framework is implemented in JavaTM, MapReduce applications need not be written in Java — thanks to the Hadoop Streaming

Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer
Hadoop Streaming is a generic API which allows writing Mappers and Reduces in any language. But the basic concept remains the same. Mappers and Reducers receive their input and output on stdin and stdout as (key, value) pairs
Apache Hadoop uses streams as per UNIX standard between your application and Hadoop system — basically , it uses ‘stdin’ and ‘stdout’ streams
NOT to be confused with Stream Processing — which is a totally different concept of processing real-time data

K2 Data Science & Engineering

Batch Processing — Hadoop Ecosystem

Apache Hadoop Project

Distributed FS

Architecture

NameNode & DataNodes

Client Interaction

Complexity

HDFS Setup

MapReduce Framework

Job Submission

Hadoop Streaming

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in K2 Data Science & Engineering

Written by Ty Shaikh

No responses yet