Hadoop and Mapreduce Tutorial | Hadoop Architecture

Hadoop is a part of a larger framework of related technologies

HDFS – Hadoop Distributed File System
HBase – Column oriented, non-relational, schema-less, distributed database modeled after Google’s BigTable. Promises “Random, real-time read/write access to Big Data”
Hive – Data warehouse system that provides SQL interface. Data structure can be projected ad hoc onto unstructured underlying data
Pig – A platform for manipulating and analyzing large data sets. High level language for analysts
ZooKeeper – a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

At its core, it is comprised of four things:

Hadoop Common- A set of common libraries and utilities used by other Hadoop modules.
HDFS- The default storage layer for Hadoop.
MapReduce- Executes a wide range of analytic functions by analysing datasets in parallel before ‘reducing’ the results. The “Map” job distributes a query to different nodes, and the “Reduce” gathers the results and resolves them into a single value.
YARN- Present in version 2.0 onwards, YARN is the cluster management layer of Hadoop. Prior to 2.0, MapReduce was responsible for cluster management as well as processing. The inclusion of YARN means you can run multiple applications in it (so you’re no longer limited to MapReduce), which all share common cluster management.

These four components form the basic framework. However, a vast array of other components have emerged, aiming to ameliorate it in some way- whether that be making it faster, better integrating it with other database solutions or building in new capabilities. Some the more well-known components include

Spark- Used on top of HDFS, Spark promises speeds up to 100 times faster than the two-step MapReduce function in certain applications. Allows data to loaded in-memory and queried repeatedly, making it particularly apt for machine learning algorithms
Hive- Originally developed by Facebook, Hive is a data warehouse infrastructure built on top of it. Hive provides a simple, SQL-like language called HiveQL, whilst maintaining full support for MapReduce. This means SQL programmers with little former experience with Hadoop can use the system easier, and provides better integration with certain analytics packages like Tableau. Hive also provides indexes, making querying faster.
HBase- Is a NoSQL columnar database which is designed to run on top of HDFS. It is modelled after Google’s BigTable and written in Java. It was designed to provide BigTable-like capabilities, such as the columnar data storage model and storage for sparse data.
Flume- Flume collects (typically log) data from ‘agents’ which it then aggregates and moves into it. In essence, Flume is what takes the data from the source (say a server or mobile device) and delivers it.

Apply for Big Data and Hadoop Developer Certification

https://www.vskills.in/certification/certified-big-data-and-apache-hadoop-developer

Back to Tutorials

Team Vskills

First Tenet

Second Tenet

Hadoop and Mapreduce Tutorial | Hadoop Architecture

Apply for Big Data and Hadoop Developer Certification

Back to Tutorials

Get Govt. Certified Secure Assured Job Interview

Level Up Your Job Skills Now!

Get industry recognized certification – Contact us

Get Govt. Certified
Secure Assured Job Interview