Hadoop and Mapreduce Tutorial | Introduction to Apache Hadoop

Apache Hadoop

Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users. It is licensed under the Apache License 2.0.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Advantages & Disadvantages

Advantages of Hadoop

Hadoop provides both distributed storage and computational capabilities to address big data challenges.
Hadoop is extremely scalable on commodity hardware.
HDFS which is the storage component of Hadoop, is optimized for high throughput as HDFS uses large block sizes needed to manipulate large filesizes (gigabytes, petabytes…).
HDFS also provides data replication by replicating files for specified number of times and fault tolerant by automatically re-replicating data blocks on failed nodes.
MapReduce framework is used by Hadoop, which is a batch-based, distributed computing framework so, as to provide paralleled work processing over a large amount of data thus, removing distributed system complications for the developer. Mapreduce decomposes the job into Map and Reduce tasks and schedules them for remote execution on the slave or data nodes in a hadoop cluster

Disadvantages of Hadoop

Following are the major common areas found as weaknesses of Hadoop framework or system:

HDFS and MapReduce components of Hadoop have single points of failure due to their single-master processes, until the Hadoop 2.x release.
Security is disabled by default as it is highly complex for configuration.
Storage and network level encryption is also not present in Hadoop which is needed for data security.
HDFS is inefficient for handling small files, and it lacks transparent compression. As HDFS is not designed to work well with random reads over small files due to its optimization for sustained throughput.

MapReduce is a batch-based architecture hence it is inefficient for real-time data access needs.