Hadoop and Mapreduce Tutorial | Need and requirement for Hadoop

Need and requirement for Hadoop

Hadoop is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. It has become the technology of choice to support applications that in turn support petabyte-sized analytics utilizing large numbers of computing nodes. It is:

Reliable – The software is fault tolerant, it expects and handles hardware and software failures
Scalable – Designed for massive scale of processors, memory, and local attached storage
Distributed – Handles replication. Offers massively parallel programming model, MapReduce

It is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing. It is particularly useful when

Complex information processing is needed
Unstructured data needs to be turned into structured data
Queries can be reasonably expressed using SQL
Heavily recursive algorithms
Complex but parallelizable algorithms needed, like geo-spatial analysis or genome sequencing
Machine learning
Data sets are too large to fit into database RAM, discs, or need too many cores (TB up to PB)
Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost
Results are not needed in real time
Fault tolerance is critical
Significant custom coding would be required to handle job scheduling

The problems that it address are need for analytical platforms that can rapidly scale with the following features

Detailed, interactive, multivariate statistical analysis
Aggregation, correlation, and analysis of historical and current data
Modeling and simulation, what-if analysis, and forecasting of alternate future states
Semantic mining of unstructured data, streaming information, and multimedia

It helps for

Iterate predictive models more rapidly.
Run models of increasing complexity.
Deliver model-driven decisions to more business processes.

Requirements

Hadoop clusters have two types of machines: masters (the HDFS NameNode and the MapReduce JobTracker) and slaves (the HDFS DataNodes and the MapReduce TaskTrackers). The DataNodes, TaskTrackers, and HBase RegionServers are co-located or co-deployed for optimal data locality.

Slave nodes occupy the majority of the IT hardware infrastructure. Disk space, I/O Bandwidth and computational power are the crucial factors for hardware sizing.

start with 1U/machine and use the following recommendations:

Two quad core CPUs, 12 GB to 24 GB memory and Four to six disk drives of 2 terabyte (TB) capacity are usually needed to start the cluster. The minimum requirement for network is 1GigE connecting all nodes to a Gigabit Ethernet switch.

Apply for Big Data and Hadoop Developer Certification

https://www.vskills.in/certification/certified-big-data-and-apache-hadoop-developer

Back to Tutorials

Team Vskills

Introduction to Dow Theory

First Tenet

Hadoop and Mapreduce Tutorial | Need and requirement for Hadoop

Need and requirement for Hadoop

Requirements

Apply for Big Data and Hadoop Developer Certification

Back to Tutorials

Get Govt. Certified Secure Assured Job Interview

Upgrade Your Job Skills Now!

Get industry recognized certification – Contact us

Get Govt. Certified
Secure Assured Job Interview