Hadoop and Mapreduce Tutorial | Need and requirement for Hadoop

Need and requirement for Hadoop

Hadoop is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. It has become the technology of choice to support applications that in turn support petabyte-sized analytics utilizing large numbers of computing nodes. It is:

  • Reliable – The software is fault tolerant, it expects and handles hardware and software failures
  • Scalable – Designed for massive scale of processors, memory, and local attached storage
  • Distributed – Handles replication. Offers massively parallel programming model, MapReduce

It is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing. It is particularly useful when

  • Complex information processing is needed
  • Unstructured data needs to be turned into structured data
  • Queries can be reasonably expressed using SQL
  • Heavily recursive algorithms
  • Complex but parallelizable algorithms needed, like geo-spatial analysis or genome sequencing
  • Machine learning
  • Data sets are too large to fit into database RAM, discs, or need too many cores (TB up to PB)
  • Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost
  • Results are not needed in real time
  • Fault tolerance is critical
  • Significant custom coding would be required to handle job scheduling

The problems that it address are need for analytical platforms that can rapidly scale with the following features

  1. Detailed, interactive, multivariate statistical analysis
  2. Aggregation, correlation, and analysis of historical and current data
  3. Modeling and simulation, what-if analysis, and forecasting of alternate future states
  4. Semantic mining of unstructured data, streaming information, and multimedia

It helps for

  • Iterate predictive models more rapidly.
  • Run models of increasing complexity.
  • Deliver model-driven decisions to more business processes.

Requirements

Hadoop clusters have two types of machines: masters (the HDFS NameNode and the MapReĀ­duce JobTracker) and slaves (the HDFS DataNodes and the MapReduce TaskĀ­Trackers). The DataNodes, TaskTrackers, and HBase RegionServers are co-located or co-deployed for optimal data locality.

Slave nodes occupy the majority of the IT hardware infrastructure. Disk space, I/O Bandwidth and computational power are the crucial factors for hardware sizing.

start with 1U/machine and use the following recommendations:

Two quad core CPUs, 12 GB to 24 GB memory and Four to six disk drives of 2 terabyte (TB) capacity are usually needed to start the cluster. The minimum requirement for network is 1GigE connecting all nodes to a Gigabit Ethernet switch.

Get industry recognized certification – Contact us

Menu