Need and requirement for Hadoop
Hadoop is open-source software that enables reliable, scalable, distributed computing on clusters of inexpensive servers. It has become the technology of choice to support applications that in turn support petabyte-sized analytics utilizing large numbers of computing nodes. It is:
- Reliable – The software is fault tolerant, it expects and handles hardware and software failures
- Scalable – Designed for massive scale of processors, memory, and local attached storage
- Distributed – Handles replication. Offers massively parallel programming model, MapReduce
It is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing. It is particularly useful when
- Complex information processing is needed
- Unstructured data needs to be turned into structured data
- Queries can be reasonably expressed using SQL
- Heavily recursive algorithms
- Complex but parallelizable algorithms needed, like geo-spatial analysis or genome sequencing
- Machine learning
- Data sets are too large to fit into database RAM, discs, or need too many cores (TB up to PB)
- Data value does not justify expense of constant real-time availability, such as archives or special interest info, which can be moved to Hadoop and remain available at lower cost
- Results are not needed in real time
- Fault tolerance is critical
- Significant custom coding would be required to handle job scheduling
The problems that it address are need for analytical platforms that can rapidly scale with the following features
- Detailed, interactive, multivariate statistical analysis
- Aggregation, correlation, and analysis of historical and current data
- Modeling and simulation, what-if analysis, and forecasting of alternate future states
- Semantic mining of unstructured data, streaming information, and multimedia
It helps for
- Iterate predictive models more rapidly.
- Run models of increasing complexity.
- Deliver model-driven decisions to more business processes.
Requirements
Hadoop clusters have two types of machines: masters (the HDFS NameNode and the MapReduce JobTracker) and slaves (the HDFS DataNodes and the MapReduce TaskTrackers). The DataNodes, TaskTrackers, and HBase RegionServers are co-located or co-deployed for optimal data locality.
Slave nodes occupy the majority of the IT hardware infrastructure. Disk space, I/O Bandwidth and computational power are the crucial factors for hardware sizing.
start with 1U/machine and use the following recommendations:
Two quad core CPUs, 12 GB to 24 GB memory and Four to six disk drives of 2 terabyte (TB) capacity are usually needed to start the cluster. The minimum requirement for network is 1GigE connecting all nodes to a Gigabit Ethernet switch.
Apply for Big Data and Hadoop Developer Certification
https://www.vskills.in/certification/certified-big-data-and-apache-hadoop-developer