Hadoop Cluster Management
A computer cluster consists of a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Unlike grid computers, computer clusters have each node set to perform the same task, controlled and scheduled by software.
A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment
Cluster Planning
Planning the cluster is a complex stack and you might have many questions, like
- HDFS deals with replication and Map Reduce create files… How can I plan my storage needs?
- How to plan my CPU needs?
- How to plan my memory needs? Should I consider different needs on some nodes of the cluster?
- I heard that Map Reduce moves its job code where the data to process is located… What does it involve in terms of network bandwidth?
- At which point and how far should I consider what the final users will actually process on the cluster during my planning?
Workload Patterns
Disk space, I/O Bandwidth (required by Hadoop), and computational power (required for the MapReduce processes) are the most important parameters for accurate hardware sizing. Additionally, if you are installing HBase, you also need to analyze your application and its memory requirements, because HBase is a memory intensive component. Based on the typical use cases for Hadoop, the following workload patterns are commonly observed in production environments:
- Balanced Workload – If your workloads are distributed equally across the various job types (CPU bound, Disk I/O bound, or Network I/O bound), your cluster has a balanced workload pattern. This is a good default configuration for unknown or evolving workloads.
- Compute Intensive – These workloads are CPU bound and are characterized by the need of a large number of CPUs and large amounts of memory to store in-process data. (This usage pattern is typical for natural language processing or HPCC workloads.)
- I/O Intensive – A typical MapReduce job (like sorting) requires very little compute power. Instead it relies more on the I/O bound capacity of the cluster (for example, if you have lot of cold data). For this type of workload, we recommend investing in more disks per box.
- Unknown or evolving workload patterns – You may not know your eventual workload patterns from the first. And typically the first jobs submitted to Hadoop in the early days are usually very different than the actual jobs you will run in your production environment. For these reasons, Hortonworks recommends that you either use the Balanced workload configuration or invest in a pilot Hadoop cluster and plan to evolve its structure as you analyze the workload patterns in your environment.
Apply for Big Data and Hadoop Developer Certification
https://www.vskills.in/certification/certified-big-data-and-apache-hadoop-developer