The stress test is a very important step when you go live. Good stress tests help us to
- ensure that the software meets its performances requirements
- ensure that the service will deliver a fast response time even under a heavy load
- get to now the scalability limits which in turn is useful to plan the next steps of the development
Hadoop is not a web application, a database or a webservice. You don’t stress test a Hadoop job with a heavy load. Instead, you need to becnhmark the cluster which means assessing its performances by running a variety of jobs each focused on a specific field (indexing, querying, predictive statistics, machine learning, …).
Benchmarks
HiBench – Intel has released HiBench, a tool dedicated to run such benchmarks. It is an collection of shell scripts published under the Apache Licence 2 on GitHub : https://github.com/intel-hadoop/HiBench. It allows to stress test a Hadoop cluster according to several usage profile.
WordCount – This test dispatches the counting of the number of words from a data source. The data source is generated by a preparation script of HiBench which relies on the randomtextwriter of Hadoop. This test belongs to a class of jobs which extracts a small amount of information from a large data source. It is a CPU bound test.
Sort – This test dispatches the sort of a data source. The data source is generated by a preparation script which relies on the random text writer of the Hadoop. This test is the simplest one you can imagine. Indeed, both Map and Reduce stages are identity functions. The sorting is done automatically during the Shuffle & Merge stage of MapReduce. It is I/O bound.
TeraSort – This test too dispatches the sort of a data source. The data source is generated by the Teragen jobs which creates by default 1 billion of 100 bytes lines. These lines are then sorted by the Terasort. Unlike Sort, Terasort provides its own input and output format and also its own Partitioner which ensures that the keys are equally distributed among all nodes. Therefore, it is an improved Sort which aims at providing an equal load between all nodes during the test. With this specificity, this test is CPU bound for the Map stage and I/O bound for the Reduce stage
Enhanced DFSIO – This test is dedicated to HDFS. It aims at measuring the aggregated I/O rate and throughput of HDFS during reads and writes. During its preparation stage, a data source is generated and put on HDFS. Then, two tests are run – a read of the generated data source and a write of a large amount of data.
The write test is basically the same thing as the preparation stage. This test is I/O bound.
Nutch indexing – This test focuses on the performances of the cluster when it comes to indexing data. In order to do it, the preparation stage generates the data to be indexed. Then, indexing is performed with Apache Nutch. This test is I/O bound with a high CPU utilization during the Map stage.
Page Rank – This test measures the performances of the cluster for PageRanking jobs. The preparation phase generates the data source in the form of a graph which can be processed using the PageRank algorithm. Then, the actual indexing is performed by a chain of 6 MapReduce jobs. This test is CPU bound.
Naive Bayes Classifier – This test performs a probabilistic classification on a data source. The preparation stage generates the data source. Then, the test chains two MapReduce jobs with Mahout as – seq2sparse transforms a text data source into vector and trainnb computes the final model using vectors. This test is I/O bound with a high CPU utilization during the Map stage of the seq2sparse. When using this test, we didn’t observe a real load on the cluster. It looks like it is necessary to either provide its own data source or to greatly increase the size of the generated data during the preparation stage.
K-Means clustering – This test partitions a data source into several clusters where each element belongs to the cluster with the nearest mean. The preparation stage generates the data source. Then, the algorithm runs on this data source through Mahout. The K-Means clustering algorithm is composed of two stages of iterations and clustering. Each of these stages runs MapReduce jobs and has a specific usage profile as CPU bound for iterations and I/O bound for clustering
Analytical Query – This class of tests performs queries that correspond to the usage profile of business analysts and other database users. The data source is generated during the preparation stage. Two tables are created – a rankings table and a uservisits table. This is a common schema that we can meet in many web applications. Once that the data source has been generated, two Hive requests are performed – a joint and an aggregation. These tests are I/O bound.