Cluster monitoring using metrics

HBase emits metrics which adhere to the Hadoop Metrics API. Starting with HBase 0.95, HBase is configured to emit a default set of metrics with a default sampling period of every 10 seconds. You can use HBase metrics in conjunction with Ganglia. You can also filter which metrics are emitted and extend the metrics framework to capture custom metrics appropriate for your environment.

Metric Setup

For HBase 0.95 and newer, HBase ships with a default metrics configuration, or sink. This includes a wide variety of individual metrics, and emits them every 10 seconds by default. To configure metrics for a given region server, edit the conf/hadoop-metrics2-hbase.properties file. Restart the region server for the changes to take effect.

To change the sampling rate for the default sink, edit the line beginning with *.period. To filter which metrics are emitted or to extend the metrics framework

Disabling Metrics

To disable metrics for a region server, edit the conf/hadoop-metrics2-hbase.properties file and comment out any uncommented lines. Restart the region server for the changes to take effect.

Interface ClusterMetrics

Metrics information on the HBase cluster. ClusterMetrics provides clients with information such as:

The count and names of region servers in the cluster.
The count and names of dead region servers in the cluster.
The name of the active master for the cluster.
The name(s) of the backup master(s) for the cluster, if they exist.
The average cluster load.
The number of regions deployed on the cluster.
The number of requests since last report.
Detailed region server loading and resource usage information, per server and per region.
Regions in transition at master
The unique cluster ID

ClusterMetrics.Option provides a way to get desired ClusterStatus information. The following codes will get all the cluster information. If information about live servers is the only wanted. then codes in the following way:

Admin admin = connection.getAdmin();

ClusterMetrics metrics = admin.getClusterStatus(EnumSet.of(Option.LIVE_SERVERS));

Units of Measure for Metrics

Different metrics are expressed in different units, as appropriate. Often, the unit of measure is in the name (as in the metric shippedKBs). Otherwise, use the following guidelines. When in doubt, you may need to examine the source for a given metric.

Metrics that refer to a point in time are usually expressed as a timestamp.
Metrics that refer to an age (such as ageOfLastShippedOp) are usually expressed in milliseconds.
Metrics that refer to memory sizes are in bytes.
Sizes of queues (such as sizeOfLogQueue) are expressed as the number of items in the queue. Determine the size by multiplying by the block size (default is 64 MB in HDFS).
Metrics that refer to things like the number of a given type of operations (such as logEditsRead) are expressed as an integer.

Important Master Metrics

Counts are usually over the last metrics reporting interval.

master.numRegionServers – Number of live regionservers
master.numDeadRegionServers – Number of dead regionservers
master.ritCount – The number of regions in transition
master.ritCountOverThreshold – The number of regions that have been in transition longer than a threshold time (default: 60 seconds)
master.ritOldestAge – The age of the longest region in transition, in milliseconds

Important RegionServer Metrics

Counts are usually over the last metrics reporting interval.

regionserver.regionCount – The number of regions hosted by the regionserver
regionserver.storeFileCount – The number of store files on disk currently managed by the regionserver
regionserver.storeFileSize – Aggregate size of the store files on disk
regionserver.hlogFileCount – The number of write ahead logs not yet archived
regionserver.totalRequestCount – The total number of requests received
regionserver.readRequestCount – The number of read requests received
regionserver.writeRequestCount – The number of write requests received
regionserver.numOpenConnections – The number of open connections at the RPC layer
regionserver.numActiveHandler – The number of RPC handlers actively servicing requests
regionserver.numCallsInGeneralQueue – The number of currently enqueued user requests
regionserver.numCallsInReplicationQueue – The number of currently enqueued operations received from replication
regionserver.numCallsInPriorityQueue – The number of currently enqueued priority (internal housekeeping) requests
regionserver.flushQueueLength – Current depth of the memstore flush queue. If increasing, we are falling behind with clearing memstores out to HDFS.
regionserver.updatesBlockedTime – Number of milliseconds updates have been blocked so the memstore can be flushed
regionserver.compactionQueueLength – Current depth of the compaction request queue. If increasing, we are falling behind with storefile compaction.
regionserver.blockCacheHitCount – The number of block cache hits
regionserver.blockCacheMissCount – The number of block cache misses
regionserver.blockCacheExpressHitPercent – The percent of the time that requests with the cache turned on hit the cache
regionserver.percentFilesLocal – Percent of store file data that can be read from the local DataNode, 0-100
regionserver.<op>_<measure> – Operation latencies, where <op> is one of Append, Delete, Mutate, Get, Replay, Increment; and where <measure> is one of min, max, mean, median, 75th_percentile, 95th_percentile, 99th_percentile
regionserver.slow<op>Count – The number of operations we thought were slow, where <op> is one of the list above
regionserver.GcTimeMillis – Time spent in garbage collection, in milliseconds
regionserver.GcTimeMillisParNew – Time spent in garbage collection of the young generation, in milliseconds
regionserver.GcTimeMillisConcurrentMarkSweep – Time spent in garbage collection of the old generation, in milliseconds
regionserver.authenticationSuccesses – Number of client connections where authentication succeeded
regionserver.authenticationFailures – Number of client connection authentication failures
regionserver.mutationsWithoutWALCount – Count of writes submitted with a flag indicating they should bypass the write ahead log –