Cluster monitoring using metrics

HBase emits metrics which adhere to the Hadoop Metrics API. Starting with HBase 0.95, HBase is configured to emit a default set of metrics with a default sampling period of every 10 seconds. You can use HBase metrics in conjunction with Ganglia. You can also filter which metrics are emitted and extend the metrics framework to capture custom metrics appropriate for your environment.

Metric Setup

For HBase 0.95 and newer, HBase ships with a default metrics configuration, or sink. This includes a wide variety of individual metrics, and emits them every 10 seconds by default. To configure metrics for a given region server, edit the conf/hadoop-metrics2-hbase.properties file. Restart the region server for the changes to take effect.

To change the sampling rate for the default sink, edit the line beginning with *.period. To filter which metrics are emitted or to extend the metrics framework

Disabling Metrics

To disable metrics for a region server, edit the conf/hadoop-metrics2-hbase.properties file and comment out any uncommented lines. Restart the region server for the changes to take effect.

Interface ClusterMetrics

Metrics information on the HBase cluster. ClusterMetrics provides clients with information such as:

  • The count and names of region servers in the cluster.
  • The count and names of dead region servers in the cluster.
  • The name of the active master for the cluster.
  • The name(s) of the backup master(s) for the cluster, if they exist.
  • The average cluster load.
  • The number of regions deployed on the cluster.
  • The number of requests since last report.
  • Detailed region server loading and resource usage information, per server and per region.
  • Regions in transition at master
  • The unique cluster ID

ClusterMetrics.Option provides a way to get desired ClusterStatus information. The following codes will get all the cluster information.  If information about live servers is the only wanted. then codes in the following way:

Admin admin = connection.getAdmin();

ClusterMetrics metrics = admin.getClusterStatus(EnumSet.of(Option.LIVE_SERVERS));

Units of Measure for Metrics

Different metrics are expressed in different units, as appropriate. Often, the unit of measure is in the name (as in the metric shippedKBs). Otherwise, use the following guidelines. When in doubt, you may need to examine the source for a given metric.

  • Metrics that refer to a point in time are usually expressed as a timestamp.
  • Metrics that refer to an age (such as ageOfLastShippedOp) are usually expressed in milliseconds.
  • Metrics that refer to memory sizes are in bytes.
  • Sizes of queues (such as sizeOfLogQueue) are expressed as the number of items in the queue. Determine the size by multiplying by the block size (default is 64 MB in HDFS).
  • Metrics that refer to things like the number of a given type of operations (such as logEditsRead) are expressed as an integer.

Important Master Metrics

Counts are usually over the last metrics reporting interval.

  • master.numRegionServers – Number of live regionservers
  • master.numDeadRegionServers – Number of dead regionservers
  • master.ritCount – The number of regions in transition
  • master.ritCountOverThreshold – The number of regions that have been in transition longer than a threshold time (default: 60 seconds)
  • master.ritOldestAge – The age of the longest region in transition, in milliseconds

Important RegionServer Metrics

Counts are usually over the last metrics reporting interval.

  • regionserver.regionCount – The number of regions hosted by the regionserver
  • regionserver.storeFileCount – The number of store files on disk currently managed by the regionserver
  • regionserver.storeFileSize – Aggregate size of the store files on disk
  • regionserver.hlogFileCount – The number of write ahead logs not yet archived
  • regionserver.totalRequestCount – The total number of requests received
  • regionserver.readRequestCount – The number of read requests received
  • regionserver.writeRequestCount – The number of write requests received
  • regionserver.numOpenConnections – The number of open connections at the RPC layer
  • regionserver.numActiveHandler – The number of RPC handlers actively servicing requests
  • regionserver.numCallsInGeneralQueue – The number of currently enqueued user requests
  • regionserver.numCallsInReplicationQueue – The number of currently enqueued operations received from replication
  • regionserver.numCallsInPriorityQueue – The number of currently enqueued priority (internal housekeeping) requests
  • regionserver.flushQueueLength – Current depth of the memstore flush queue. If increasing, we are falling behind with clearing memstores out to HDFS.
  • regionserver.updatesBlockedTime – Number of milliseconds updates have been blocked so the memstore can be flushed
  • regionserver.compactionQueueLength – Current depth of the compaction request queue. If increasing, we are falling behind with storefile compaction.
  • regionserver.blockCacheHitCount – The number of block cache hits
  • regionserver.blockCacheMissCount – The number of block cache misses
  • regionserver.blockCacheExpressHitPercent – The percent of the time that requests with the cache turned on hit the cache
  • regionserver.percentFilesLocal – Percent of store file data that can be read from the local DataNode, 0-100
  • regionserver.<op>_<measure> – Operation latencies, where <op> is one of Append, Delete, Mutate, Get, Replay, Increment; and where <measure> is one of min, max, mean, median, 75th_percentile, 95th_percentile, 99th_percentile
  • regionserver.slow<op>Count – The number of operations we thought were slow, where <op> is one of the list above
  • regionserver.GcTimeMillis – Time spent in garbage collection, in milliseconds
  • regionserver.GcTimeMillisParNew – Time spent in garbage collection of the young generation, in milliseconds
  • regionserver.GcTimeMillisConcurrentMarkSweep – Time spent in garbage collection of the old generation, in milliseconds
  • regionserver.authenticationSuccesses – Number of client connections where authentication succeeded
  • regionserver.authenticationFailures – Number of client connection authentication failures
  • regionserver.mutationsWithoutWALCount – Count of writes submitted with a flag indicating they should bypass the write ahead log –
Share this post
[social_warfare]
Data import and export
Metrics usage in JMX, ganglia or nagios

Get industry recognized certification – Contact us

keyboard_arrow_up