Splits and compaction optimizing techniques

Apache HBase distributes its load through region splitting. HBase stored rows in the tables and each table is split into ‘regions’. Those regions are distributed across the cluster, hosted and made available to client processes by the RegionServer process in the system. All rows in the tables are sorted between regions start and end key. Every single row is belonging to exactly one region and a region is served by single region server at any given point of time.

HBase Table Regions

Regions are the physical mechanism used to distribute the write and query load across region servers in HBase. A table in HBase consists of many regions associated with region servers. When table is created, by default, HBase allocate single region to it. Thus, initial loading of HBase table does not utilize the entire capacity of cluster.

Pre-splitting HBase Tables

As mentioned in previous section, HBase allocates only one region to table, because it does not know how to split the table into multiple regions. With a pre-splitting process, you can create a HBase table with many regions by supplying the split points at the table creation time.

However, there is always risk of creating multiple regions with pre-splitting. This could affect the distribution because of data skew. You should always know the key distribution before applying pre-split to avoid data skew.

Calculating Split Point for Tables

You can use the RegionSplitter utility to identify correct split point for table. RegionSplitter creates the split points, by with either HexStringSplit or UniformSplit Split Algorithm.

For example, create table ‘table1’ with 5 regions:

$hbase org.apache.hadoop.hbase.util.RegionSplitter table1 HexStringSplit -c 5 -f cf

2018-02-06 12:32:05,595 INFO [main] zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x2accdbb5 connecting to ZooKeeper ensemble=server.example.co.in:2181, server2.example.co.in:2181

2018-02-06 12:32:05,600 INFO [main] zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.6-1245–1, built on 08/26/2016 00:47 GMT

…

2018-02-06 12:32:08,808 INFO [main] client.HBaseAdmin: Created table1

Pre-splitting HBase Tables Examples

If you know the split point, you can use HBase shell command to create table. Below is the example for splitting HBase tables:

hbase(main):003:0> create ‘table1’, ‘cf’, SPLITS=> [‘a’, ‘b’, ‘c’]

0 row(s) in 2.2840 seconds

=>Hbase::Table – table1

Split steps

A region is decided to be split when store file size goes above hbase.hregion.max.filesize or according to defined region split policy.
At this point this region is divided into two by region server.
Region server creates two reference files for these two daughter regions.
These reference files are stored in a new directory called splits under parent directory.
Exactly at this point, parent region is marked as closed or offline so no client tries to read or write to it.
Now region server creates two new directories in splits directory for these daughter regions.
If steps till 6 are completed successfully, Region server moves both daughter region directories under table directory.
The META table is now informed of the creation of two new regions, along with an update in the entry of parent region that it has now been split and is offline. (OFFLINE=true , SPLIT=true)
The reference files are very small files containing only the key at which the split happened and also whether it represents top half or bottom half of the parent region.
There is a class called “HalfHFileReader”which then utilizes these two reference files to read the original data file of parent region and also to decide as which half of the file has to be read.
Both regions are now brought online by region server and start serving requests to clients.
As soon as the daughter regions come online, a compaction is scheduled which rewrites the HFile of parent region into two HFiles independent for both daughter regions.
As this process in step 12 completes, both the HFiles cleanly replace their respective reference files. The compaction activity happens under .tmp directory of daughter regions.
With the successful completion till step 13, the parent region is now removed from META and all its files and directories marked for deletion.
Finally Master server is informed by this Region server about two new regions getting born. Master now decides the fate of the two regions as to let them run on same region server or have them travel to another one.

Auto splitting

Regardless of whether pre-splitting is used or not, once a region gets to a certain limit, it is automatically split into two regions. If you are using HBase 0.94 (which comes with HDP-1.2), you can configure when HBase decides to split a region, and how it calculates the split points via the pluggable RegionSplitPolicy API. There are a couple predefined region split policies: ConstantSizeRegionSplitPolicy, IncreasingToUpperBoundRegionSplitPolicy, and KeyPrefixRegionSplitPolicy.

The first one is the default and only split policy for HBase versions before 0.94. It splits the regions when the total data size for one of the stores (corresponding to a column-family) in the region gets bigger than configured “hbase.hregion.max.filesize”, which has a default value of 10GB. This split policy is ideal in cases, where you are have done pre-splitting, and are interested in getting lower number of regions per region server.

The default split policy for HBase 0.94 and trunk is IncreasingToUpperBoundRegionSplitPolicy, which does more aggressive splitting based on the number of regions hosted in the same region server. The split policy uses the max store file size based on Min (R^2 * “hbase.hregion.memstore.flush.size”, “hbase.hregion.max.filesize”), where R is the number of regions of the same table hosted on the same regionserver. So for example, with the default memstore flush size of 128MB and the default max store size of 10GB, the first region on the region server will be split just after the first flush at 128MB. As number of regions hosted in the region server increases, it will use increasing split sizes: 512MB, 1152MB, 2GB, 3.2GB, 4.6GB, 6.2GB, etc. After reaching 9 regions, the split size will go beyond the configured “hbase.hregion.max.filesize”, at which point, 10GB split size will be used from then on. For both of these algorithms, regardless of when splitting occurs, the split point used is the rowkey that corresponds to the mid point in the “block index” for the largest store file in the largest store.

KeyPrefixRegionSplitPolicy is a curious addition to the HBase arsenal. You can configure the length of the prefix for your row keys for grouping them, and this split policy ensures that the regions are not split in the middle of a group of rows having the same prefix. If you have set prefixes for your keys, then you can use this split policy to ensure that rows having the same rowkey prefix always end up in the same region. This grouping of records is sometimes referred to as “Entity Groups” or “Row Groups”. This is a key feature when considering use of the “local transactions” (alternative link) feature in your application design.

You can configure the default split policy to be used by setting the configuration “hbase.regionserver.region.split.policy”, or by configuring the table descriptor. For you brave souls, you can also implement your own custom split policy, and plug that in at table creation time, or by modifying an existing table:

HTableDescriptor tableDesc = new HTableDescriptor(“example-table”);

tableDesc.setValue(HTableDescriptor.SPLIT_POLICY, AwesomeSplitPolicy.class.getName());

//add columns etc

admin.createTable(tableDesc);

If you are doing pre-splitting, and want to manually manage region splits, you can also disable region splits, by setting “hbase.hregion.max.filesize” to a high number and setting the split policy to ConstantSizeRegionSplitPolicy. However, you should use a safeguard value of like 100GB, so that regions does not grow beyond a region server’s capabilities. You can consider disabling automated splitting and rely on the initial set of regions from pre-splitting for example, if you are using uniform hashes for your key prefixes, and you can ensure that the read/write load to each region as well as its size is uniform across the regions in the table.

Forced Splits

HBase also enables clients to force split an online table from the client side. For example, the HBase shell can be used to split all regions of the table, or split a region, optionally by supplying a split point.

hbase(main):024:0> split ‘b07d0034cbe72cb040ae9cf66300a10c’, ‘b’

0 row(s) in 0.1620 seconds

With careful monitoring of your HBase load distribution, if you see that some regions are getting uneven loads, you may consider manually splitting those regions to even-out the load and improve throughput. Another reason why you might want to do manual splits is when you see that the initial splits for the region turns out to be suboptimal, and you have disabled automated splits. That might happen for example, if the data distribution changes over time.

Offline Compaction Tool

CompactionTool provides a way of running compactions (either minor or major) as an independent process from the RegionServer. It reuses same internal implementation classes executed by RegionServer compaction feature. However, since this runs on a complete separate independent java process, it releases RegionServers from the overhead involved in rewrite a set of hfiles, which can be critical for latency sensitive use cases.

Usage:

$ ./bin/hbase org.apache.hadoop.hbase.regionserver.CompactionTool

Usage: java org.apache.hadoop.hbase.regionserver.CompactionTool \

[-compactOnce] [-major] [-mapred] [-D<property=value>]* files…

Options:

mapred Use MapReduce to run compaction.
compactOnce Execute just one compaction step. (default: while needed)
major Trigger major compaction.

Note: –D properties will be applied to the conf used.

For example:

To stop delete of compacted file, pass -Dhbase.compactiontool.delete=false

To set tmp dir, pass -Dhbase.tmp.dir=ALTERNATE_DIR

Examples:

To compact the full ‘TestTable’ using MapReduce:

$ hbase org.apache.hadoop.hbase.regionserver.CompactionTool -mapred hdfs://hbase/data/default/TestTable

To compact column family ‘x’ of the table ‘TestTable’ region ‘abc’:

$ hbase org.apache.hadoop.hbase.regionserver.CompactionTool hdfs://hbase/data/default/TestTable/abc/x

As shown by usage options above, CompactionTool can run as a standalone client or a mapreduce job. When running as mapreduce job, each family dir is handled as an input split, and is processed by a separate map task.

The compactionOnce parameter controls how many compaction cycles will be performed until CompactionTool program decides to finish its work. If omitted, it will assume it should keep running compactions on each specified family as determined by the given compaction policy configured. For more info on compaction policy.

If a major compaction is desired, major flag can be specified. If omitted, CompactionTool will assume minor compaction is wanted by default.

It also allows for configuration overrides with -D flag. In the usage section above, for example, -Dhbase.compactiontool.delete=false option will instruct compaction engine to not delete original files from temp folder.

Files targeted for compaction must be specified as parent hdfs dirs. It allows for multiple dirs definition, as long as each for these dirs are either a family, a region, or a table dir. If a table or region dir is passed, the program will recursively iterate through related sub-folders, effectively running compaction for each family found below the table/region level.

Since these dirs are nested under hbase hdfs directory tree, CompactionTool requires hbase super user permissions in order to have access to required hfiles.