Schema definition and HBaseAdmin

HBase schemas can be created or updated using the The Apache HBase Shell or by using Admin in the Java API.

Tables must be disabled when making ColumnFamily modifications, for example:

Configuration config = HBaseConfiguration.create();

Admin admin = new Admin(conf);

TableName table = TableName.valueOf(“myTable”);

admin.disableTable(table);

HColumnDescriptor cf1 = …;

admin.addColumn(table, cf1);      // adding new ColumnFamily

HColumnDescriptor cf2 = …;

admin.modifyColumn(table, cf2);    // modifying existing ColumnFamily

admin.enableTable(table);

When changes are made to either Tables or ColumnFamilies (e.g. region size, block size), these changes take effect the next time there is a major compaction and the StoreFiles get re-written.

Table Properties

The table descriptor offers getters and setters † to set other options of the table. In practice, a lot are not used very often, but it is important to know them all, as they can be used to fine-tune the table’s performance.

Name – The constructor already had the parameter to specify the table name. The Java API has additional methods to access the name or change it.

byte[] getName();

String getNameAsString();

void setName(byte[] name);

Table Schema Rules Of Thumb

There are many different data sets, with different access patterns and service-level expectations. Therefore, these rules of thumb are only an overview. Read the rest of this chapter to get more details after you have gone through this list.

  • Aim to have regions sized between 10 and 50 GB.
  • Aim to have cells no larger than 10 MB, or 50 MB if you use mob. Otherwise, consider storing your cell data in HDFS and store a pointer to the data in HBase.
  • A typical schema has between 1 and 3 column families per table. HBase tables should not be designed to mimic RDBMS tables.
  • Around 50-100 regions is a good number for a table with 1 or 2 column families. Remember that a region is a contiguous segment of a column family.
  • Keep your column family names as short as possible. The column family names are stored for every value (ignoring prefix encoding). They should not be self-documenting and descriptive like in a typical RDBMS.
  • If you are storing time-based machine data or logging information, and the row key is based on device ID or service ID plus time, you can end up with a pattern where older data regions never have additional writes beyond a certain age. In this type of situation, you end up with a small number of active regions and a large number of older regions which have no new writes. For these situations, you can tolerate a larger number of regions because your resource consumption is driven by the active regions only.
  • If only one column family is busy with writes, only that column family accomulates memory. Be aware of write patterns when allocating resources.
Share this post
[social_warfare]
Access by Avro, thrift and REST
Task and node management in cluster

Get industry recognized certification – Contact us

keyboard_arrow_up