Data import and export

During HBase development you often need to move data around. The easiest way to import and export data is via command line.

Export a table to local filesystem or HDFS

bin/hbase org.apache.hadoop.hbase.mapreduce.Driver export [tbl_name] [/local/export/path | hdfs:/node/path]

Importing is also easy, but table MUST EXIST.

bin/hbase org.apache.hadoop.hbase.mapreduce.Driver import [tbl_name] [/local/export/path]

Export

Export is a utility that will dump the contents of table to HDFS in a sequence file. The Export can be run via a Coprocessor Endpoint or MapReduce. Invoke via:

mapreduce-based Export

$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]

endpoint-based Export

Make sure the Export coprocessor is enabled by adding org.apache.hadoop.hbase.coprocessor.Export to hbase.coprocessor.region.classes.

$ bin/hbase org.apache.hadoop.hbase.coprocessor.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]

The outputdir is a HDFS directory that does not exist prior to the export. When done, the exported files will be owned by the user invoking the export command.

The Comparison of Endpoint-based Export And Mapreduce-based Export

	Endpoint-based Export	Mapreduce-based Export
HBase version requirement	2.0+	0.2.1+
Maven dependency	hbase-endpoint	hbase-mapreduce (2.0+), hbase-server(prior to 2.0)
Requirement before dump	mount the endpoint.Export on the target table	deploy the MapReduce framework
Read latency	low, directly read the data from region	normal, traditional RPC scan
Read Scalability	depend on number of regions	depend on number of mappers
Timeout	operation timeout. configured by hbase.client.operation.timeout	scan timeout. configured by hbase.client.scanner.timeout.period
Permission requirement	READ, EXECUTE	READ
Fault tolerance	no	depend on MapReduce

By default, the Export tool only exports the newest version of a given cell, regardless of the number of versions stored. To export more than one version, replace <versions> with the desired number of versions. Caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration.

Import

Import is a utility that will load data that has been exported back into HBase. Invoke via:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>

To see usage instructions, run the command with no options. To import 0.94 exported files in a 0.96 cluster or onwards, you need to set system property “hbase.import.version” when running the import command as below:

$ bin/hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>

ImportTsv

ImportTsv is a utility that will load data in TSV format into HBase. It has two distinct usages: loading data from TSV format in HDFS into HBase via Puts, and preparing StoreFiles to be loaded via the completebulkload.

To load data via Puts (i.e., non-bulk loading):

$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>

To generate StoreFiles for bulk-loading:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir>

These generated StoreFiles can be loaded into HBase via completebulkload.

CompleteBulkLoad

The completebulkload utility will move generated StoreFiles into an HBase table. This utility is often used in conjunction with output from importtsv.

There are two ways to invoke this utility, with explicit classname and via the driver:

Explicit Classname

$ bin/hbase org.apache.hadoop.hbase.tool.LoadIncrementalHFiles <hdfs://storefileoutput> <tablename>

Driver

HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar completebulkload <hdfs://storefileoutput> <tablename>

WALPlayer

WALPlayer is a utility to replay WAL files into HBase.

The WAL can be replayed for a set of tables or all tables, and a timerange can be provided (in milliseconds). The WAL is filtered to this set of tables. The output can optionally be mapped to another set of tables. WALPlayer can also generate HFiles for later bulk importing, in that case only a single table and no mapping can be specified.

Invoke via:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer [options] <wal inputdir> <tables> [<tableMappings>]>

For example:

$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer /backuplogdir oldTable1,oldTable2 newTable1,newTable2

WALPlayer, by default, runs as a mapreduce job. To NOT run WALPlayer as a mapreduce job on your cluster, force it to run all in the local process by adding the flags -Dmapreduce.jobtracker.address=local on the command line.

RowCounter

RowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit. It is possible to limit the time range of data to be scanned by using the –starttime=[starttime] and –endtime=[endtime] flags. The scanned data can be limited based on keys using the –range=[startKey],[endKey][;[startKey],[endKey]…] option.

$ bin/hbase rowcounter [options] <tablename> [–starttime=<start> –endtime=<end>] [–range=[startKey],[endKey][;[startKey],[endKey]…]] [<column1> <column2>…]

RowCounter only counts one version per cell.

For performance consider to use -Dhbase.client.scanner.caching=100 and -Dmapreduce.map.speculative=false options.

CellCounter

HBase ships another diagnostic mapreduce job called CellCounter. Like RowCounter, it gathers more fine-grained statistics about your table. The statistics gathered by CellCounter are more fine-grained and include:

Total number of rows in the table.
Total number of CFs across all rows.
Total qualifiers across all rows.
Total occurrence of each CF.
Total occurrence of each qualifier.
Total number of versions of each qualifier.

The program allows you to limit the scope of the run. Provide a row regex or prefix to limit the rows to analyze. Specify a time range to scan the table by using the –starttime=<starttime> and –endtime=<endtime> flags.

Use hbase.mapreduce.scan.column.family to specify scanning a single column family.

$ bin/hbase cellcounter <tablename> <outputDir> [reportSeparator] [regex or prefix] [–starttime=<starttime> –endtime=<endtime>]

Note: just like RowCounter, caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration.