During HBase development you often need to move data around. The easiest way to import and export data is via command line.
Export a table to local filesystem or HDFS
bin/hbase org.apache.hadoop.hbase.mapreduce.Driver export [tbl_name] [/local/export/path | hdfs:/node/path]
Importing is also easy, but table MUST EXIST.
bin/hbase org.apache.hadoop.hbase.mapreduce.Driver import [tbl_name] [/local/export/path]
Export
Export is a utility that will dump the contents of table to HDFS in a sequence file. The Export can be run via a Coprocessor Endpoint or MapReduce. Invoke via:
mapreduce-based Export
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
endpoint-based Export
Make sure the Export coprocessor is enabled by adding org.apache.hadoop.hbase.coprocessor.Export to hbase.coprocessor.region.classes.
$ bin/hbase org.apache.hadoop.hbase.coprocessor.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
The outputdir is a HDFS directory that does not exist prior to the export. When done, the exported files will be owned by the user invoking the export command.
The Comparison of Endpoint-based Export And Mapreduce-based Export
Endpoint-based Export | Mapreduce-based Export | |
HBase version requirement | 2.0+ | 0.2.1+ |
Maven dependency | hbase-endpoint | hbase-mapreduce (2.0+), hbase-server(prior to 2.0) |
Requirement before dump | mount the endpoint.Export on the target table | deploy the MapReduce framework |
Read latency | low, directly read the data from region | normal, traditional RPC scan |
Read Scalability | depend on number of regions | depend on number of mappers |
Timeout | operation timeout. configured by hbase.client.operation.timeout | scan timeout. configured by hbase.client.scanner.timeout.period |
Permission requirement | READ, EXECUTE | READ |
Fault tolerance | no | depend on MapReduce |
By default, the Export tool only exports the newest version of a given cell, regardless of the number of versions stored. To export more than one version, replace <versions> with the desired number of versions. Caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration.
Import
Import is a utility that will load data that has been exported back into HBase. Invoke via:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
To see usage instructions, run the command with no options. To import 0.94 exported files in a 0.96 cluster or onwards, you need to set system property “hbase.import.version” when running the import command as below:
$ bin/hbase -Dhbase.import.version=0.94 org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>
ImportTsv
ImportTsv is a utility that will load data in TSV format into HBase. It has two distinct usages: loading data from TSV format in HDFS into HBase via Puts, and preparing StoreFiles to be loaded via the completebulkload.
To load data via Puts (i.e., non-bulk loading):
$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c <tablename> <hdfs-inputdir>
To generate StoreFiles for bulk-loading:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir <tablename> <hdfs-data-inputdir>
These generated StoreFiles can be loaded into HBase via completebulkload.
CompleteBulkLoad
The completebulkload utility will move generated StoreFiles into an HBase table. This utility is often used in conjunction with output from importtsv.
There are two ways to invoke this utility, with explicit classname and via the driver:
Explicit Classname
$ bin/hbase org.apache.hadoop.hbase.tool.LoadIncrementalHFiles <hdfs://storefileoutput> <tablename>
Driver
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar completebulkload <hdfs://storefileoutput> <tablename>
WALPlayer
WALPlayer is a utility to replay WAL files into HBase.
The WAL can be replayed for a set of tables or all tables, and a timerange can be provided (in milliseconds). The WAL is filtered to this set of tables. The output can optionally be mapped to another set of tables. WALPlayer can also generate HFiles for later bulk importing, in that case only a single table and no mapping can be specified.
Invoke via:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer [options] <wal inputdir> <tables> [<tableMappings>]>
For example:
$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer /backuplogdir oldTable1,oldTable2 newTable1,newTable2
WALPlayer, by default, runs as a mapreduce job. To NOT run WALPlayer as a mapreduce job on your cluster, force it to run all in the local process by adding the flags -Dmapreduce.jobtracker.address=local on the command line.
RowCounter
RowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit. It is possible to limit the time range of data to be scanned by using the –starttime=[starttime] and –endtime=[endtime] flags. The scanned data can be limited based on keys using the –range=[startKey],[endKey][;[startKey],[endKey]…] option.
$ bin/hbase rowcounter [options] <tablename> [–starttime=<start> –endtime=<end>] [–range=[startKey],[endKey][;[startKey],[endKey]…]] [<column1> <column2>…]
RowCounter only counts one version per cell.
For performance consider to use -Dhbase.client.scanner.caching=100 and -Dmapreduce.map.speculative=false options.
CellCounter
HBase ships another diagnostic mapreduce job called CellCounter. Like RowCounter, it gathers more fine-grained statistics about your table. The statistics gathered by CellCounter are more fine-grained and include:
- Total number of rows in the table.
- Total number of CFs across all rows.
- Total qualifiers across all rows.
- Total occurrence of each CF.
- Total occurrence of each qualifier.
- Total number of versions of each qualifier.
The program allows you to limit the scope of the run. Provide a row regex or prefix to limit the rows to analyze. Specify a time range to scan the table by using the –starttime=<starttime> and –endtime=<endtime> flags.
Use hbase.mapreduce.scan.column.family to specify scanning a single column family.
$ bin/hbase cellcounter <tablename> <outputDir> [reportSeparator] [regex or prefix] [–starttime=<starttime> –endtime=<endtime>]
Note: just like RowCounter, caching for the input Scan is configured via hbase.client.scanner.caching in the job configuration.
Apply for HBase Certification
https://www.vskills.in/certification/certified-hbase-professional