Prerequisites
Java
HBase Version | JDK 6 | JDK 7 | JDK 8 |
1.2 | Not Supported | Yes | yes |
1.1 | Not Supported | yes | Running with JDK 8 will work but is not well tested. |
1.0 | Not Supported | yes | Running with JDK 8 will work but is not well tested. |
0.98 | yes | yes | Running with JDK 8 works but is not well tested. Building with JDK 8 would require removal of the deprecated remove() method of the PoolMap class and is under consideration. |
0.94 | yes | yes | N/A |
In HBase 0.98.5 and newer, you must set JAVA_HOME on each node of your cluster. hbase-env.sh provides a handy mechanism to do this.
Hadoop – Hadoop 2.x is faster and includes features, such as short-circuit reads, which will help improve your HBase random read profile. Hadoop 2.x also includes important bug fixes that will improve your overall HBase experience. HBase 0.98 drops support for Hadoop 1.0, deprecates use of Hadoop 1.1+, and HBase 1.0 will not support Hadoop 1.x. The following table summarizes the versions of Hadoop supported with each version of HBase.
HBase-0.94.x | HBase-0.98.x (Support for Hadoop 1.1+ is deprecated.) | HBase-1.0.x (Hadoop 1.x is NOT supported) | HBase-1.1.x | HBase-1.2.x | |
Hadoop-1.0.x | X | X | X | X | X |
Hadoop-1.1.x | S | NT | X | X | X |
Hadoop-0.23.x | S | X | X | X | X |
Hadoop-2.0.x-alpha | NT | X | X | X | X |
Hadoop-2.1.0-beta | NT | X | X | X | X |
Hadoop-2.2.0 | NT | S | NT | NT | NT |
Hadoop-2.3.x | NT | S | NT | NT | NT |
Hadoop-2.4.x | NT | S | S | S | S |
Hadoop-2.5.x | NT | S | S | S | S |
Hadoop-2.6.0 | X | X | X | X | X |
Hadoop-2.6.1+ | NT | NT | NT | NT | S |
Hadoop-2.7.0 | X | X | X | X | X |
Hadoop-2.7.1+ | NT | NT | NT | NT | S |
Hadoop version support matrix details –
- “S” = supported
- “X” = not supported
- “NT” = Not tested
ssh – HBase uses the Secure Shell (ssh) command and utilities extensively to communicate between cluster nodes. Each server in the cluster must be running ssh so that the Hadoop and HBase daemons can be managed. You must be able to connect to all nodes via SSH, including the local node, from the Master as well as any backup Master, using a shared key rather than a password.
DNS – HBase uses the local hostname to self-report its IP address. Both forward and reverse DNS resolving must work in versions of HBase previous to 0.92.0. The hadoop-dns-checker tool can be used to verify DNS is working correctly on the cluster.
Loopback IP – Prior to hbase-0.96.0, HBase only used the IP address 127.0.0.1 to refer to localhost, and this could not be configured.
NTP – The clocks on cluster nodes should be synchronized. A small amount of variation is acceptable, but larger amounts of skew can cause erratic and unexpected behavior. Time synchronization is one of the first things to check if you see unexplained problems in your cluster. It is recommended that you run a Network Time Protocol (NTP) service, or another time-synchronization mechanism, on your cluster, and that all nodes look to the same service for time synchronization.
Limits on Number of Files and Processes (ulimit) – Apache HBase is a database. It requires the ability to open a large number of files at once. Many Linux distributions limit the number of files a single user is allowed to open to 1024 (or 256 on older versions of OS X). You can check this limit on your servers by running the command ulimit -n when logged in as the user which runs HBase.
Linux Shell – All of the shell scripts that come with HBase rely on the GNU Bash shell.
Windows – Prior to HBase 0.96, testing for running HBase on Microsoft Windows was limited. Running on Windows nodes is not recommended for production systems.
ZooKeeper Requirements – ZooKeeper 3.4.x is required as of HBase 1.0.0. HBase makes use of the multi functionality that is only available since 3.4.0 (The useMulti configuration option defaults to true in HBase 1.0.0).
Standalone Mode Installation
The steps for a standalone local install are
- Choose a download site from this list of Apache Download Mirrors. Click on the suggested top link. This will take you to a mirror of HBase Releases. Click on the folder named stable and then download the binary file that ends in .tar.gz to your local filesystem. Prior to 1.x version, be sure to choose the version that corresponds with the version of Hadoop you are likely to use later (in most cases, you should choose the file for Hadoop 2, which will be called something like hbase-0.98.13-hadoop2-bin.tar.gz).
- Extract the downloaded file, and change to the newly-created directory.
$ tar xzvf hbase-<?eval ${project.version}?>-bin.tar.gz
$ cd hbase-<?eval ${project.version}?>/
- For HBase 0.98.5 and later, you are required to set the JAVA_HOME environment variable before starting HBase. Prior to 0.98.5, HBase attempted to detect the location of Java if the variables was not set. You can set the variable via your operating system’s usual mechanism, but HBase provides a central mechanism, conf/hbase-env.sh. Edit this file, uncomment the line starting with JAVA_HOME, and set it to the appropriate location for your operating system. The JAVA_HOME variable should be set to a directory which contains the executable file bin/java. Most modern Linux operating systems provide a mechanism, such as /usr/bin/alternatives on RHEL or CentOS, for transparently switching between versions of executables such as Java. In this case, you can set JAVA_HOME to the directory containing the symbolic link to bin/java, which is usually /usr.
JAVA_HOME=/usr
These instructions assume that each node of your cluster uses the same configuration. If this is not the case, you may need to set JAVA_HOME separately for each node.
- Edit conf/hbase-site.xml, which is the main HBase configuration file. At this time, you only need to specify the directory on the local filesystem where HBase and ZooKeeper write data. By default, a new directory is created under /tmp. Many servers are configured to delete the contents of /tmp upon reboot, so you should store the data elsewhere. The following configuration will store HBase’s data in the hbase directory, in the home directory of the user called testuser. Paste the <property> tags beneath the <configuration> tags, which should be empty in a new HBase install. Example hbase-site.xml for Standalone HBase is as
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///home/testuser/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/testuser/zookeeper</value>
</property>
</configuration>
You do not need to create the HBase data directory. HBase will do this for you. If you create the directory, HBase will attempt to do a migration, which is not what you want.
- The bin/start-hbase.sh script is provided as a convenient way to start HBase. Issue the command, and if all goes well, a message is logged to standard output showing that HBase started successfully. You can use the jps command to verify that you have one running process called HMaster. In standalone mode HBase runs all daemons within this single JVM, i.e. the HMaster, a single HRegionServer, and the ZooKeeper daemon.
Java needs to be installed and available. If you get an error indicating that Java is not installed, but it is on your system, perhaps in a non-standard location, edit the conf/hbase-env.sh file and modify the JAVA_HOME setting to point to the directory that contains bin/java your system.
Starting HBase
- Connect to HBase. – Connect to your running instance of HBase using the hbase shell command, located in the bin/ directory of your HBase install. In this example, some usage and version information that is printed when you start HBase Shell has been omitted. The HBase Shell prompt ends with a > character.
$ ./bin/hbase shell
hbase(main):001:0>
- Display HBase Shell Help Text – Type help and press Enter, to display some basic usage information for HBase Shell, as well as several example commands. Notice that table names, rows, columns all must be enclosed in quote characters.
Pseudo-Distributed Local Install
You can re-configure locally installed HBase to run in pseudo-distributed mode. Pseudo-distributed mode means that HBase still runs completely on a single host, but each HBase daemon (HMaster, HRegionServer, and Zookeeper) runs as a separate process. By default, unless you configure the hbase.rootdir property, your data is still stored in /tmp/. In this walk-through, we store your data in HDFS instead, assuming you have HDFS available. You can skip the HDFS configuration to continue storing your data in the local filesystem. This procedure assumes that you have configured Hadoop and HDFS on your local system and/or a remote system, and that they are running and available. It also assumes you are using Hadoop 2.
- Stop HBase if it is running – If you have just finished quickstart and HBase is still running, stop it. This procedure will create a totally new directory where HBase will store its data, so any databases you created before will be lost.
- Configure HBase – Edit the hbase-site.xml configuration. First, add the following property. which directs HBase to run in distributed mode, with one JVM instance per daemon.
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
Next, change the hbase.rootdir from the local filesystem to the address of your HDFS instance, using the hdfs://// URI syntax. In this example, HDFS is running on the localhost at port 8020.
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:8020/hbase</value>
</property>
You do not need to create the directory in HDFS. HBase will do this for you. If you create the directory, HBase will attempt to do a migration, which is not what you want.
- Start HBase – Use the bin/start-hbase.sh command to start HBase. If your system is configured correctly, the jps command should show the HMaster and HRegionServer processes running.
- Check the HBase directory in HDFS – If everything worked correctly, HBase created its directory in HDFS. In the configuration above, it is stored in /hbase/ on HDFS. You can use the hadoop fs command in Hadoop’s bin/ directory to list this directory.
$ ./bin/hadoop fs -ls /hbase
Found 7 items
drwxr-xr-x – hbase users 0 2014-06-25 18:58 /hbase/.tmp
drwxr-xr-x – hbase users 0 2014-06-25 21:49 /hbase/WALs
drwxr-xr-x – hbase users 0 2014-06-25 18:48 /hbase/corrupt
drwxr-xr-x – hbase users 0 2014-06-25 18:58 /hbase/data
-rw-r–r– 3 hbase users 42 2014-06-25 18:41 /hbase/hbase.id
-rw-r–r– 3 hbase users 7 2014-06-25 18:41 /hbase/hbase.version
drwxr-xr-x – hbase users 0 2014-06-25 21:49 /hbase/oldWALs
- Create a table and populate it with data – You can use the HBase Shell to create a table, populate it with data, scan and get values from it
Fully Distributed Mode Install
In a distributed configuration, the cluster contains multiple nodes, each of which runs one or more HBase daemon. These include primary and backup Master instances, multiple Zookeeper nodes, and multiple RegionServer nodes. This procedure adds two more nodes to your cluster. The architecture will be as
Node Name | Master | ZooKeeper | RegionServer |
node-a.example.com | yes | yes | no |
node-b.example.com | backup | yes | yes |
node-c.example.com | no | yes | yes |
The procedure assumes that each node is a virtual machine and that they are all on the same network. It builds upon the previous Pseudo-Distributed Local Install, assuming that the system you configured in that procedure is now node-a. Stop HBase on node-a before continuing.
Be sure that all the nodes have full access to communicate, and that no firewall rules are in place which could prevent them from talking to each other. If you see any errors like no route to host, check your firewall.
- Configure Passwordless SSH Access – node-a needs to be able to log into node-b and node-c (and to itself) in order to start the daemons. The easiest way to accomplish this is to use the same username on all hosts, and configure password-less SSH login from node-a to each of the others.
On node-a, generate a key pair. While logged in as the user who will run HBase, generate a SSH key pair, using the following command:
$ ssh-keygen -t rsa
If the command succeeds, the location of the key pair is printed to standard output. The default name of the public key is id_rsa.pub.
Create the directory that will hold the shared keys on the other nodes – On node-b and node-c, log in as the HBase user and create a .ssh/ directory in the user’s home directory, if it does not already exist. If it already exists, be aware that it may already contain other keys.
Copy the public key to the other nodes – Securely copy the public key from node-a to each of the nodes, by using the scp or some other secure means. On each of the other nodes, create a new file called .ssh/authorized_keys if it does not already exist, and append the contents of the id_rsa.pub file to the end of it. Note that you also need to do this for node-a itself.
$ cat id_rsa.pub >> ~/.ssh/authorized_keys
Test password-less login – If you performed the procedure correctly, if you SSH from node-a to either of the other nodes, using the same username, you should not be prompted for a password.
Since node-b will run a backup Master, repeat the procedure above, substituting node-b everywhere you see node-a. Be sure not to overwrite your existing .ssh/authorized_keys files, but concatenate the new key onto the existing file using the >> operator rather than the > operator.
- Prepare node-a – node-a will run your primary master and ZooKeeper processes, but no RegionServers. Stop the RegionServer from starting on node-a.
Edit conf/regionservers and remove the line which contains localhost. Add lines with the hostnames or IP addresses for node-b and node-c. Even if you did want to run a RegionServer on node-a, you should refer to it by the hostname the other servers would use to communicate with it. In this case, that would be node-a.example.com. This enables you to distribute the configuration to each node of your cluster any hostname conflicts. Save the file.
Configure HBase to use node-b as a backup master. Create a new file in conf/ called backup-masters, and add a new line to it with the hostname for node-b. In this demonstration, the hostname is node-b.example.com.
Configure ZooKeeper. In reality, you should carefully consider your ZooKeeper configuration. You can find out more about configuring ZooKeeper in zookeeper. This configuration will direct HBase to start and manage a ZooKeeper instance on each node of the cluster. On node-a, edit conf/hbase-site.xml and add the following properties.
<property>
<name>hbase.zookeeper.quorum</name>
<value>node-a.example.com,node-b.example.com,node-c.example.com</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/zookeeper</value>
</property>
Everywhere in your configuration that you have referred to node-a as localhost, change the reference to point to the hostname that the other nodes will use to refer to node-a. In these examples, the hostname is node-a.example.com.
- Prepare node-b and node-c – node-b will run a backup master server and a ZooKeeper instance.
Download and unpack HBase – Download and unpack HBase to node-b, just as you did for the standalone and pseudo-distributed quickstarts.
Copy the configuration files from node-a to node-b.and node-c – Each node of your cluster needs to have the same configuration information. Copy the contents of the conf/ directory to the conf/ directory on node-b and node-c.
- Start the cluster – Be sure HBase is not running on any node. If you forgot to stop HBase from previous testing, you will have errors. Check to see whether HBase is running on any of your nodes by using the jps command. Look for the processes HMaster, HRegionServer, and HQuorumPeer. If they exist, kill them. On node-a, issue the start-hbase.sh command. Your output will be similar to that below.
$ bin/start-hbase.sh
node-c.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-c.example.com.out
node-a.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-a.example.com.out
node-b.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-b.example.com.out
starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-master-node-a.example.com.out
node-c.example.com: starting regionserver, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-c.example.com.out
node-b.example.com: starting regionserver, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-regionserver-node-b.example.com.out
node-b.example.com: starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase-hbuser-master-nodeb.example.com.out
ZooKeeper starts first, followed by the master, then the RegionServers, and finally the backup masters. Verify that the processes are running – On each node of the cluster, run the jps command and verify that the correct processes are running on each server. You may see additional Java processes running on your servers as well, if they are used for other purposes.
node-a jps Output
$ jps
20355 Jps
20071 HQuorumPeer
20137 HMaster
node-b jps Output
$ jps
15930 HRegionServer
16194 Jps
15838 HQuorumPeer
16010 HMaster
node-a jps Output
$ jps
13901 Jps
13639 HQuorumPeer
13737 HRegionServer
The HQuorumPeer process is a ZooKeeper instance which is controlled and started by HBase. If you use ZooKeeper this way, it is limited to one instance per cluster node, , and is appropriate for testing only. If ZooKeeper is run outside of HBase, the process is called QuorumPeer.
Browse to the Web UI – In HBase newer than 0.98.x, the HTTP ports used by the HBase Web UI changed from 60010 for the Master and 60030 for each RegionServer to 16010 for the Master and 16030 for the RegionServer. If everything is set up correctly, you should be able to connect to the UI for the Master http://node-a.example.com:16010/ or the secondary master at http://node-b.example.com:16010/ for the secondary master, using a web browser. If you can connect via localhost but not from another host, check your firewall rules. You can see the web UI for each of the RegionServers at port 16030 of their IP addresses, or by clicking their links in the web UI for the Master.