It means protecting data, such as a database, from destructive forces and from the unwanted actions of unauthorized users.
By default Hadoop runs in non-secure mode in which no actual authentication is required. By configuring Hadoop runs in secure mode, each user and service needs to be authenticated by Kerberos in order to use Hadoop services.
Security features of Hadoop consist of authentication, service level authorization, authentication for Web consoles and data confidentiality.
It is a process in which the credentials provided are compared to those on file in a database of authorized users’ information on a local operating system or within an authentication server. If the credentials match, the process is completed and the user is granted authorization for access.
When service level authentication is turned on, end users using Hadoop in secure mode needs to be authenticated by Kerberos. The simplest way to do authentication is using kinit command of Kerberos.
User Accounts for Hadoop Daemons
Ensure that HDFS and YARN daemons run as different Unix users, e.g. hdfs and yarn. Also, ensure that the MapReduce JobHistory server runs as different user such as mapred.
It’s recommended to have them share a Unix group, for e.g. hadoop.
User:Group | Daemons |
hdfs:hadoop | NameNode, Secondary NameNode, JournalNode, DataNode |
yarn:hadoop | ResourceManager, NodeManager |
mapred:hadoop | MapReduce JobHistory Server |
Kerberos principals for Hadoop Daemons and Users
For running hadoop service daemons in Hadoop in secure mode, Kerberos principals are required. Each service reads authenticate information saved in keytab file with appropriate permission. HTTP web-consoles should be served by principal different from RPC’s one.
HDFS – The NameNode keytab file, on the NameNode host, should look like the following:
$ klist -e -k -t /etc/security/keytab/nn.service.keytab
Keytab name: FILE:/etc/security/keytab/nn.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 nn/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 nn/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 nn/[email protected] (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/[email protected] (ArcFour with HMAC/md5)
The Secondary NameNode keytab file, on that host, should look like the following:
$ klist -e -k -t /etc/security/keytab/sn.service.keytab
Keytab name: FILE:/etc/security/keytab/sn.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 sn/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 sn/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 sn/[email protected] (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/[email protected] (ArcFour with HMAC/md5)
The DataNode keytab file, on each host, should look like the following:
$ klist -e -k -t /etc/security/keytab/dn.service.keytab
Keytab name: FILE:/etc/security/keytab/dn.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 dn/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 dn/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 dn/[email protected] (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/[email protected] (ArcFour with HMAC/md5)
YARN – The ResourceManager keytab file, on the ResourceManager host, should look like the following:
$ klist -e -k -t /etc/security/keytab/rm.service.keytab
Keytab name: FILE:/etc/security/keytab/rm.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 rm/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 rm/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 rm/[email protected] (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/[email protected] (ArcFour with HMAC/md5)
The NodeManager keytab file, on each host, should look like the following:
$ klist -e -k -t /etc/security/keytab/nm.service.keytab
Keytab name: FILE:/etc/security/keytab/nm.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 nm/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 nm/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 nm/[email protected] (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/[email protected] (ArcFour with HMAC/md5)
MapReduce JobHistory Server – The MapReduce JobHistory Server keytab file, on that host, should look like the following:
$ klist -e -k -t /etc/security/keytab/jhs.service.keytab
Keytab name: FILE:/etc/security/keytab/jhs.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 jhs/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 jhs/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 jhs/[email protected] (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/[email protected] (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/[email protected] (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/[email protected] (ArcFour with HMAC/md5)
Mapping from Kerberos principal to OS user account
Hadoop maps Kerberos principal to OS user account using the rule specified by hadoop.security.auth_to_local which works in the same way as the auth_to_local in Kerberos configuration file (krb5.conf). In addition, Hadoop auth_to_local mapping supports the /L flag that lowercases the returned name.
By default, it picks the first component of principal name as a user name if the realms matches to the default_realm (usually defined in /etc/krb5.conf). For example, host/[email protected] is mapped to host by default rule.
Custom rules can be tested using the hadoop kerbname command. This command allows one to specify a principal and apply Hadoop’s current auth_to_local ruleset. The output will be what identity Hadoop will use for its usage.
Mapping from user to group
Though files on HDFS are associated to owner and group, Hadoop does not have the definition of group by itself. Mapping from user to group is done by OS or LDAP.
You can change a way of mapping by specifying the name of mapping provider as a value of hadoop.security.group.mapping. Practically you need to manage SSO environment using Kerberos with LDAP for Hadoop in secure mode.
Some products such as Apache Oozie which access the services of Hadoop on behalf of end users need to be able to impersonate end users.
Secure DataNode
Because the data transfer protocol of DataNode does not use the RPC framework of Hadoop, DataNode must authenticate itself by using privileged ports which are specified by dfs.datanode.address and dfs.datanode.http.address. This authentication is based on the assumption that the attacker won’t be able to get root privileges.
When you execute hdfs datanode command as root, server process binds privileged port at first, then drops privilege and runs as the user account specified by HADOOP_SECURE_DN_USER. This startup process uses jsvc installed to JSVC_HOME. You must specify HADOOP_SECURE_DN_USER and JSVC_HOME as environment variables on start up (in hadoop-env.sh).
As of version 2.6.0, SASL can be used to authenticate the data transfer protocol. In this configuration, it is no longer required for secured clusters to start the DataNode as root using jsvc and bind to privileged ports. To enable SASL on data transfer protocol, set dfs.data.transfer.protection in hdfs-site.xml, set a non-privileged port for dfs.datanode.address, set dfs.http.policy to HTTPS_ONLY and make sure the HADOOP_SECURE_DN_USER environment variable is not defined. Note that it is not possible to use SASL on data transfer protocol if dfs.datanode.address is set to a privileged port. This is required for backwards-compatibility reasons.
In order to migrate an existing cluster that used root authentication to start using SASL instead, first ensure that version 2.6.0 or later has been deployed to all cluster nodes as well as any external applications that need to connect to the cluster. Only versions 2.6.0 and later of the HDFS client can connect to a DataNode that uses SASL for authentication of data transfer protocol, so it is vital that all callers have the correct version before migrating. After version 2.6.0 or later has been deployed everywhere, update configuration of any external applications to enable SASL. If an HDFS client is enabled for SASL, then it can connect successfully to a DataNode running with either root authentication or SASL authentication. Changing configuration for all clients guarantees that subsequent configuration changes on DataNodes will not disrupt the applications. Finally, each individual DataNode can be migrated by changing its configuration and restarting. It is acceptable to have a mix of some DataNodes running with root authentication and some DataNodes running with SASL authentication temporarily during this migration period, because an HDFS client enabled for SASL can connect to both.