Wednesday, March 27, 2013

HADOOP Security Configuration




Authentication :  mechanism to assure Hadoop that the user seeking to perform an operation on the cluster is who she claims to be and therefore can be trusted. 

HDFS file permissions provide only a mechanism for authorization, which controls what a particular user can do to a particular file.

However, authorization is not enough by itself,because the system is still open to abuse via spoofing by a malicious user who can gain network access to the cluster.


In 2009 Yahoo!  led a team of engineers there to implement secure authentication for Hadoop. In their design, Hadoop itself does not manage user credentials; instead, it relies on Kerberos, a mature open-source network authentication protocol, to authenticate the user. In turn, Kerberos doesn’t manage permissions. Kerberos says that a user is who he says he is; it’s Hadoop’s job to determine whether that user has permission to perform a given action.


There are three steps that a client must take to access a service when using Kerberos, each of which involves a message exchange with a server:

1. Authentication : The client authenticates itself to the Authentication Server and
receives a timestamped Ticket-Granting Ticket (TGT).

2. Authorization : The client uses the TGT to request a service ticket from the Ticket Granting Server.

3. Service request : The client uses the service ticket to authenticate itself to the server that is providing the service the client is using. In the case of Hadoop, this might be the namenode or the jobtracker.





The authorization and service request steps are not user-level actions; the client performs these steps on the user’s behalf. The authentication step, however, is normally carried out explicitly by the user using the kinit command, which will prompt for a password. However, this doesn’t mean you need to enter your password every time you run a job or access HDFS, since TGTs last for 10 hours by default (and can be renewed for up to a week). It’s common to automate authentication at operating system login time, thereby providing single sign-on to Hadoop. 

In cases where you don’t want to be prompted for a password (for running an unattended MapReduce job, for example), you can create a Kerberos keytab file using the ktutil command. A keytab is a file that stores passwords and may be supplied to kinit with the -t option.


With Kerberos authentication turned on, let’s see what happens when we try to copy a local file to HDFS:

% hadoop fs -put quangle.txt 

10/07/03 15:44:58 WARN ipc.Client: Exception encountered while connecting to the server: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

Bad connection to FS. command aborted. exception: Call to localhost/127.0.0.1:80 20 failed on local exception: java.io.IOException: javax.security.sasl.SaslExcep tion: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]




The operation fails because we don’t have a Kerberos ticket. We can get one by authenticating to the KDC, using kinit:

% kinit
Password for hadoop-user@LOCALDOMAIN: password
% hadoop fs -put quangle.txt .
% hadoop fs -stat %n quangle.txt
quangle.txt

And we see that the file is successfully written to HDFS. Notice that even though we carried out two filesystem commands, we only needed to call kinit once, since the Kerberos ticket is valid for 10 hours (use the klist command to see the expiry time of your tickets and kdestroy to invalidate your tickets). After we get a ticket, everything works just as it normally would.


Instead of using the three-step Kerberos ticket exchange protocol to authenticate each call, which would present a high load on the KDC on a busy cluster, Hadoop uses delegation tokens to allow later authenticated access without having to contact the KDC again. Delegation tokens are created and used transparently by Hadoop on behalf of users, so there’s no action you need to take as a user beyond using kinit to sign in, but it’s useful to have a basic idea of how they are used.


A delegation token is generated by the server (the namenode in this case) and can be thought of as a shared secret between the client and the server. On the first RPC call to the namenode, the client has no delegation token, so it uses Kerberos to authenticate, and as a part of the response it gets a delegation token from the namenode. In subsequent calls, it presents the delegation token, which the namenode can verify (since it generated it using a secret key), and hence the client is authenticated to the server. When it wants to perform operations on HDFS blocks, the client uses a special kind of delegation token, called a block access token, that the namenode passes to the client in response to a metadata request. 

The client uses the block access token to authenticate itself to datanodes. This is possible only because the namenode shares its secret key used to generate the block access token with datanodes (which it sends in heartbeat messages), so that they can verify block access tokens. Thus, an HDFS block may be accessed only by a client with a valid block access token from a namenode. This closes the security hole in unsecured Hadoop where only the block ID was needed to gain access to a block.











2 comments:

Kumar said...

Thanks. Your post helped me.

Unknown said...

Great Help. Thanks.

Popular Posts