Heterogeneous storages in HDFS

From hadoop 2.3.0 onwards, hdfs supports heterogeneous storage. What is this heterogeneous storage? What are the advantages of using this?.

Hadoop came as a processing system for processing and storing huge data, a scalable batch processing system. But now it became the platform for DataLake for Enterprises. In large enterprises, various types of data needs to be stored and processed for advanced analytics. Some of these data are required frequently, some are not required frequently, some are required very rarely. If we store all these in the same platform or hardware, the cost will be more. For example, if we are using a cluster in AWS. We have EC2 nodes for our cluster nodes. EC2 uses EBS and ephemeral storage. Depending upon the type of storage, the cost varies. S3 storage is cheaper than EBS storage, but access speed will be less. Similarly glacier will be cheaper compared to S3, but again the data retrieval will take time. Similarly, if we want to keep data in different storage types depending upon the priority and requirement, we can use this feature in hadoop. This feature was not available in earlier versions of hadoop. This is available in hadoop version 2.3.0 onwards. Now datanode can be defined as a collection of storages. Various storage policies available in hadoop are Hot, Warm, Cold, All_SSD, One_SSD and Lazy_Persist.

  • Hot – for both storage and compute. The data that is popular and still being used for processing will stay in this policy. When a block is hot, all replicas are stored in DISK.
  • Cold – only for storage with limited compute. The data that is no longer being used, or data that needs to be archived is moved from hot storage to cold storage. When a block is cold, all replicas are stored in ARCHIVE.
  • Warm – partially hot and partially cold. When a block is warm, some of its replicas are stored in DISK and the remaining replicas are stored in ARCHIVE.
  • All_SSD – for storing all replicas in SSD.
  • One_SSD – for storing one of the replicas in SSD. The remaining replicas are stored in DISK.
  • Lazy_Persist – for writing blocks with single replica in memory. The replica is first written in RAM_DISK and then it is lazily persisted in DISK.

Unauthorized connection for super-user: hue from IP “x.x.x.x”

If you are getting the following error in hue,

Unauthorized connection for superuser: hue from IP “x.x.x.x”

Add the following property in the core-site.xml of your hadoop cluster and restart the cluster

<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>

You may face similar error with oozie also. In that case add a similar conf for oozie user in the core-sire.xml

<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>

Enabling Log Aggregation in YARN

While checking the details of a YARN applications, if you are getting a message similar to “Log Aggregation not enabled”. You can follow the below steps to enable it. This issue occurs in EMR, because in most of the AMI’s the log aggregation is not enabled by default. It is very simple to enable it. Add the following configuration to the yarn-site.xml of all the yarn hosts and restart the cluster. (full cluster restart is not required. Restarting all the nodemanagers will be fine)

<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>

<property>
    <description>Where to aggregate logs to.</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/tmp/logs</value>
</property>

<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>259200</value>
</property>

<property>
    <name>yarn.log-aggregation.retain-check-interval-seconds</name>
    <value>3600</value>
</property>

Undeleting and purging KeyTrustee Key Provider methods via the REST interface

HDFS Data encryption is an excellent feature that came recently. With this we can encrypt the data in hdfs. We can create multiple encryption zones with different encryption keys. In this way, we can secure the data in hdfs properly. For more details, you can visit these websites. Reference1, Reference2

I am using a cluster installed with CDH. I created some encryption keys and zones.
The command I used for creating a key is given below.

# As the normal user, create a new encryption key
hadoop key create amalKey
 

# As the super user, create a new empty directory and make it an encryption zone
hadoop fs -mkdir /user/amal
hdfs crypto -createZone -keyName amalKey -path /user/amal
 

# chown it to the normal user
hadoop fs -chown amal:hadoop /user/amal
 

# As the normal user, put a file in, read it out
hadoop fs -put test.txt /user/amal/
hadoop fs -cat /user/amal/test.tx
 

After some days, I deleted the encryption zone and I deleted the encryption key also.
The command I used for deleting the encryption key is given below

hadoop key delete <key-name>

After the deletion, I tried creating the key with the same name. But I got an exception that the key is still present in the disabled state. When I list the keys, I am not able to see the key. The exception that I got was given below.

amalKey has not been created. java.io.IOException: HTTP status [500], exception [com.cloudera.keytrustee.TrusteeKeyProvider$DuplicateKeyException], message [Key with name "amalKey" already exists in "com.cloudera.keytrustee.TrusteeKeyProvider@6d88562. Key exists but has been disabled. Use undelete to enable.]
java.io.IOException: HTTP status [500], exception [com.cloudera.keytrustee.TrusteeKeyProvider$DuplicateKeyException], message [Key with name "amalKey" already exists in "com.cloudera.keytrustee.TrusteeKeyProvider@6d88562. Key exists but has been disabled. Use undelete to enable.]
at org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:159)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.call(KMSClientProvider.java:545)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.call(KMSClientProvider.java:503)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.createKeyInternal(KMSClientProvider.java:676)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.createKey(KMSClientProvider.java:684)
at org.apache.hadoop.crypto.key.KeyShell$CreateCommand.execute(KeyShell.java:483)
at org.apache.hadoop.crypto.key.KeyShell.run(KeyShell.java:79)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.crypto.key.KeyShell.main(KeyShell.java:515)

In the error logs, it says to use purge option to permanently delete the key and undelete to recover the deleted key. But I was not able to find these options with hadoop key command. I googled it and I couldn’t figure out this issue. Finally I got the guidance from one guy from cloudera to execute the purge & undelete commands through rest api of keytrustee and he gave a nice explanation for my issue. I am briefly putting the solution for this exception below.

The delete operation on the Trustee key provider is a “soft delete”, meaning that is possible to “undelete” the key. It is also possible to “purge” the key to delete it permanently. Because these operations are not part of the standard Hadoop key provider API, they are not currently exposed through Hadoop KeyShell (hadoop key). However, you can call these operations directly via the Trustee key provider REST API.

See the examples below.

Use KeyShell to list existing keys

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1

Use KeyShell to delete an existing key

$ ./bin/hadoop key delete amal-testkey-1 -provider kms://http@localhost:16000/kms
 
Deleting key: ajy-testkey-1 from KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1 has been successfully deleted.
KMSClientProvider[http://localhost:16000/kms/v1/] has been updated.

Use KeyShell to verify the key was deleted

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
 

Use the KeyTrustee key provider REST API to undelete the deleted key

$ curl -L -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name=amal&trusteeOp=undelete"

Use KeyShell to verify the key was restored

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1

Use the KeyTrustee key provider REST API to purge the restored key

$ curl L -d "trusteeOp=purge" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name=amal&trusteeOp=purge"

Use KeyShell to verify the key was deleted

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
 

Use the KeyTrustee key provider REST API to attempt to undelete the purged key

$ curl -L -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name=amal&trusteeOp=undelete"

{
"RemoteException" : {
"message" : "Key with name amal-testkey-1 not found in com.cloudera.keytrustee.TrusteeKeyProvider@6d88562",
"exception" : "IOException",
"javaClassName" : "java.io.IOException"
}
}

Configure ACLs for KeyTrustee undelete, purge and migrate operations

ACLs for the KeyTrustee specific undelete, purge and migrate operations are configured in kts-acls.xml. Place this file in the same location as your kms-acls.xml file. See example below.

<property>
   <name>keytrustee.kms.acl.UNDELETE</name>
     <value>*</value>
       <description>
          ACL for undelete-key operations.
      </description>
</property>
 
<property>
  <name>keytrustee.kms.acl.PURGE</name>
    <value>*</value>
      <description>
         ACL for purge-key operations.
      </description>
</property>
 
<property>
  <name>keytrustee.kms.acl.MIGRATE</name>
    <value>*</value>
     <description> 
      ACL for purge-key operations.
     </description>
</property>
 

Note: In kerberized environments, the requests will be a little different. It will be in the following format

Eg :
curl -L --negotiate -u [username]  -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name={username}&trusteeOp=undelete"

Configuring Fair Scheduler in Hadoop Cluster

Hadoop comes with various scheduling algorithms such as FIFO, Capacity, Fair, DRF etc. Here I am briefly explaining about setting up fair scheduler in hadoop. This can be performed in any distribution of hadoop. By default hadoop comes with FIFO scheduler, some distribution comes with Capacity Scheduler as the default scheduler. In multiuser environments, a scheduler other than the default FIFO is definitely required. FIFO will not help us in multiuser environments because it makes us to wait in a single queue based on the order of job submission. Creating multiple job queues and assigning a portion of the cluster capacity and adding users to these queues will help us to manage and utilize the cluster resources properly.
For setting up a fair scheduler manually, we have to make some changes in the resource manager node. One is a change in the yarn-site.xml and another is the addition of a new configuration file fair-scheduler.xml
The configurations for a basic set up are given below.

Step 1:
Specify the scheduler class in the yarn-site.xml. If this property exists, replace it with the below value else add this property to the yarn-site.xml

  
<property>
   <name>yarn.resourcemanager.scheduler.class</name>
   <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>

Step 2:
Specify the Fair Scheduler allocation file. This property has to be set in yarn-site.xml. The value should be the absolute location of fair-scheduler.xml file. This file should be present locally.

 
<property>
  <name>yarn.scheduler.fair.allocation.file</name>
  <value>/etc/hadoop/conf/fair-scheduler.xml</value>
</property>

Step 3:
Create the allocation configuration file
A sample allocation file is given below. We can have advanced configurations in this allocation file. This is an allocation file with a basic set of configurations
There are five types of elements which can be set up in an allocation file

Queue element :– Representing queues. It has the following properties:

  • minResources — Setting the minimum resources of a queue
  • maxResources — Setting the maximum resources of a queue
  • maxRunningApps — Setting the maximum number of apps from a queue to run at once
  • weight — Sharing the cluster non-proportional with other queues. Default to 1
  • schedulingPolicy — Values are “fair”/”fifo”/”drf” or any class that extends
  • org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy
  • aclSubmitApps — Listing the users who can submit apps to the queue. If specified, other users will not be able to submit apps to the queue.
  • minSharePreemptionTimeout — Specifying the number of seconds the queue is under its minimum share before it tries to preempt containers to take resources from other queues.

User elements :– Representing user behaviors. It can contain a single properties to set maximum number apps for a particular user.

userMaxAppsDefault element :– Setting the default running app limit for users if the limit is not otherwise specified.

fairSharePreemptionTimeout element :– Setting the number of seconds a queue is under its fair share before it tries to preempt containers to take resources from other queues.

defaultQueueSchedulingPolicy element :– Specifying the default scheduling policy for queues; overriden by the schedulingPolicy element in each queue if specified.

 <?xml version="1.0"?>
<allocations>
 
 <queue name="queueA">
 <minResources>1000 mb, 1 vcores</minResources>
 <maxResources>5000 mb, 1 vcores</maxResources>
 <maxRunningApps>10</maxRunningApps>
 <aclSubmitApps>hdfs,amal</aclSubmitApps>
 <weight>2.0</weight>
 <schedulingPolicy>fair</schedulingPolicy>
 </queue>
 
 <queue name="queueB">
 <minResources>1000 mb, 1 vcores</minResources>
 <maxResources>2500 mb, 1 vcores</maxResources>
 <maxRunningApps>10</maxRunningApps>
 <aclSubmitApps>hdfs,sahad,amal</aclSubmitApps>
 <weight>1.0</weight>
 <schedulingPolicy>fair</schedulingPolicy>
 </queue>
 
 <queue name="queueC">
 <minResources>1000 mb, 1 vcores</minResources>
 <maxResources>2500 mb, 1 vcores</maxResources>
 <maxRunningApps>10</maxRunningApps>
 <aclSubmitApps>hdfs,sree</aclSubmitApps>
 <weight>1.0</weight>
 <schedulingPolicy>fair</schedulingPolicy>
 </queue>
 
 <user name="amal">
 <maxRunningApps>10</maxRunningApps>
 </user>
 
 <user name="hdfs">
 <maxRunningApps>5</maxRunningApps>
 </user>
 
 <user name="sree">
 <maxRunningApps>8</maxRunningApps>
 </user>
 
 <user name="sahad">
 <maxRunningApps>2</maxRunningApps>
 </user>
 
 <userMaxAppsDefault>5</userMaxAppsDefault>
 <fairSharePreemptionTimeout>30</fairSharePreemptionTimeout>
 </allocations>

Here we created three queues queueA, queueB and queueC and mapped users to these queues. While submitting the job, the user should specify the queue name. Only the user who has access to the queue can submit jobs to a particular queue. This is defined in the acls. Another thing is scheduling rules. If we specify scheduling rules, the jobs from a particular user will be directed automatically to a particular queue based on the rule. I am not mentioning the scheduling rule part here.

After making these changes, restart the resource manager. 

Now go to the resource manager web ui. In the left side of the UI, you can see a section named Scheduler. Click on that section, you will be able to see the newly created queues.

Now submit a job by specifying a queue name. You can use the option as below. The below option will submit the job to queueA. All the queues that we created are the sub-pools of root queue. Because of that, we have to specify queue name in the fomat parentQueue.subQueue

-Dmapred.job.queue.name=root.queueA

Eg:  hadoop jar hadoop-examples.jar wordcount -Dmapred.job.queue.name=root.queueA  <input-location>  <output-location>

If you are running a hive query, you can set these property in the below format. This property should be set at the top.

set mapred.job.queue.name=root.queueA

Installing Cloudera Manager in an existing hadoop cluster

Cloudera Manager is an Infrastructure management and monitoring tool provided by cloudera. This has now became a very excellent tool to manage bigdata infrastructure. The pain of administrators has been reduced by 80% with this cloudera manager. Almost everything required for an administrator is integrated into this great software and is very user friendly. Cloudera Manager became this muhc powerful recently. So lot of existing clusters are still running without using cloudera manager. If you want to manage an existing cluster using cloudera manager, the following steps may help you. For this you have to completely uninstall the existing hadoop set up. No data loss will happen because we are not touching any data. The configurations also will remain the same. These are just pointers.

1) Stop all the services
2) Back up hive metastore, Namenode metadata and all the other required metastores (Eg hue, oozie)
3) Back up all the configurations
4) Note down the existing storage directories
5) Uninstall all the hadoop services (Never touch the data)
6) Install Cloudera Manager Server and Agent
7) Install all the services (It should be same version as that of previous to make installation smoother)
8) Add the configurations (Use the same configurations as that of previous. There is an option to add xml configs in CM)
9) Point the storage directories in the cloudera manager configurations.
10) Point the new installation to the existing metastore (hive, oozie, hue etc)
11) Start all the services (Don’t format the namenode)
12) Test the cluster

Notification on completion of Mapreduce jobs

Heavy mapreduce jobs may run for several hours. There can be several jobs and checking the status of mapreduce jobs manually will be a boring task. I don’t like this  J. If we try to manage java programs using a script, it will not be a clean approach. Using scripts for managing java programs is bad approach. I consider these kind of designs as worst designs.

My requirement was to get notification on completion of mapreduce jobs. These are some critical mapreduce jobs and I don’t want to frequently check the status and wait for its completion.

Hadoop is providing a useful configuration to solve my problem. It is very easy to achieve this solution. Just a few lines of code will help us. Add these three lines to the Driver class

conf.set("job.end.notification.url", "http://myserverhost:8888/notification/jobId=$jobId?status=$jobStatus");
conf.setInt("job.end.retry.attempts", 3);
conf.setInt("job.end.retry.interval", 1000);

By setting these properties, hadoop sends an http request on completion of the job. We need a small piece of code for creating a webservice that accepts this http request and send email. For creating the webservice and email utility I used python language because of simplicity. Once the mapreduce job completes, it sends an http request to the URL mentioned by the configuration job.end.notification.url. The variables jobId and jobStatus will be replaced with the actual values. Once a request comes to the webservice, it will parse the arguments and call the email sending module. This is a very simple example. Instead of email, we can make different kind of notifications such as sms, phone call or triggering some other application etc. The property job.end.notification.url  is very helpful in tracking asynchronous mapreduce jobs. We can trigger another action also using this trigger. This is a clean approach because we are not running any other script to track the status of the job. The job itself is providing the status. We are using the python program for just collecting the status and making notifications using the status.

The python code for the webservice and email notification are given below.

Migrating Namenode from one host to another host

Namenode is the heart of the hadoop cluster. So namenode will be installed in a good quality machine compared to the other nodes. If we want to migrate namenode from one node to another node, the following steps are required. This is a rare scenario.

Manual Approach

Method 1: (By migrating the harddrive)

  • Stop all the running jobs in the cluster
  • Enter into Namenode Safe
    • hdfs dfsadmin -safemode enter
  • Execute the following command to save the currrent namespace to the storage directories and reset editlogs..
    • hdfs dfsadmin -saveNamespace
  • Stop the entire cluster
  • Remove the hard disk from the old namenode host and attach it to the new namenode host
  • Release the ipaddress from the old namenode host and assign it to the new namenode host
  • Start the new namenode (DO NOT PERFORM FORMAT)
  • Start all the services

Method 2: (New Harddrive)

  • Stop all the running jobs in the cluster
  • Enter into Namenode Safe
    • hdfs dfsadmin -safemode enter
  • Execute the following command to save the currrent namespace to the storage directories and reset editlogs..
    • hdfs dfsadmin -saveNamespace
  • Stop the entire cluster
  • Login to the namenode host.
  • Navigate to the namenode storage directories.
  • Copy the namenode metadata. Always better to keep this as a compressed file. Notedown the folder and file permissions & ownership.
  • Take a back up of the configuration files.
  • Install namenode of the same version as that of the existing system to the new machine.
  • Ensure that the ipaddress of the old host is taken and assigned to the new host.
  • Copy the configuration files and metadata to the new namenode host
  • Create namenode storage directory structure in the new host.
  • Maintain the same folder permissions and ownership in the new host also.
  • If there are any changes in namenode directory structure, make the corresponding changes in config files.
  • Incase of a kerberised cluster, create appropriate principles for the new host and place the proper keytabs.
  • Start the new namenode. (DO NOT PERFORM FORMAT)
  • Start the remaining services.
  • Test the working of the cluster by executing file system operations as well as MR operations.

Automated Approach in a cluster managed using Cloudera Manager (CM above 5.4)

If you are using cloudera manager 5.4 or above, there is a new feature known as Namenode Role Migration that helps us to migrate namenode from one host to another. This requires HDFS HA to be enabled.

Load Balancers – HA Proxy and ELB

ELB

HAProxy

Earlier I wondered how the sites like google handles the large number of requests reaching there. Later I came to know that there is a concept of load balancing. Using load balancing we can keep multiple servers in the back end and route the incoming requests to the back end servers. This will ensure faster response as well as high availability. This Load balancers play a very important role. There are a lot of opensource load balancers as well as paid services. HAProxy is one of the opensource load balancer. Amazon is providing a Load Balancer as a service known as Elastic Load Balancer (ELB).Using the load balancer, we can handle very large number of requests in a very reliable and optimal way. We can use this load balancer in Impala for load balancing the requests hitting the impala server. For on-premise environments, we can configure HAProxy and for cloud environments, we can use ELB.The ELB is a ready to use service, we just have to add the details of ports to be forwarded and the listener machines. HA Proxy is a very simple application that is available in the linux repositories. It is very easy to configure also.