Programmatic way to identify the status of the namenode in an HA enabled hadoop cluster

In an namenode HA enabled hadoop cluster, one of the namenodes will be active and the other will be standby. If you want to perform some operations on HDFS programmatically, some of the the libraries or packages need the details of active namenode (some of the packages in python need the details of active namenode, they will not support the nameservice). In this case, the easiest way to get the status is to issue a GET request similar to the one given below on each of the namenodes. This will help us to identify the status of each namenode.

GET REQUEST

curl 'http://namenode.1.host:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus'

SAMPLE OUTPUT

{
"beans" : [ {
"name" : "Hadoop:service=NameNode,name=NameNodeStatus",
"modelerType" : "org.apache.hadoop.hdfs.server.namenode.NameNode",
"State" : "active",
"SecurityEnabled" : false,
"NNRole" : "NameNode",
"HostAndPort" : "namenode.1.host:8020",
"LastHATransitionTime" : 0
} ]
}
Advertisements

Python code to list all the running EC2 instances across all regions in an AWS account

This code snippet will help you to get the list of all running EC2 instances across all regions in an AWS account. I have used python boto3 package for developing the code. This code will dynamically pick up all the aws ec2 regions. So the code will work perfectly without any modification even if a new region gets added to the AWS.

Note: Only the basic api calls just to list the instance details are mentioned in this program . Proper coding convention is not followed . 🙂

Add partitions to hive table with location as S3

Recently I tried to add a partition to a hive table with S3 as the storage. The command I tried is given below.

ALTER table mytable ADD PARTITION (testdate='2015-03-05') location 's3a://XXXACCESS-KEYXXXXX:XXXSECRET-KEYXXX@bucket-name/DATA/mytable/testdate=2015-03-05';

I got the following exceptions

Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.SentryFilterDDLTask. MetaException(message:Got exception: org.apache.hadoop.fs.FileAlreadyExistsException Can't make directory for path 's3a://XXXACCESS-KEYXXXXX:XXXSECRET-KEYXXX@bucket-name/DATA/mytable' since it is a file.) (state=08S01,code=1)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:s3a://XXXACCESS-KEYXXXXX:XXXSECRET-KEYXXX@bucket-name/DATA/mytable/testdate=2015-03-05 is not a directory or unable to create one)

Solution:

Use S3n instead of S3a. It will work. So the S3 url should be

s3n://XXXACCESS-KEYXXXXX:XXXSECRET-KEYXXX@bucket-name/DATA/mytable/testdate=2015-03-05

 

 

Heterogeneous storages in HDFS

From hadoop 2.3.0 onwards, hdfs supports heterogeneous storage. What is this heterogeneous storage? What are the advantages of using this?.

Hadoop came as a processing system for processing and storing huge data, a scalable batch processing system. But now it became the platform for DataLake for Enterprises. In large enterprises, various types of data needs to be stored and processed for advanced analytics. Some of these data are required frequently, some are not required frequently, some are required very rarely. If we store all these in the same platform or hardware, the cost will be more. For example, if we are using a cluster in AWS. We have EC2 nodes for our cluster nodes. EC2 uses EBS and ephemeral storage. Depending upon the type of storage, the cost varies. S3 storage is cheaper than EBS storage, but access speed will be less. Similarly glacier will be cheaper compared to S3, but again the data retrieval will take time. Similarly, if we want to keep data in different storage types depending upon the priority and requirement, we can use this feature in hadoop. This feature was not available in earlier versions of hadoop. This is available in hadoop version 2.3.0 onwards. Now datanode can be defined as a collection of storages. Various storage policies available in hadoop are Hot, Warm, Cold, All_SSD, One_SSD and Lazy_Persist.

  • Hot – for both storage and compute. The data that is popular and still being used for processing will stay in this policy. When a block is hot, all replicas are stored in DISK.
  • Cold – only for storage with limited compute. The data that is no longer being used, or data that needs to be archived is moved from hot storage to cold storage. When a block is cold, all replicas are stored in ARCHIVE.
  • Warm – partially hot and partially cold. When a block is warm, some of its replicas are stored in DISK and the remaining replicas are stored in ARCHIVE.
  • All_SSD – for storing all replicas in SSD.
  • One_SSD – for storing one of the replicas in SSD. The remaining replicas are stored in DISK.
  • Lazy_Persist – for writing blocks with single replica in memory. The replica is first written in RAM_DISK and then it is lazily persisted in DISK.

ORA-01045:user name lacks CREATE SESSION privilege; logon denied

After creating a user in oracle database, I tried to login using SQL developer and got an error “ORA-01045:user name lacks CREATE SESSION privilege; logon denied”.

The reason for this error was insufficient privileges.

I solved this issue by granting the following privilege.

grant create session to "<user-name>";

 

 

Unauthorized connection for super-user: hue from IP “x.x.x.x”

If you are getting the following error in hue,

Unauthorized connection for superuser: hue from IP “x.x.x.x”

Add the following property in the core-site.xml of your hadoop cluster and restart the cluster

<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>

You may face similar error with oozie also. In that case add a similar conf for oozie user in the core-sire.xml

<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>

Enabling Log Aggregation in YARN

While checking the details of a YARN applications, if you are getting a message similar to “Log Aggregation not enabled”. You can follow the below steps to enable it. This issue occurs in EMR, because in most of the AMI’s the log aggregation is not enabled by default. It is very simple to enable it. Add the following configuration to the yarn-site.xml of all the yarn hosts and restart the cluster. (full cluster restart is not required. Restarting all the nodemanagers will be fine)

<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>

<property>
    <description>Where to aggregate logs to.</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/tmp/logs</value>
</property>

<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>259200</value>
</property>

<property>
    <name>yarn.log-aggregation.retain-check-interval-seconds</name>
    <value>3600</value>
</property>

Undeleting and purging KeyTrustee Key Provider methods via the REST interface

HDFS Data encryption is an excellent feature that came recently. With this we can encrypt the data in hdfs. We can create multiple encryption zones with different encryption keys. In this way, we can secure the data in hdfs properly. For more details, you can visit these websites. Reference1, Reference2

I am using a cluster installed with CDH. I created some encryption keys and zones.
The command I used for creating a key is given below.

# As the normal user, create a new encryption key
hadoop key create amalKey
 

# As the super user, create a new empty directory and make it an encryption zone
hadoop fs -mkdir /user/amal
hdfs crypto -createZone -keyName amalKey -path /user/amal
 

# chown it to the normal user
hadoop fs -chown amal:hadoop /user/amal
 

# As the normal user, put a file in, read it out
hadoop fs -put test.txt /user/amal/
hadoop fs -cat /user/amal/test.tx
 

After some days, I deleted the encryption zone and I deleted the encryption key also.
The command I used for deleting the encryption key is given below

hadoop key delete <key-name>

After the deletion, I tried creating the key with the same name. But I got an exception that the key is still present in the disabled state. When I list the keys, I am not able to see the key. The exception that I got was given below.

amalKey has not been created. java.io.IOException: HTTP status [500], exception [com.cloudera.keytrustee.TrusteeKeyProvider$DuplicateKeyException], message [Key with name "amalKey" already exists in "com.cloudera.keytrustee.TrusteeKeyProvider@6d88562. Key exists but has been disabled. Use undelete to enable.]
java.io.IOException: HTTP status [500], exception [com.cloudera.keytrustee.TrusteeKeyProvider$DuplicateKeyException], message [Key with name "amalKey" already exists in "com.cloudera.keytrustee.TrusteeKeyProvider@6d88562. Key exists but has been disabled. Use undelete to enable.]
at org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:159)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.call(KMSClientProvider.java:545)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.call(KMSClientProvider.java:503)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.createKeyInternal(KMSClientProvider.java:676)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.createKey(KMSClientProvider.java:684)
at org.apache.hadoop.crypto.key.KeyShell$CreateCommand.execute(KeyShell.java:483)
at org.apache.hadoop.crypto.key.KeyShell.run(KeyShell.java:79)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.crypto.key.KeyShell.main(KeyShell.java:515)

In the error logs, it says to use purge option to permanently delete the key and undelete to recover the deleted key. But I was not able to find these options with hadoop key command. I googled it and I couldn’t figure out this issue. Finally I got the guidance from one guy from cloudera to execute the purge & undelete commands through rest api of keytrustee and he gave a nice explanation for my issue. I am briefly putting the solution for this exception below.

The delete operation on the Trustee key provider is a “soft delete”, meaning that is possible to “undelete” the key. It is also possible to “purge” the key to delete it permanently. Because these operations are not part of the standard Hadoop key provider API, they are not currently exposed through Hadoop KeyShell (hadoop key). However, you can call these operations directly via the Trustee key provider REST API.

See the examples below.

Use KeyShell to list existing keys

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1

Use KeyShell to delete an existing key

$ ./bin/hadoop key delete amal-testkey-1 -provider kms://http@localhost:16000/kms
 
Deleting key: ajy-testkey-1 from KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1 has been successfully deleted.
KMSClientProvider[http://localhost:16000/kms/v1/] has been updated.

Use KeyShell to verify the key was deleted

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
 

Use the KeyTrustee key provider REST API to undelete the deleted key

$ curl -L -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name=amal&trusteeOp=undelete"

Use KeyShell to verify the key was restored

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1

Use the KeyTrustee key provider REST API to purge the restored key

$ curl L -d "trusteeOp=purge" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name=amal&trusteeOp=purge"

Use KeyShell to verify the key was deleted

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
 

Use the KeyTrustee key provider REST API to attempt to undelete the purged key

$ curl -L -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name=amal&trusteeOp=undelete"

{
"RemoteException" : {
"message" : "Key with name amal-testkey-1 not found in com.cloudera.keytrustee.TrusteeKeyProvider@6d88562",
"exception" : "IOException",
"javaClassName" : "java.io.IOException"
}
}

Configure ACLs for KeyTrustee undelete, purge and migrate operations

ACLs for the KeyTrustee specific undelete, purge and migrate operations are configured in kts-acls.xml. Place this file in the same location as your kms-acls.xml file. See example below.

<property>
   <name>keytrustee.kms.acl.UNDELETE</name>
     <value>*</value>
       <description>
          ACL for undelete-key operations.
      </description>
</property>
 
<property>
  <name>keytrustee.kms.acl.PURGE</name>
    <value>*</value>
      <description>
         ACL for purge-key operations.
      </description>
</property>
 
<property>
  <name>keytrustee.kms.acl.MIGRATE</name>
    <value>*</value>
     <description> 
      ACL for purge-key operations.
     </description>
</property>
 

Note: In kerberized environments, the requests will be a little different. It will be in the following format

Eg :
curl -L --negotiate -u [username]  -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name={username}&trusteeOp=undelete"

Changing the data type mapping in sqoop

Sqoop is very helpful in importing data from RDBMS to hadoop. The hive import feature will create a hive table corresponding to the RDBMS table and import the data. By default sqoop creates a hive table based on the predefined data type conversion logic build inside sqoop. We have an option to change the default conversion. We can explicitly specify the data type required in  the hive table. This is possible by adding an extra option as below.

--map-column-java <mapping> Override mapping from SQL to Java type for configured columns.
--map-column-hive <mapping> Override mapping from SQL to Hive type for configured columns.

For example. If we have a field called ‘id‘ in an sql table which is of integer data type and we want it as a string data type column in hive, we can do the following step.

sqoop import ... --map-column-hive id=string

Compressing the output of sqoop

The output of a sqoop job can be compressed directly. Sqoop job is a mapreduce job, so by setting the mapreduce output compression codec, we can get the output of sqoop compressed. It is very simple, just put an argument to the sqoop command string.

--compression-codec <compression codec>

For snappy compressed output the argument will be as below.

--compression-codec org.apache.hadoop.io.compress.SnappyCodec

For Gzip compression

--compression-codec org.apache.hadoop.io.compress.GzipCodec

For Bzip compression

--compression-codec org.apache.hadoop.io.compress.BZip2Codec