Advertisements

How to set up Delta Lake in Apache Spark ?

Delta lake is supported in the latest version of Apache Spark. Delta Lake is open sourced with Apache 2.0 license. So it is free to use. Delta Lake is supported in Apache Spark versions above 2.4.2. It is very easy to set up and it does not require any admin skills to configure. Delta Lake is available by default in Databricks. We don’t have to do any installation or configuration to use this Delta Lake in Databricks.

For trying out the basic example, launch pyspark or spark-shell by adding the delta package. No need of any additional installation. Just use the following command

For pyspark

pyspark --packages io.delta:delta-core_2.11:0.4.0
For spark-shell
bin/spark-shell --packages io.delta:delta-core_2.11:0.4.0

The above command/s will add delta package to the context and delta lake will be enabled. You can try out the following basic example in the pyspark shell.

Advertisements

Delta Lake – The New Generation Data Lake

‘Delta Lake is the need of the present era. From the past several years, we have been hearing about DataLakes. I myself worked on several Data Lake implementations also. But the previous generation Data Lake was not a complete solution. One of the main fall backs in the old gen data lake is the difficulty in handling ACID transactions.

Delta Lake brings ACID transactions in the storage layer and thus makes the system more robust and efficient. One of my recent projects was to build a Data Lake for one of the India’s Largest Ride Sharing company. Their requirements include handling CDC (Change Data Capture) in the lake.  Their customers make several rides per day and there will be lot of transactions and changes happening in various entities associated with the platform such as debiting money from wallet, crediting money to wallet, creating ride, deleting ride, updating ride, updating user profile etc.

The initial version of the Lake that I designed was capable of recording only the latest values of each of these entities. But that was not a proper solution as it will not bring the complete analytics capability. So after that I came up with a design using Delta that has the capability to handle the CDC. In this way we will be able to track all the changes happening to the data and also instead of updating the records, we will be keeping the historic data also in the system. Delta-Lake-Architecture

Image Credits: Delta Lake 

The Delta format is the main magic behind the Delta Lake. The Delta format is open sourced by DataBricks and it is available with Apache Spark.

Some of the key features of Delta Lake are listed below.

  1. Support to ACID transactions. It is very tedious to bring data integrity in the conventional Data Lake. The transaction handling capability was missing in the old generation Data Lakes. With the support to transactions, the Delta Lake becomes more efficient and reduces the workload of Data Engineers.
  2. Data Versioning: Delta Lake supports time travel. This helps us for rollback, audit control, version control etc. In this way, the old records are not getting deleted instead it is getting versioned.
  3. Support for Merge, Update and Delete operations.
  4. No major change is required in the existing system to implement Delta Lake. Delta Lake is 100% open source and it is 100% compatible with the Spark APIs. The Delta Lake uses Apache Parquet format to store the data. The following snippet shows show to save data in Delta format. It is very simple, just use “delta” instead of “parquet”
dataframe
   .write
   .format("parquet")
   .save("/dataset")
dataframe
   .write
   .format("delta")
   .save("/dataset")

For trying out Delta in detail, use the community version of DataBricks

Sample code snippet for trying out the same is attached below.

Programmatic way to identify the status of the namenode in an HA enabled hadoop cluster

In an namenode HA enabled hadoop cluster, one of the namenodes will be active and the other will be standby. If you want to perform some operations on HDFS programmatically, some of the the libraries or packages need the details of active namenode (some of the packages in python need the details of active namenode, they will not support the nameservice). In this case, the easiest way to get the status is to issue a GET request similar to the one given below on each of the namenodes. This will help us to identify the status of each namenode.

GET REQUEST

curl 'http://namenode.1.host:50070/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus'

SAMPLE OUTPUT

{
"beans" : [ {
"name" : "Hadoop:service=NameNode,name=NameNodeStatus",
"modelerType" : "org.apache.hadoop.hdfs.server.namenode.NameNode",
"State" : "active",
"SecurityEnabled" : false,
"NNRole" : "NameNode",
"HostAndPort" : "namenode.1.host:8020",
"LastHATransitionTime" : 0
} ]
}

Python code to list all the running EC2 instances across all regions in an AWS account

This code snippet will help you to get the list of all running EC2 instances across all regions in an AWS account. I have used python boto3 package for developing the code. This code will dynamically pick up all the aws ec2 regions. So the code will work perfectly without any modification even if a new region gets added to the AWS.

Note: Only the basic api calls just to list the instance details are mentioned in this program . Proper coding convention is not followed . 🙂

Add partitions to hive table with location as S3

Recently I tried to add a partition to a hive table with S3 as the storage. The command I tried is given below.

ALTER table mytable ADD PARTITION (testdate='2015-03-05') location 's3a://XXXACCESS-KEYXXXXX:XXXSECRET-KEYXXX@bucket-name/DATA/mytable/testdate=2015-03-05';

I got the following exceptions

Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.SentryFilterDDLTask. MetaException(message:Got exception: org.apache.hadoop.fs.FileAlreadyExistsException Can't make directory for path 's3a://XXXACCESS-KEYXXXXX:XXXSECRET-KEYXXX@bucket-name/DATA/mytable' since it is a file.) (state=08S01,code=1)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:s3a://XXXACCESS-KEYXXXXX:XXXSECRET-KEYXXX@bucket-name/DATA/mytable/testdate=2015-03-05 is not a directory or unable to create one)

Solution:

Use S3n instead of S3a. It will work. So the S3 url should be

s3n://XXXACCESS-KEYXXXXX:XXXSECRET-KEYXXX@bucket-name/DATA/mytable/testdate=2015-03-05

 

 

Heterogeneous storages in HDFS

From hadoop 2.3.0 onwards, hdfs supports heterogeneous storage. What is this heterogeneous storage? What are the advantages of using this?.

Hadoop came as a processing system for processing and storing huge data, a scalable batch processing system. But now it became the platform for DataLake for Enterprises. In large enterprises, various types of data needs to be stored and processed for advanced analytics. Some of these data are required frequently, some are not required frequently, some are required very rarely. If we store all these in the same platform or hardware, the cost will be more. For example, if we are using a cluster in AWS. We have EC2 nodes for our cluster nodes. EC2 uses EBS and ephemeral storage. Depending upon the type of storage, the cost varies. S3 storage is cheaper than EBS storage, but access speed will be less. Similarly glacier will be cheaper compared to S3, but again the data retrieval will take time. Similarly, if we want to keep data in different storage types depending upon the priority and requirement, we can use this feature in hadoop. This feature was not available in earlier versions of hadoop. This is available in hadoop version 2.3.0 onwards. Now datanode can be defined as a collection of storages. Various storage policies available in hadoop are Hot, Warm, Cold, All_SSD, One_SSD and Lazy_Persist.

  • Hot – for both storage and compute. The data that is popular and still being used for processing will stay in this policy. When a block is hot, all replicas are stored in DISK.
  • Cold – only for storage with limited compute. The data that is no longer being used, or data that needs to be archived is moved from hot storage to cold storage. When a block is cold, all replicas are stored in ARCHIVE.
  • Warm – partially hot and partially cold. When a block is warm, some of its replicas are stored in DISK and the remaining replicas are stored in ARCHIVE.
  • All_SSD – for storing all replicas in SSD.
  • One_SSD – for storing one of the replicas in SSD. The remaining replicas are stored in DISK.
  • Lazy_Persist – for writing blocks with single replica in memory. The replica is first written in RAM_DISK and then it is lazily persisted in DISK.

ORA-01045:user name lacks CREATE SESSION privilege; logon denied

After creating a user in oracle database, I tried to login using SQL developer and got an error “ORA-01045:user name lacks CREATE SESSION privilege; logon denied”.

The reason for this error was insufficient privileges.

I solved this issue by granting the following privilege.

grant create session to "<user-name>";

 

 

Unauthorized connection for super-user: hue from IP “x.x.x.x”

If you are getting the following error in hue,

Unauthorized connection for superuser: hue from IP “x.x.x.x”

Add the following property in the core-site.xml of your hadoop cluster and restart the cluster

<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>

You may face similar error with oozie also. In that case add a similar conf for oozie user in the core-sire.xml

<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>

Enabling Log Aggregation in YARN

While checking the details of a YARN applications, if you are getting a message similar to “Log Aggregation not enabled”. You can follow the below steps to enable it. This issue occurs in EMR, because in most of the AMI’s the log aggregation is not enabled by default. It is very simple to enable it. Add the following configuration to the yarn-site.xml of all the yarn hosts and restart the cluster. (full cluster restart is not required. Restarting all the nodemanagers will be fine)

<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>

<property>
    <description>Where to aggregate logs to.</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/tmp/logs</value>
</property>

<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>259200</value>
</property>

<property>
    <name>yarn.log-aggregation.retain-check-interval-seconds</name>
    <value>3600</value>
</property>

Undeleting and purging KeyTrustee Key Provider methods via the REST interface

HDFS Data encryption is an excellent feature that came recently. With this we can encrypt the data in hdfs. We can create multiple encryption zones with different encryption keys. In this way, we can secure the data in hdfs properly. For more details, you can visit these websites. Reference1, Reference2

I am using a cluster installed with CDH. I created some encryption keys and zones.
The command I used for creating a key is given below.

# As the normal user, create a new encryption key
hadoop key create amalKey
 

# As the super user, create a new empty directory and make it an encryption zone
hadoop fs -mkdir /user/amal
hdfs crypto -createZone -keyName amalKey -path /user/amal
 

# chown it to the normal user
hadoop fs -chown amal:hadoop /user/amal
 

# As the normal user, put a file in, read it out
hadoop fs -put test.txt /user/amal/
hadoop fs -cat /user/amal/test.tx
 

After some days, I deleted the encryption zone and I deleted the encryption key also.
The command I used for deleting the encryption key is given below

hadoop key delete <key-name>

After the deletion, I tried creating the key with the same name. But I got an exception that the key is still present in the disabled state. When I list the keys, I am not able to see the key. The exception that I got was given below.

amalKey has not been created. java.io.IOException: HTTP status [500], exception [com.cloudera.keytrustee.TrusteeKeyProvider$DuplicateKeyException], message [Key with name "amalKey" already exists in "com.cloudera.keytrustee.TrusteeKeyProvider@6d88562. Key exists but has been disabled. Use undelete to enable.]
java.io.IOException: HTTP status [500], exception [com.cloudera.keytrustee.TrusteeKeyProvider$DuplicateKeyException], message [Key with name "amalKey" already exists in "com.cloudera.keytrustee.TrusteeKeyProvider@6d88562. Key exists but has been disabled. Use undelete to enable.]
at org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:159)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.call(KMSClientProvider.java:545)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.call(KMSClientProvider.java:503)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.createKeyInternal(KMSClientProvider.java:676)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.createKey(KMSClientProvider.java:684)
at org.apache.hadoop.crypto.key.KeyShell$CreateCommand.execute(KeyShell.java:483)
at org.apache.hadoop.crypto.key.KeyShell.run(KeyShell.java:79)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.crypto.key.KeyShell.main(KeyShell.java:515)

In the error logs, it says to use purge option to permanently delete the key and undelete to recover the deleted key. But I was not able to find these options with hadoop key command. I googled it and I couldn’t figure out this issue. Finally I got the guidance from one guy from cloudera to execute the purge & undelete commands through rest api of keytrustee and he gave a nice explanation for my issue. I am briefly putting the solution for this exception below.

The delete operation on the Trustee key provider is a “soft delete”, meaning that is possible to “undelete” the key. It is also possible to “purge” the key to delete it permanently. Because these operations are not part of the standard Hadoop key provider API, they are not currently exposed through Hadoop KeyShell (hadoop key). However, you can call these operations directly via the Trustee key provider REST API.

See the examples below.

Use KeyShell to list existing keys

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1

Use KeyShell to delete an existing key

$ ./bin/hadoop key delete amal-testkey-1 -provider kms://http@localhost:16000/kms
 
Deleting key: ajy-testkey-1 from KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1 has been successfully deleted.
KMSClientProvider[http://localhost:16000/kms/v1/] has been updated.

Use KeyShell to verify the key was deleted

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
 

Use the KeyTrustee key provider REST API to undelete the deleted key

$ curl -L -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name=amal&trusteeOp=undelete"

Use KeyShell to verify the key was restored

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1

Use the KeyTrustee key provider REST API to purge the restored key

$ curl L -d "trusteeOp=purge" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name=amal&trusteeOp=purge"

Use KeyShell to verify the key was deleted

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
 

Use the KeyTrustee key provider REST API to attempt to undelete the purged key

$ curl -L -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name=amal&trusteeOp=undelete"

{
"RemoteException" : {
"message" : "Key with name amal-testkey-1 not found in com.cloudera.keytrustee.TrusteeKeyProvider@6d88562",
"exception" : "IOException",
"javaClassName" : "java.io.IOException"
}
}

Configure ACLs for KeyTrustee undelete, purge and migrate operations

ACLs for the KeyTrustee specific undelete, purge and migrate operations are configured in kts-acls.xml. Place this file in the same location as your kms-acls.xml file. See example below.

<property>
   <name>keytrustee.kms.acl.UNDELETE</name>
     <value>*</value>
       <description>
          ACL for undelete-key operations.
      </description>
</property>
 
<property>
  <name>keytrustee.kms.acl.PURGE</name>
    <value>*</value>
      <description>
         ACL for purge-key operations.
      </description>
</property>
 
<property>
  <name>keytrustee.kms.acl.MIGRATE</name>
    <value>*</value>
     <description> 
      ACL for purge-key operations.
     </description>
</property>
 

Note: In kerberized environments, the requests will be a little different. It will be in the following format

Eg :
curl -L --negotiate -u [username]  -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name={username}&trusteeOp=undelete"