Delta Science – The art of designing new generation Data Lake

When we hear about Delta Lake, the first question that comes to our mind is

“What is Delta Lake and How it works ?”. 

“Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads”

But the question is how it is possible to maintain transactions in the Big Data world ?. The answer is very simple. It is using Delta Format.

Delta Lake stores data in Delta Format. Delta format is a versioned parquet format along with a scalable metadata. It stores the data as parquet internally and it tracks the changes happening to the data in the metadata file. So the metadata will also grow along with the data.

Delta format solved several major challenges in the Big Data Lake world.  Some of them are listed below

  1. Transaction management
  2. Versioning
  3. Incremental Load
  4. Indexing
  5. UPSERT and DELETE operations
  6. Schema Enforcement and Schema Evolution

I will elaborate this post by explaining each of the above features and explain more about the internals of Delta Lake.

How to configure Delta Lake on EMR ?

EMR versions 5.24.x and higher versions has Apache Spark version 2.4.2 and higher. So Delta Lake can be enabled in EMR versions 5.24.x and above. By default Delta Lake is not enabled in EMR. It is easy to enable Delta Lake in EMR.

We just need to add the delta jar to the spark jars. We can either add it manually or can be performed easily by using a custom bootstrap script. A Sample script is given below. Upload the delta-core jar to an S3 bucket and download it to the spark jars folder using the below shell script. The delta core jar can be downloaded from maven repository. You can even build it yourselves also. The source code is available in github.

Adding this as a bootstrap action will automatically perform this activity while provisioning the cluster. Keep the below script in an S3 location and pass it as bootstrap script.


aws s3 cp s3://mybucket/delta/delta-core_2. /usr/lib/spark/jars/

You can launch the cluster either by using the aws web console or by using the aws cli.

aws emr create-cluster --name "Test cluster" --release-label emr-5.25.0 \
--use-default-roles --ec2-attributes KeyName=myDeltaKey \
--applications Name=Hive Name=Spark \
--instance-count 3 --instance-type m5.xlarge \
--bootstrap-actions Path="s3://mybucket/bootstrap/"