EMR versions 5.24.x and higher versions has Apache Spark version 2.4.2 and higher. So Delta Lake can be enabled in EMR versions 5.24.x and above. By default Delta Lake is not enabled in EMR. It is easy to enable Delta Lake in EMR.
We just need to add the delta jar to the spark jars. We can either add it manually or can be performed easily by using a custom bootstrap script. A Sample script is given below. Upload the delta-core jar to an S3 bucket and download it to the spark jars folder using the below shell script. The delta core jar can be downloaded from maven repository. You can even build it yourselves also. The source code is available in github.
Adding this as a bootstrap action will automatically perform this activity while provisioning the cluster. Keep the below script in an S3 location and pass it as bootstrap script.
copydeltajar.sh
#!/bin/bash aws s3 cp s3://mybucket/delta/delta-core_2.11.0.4.0.jar /usr/lib/spark/jars/
You can launch the cluster either by using the aws web console or by using the aws cli.
aws emr create-cluster --name "Test cluster" --release-label emr-5.25.0 \ --use-default-roles --ec2-attributes KeyName=myDeltaKey \ --applications Name=Hive Name=Spark \ --instance-count 3 --instance-type m5.xlarge \ --bootstrap-actions Path="s3://mybucket/bootstrap/copydeltajar.sh"
Hey, nice tip. Unfortunately, I don’t think it works on newer EMR versions. I tried it on 5.32.0 today, and I think the spark jars overwrote my bootstrap jar.
I think you have to use this technique to as well now:
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/