Advertisements

How to configure Delta Lake on EMR ?

EMR versions 5.24.x and higher versions has Apache Spark version 2.4.2 and higher. So Delta Lake can be enabled in EMR versions 5.24.x and above. By default Delta Lake is not enabled in EMR. It is easy to enable Delta Lake in EMR.

We just need to add the delta jar to the spark jars. We can either add it manually or can be performed easily by using a custom bootstrap script. A Sample script is given below. Upload the delta-core jar to an S3 bucket and download it to the spark jars folder using the below shell script. The delta core jar can be downloaded from maven repository. You can even build it yourselves also. The source code is available in github.

Adding this as a bootstrap action will automatically perform this activity while provisioning the cluster. Keep the below script in an S3 location and pass it as bootstrap script.

copydeltajar.sh

#!/bin/bash

aws s3 cp s3://mybucket/delta/delta-core_2.11.0.4.0.jar /usr/lib/spark/jars/

You can launch the cluster either by using the aws web console or by using the aws cli.

aws emr create-cluster --name "Test cluster" --release-label emr-5.25.0 \
--use-default-roles --ec2-attributes KeyName=myDeltaKey \
--applications Name=Hive Name=Spark \
--instance-count 3 --instance-type m5.xlarge \
--bootstrap-actions Path="s3://mybucket/bootstrap/copydeltajar.sh"

 

Advertisements

How to add EPEL Repository in Linux ?

Linux is my favourite operating system. I like windows for multimedia activities. But when it comes to work and experiments, I like linux. Linux gives us the flexibility to perform all operations and it is a vast ocean to explore. Most of us might have heard about EPEL. We used to download lot of packages from EPEL.

But did anyone knows what is EPEL ??
EPEL stands for Extra Packages for Enterprise Linux. It is an opensource repository maintained by the community which contains lot of useful software packages for Redhat, CentOS and Scientific Linux. We can find packages for almost everything as per our needs from this repository.

  • EPEL repository is 100% opensource and is free to use.
  • No extra effort is required to install these packages.
  • Version specific packages are available depending upon the OS version. So this will not cause any conflicts with existing packages in the OS.
  • Can be simply installed using yum

By default the epel repository will not be added in the linux. We have to add it explicitly. We have to download the epel repo and add it to the repositories. This can be simply done by installing an rpm. The following steps help you in adding the epel repository to your CentOS/Redhat machine.

RHEL/CentOS 7 64-Bit

## RHEL/CentOS 7 64-Bit ##
# wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
# rpm -ivh epel-release-7-5.noarch.rpm

RHEL/CentOS 6 32-Bit

## RHEL/CentOS 6 32-Bit ##
# wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
# rpm -ivh epel-release-6-8.noarch.rpm

RHEL/CentOS 6 64-Bit

## RHEL/CentOS 6 64-Bit ##
# wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
# rpm -ivh epel-release-6-8.noarch.rpm

RHEL/CentOS 5 32-Bit

## RHEL/CentOS 5 32-Bit ##
# wget http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
# rpm -ivh epel-release-5-4.noarch.rpm

RHEL/CentOS 5 64-Bit

## RHEL/CentOS 5 64-Bit ##
# wget http://download.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm
# rpm -ivh epel-release-5-4.noarch.rpm

RHEL/CentOS 4 32-Bit

## RHEL/CentOS 4 32-Bit ##
# wget http://download.fedoraproject.org/pub/epel/4/i386/epel-release-4-10.noarch.rpm
# rpm -ivh epel-release-4-10.noarch.rpm

RHEL/CentOS 4 64-Bit

## RHEL/CentOS 4 64-Bit ##
# wget http://download.fedoraproject.org/pub/epel/4/x86_64/epel-release-4-10.noarch.rpm
# rpm -ivh epel-release-4-10.noarch.rpm