How to find the IP address of a linux server ?

To check the IP address of a Linux server, type the following command in the terminal/commandline.

ifconfig

The below command also will help in finding the ip address.

ip addr

CDH cluster installation failing in “distributing” stage- Failure due to stall on seeded torrent

I faced this issue while distributing the downloaded packages in cloudera manager.

The solution that worked for me is to add the IP Address – Hostname mapping in all the /etc/hosts files of all the cloudera manager server and agents

/etc/hosts

192.168.0.101   cdhdatanode1

ERROR Failed to collect NTP metrics – Cloudera Manager Agent

If you are facing an error like “Failed to collect NTP metrics”. The following solution might help you. This is because of the lack of ntp server in the server. The below solution will work for CentOS/RHEL systems. NTP will sync the system time with the network time.

yum install ntp

systemctl enable ntpd

systemctl restart ntpd

Delta Science – The art of designing new generation Data Lake

When we hear about Delta Lake, the first question that comes to our mind is

“What is Delta Lake and How it works ?”. 

“Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads”

But the question is how it is possible to maintain transactions in the Big Data world ?. The answer is very simple. It is using Delta Format.

Delta Lake stores data in Delta Format. Delta format is a versioned parquet format along with a scalable metadata. It stores the data as parquet internally and it tracks the changes happening to the data in the metadata file. So the metadata will also grow along with the data.

Delta format solved several major challenges in the Big Data Lake world.  Some of them are listed below

  1. Transaction management
  2. Versioning
  3. Incremental Load
  4. Indexing
  5. UPSERT and DELETE operations
  6. Schema Enforcement and Schema Evolution

I will elaborate this post by explaining each of the above features and explain more about the internals of Delta Lake.

How to configure Delta Lake on EMR ?

EMR versions 5.24.x and higher versions has Apache Spark version 2.4.2 and higher. So Delta Lake can be enabled in EMR versions 5.24.x and above. By default Delta Lake is not enabled in EMR. It is easy to enable Delta Lake in EMR.

We just need to add the delta jar to the spark jars. We can either add it manually or can be performed easily by using a custom bootstrap script. A Sample script is given below. Upload the delta-core jar to an S3 bucket and download it to the spark jars folder using the below shell script. The delta core jar can be downloaded from maven repository. You can even build it yourselves also. The source code is available in github.

Adding this as a bootstrap action will automatically perform this activity while provisioning the cluster. Keep the below script in an S3 location and pass it as bootstrap script.

copydeltajar.sh

#!/bin/bash

aws s3 cp s3://mybucket/delta/delta-core_2.11.0.4.0.jar /usr/lib/spark/jars/

You can launch the cluster either by using the aws web console or by using the aws cli.

aws emr create-cluster --name "Test cluster" --release-label emr-5.25.0 \
--use-default-roles --ec2-attributes KeyName=myDeltaKey \
--applications Name=Hive Name=Spark \
--instance-count 3 --instance-type m5.xlarge \
--bootstrap-actions Path="s3://mybucket/bootstrap/copydeltajar.sh"

 

How to set up Delta Lake in Apache Spark ?

Delta lake is supported in the latest version of Apache Spark. Delta Lake is open sourced with Apache 2.0 license. So it is free to use. Delta Lake is supported in Apache Spark versions above 2.4.2. It is very easy to set up and it does not require any admin skills to configure. Delta Lake is available by default in Databricks. We don’t have to do any installation or configuration to use this Delta Lake in Databricks.

For trying out the basic example, launch pyspark or spark-shell by adding the delta package. No need of any additional installation. Just use the following command

For pyspark

pyspark --packages io.delta:delta-core_2.11:0.4.0
For spark-shell
bin/spark-shell --packages io.delta:delta-core_2.11:0.4.0

The above command/s will add delta package to the context and delta lake will be enabled. You can try out the following basic example in the pyspark shell.

Delta Lake – The New Generation Data Lake

‘Delta Lake is the need of the present era. From the past several years, we have been hearing about DataLakes. I myself worked on several Data Lake implementations also. But the previous generation Data Lake was not a complete solution. One of the main fall backs in the old gen data lake is the difficulty in handling ACID transactions.

Delta Lake brings ACID transactions in the storage layer and thus makes the system more robust and efficient. One of my recent projects was to build a Data Lake for one of the India’s Largest Ride Sharing company. Their requirements include handling CDC (Change Data Capture) in the lake.  Their customers make several rides per day and there will be lot of transactions and changes happening in various entities associated with the platform such as debiting money from wallet, crediting money to wallet, creating ride, deleting ride, updating ride, updating user profile etc.

The initial version of the Lake that I designed was capable of recording only the latest values of each of these entities. But that was not a proper solution as it will not bring the complete analytics capability. So after that I came up with a design using Delta that has the capability to handle the CDC. In this way we will be able to track all the changes happening to the data and also instead of updating the records, we will be keeping the historic data also in the system. Delta-Lake-Architecture

Image Credits: Delta Lake 

The Delta format is the main magic behind the Delta Lake. The Delta format is open sourced by DataBricks and it is available with Apache Spark.

Some of the key features of Delta Lake are listed below.

  1. Support to ACID transactions. It is very tedious to bring data integrity in the conventional Data Lake. The transaction handling capability was missing in the old generation Data Lakes. With the support to transactions, the Delta Lake becomes more efficient and reduces the workload of Data Engineers.
  2. Data Versioning: Delta Lake supports time travel. This helps us for rollback, audit control, version control etc. In this way, the old records are not getting deleted instead it is getting versioned.
  3. Support for Merge, Update and Delete operations.
  4. No major change is required in the existing system to implement Delta Lake. Delta Lake is 100% open source and it is 100% compatible with the Spark APIs. The Delta Lake uses Apache Parquet format to store the data. The following snippet shows show to save data in Delta format. It is very simple, just use “delta” instead of “parquet”
dataframe
   .write
   .format("parquet")
   .save("/dataset")
dataframe
   .write
   .format("delta")
   .save("/dataset")

For trying out Delta in detail, use the community version of DataBricks

Sample code snippet for trying out the same is attached below.

CentOS 8 Released Last week.

CentOS 8 released last week and I just downloaded the iso file locally. It is around 6.6 GB.

centos_8_download

There are a lot of exciting updates in this new release. I will share my experience after installing it.

 

RJDBC java.lang.OutOfMemoryError

You might see the below error while making jdbc connections from R programs.

java.lang.OutOfMemoryError: Java heap space

If you face java heap size exceptions in RJDBC connections like above, simply increase the JAVA heap size from your R program. Sample snippet is given below.

options(java.parameters = "-Xmx8048m")
library("RJDBC")

or

options(java.parameters = "-Xmx8g")
library("RJDBC")

Hope this helps you.

How to extract a tar.gz file quickly in Linux

Recently I got a tar.gz file of around 30 GB and on extraction it will become approximately 4TB. I want to speed up the extraction as the normal extraction was taking approximately a day. I searched a lot and finally figured out a solution.

The solution was pigz. This is an advanced version of gzip. It uses multiple threads for reading, writing and checksum calculations. The extraction happens in a single thread. But overall performance is far better than the normal gzip.

The command to install pigz in CentOS or RHEL is given below. Ensure epel repository is enabled in your system

yum install pigz

The command to extract a tar.gz file using pigz is given below.

pigz -dc compressed.tar.gz | tar xf -

If you want to see the progress of the extraction process, you need to use Pipe Viewer (pv). PV (“Pipe Viewer”) is a tool for monitoring the progress of data through a pipeline. It can be inserted into any normal pipeline between two processes to give a visual indication of how quickly data is passing through, how long it has taken, how near to completion it is, and an estimate of how long it will be until completion.

Pipe viewer can be installed in CentOS or RHEL using the following command

yum install pv

Using pv, we can monitor the progress of the decompression process

pigz -dc compressed.tar.gz | pv | tar xf -