dependency xml is not available

The error “dependency xml is not available” can be resolved by installing the following packages.

For CentOS/RHEL

yum install libxml2 libxml2-devel

For Ubuntu

apt-get install libxml2-dev

How to check the performance of DNS in your network ?

I was checking for tools to benchmark the performance of DNS servers in my network. The reason behind this performance test was to identify the root cause of the internet slowness within my network. One of the good free tool that I found online is DNS Benchmark Tool.

This is a very light weight and portable tool. This is just 180KB and helps us to perform the DNS speeed test. With this tool I figured out one anonymous DNS server running in an individuals laptop also.

dns_server

I found this tool as a useful utility.

How to split a list into chunks using Python

To split a large list into smaller lists, you can use the following code snippet.

This can be easily performed by numpy.

import numpy
num_splits = 3
large_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]
splitted_list = numpy.array_split(large_list,num_splits);
for split in splitted_list:
    print(list(split))

 

How to find the IP address of a linux server ?

To check the IP address of a Linux server, type the following command in the terminal/commandline.

ifconfig

The below command also will help in finding the ip address.

ip addr

CDH cluster installation failing in “distributing” stage- Failure due to stall on seeded torrent

I faced this issue while distributing the downloaded packages in cloudera manager.

The solution that worked for me is to add the IP Address – Hostname mapping in all the /etc/hosts files of all the cloudera manager server and agents

/etc/hosts

192.168.0.101   cdhdatanode1

ERROR Failed to collect NTP metrics – Cloudera Manager Agent

If you are facing an error like “Failed to collect NTP metrics”. The following solution might help you. This is because of the lack of ntp server in the server. The below solution will work for CentOS/RHEL systems. NTP will sync the system time with the network time.

yum install ntp

systemctl enable ntpd

systemctl restart ntpd

Delta Science – The art of designing new generation Data Lake

When we hear about Delta Lake, the first question that comes to our mind is

“What is Delta Lake and How it works ?”. 

“Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads”

But the question is how it is possible to maintain transactions in the Big Data world ?. The answer is very simple. It is using Delta Format.

Delta Lake stores data in Delta Format. Delta format is a versioned parquet format along with a scalable metadata. It stores the data as parquet internally and it tracks the changes happening to the data in the metadata file. So the metadata will also grow along with the data.

Delta format solved several major challenges in the Big Data Lake world.  Some of them are listed below

  1. Transaction management
  2. Versioning
  3. Incremental Load
  4. Indexing
  5. UPSERT and DELETE operations
  6. Schema Enforcement and Schema Evolution

I will elaborate this post by explaining each of the above features and explain more about the internals of Delta Lake.

How to configure Delta Lake on EMR ?

EMR versions 5.24.x and higher versions has Apache Spark version 2.4.2 and higher. So Delta Lake can be enabled in EMR versions 5.24.x and above. By default Delta Lake is not enabled in EMR. It is easy to enable Delta Lake in EMR.

We just need to add the delta jar to the spark jars. We can either add it manually or can be performed easily by using a custom bootstrap script. A Sample script is given below. Upload the delta-core jar to an S3 bucket and download it to the spark jars folder using the below shell script. The delta core jar can be downloaded from maven repository. You can even build it yourselves also. The source code is available in github.

Adding this as a bootstrap action will automatically perform this activity while provisioning the cluster. Keep the below script in an S3 location and pass it as bootstrap script.

copydeltajar.sh

#!/bin/bash

aws s3 cp s3://mybucket/delta/delta-core_2.11.0.4.0.jar /usr/lib/spark/jars/

You can launch the cluster either by using the aws web console or by using the aws cli.

aws emr create-cluster --name "Test cluster" --release-label emr-5.25.0 \
--use-default-roles --ec2-attributes KeyName=myDeltaKey \
--applications Name=Hive Name=Spark \
--instance-count 3 --instance-type m5.xlarge \
--bootstrap-actions Path="s3://mybucket/bootstrap/copydeltajar.sh"