Find the file name corresponding to a record in hive

Every table in hive has two virtual columns. They are

  • INPUT__FILE__NAME
  • BLOCK__OFFSET__INSIDE__FILE

INPUT__FILE__NAME give the name of the file.

BLOCK__OFFSET__INSIDE__FILE is the current global file position.

Suppose if we want to find the name of the file corresponding to each record in a file. We can use the INPUT__FILE__NAME column. This feature is available from hive versions above 0.8. A small example is given below.

Table Customer Details

create table customer_details ( name string, phone_number string) row format delimited fields terminated by ',';

Sample data set

datafile1.csv

amal,9876543210
sreesankar,8976543210
sahad,7896543210

datafile2.csv

sivaram76896543210
rupak,7568943210
venkatesh,9087654321

Query

select INPUT__FILE__NAME, name from customer_data;

This will give us the file name corresponding to each record. If you want to get the file names corresponding to a hive table, the below query will help you.

select distinct(INPUT__FILE__NAME) from customer_data;

Utility to get the complete details of a Linux system

This is a small shell script that captures almost all the necessary details of a linux system. I tested this script in CentOS and Redhat operating systems. You can access this script directly from github.

How to add EPEL Repository in Linux ?

Linux is my favourite operating system. I like windows for multimedia activities. But when it comes to work and experiments, I like linux. Linux gives us the flexibility to perform all operations and it is a vast ocean to explore. Most of us might have heard about EPEL. We used to download lot of packages from EPEL.

But did anyone knows what is EPEL ??
EPEL stands for Extra Packages for Enterprise Linux. It is an opensource repository maintained by the community which contains lot of useful software packages for Redhat, CentOS and Scientific Linux. We can find packages for almost everything as per our needs from this repository.

  • EPEL repository is 100% opensource and is free to use.
  • No extra effort is required to install these packages.
  • Version specific packages are available depending upon the OS version. So this will not cause any conflicts with existing packages in the OS.
  • Can be simply installed using yum

By default the epel repository will not be added in the linux. We have to add it explicitly. We have to download the epel repo and add it to the repositories. This can be simply done by installing an rpm. The following steps help you in adding the epel repository to your CentOS/Redhat machine.

RHEL/CentOS 7 64-Bit

## RHEL/CentOS 7 64-Bit ##
# wget http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm
# rpm -ivh epel-release-7-5.noarch.rpm

RHEL/CentOS 6 32-Bit

## RHEL/CentOS 6 32-Bit ##
# wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
# rpm -ivh epel-release-6-8.noarch.rpm

RHEL/CentOS 6 64-Bit

## RHEL/CentOS 6 64-Bit ##
# wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
# rpm -ivh epel-release-6-8.noarch.rpm

RHEL/CentOS 5 32-Bit

## RHEL/CentOS 5 32-Bit ##
# wget http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
# rpm -ivh epel-release-5-4.noarch.rpm

RHEL/CentOS 5 64-Bit

## RHEL/CentOS 5 64-Bit ##
# wget http://download.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm
# rpm -ivh epel-release-5-4.noarch.rpm

RHEL/CentOS 4 32-Bit

## RHEL/CentOS 4 32-Bit ##
# wget http://download.fedoraproject.org/pub/epel/4/i386/epel-release-4-10.noarch.rpm
# rpm -ivh epel-release-4-10.noarch.rpm

RHEL/CentOS 4 64-Bit

## RHEL/CentOS 4 64-Bit ##
# wget http://download.fedoraproject.org/pub/epel/4/x86_64/epel-release-4-10.noarch.rpm
# rpm -ivh epel-release-4-10.noarch.rpm