Find the file name corresponding to a record in hive

Every table in hive has two virtual columns. They are

  • INPUT__FILE__NAME
  • BLOCK__OFFSET__INSIDE__FILE

INPUT__FILE__NAME give the name of the file.

BLOCK__OFFSET__INSIDE__FILE is the current global file position.

Suppose if we want to find the name of the file corresponding to each record in a file. We can use the INPUT__FILE__NAME column. This feature is available from hive versions above 0.8. A small example is given below.

Table Customer Details

create table customer_details ( name string, phone_number string) row format delimited fields terminated by ',';

Sample data set

datafile1.csv

amal,9876543210
sreesankar,8976543210
sahad,7896543210

datafile2.csv

sivaram76896543210
rupak,7568943210
venkatesh,9087654321

Query

select INPUT__FILE__NAME, name from customer_data;

This will give us the file name corresponding to each record. If you want to get the file names corresponding to a hive table, the below query will help you.

select distinct(INPUT__FILE__NAME) from customer_data;
Advertisements

About amalgjose
I am an Electrical Engineer by qualification, now I am working as a Software Engineer. I am very much interested in Electrical, Electronics, Mechanical and now in Software fields. I like exploring things in these fields. I like travelling, long drives and very much addicted to music.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: