Every table in hive has two virtual columns. They are
- INPUT__FILE__NAME
- BLOCK__OFFSET__INSIDE__FILE
INPUT__FILE__NAME give the name of the file.
BLOCK__OFFSET__INSIDE__FILE is the current global file position.
Suppose if we want to find the name of the file corresponding to each record in a file. We can use the INPUT__FILE__NAME column. This feature is available from hive versions above 0.8. A small example is given below.
Table Customer Details
create table customer_details ( name string, phone_number string) row format delimited fields terminated by ',';
Sample data set
datafile1.csv
amal,9876543210 sreesankar,8976543210 sahad,7896543210
datafile2.csv
sivaram76896543210 rupak,7568943210 venkatesh,9087654321
Query
select INPUT__FILE__NAME, name from customer_data;
This will give us the file name corresponding to each record. If you want to get the file names corresponding to a hive table, the below query will help you.
select distinct(INPUT__FILE__NAME) from customer_data;