Find the file name corresponding to a record in hive

Every table in hive has two virtual columns. They are

  • INPUT__FILE__NAME
  • BLOCK__OFFSET__INSIDE__FILE

INPUT__FILE__NAME give the name of the file.

BLOCK__OFFSET__INSIDE__FILE is the current global file position.

Suppose if we want to find the name of the file corresponding to each record in a file. We can use the INPUT__FILE__NAME column. This feature is available from hive versions above 0.8. A small example is given below.

Table Customer Details

create table customer_details ( name string, phone_number string) row format delimited fields terminated by ',';

Sample data set

datafile1.csv

amal,9876543210
sreesankar,8976543210
sahad,7896543210

datafile2.csv

sivaram76896543210
rupak,7568943210
venkatesh,9087654321

Query

select INPUT__FILE__NAME, name from customer_data;

This will give us the file name corresponding to each record. If you want to get the file names corresponding to a hive table, the below query will help you.

select distinct(INPUT__FILE__NAME) from customer_data;

Custom Text Input Format Record Delimiter for Hadoop

By default mapreduce program accepts text file and it reads line by line. Technically speaking the default input format is text input format and the default delimiter is ‘/n’ (new line).

In several cases, we need to override this property. For example if you have a large text file and you want to read the contents between ‘.’  in each read.

In this case, using the default delimiter will be difficult.

For example.

If you have a file like this


Ice cream (derived from earlier iced cream or cream ice) is a frozen  title="Dessert"  usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours. Most varieties contain sugar, although some are made with other sweeteners. In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients. The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming. The result is a smoothly textured semi-solid foam that is malleable and can be scooped.

And we want each record as


1) Ice cream (derived from earlier iced cream or cream ice) is a frozen dessert usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours

2) Most varieties contain sugar, although some are made with other sweeteners

3) In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients

4) The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming

5) The result is a smoothly textured semi-solid foam that is malleable and can be scooped

This we can do it by overriding one property textinputformat.record.delimiter

We can either set this property in the driver class or just changing the value of delimiter in the TextInputFormat class.

The first method is the easiest way.

Setting the textinputformat.record.delimiter in Driver class

The format for setting it in the program (Driver class)  is


conf.set(“textinputformat.record.delimiter”, “delimiter”)

The value you are setting by this method is ultimately going into the TextInputFormat class. This is explained below.

Editting the TextInputFormat class

.

Default TextInputFormat class

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text>
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
// By default,textinputformat.record.delimiter = ‘/n’(Set in configuration file)
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
  }
}

Editted TextInputFormat class


public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text>
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {

// Hardcoding this value as “.”
// You can add any delimiter as your requirement

    String delimiter = “.”;
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
  }
}