Custom Text Input Format Record Delimiter for Hadoop

By default mapreduce program accepts text file and it reads line by line. Technically speaking the default input format is text input format and the default delimiter is ‘/n’ (new line).

In several cases, we need to override this property. For example if you have a large text file and you want to read the contents between ‘.’  in each read.

In this case, using the default delimiter will be difficult.

For example.

If you have a file like this


Ice cream (derived from earlier iced cream or cream ice) is a frozen  title="Dessert"  usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours. Most varieties contain sugar, although some are made with other sweeteners. In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients. The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming. The result is a smoothly textured semi-solid foam that is malleable and can be scooped.

And we want each record as


1) Ice cream (derived from earlier iced cream or cream ice) is a frozen dessert usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours

2) Most varieties contain sugar, although some are made with other sweeteners

3) In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients

4) The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming

5) The result is a smoothly textured semi-solid foam that is malleable and can be scooped

This we can do it by overriding one property textinputformat.record.delimiter

We can either set this property in the driver class or just changing the value of delimiter in the TextInputFormat class.

The first method is the easiest way.

Setting the textinputformat.record.delimiter in Driver class

The format for setting it in the program (Driver class)  is


conf.set(“textinputformat.record.delimiter”, “delimiter”)

The value you are setting by this method is ultimately going into the TextInputFormat class. This is explained below.

Editting the TextInputFormat class

.

Default TextInputFormat class

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text>
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
// By default,textinputformat.record.delimiter = ‘/n’(Set in configuration file)
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
  }
}

Editted TextInputFormat class


public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text>
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {

// Hardcoding this value as “.”
// You can add any delimiter as your requirement

    String delimiter = “.”;
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
  }
}
Advertisements

About amalgjose
I am an Electrical Engineer by qualification, now I am working as a Software Engineer. I am very much interested in Electrical, Electronics, Mechanical and now in Software fields. I like exploring things in these fields. I like travelling, long drives and very much addicted to music.

7 Responses to Custom Text Input Format Record Delimiter for Hadoop

  1. Dev says:

    Hi,

    Not sure what I may be doing wrong but this code does not work.

    My input looks like

    improvement is a relative word|en
    multiline
    improvement|en

    ——-
    I am expecting by setting the delimiter to |en the mapper should reach these as 2 records and not 3. But the output is always 3 records.

    import java.io.IOException;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    public class DelimTest {
    public static class WordCountMapper
    extends Mapper{

    private static final IntWritable one = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context)
    throws IOException, InterruptedException {
    context.write(value, one);
    }
    }
    public static void main(String[] args) throws Exception {

    // Create a new job

    Configuration conf = new Configuration();
    conf.set(“textinputformat.record.delimiter”,”|en”);
    Job job = new Job(conf);

    // Set job name to locate it in the distributed environment
    job.setJarByClass(DelimTest.class);
    job.setJobName(“Word Count”);

    // Set input and output Path, note that we use the default input format
    // which is TextInputFormat (each record is a line of input)
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    // Set Mapper and Reducer class
    job.setMapperClass(WordCountMapper.class);
    job.setInputFormatClass(TextInputFormat.class);
    // Set Output key and value
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
    }

  2. Umamahesh says:

    Its a good example, Working fine

  3. Ceyhun Karimov says:

    Thank you for your explanation. My question is that, is there way to keep delimiter also in records?

  4. SMjee says:

    Hello, I am using your input text and coded the problem as per your suggestion. However I am still not getting each sentence as a output from map task. Here is my code:

    package aamend;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.io.compress.CompressionCodec;
    import org.apache.hadoop.io.compress.CompressionCodecFactory;
    import org.apache.hadoop.mapreduce.InputSplit;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.JobContext;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.RecordReader;
    import org.apache.hadoop.mapreduce.TaskAttemptContext;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    import java.io.IOException;

    /**
    * Created by antoine on 05/06/14.
    */
    public class Delimeter {

    public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
    conf.set(“textinputformat.record.delimiter”, “delimiter”);
    Job job = new Job(conf, “Delimiter”);
    job.setJarByClass(Delimeter.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(DelimiterMapper.class);
    job.setNumReduceTasks(0);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

    public class TextInputFormat extends FileInputFormat {

    @Override
    public RecordReader
    createRecordReader(InputSplit split,
    TaskAttemptContext context) {

    // Hardcoding this value as “.”
    // You can add any delimiter as your requirement

    String delimiter = “.”;
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
    recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader();
    }

    @Override
    protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
    new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
    }
    }

    public static class DelimiterMapper extends Mapper {

    private final Text TEXT = new Text();

    @Override
    public void map(LongWritable key, Text value, Context context)
    throws IOException, InterruptedException {

    // Should be a 2 lines key value
    TEXT.set(“{” + value.toString() + “}”);
    context.write(NullWritable.get(), TEXT);
    }
    }

    }

    And my output looks like:

    {Ice cream (derived from earlier iced cream or cream ice) is a frozen title=”Dessert” usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours. Most varieties contain sugar, although some are made with other sweeteners. In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients. The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming. The result is a smoothly textured semi-solid foam that is malleable and can be scooped.}

    Please suggest!

    • amalgjose says:

      Hi,
      I have suggested two methods in my post, you added the both the methods in your program.
      Either you can follow the first approach or you can use the second approach.

      conf.set(“textinputformat.record.delimiter”, “delimiter”);

      The delimiter mentioned here should be replaced by your delimiter. In your case, it will be ‘.’.
      So it will look like this
      conf.set(“textinputformat.record.delimiter”, “.”);

      This is one method. Editting the TextInputFormat class is the 2nd method.
      No need to use both. Use any of these. In your case using
      conf.set(“textinputformat.record.delimiter”, “.”);
      will be fine. Sorry for the delay in response.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: