Custom Text Input Format Record Delimiter for Hadoop

By default mapreduce program accepts text file and it reads line by line. Technically speaking the default input format is text input format and the default delimiter is ‘/n’ (new line).

In several cases, we need to override this property. For example if you have a large text file and you want to read the contents between ‘.’  in each read.

In this case, using the default delimiter will be difficult.

For example.

If you have a file like this


Ice cream (derived from earlier iced cream or cream ice) is a frozen  title="Dessert"  usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours. Most varieties contain sugar, although some are made with other sweeteners. In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients. The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming. The result is a smoothly textured semi-solid foam that is malleable and can be scooped.

And we want each record as


1) Ice cream (derived from earlier iced cream or cream ice) is a frozen dessert usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours

2) Most varieties contain sugar, although some are made with other sweeteners

3) In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients

4) The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming

5) The result is a smoothly textured semi-solid foam that is malleable and can be scooped

This we can do it by overriding one property textinputformat.record.delimiter

We can either set this property in the driver class or just changing the value of delimiter in the TextInputFormat class.

The first method is the easiest way.

Setting the textinputformat.record.delimiter in Driver class

The format for setting it in the program (Driver class)  is


conf.set(“textinputformat.record.delimiter”, “delimiter”)

The value you are setting by this method is ultimately going into the TextInputFormat class. This is explained below.

Editting the TextInputFormat class

.

Default TextInputFormat class

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text>
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
// By default,textinputformat.record.delimiter = ‘/n’(Set in configuration file)
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
  }
}

Editted TextInputFormat class


public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text>
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {

// Hardcoding this value as “.”
// You can add any delimiter as your requirement

    String delimiter = “.”;
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
  }
}

About amalgjose
I am an Electrical Engineer by qualification, now I am working as a Software Architect. I am very much interested in Electrical, Electronics, Mechanical and now in Software fields. I like exploring things in these fields. I love travelling, long drives and music.

11 Responses to Custom Text Input Format Record Delimiter for Hadoop

  1. Dev says:

    Hi,

    Not sure what I may be doing wrong but this code does not work.

    My input looks like

    improvement is a relative word|en
    multiline
    improvement|en

    ——-
    I am expecting by setting the delimiter to |en the mapper should reach these as 2 records and not 3. But the output is always 3 records.

    import java.io.IOException;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    public class DelimTest {
    public static class WordCountMapper
    extends Mapper{

    private static final IntWritable one = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context)
    throws IOException, InterruptedException {
    context.write(value, one);
    }
    }
    public static void main(String[] args) throws Exception {

    // Create a new job

    Configuration conf = new Configuration();
    conf.set(“textinputformat.record.delimiter”,”|en”);
    Job job = new Job(conf);

    // Set job name to locate it in the distributed environment
    job.setJarByClass(DelimTest.class);
    job.setJobName(“Word Count”);

    // Set input and output Path, note that we use the default input format
    // which is TextInputFormat (each record is a line of input)
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    // Set Mapper and Reducer class
    job.setMapperClass(WordCountMapper.class);
    job.setInputFormatClass(TextInputFormat.class);
    // Set Output key and value
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
    }

  2. Umamahesh says:

    Its a good example, Working fine

  3. Abhishek Agrawal says:

    Hi,

    Below is my driver code.

    public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration(true);
    Job job = new Job(conf);
    job.setJobName(“wordcount”);
    conf.set(“textinputformat.record.delimiter”,”\n\n”);
    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.setInputPaths(job, new Path(“/home/abhishek_linuxhadoop/workspace/MR_WordCount/SampleDataSet.txt”));
    FileOutputFormat.setOutputPath(job, new Path(“/home/abhishek_linuxhadoop/workspace/MR_WordCount/OUTPUT_Word_Count”));

    job.setJarByClass(WordCount.class);
    job.submit();

    }

    The sample data set is separated by new lines character hence the \n\n usage.

    I am relatively new to the map reduce module.

    Can you pls help me out here as where am going wrong?

    The value in my mapper code, while debugging is still showing the first line and not the bunch of lines as it should. 😦

    The sample data set is –

    # Full information about Amazon Share the Love products
    Total items: 548552

    Id: 0
    ASIN: 0771044445
    discontinued product

    Id: 1
    ASIN: 0827229534
    title: Patterns of Preaching: A Sermon Sampler
    group: Book
    salesrank: 396585
    similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
    categories: 2
    |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
    |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
    reviews: total: 2 downloaded: 2 avg rating: 5
    2000-7-28 cutomer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful: 9
    2003-12-14 cutomer: A2VE83MZF98ITY rating: 5 votes: 6 helpful: 5

    Id: 2
    ASIN: 0738700797
    title: Candlemas: Feast of Flames
    group: Book
    salesrank: 168596
    similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940
    categories: 2
    |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
    |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486]
    reviews: total: 12 downloaded: 12 avg rating: 4.5
    2001-12-16 cutomer: A11NCO6YTE4BTJ rating: 5 votes: 5 helpful: 4
    2002-1-7 cutomer: A9CQ3PLRNIR83 rating: 4 votes: 5 helpful: 5
    2002-1-24 cutomer: A13SG9ACZ9O5IM rating: 5 votes: 8 helpful: 8
    2002-1-28 cutomer: A1BDAI6VEYMAZA rating: 5 votes: 4 helpful: 4
    2002-2-6 cutomer: A2P6KAWXJ16234 rating: 4 votes: 16 helpful: 16
    2002-2-14 cutomer: AMACWC3M7PQFR rating: 4 votes: 5 helpful: 5
    2002-3-23 cutomer: A3GO7UV9XX14D8 rating: 4 votes: 6 helpful: 6
    2002-5-23 cutomer: A1GIL64QK68WKL rating: 5 votes: 8 helpful: 8
    2003-2-25 cutomer: AEOBOF2ONQJWV rating: 5 votes: 8 helpful: 5
    2003-11-25 cutomer: A3IGHTES8ME05L rating: 5 votes: 5 helpful: 5
    2004-2-11 cutomer: A1CP26N8RHYVVO rating: 1 votes: 13 helpful: 9
    2005-2-7 cutomer: ANEIANH0WAT9D rating: 5 votes: 1 helpful: 1

    Id: 3
    ASIN: 0486287785
    blah blah blah

    I want to get the records from one Id to other Id in a mapper function at a time.

  4. Ceyhun Karimov says:

    Thank you for your explanation. My question is that, is there way to keep delimiter also in records?

  5. SMjee says:

    Hello, I am using your input text and coded the problem as per your suggestion. However I am still not getting each sentence as a output from map task. Here is my code:

    package aamend;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.*;
    import org.apache.hadoop.io.compress.CompressionCodec;
    import org.apache.hadoop.io.compress.CompressionCodecFactory;
    import org.apache.hadoop.mapreduce.InputSplit;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.JobContext;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.RecordReader;
    import org.apache.hadoop.mapreduce.TaskAttemptContext;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    import java.io.IOException;

    /**
    * Created by antoine on 05/06/14.
    */
    public class Delimeter {

    public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
    conf.set(“textinputformat.record.delimiter”, “delimiter”);
    Job job = new Job(conf, “Delimiter”);
    job.setJarByClass(Delimeter.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(DelimiterMapper.class);
    job.setNumReduceTasks(0);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

    public class TextInputFormat extends FileInputFormat {

    @Override
    public RecordReader
    createRecordReader(InputSplit split,
    TaskAttemptContext context) {

    // Hardcoding this value as “.”
    // You can add any delimiter as your requirement

    String delimiter = “.”;
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
    recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader();
    }

    @Override
    protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
    new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
    }
    }

    public static class DelimiterMapper extends Mapper {

    private final Text TEXT = new Text();

    @Override
    public void map(LongWritable key, Text value, Context context)
    throws IOException, InterruptedException {

    // Should be a 2 lines key value
    TEXT.set(“{” + value.toString() + “}”);
    context.write(NullWritable.get(), TEXT);
    }
    }

    }

    And my output looks like:

    {Ice cream (derived from earlier iced cream or cream ice) is a frozen title=”Dessert” usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours. Most varieties contain sugar, although some are made with other sweeteners. In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients. The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming. The result is a smoothly textured semi-solid foam that is malleable and can be scooped.}

    Please suggest!

    • amalgjose says:

      Hi,
      I have suggested two methods in my post, you added the both the methods in your program.
      Either you can follow the first approach or you can use the second approach.

      conf.set(“textinputformat.record.delimiter”, “delimiter”);

      The delimiter mentioned here should be replaced by your delimiter. In your case, it will be ‘.’.
      So it will look like this
      conf.set(“textinputformat.record.delimiter”, “.”);

      This is one method. Editting the TextInputFormat class is the 2nd method.
      No need to use both. Use any of these. In your case using
      conf.set(“textinputformat.record.delimiter”, “.”);
      will be fine. Sorry for the delay in response.

  6. Nejav says:

    May i use wild-card characters to match a pattern as delimiter in method: conf.set(“textinputformat.record.delimiter”, “delimiter”)? I have a text file which contains chapters as CHAPTER (chapter number in roman) (chapter title). I want this entire line as delimiter how can i do that?

  7. K C says:

    Hi Amal,
    Thanks for the post. I was able to read the file with custom record delimiter. Is there to write a file with custom record delimiter.

    your help is much appreciated.
    Thanks

  8. Ravi says:

    hey,

    Is it necessary to set inputformat of the job as job.setInputFormatClass(TextInputFormat.class) after using the delimiter as conf.set(“textinputformat.record.delimiter”, “—END—”);

    I followed the first approach. The mapper program is still getting input by single line. Any help is appreciated thanks.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: