Custom Text Input Format Record Delimiter for Hadoop

Date: May 27, 2013Author: Amal G Jose 11 Comments

By default mapreduce program accepts text file and it reads line by line. Technically speaking the default input format is text input format and the default delimiter is ‘/n’ (new line).

In several cases, we need to override this property. For example if you have a large text file and you want to read the contents between ‘.’ in each read.

In this case, using the default delimiter will be difficult.

For example.

If you have a file like this


Ice cream (derived from earlier iced cream or cream ice) is a frozen  title="Dessert"  usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours. Most varieties contain sugar, although some are made with other sweeteners. In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients. The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming. The result is a smoothly textured semi-solid foam that is malleable and can be scooped.

And we want each record as


1) Ice cream (derived from earlier iced cream or cream ice) is a frozen dessert usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours

2) Most varieties contain sugar, although some are made with other sweeteners

3) In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients

4) The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming

5) The result is a smoothly textured semi-solid foam that is malleable and can be scooped

This we can do it by overriding one property textinputformat.record.delimiter

We can either set this property in the driver class or just changing the value of delimiter in the TextInputFormat class.

The first method is the easiest way.

Setting the textinputformat.record.delimiter in Driver class

The format for setting it in the program (Driver class) is


conf.set(“textinputformat.record.delimiter”, “delimiter”)

The value you are setting by this method is ultimately going into the TextInputFormat class. This is explained below.

Editting the TextInputFormat class

Default TextInputFormat class

public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text>
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
// By default,textinputformat.record.delimiter = ‘/n’(Set in configuration file)
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter");
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
  }
}

Editted TextInputFormat class


public class TextInputFormat extends FileInputFormat<LongWritable, Text> {

  @Override
  public RecordReader<LongWritable, Text>
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {

// Hardcoding this value as “.”
// You can add any delimiter as your requirement

    String delimiter = “.”;
    byte[] recordDelimiterBytes = null;
    if (null != delimiter)
      recordDelimiterBytes = delimiter.getBytes();
    return new LineRecordReader(recordDelimiterBytes);
  }

  @Override
  protected boolean isSplitable(JobContext context, Path file) {
    CompressionCodec codec =
      new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
    return codec == null;
  }
}

11 thoughts on “Custom Text Input Format Record Delimiter for Hadoop”

Add Comment

Dev says:

June 26, 2013 at 4:59 am

Hi,

Not sure what I may be doing wrong but this code does not work.

My input looks like

improvement is a relative word|en
multiline
improvement|en

——-
I am expecting by setting the delimiter to |en the mapper should reach these as 2 records and not 3. But the output is always 3 records.

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DelimTest {
public static class WordCountMapper
extends Mapper{

private static final IntWritable one = new IntWritable(1);

@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(value, one);
}
}
public static void main(String[] args) throws Exception {

// Create a new job

Configuration conf = new Configuration();
conf.set(“textinputformat.record.delimiter”,”|en”);
Job job = new Job(conf);

// Set job name to locate it in the distributed environment
job.setJarByClass(DelimTest.class);
job.setJobName(“Word Count”);

// Set input and output Path, note that we use the default input format
// which is TextInputFormat (each record is a line of input)
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

// Set Mapper and Reducer class
job.setMapperClass(WordCountMapper.class);
job.setInputFormatClass(TextInputFormat.class);
// Set Output key and value
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Reply
1. amalgjose says:
  
  June 26, 2013 at 2:27 pm
  
  What is the output that you are getting.?
  Please show me the output. What is coming in the third record.
  
  Reply
Umamahesh says:

October 21, 2013 at 12:14 pm

Its a good example, Working fine

Reply
Abhishek Agrawal says:

June 4, 2014 at 6:20 pm

Hi,

Below is my driver code.

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration(true);
Job job = new Job(conf);
job.setJobName(“wordcount”);
conf.set(“textinputformat.record.delimiter”,”\n\n”);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.setInputPaths(job, new Path(“/home/abhishek_linuxhadoop/workspace/MR_WordCount/SampleDataSet.txt”));
FileOutputFormat.setOutputPath(job, new Path(“/home/abhishek_linuxhadoop/workspace/MR_WordCount/OUTPUT_Word_Count”));

job.setJarByClass(WordCount.class);
job.submit();

}

The sample data set is separated by new lines character hence the \n\n usage.

I am relatively new to the map reduce module.

Can you pls help me out here as where am going wrong?

The value in my mapper code, while debugging is still showing the first line and not the bunch of lines as it should. 😦

The sample data set is –

# Full information about Amazon Share the Love products
Total items: 548552

Id: 0
ASIN: 0771044445
discontinued product

Id: 1
ASIN: 0827229534
title: Patterns of Preaching: A Sermon Sampler
group: Book
salesrank: 396585
similar: 5 0804215715 156101074X 0687023955 0687074231 082721619X
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
reviews: total: 2 downloaded: 2 avg rating: 5
2000-7-28 cutomer: A2JW67OY8U6HHK rating: 5 votes: 10 helpful: 9
2003-12-14 cutomer: A2VE83MZF98ITY rating: 5 votes: 6 helpful: 5

Id: 2
ASIN: 0738700797
title: Candlemas: Feast of Flames
group: Book
salesrank: 168596
similar: 5 0738700827 1567184960 1567182836 0738700525 0738700940
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486]
reviews: total: 12 downloaded: 12 avg rating: 4.5
2001-12-16 cutomer: A11NCO6YTE4BTJ rating: 5 votes: 5 helpful: 4
2002-1-7 cutomer: A9CQ3PLRNIR83 rating: 4 votes: 5 helpful: 5
2002-1-24 cutomer: A13SG9ACZ9O5IM rating: 5 votes: 8 helpful: 8
2002-1-28 cutomer: A1BDAI6VEYMAZA rating: 5 votes: 4 helpful: 4
2002-2-6 cutomer: A2P6KAWXJ16234 rating: 4 votes: 16 helpful: 16
2002-2-14 cutomer: AMACWC3M7PQFR rating: 4 votes: 5 helpful: 5
2002-3-23 cutomer: A3GO7UV9XX14D8 rating: 4 votes: 6 helpful: 6
2002-5-23 cutomer: A1GIL64QK68WKL rating: 5 votes: 8 helpful: 8
2003-2-25 cutomer: AEOBOF2ONQJWV rating: 5 votes: 8 helpful: 5
2003-11-25 cutomer: A3IGHTES8ME05L rating: 5 votes: 5 helpful: 5
2004-2-11 cutomer: A1CP26N8RHYVVO rating: 1 votes: 13 helpful: 9
2005-2-7 cutomer: ANEIANH0WAT9D rating: 5 votes: 1 helpful: 1

Id: 3
ASIN: 0486287785
blah blah blah

I want to get the records from one Id to other Id in a mapper function at a time.

Reply
Ceyhun Karimov says:

June 26, 2014 at 9:11 pm

Thank you for your explanation. My question is that, is there way to keep delimiter also in records?

Reply
1. amalgjose says:
  
  June 27, 2014 at 5:04 am
  
  Yes, definitely, you have to create a custom record reader class for doing this. It is not that much complex.
  In one of my blog posts, I have explained this.
  https://amalgjose.wordpress.com/2014/04/13/pdf-input-format-implementation-for-hadoop-mapreduce/
  The above blog post explains the customization of input format. In that, check the implementation of RecordReader class and create a custom record reader based on your requirement. It will work.
  
  Reply
SMjee says:

July 7, 2014 at 2:44 am

Hello, I am using your input text and coded the problem as per your suggestion. However I am still not getting each sentence as a output from map task. Here is my code:

package aamend;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
* Created by antoine on 05/06/14.
*/
public class Delimeter {

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();
conf.set(“textinputformat.record.delimiter”, “delimiter”);
Job job = new Job(conf, “Delimiter”);
job.setJarByClass(Delimeter.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(DelimiterMapper.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);
}

public class TextInputFormat extends FileInputFormat {

@Override
public RecordReader
createRecordReader(InputSplit split,
TaskAttemptContext context) {

// Hardcoding this value as “.”
// You can add any delimiter as your requirement

String delimiter = “.”;
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes();
return new LineRecordReader();
}

@Override
protected boolean isSplitable(JobContext context, Path file) {
CompressionCodec codec =
new CompressionCodecFactory(context.getConfiguration()).getCodec(file);
return codec == null;
}
}

public static class DelimiterMapper extends Mapper {

private final Text TEXT = new Text();

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

// Should be a 2 lines key value
TEXT.set(“{” + value.toString() + “}”);
context.write(NullWritable.get(), TEXT);
}
}

}

And my output looks like:

{Ice cream (derived from earlier iced cream or cream ice) is a frozen title=”Dessert” usually made from dairy products, such as milk and cream and often combined with fruits or other ingredients and flavours. Most varieties contain sugar, although some are made with other sweeteners. In some cases, artificial flavourings and colourings are used in addition to, or instead of, the natural ingredients. The mixture of chosen ingredients is stirred slowly while cooling, in order to incorporate air and to prevent large ice crystals from forming. The result is a smoothly textured semi-solid foam that is malleable and can be scooped.}

Please suggest!

Reply
1. amalgjose says:
  
  July 11, 2014 at 12:49 pm
  
  Hi,
  I have suggested two methods in my post, you added the both the methods in your program.
  Either you can follow the first approach or you can use the second approach.
  
  conf.set(“textinputformat.record.delimiter”, “delimiter”);
  
  The delimiter mentioned here should be replaced by your delimiter. In your case, it will be ‘.’.
  So it will look like this
  conf.set(“textinputformat.record.delimiter”, “.”);
  
  This is one method. Editting the TextInputFormat class is the 2nd method.
  No need to use both. Use any of these. In your case using
  conf.set(“textinputformat.record.delimiter”, “.”);
  will be fine. Sorry for the delay in response.
  
  Reply
Nejav says:

September 25, 2015 at 2:02 am

May i use wild-card characters to match a pattern as delimiter in method: conf.set(“textinputformat.record.delimiter”, “delimiter”)? I have a text file which contains chapters as CHAPTER (chapter number in roman) (chapter title). I want this entire line as delimiter how can i do that?

Reply
K C says:

August 5, 2016 at 2:12 pm

Hi Amal,
Thanks for the post. I was able to read the file with custom record delimiter. Is there to write a file with custom record delimiter.

your help is much appreciated.
Thanks

Reply
Ravi says:

November 8, 2016 at 9:59 pm

hey,

Is it necessary to set inputformat of the job as job.setInputFormatClass(TextInputFormat.class) after using the delimiter as conf.set(“textinputformat.record.delimiter”, “—END—”);

I followed the first approach. The mapper program is still getting input by single line. Any help is appreciated thanks.

Reply