Simple PDF to Text conversion

Here I am writing a program for converting simple pdf files into text files. This code will not handle images or tables inside pdf.
My intention behind writing this simple code here is not for explaining about normal pdf parsing, but for explaining a usecase in hadoop mapreduce programming. My next post will be something related to this and hadoop. 🙂 🙂

You can modify this code with more features and functionalities. Here I used Apache PDF Box library for pdf to text parsing.

Visit http://pdfbox.apache.org/ for more details.

package com.amal.pdf;

import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class PdfReader {

	public static void main(String[] args) throws IOException {

		File file = new File("TestPdf.pdf");

		PDDocument pdf = null;
		String parsedText = null;
		PDFTextStripper stripper;
		pdf = PDDocument.load(file);
		stripper = new PDFTextStripper();
		parsedText = stripper.getText(pdf);
		System.out.println(parsedText);
	}
}

All About Tech

Victory goes to the player who makes the next-to-last mistake

Simple PDF to Text conversion

2 thoughts on “Simple PDF to Text conversion”

Leave a comment Cancel reply

Share this:

Related

2 thoughts on “Simple PDF to Text conversion”

Leave a comment Cancel reply