Simple PDF to Text conversion

Here I am writing a program for converting simple pdf files into text files. This code will not handle images or tables inside pdf.
My intention behind writing this simple code here is not for explaining about normal pdf parsing, but for explaining a usecase in hadoop mapreduce programming. My next post  will be something related to this and hadoop. 🙂 🙂

You can modify this code with more features and functionalities. Here I used Apache PDF Box library for pdf to text parsing.

Visit  for more details.

package com.amal.pdf;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class PdfReader {

	public static void main(String[] args) throws IOException {

		File file = new File("TestPdf.pdf");

		PDDocument pdf = null;
		String parsedText = null;
		PDFTextStripper stripper;
		pdf = PDDocument.load(file);
		stripper = new PDFTextStripper();
		parsedText = stripper.getText(pdf);

About amalgjose
I am an Electrical Engineer by qualification, now I am working as a Software Architect. I am very much interested in Electrical, Electronics, Mechanical and now in Software fields. I like exploring things in these fields. I love travelling, long drives and music.

2 Responses to Simple PDF to Text conversion

  1. Pingback: Pdf Input Format implementation for Hadoop Mapreduce | Amal G Jose

  2. G.HARISH says:

    Cannot find PDFTextStripper when I install pdfbox

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: