PDFTextStripper (PDFBox-0.7.3-dev API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.pdfbox.util
Class PDFTextStripper

java.lang.Object
  org.pdfbox.util.PDFStreamEngine
      org.pdfbox.util.PDFTextStripper

Direct Known Subclasses:: PDFHighlighter, PDFText2HTML, PDFTextStripperByArea, PrintTextLocations

public class PDFTextStripper
extends PDFStreamEngine

This class will take a pdf document and strip out all of the text and ignore the formatting and such.

Version:: $Revision: 1.62 $
Author:: Ben Litchfield (ben@benlitchfield.com)

Field Summary
`protected Vector`	`charactersByArticle` The charactersByArticle is used to extract text by article divisions.
`protected Writer`	`output` The stream to write the output to.

Constructor Summary
`PDFTextStripper()` Instantiate a new PDFTextStripper object.

Method Summary
`protected void`	`endDocument(PDDocument pdf)` This method is available for subclasses of this class.
`protected void`	`endPage(PDPage page)` End a page.
`protected void`	`endParagraph()` End a paragraph.
`protected void`	`flushText()` This will print the text to the output stream.
`protected List`	`getCharactersByArticle()` Character strings are grouped by articles.
`protected int`	`getCurrentPageNo()` Get the current page number that is being processed.
`PDOutlineItem`	`getEndBookmark()` Get the bookmark where text extraction should end, inclusive.
`int`	`getEndPage()` This will get the last page that will be extracted.
`String`	`getLineSeparator()` This will get the line separator.
`protected Writer`	`getOutput()` The output stream that is being written to.
`String`	`getPageSeparator()` This will get the page separator.
`PDOutlineItem`	`getStartBookmark()` Get the bookmark where text extraction should start, inclusive.
`int`	`getStartPage()` This is the page that the text extraction will start on.
`String`	`getText(COSDocument doc)` Deprecated.
`String`	`getText(PDDocument doc)` This will return the text of a document.
`String`	`getWordSeparator()` This will get the word separator.
`protected void`	`processPage(PDPage page, COSStream content)` This will process the contents of a page.
`protected void`	`processPages(List pages)` This will process all of the pages and the text that is in them.
`void`	`setEndBookmark(PDOutlineItem aEndBookmark)` Set the bookmark where the text extraction should stop.
`void`	`setEndPage(int endPageValue)` This will set the last page to be extracted by this class.
`void`	`setLineSeparator(String separator)` Set the desired line separator for output text.
`void`	`setPageSeparator(String separator)` Set the desired page separator for output text.
`void`	`setShouldSeparateByBeads(boolean aShouldSeparateByBeads)` Set if the text stripper should group the text output by a list of beads.
`void`	`setSortByPosition(boolean newSortByPosition)` The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen.
`void`	`setStartBookmark(PDOutlineItem aStartBookmark)` Set the bookmark where text extraction should start, inclusive.
`void`	`setStartPage(int startPageValue)` This will set the first page to be extracted by this class.
`void`	`setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)` By default the text stripper will attempt to remove text that overlapps each other.
`void`	`setWordSeparator(String separator)` Set the desired word separator for output text.
`boolean`	`shouldSeparateByBeads()` This will tell if the text stripper should separate by beads.
`boolean`	`shouldSortByPosition()` This will tell if the text stripper should sort the text tokens before writing to the stream.
`boolean`	`shouldSuppressDuplicateOverlappingText()`
`protected void`	`showCharacter(TextPosition text)` This will show add a character to the list of characters to be printed to the text file.
`protected void`	`startDocument(PDDocument pdf)` This method is available for subclasses of this class.
`protected void`	`startPage(PDPage page)` Start a new page.
`protected void`	`startParagraph()` Start a new paragraph.
`protected void`	`writeCharacters(TextPosition text)` Write the string to the output stream.
`void`	`writeText(COSDocument doc, Writer outputStream)` Deprecated.
`void`	`writeText(PDDocument doc, Writer outputStream)` This will take a PDDocument and write the text of that document to the print writer.

Methods inherited from class org.pdfbox.util.PDFStreamEngine

getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getXObjects, processOperator, processOperator, processStream, processSubStream, setColorSpaces, setFonts, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix, showString

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

charactersByArticle

protected Vector charactersByArticle

The charactersByArticle is used to extract text by article divisions. For example a PDF that has two columns like a newspaper, we want to extract the first column and then the second column. In this example the PDF would have 2 beads(or articles), one for each column. The size of the charactersByArticle would be 5, because not all text on the screen will fall into one of the articles. The five divisions are shown below Text before first article first article text text between first article and second article second article text text after second article Most PDFs won't have any beads, so charactersByArticle will contain a single entry.

output

protected Writer output

The stream to write the output to.

Constructor Detail

PDFTextStripper

public PDFTextStripper()
                throws IOException

Instantiate a new PDFTextStripper object. This object will load properties from Resources/PDFTextStripper.properties.
Throws:: IOException - If there is an error loading the properties.

Method Detail

getText

public String getText(PDDocument doc)
               throws IOException

This will return the text of a document. See writeText.
NOTE: The document must not be encrypted when coming into this method.

Parameters:: doc - The document to get the text from.
Returns:: The text of the PDF document.
Throws:: IOException - if the doc state is invalid or it is encrypted.

getText

public String getText(COSDocument doc)
               throws IOException

Deprecated.

Parameters:: doc - The document to extract the text from.
Returns:: The document text.
Throws:: IOException - If there is an error extracting the text.
See Also:: getText( PDDocument )

writeText

public void writeText(COSDocument doc,
                      Writer outputStream)
               throws IOException

Deprecated.

Parameters:: doc - The document to extract the text.; outputStream - The stream to write the text to.
Throws:: IOException - If there is an error extracting the text.
See Also:: writeText( PDDocument, Writer )

writeText

public void writeText(PDDocument doc,
                      Writer outputStream)
               throws IOException

This will take a PDDocument and write the text of that document to the print writer.

Parameters:: doc - The document to get the data from.; outputStream - The location to put the text.
Throws:: IOException - If the doc is in an invalid state.

processPages

protected void processPages(List pages)
                     throws IOException

This will process all of the pages and the text that is in them.

Parameters:: pages - The pages object in the document.
Throws:: IOException - If there is an error parsing the text.

startDocument

protected void startDocument(PDDocument pdf)
                      throws IOException

This method is available for subclasses of this class. It will be called before processing of the document start.

Parameters:: pdf - The PDF document that is being processed.
Throws:: IOException - If an IO error occurs.

endDocument

protected void endDocument(PDDocument pdf)
                    throws IOException

This method is available for subclasses of this class. It will be called after processing of the document finishes.

Parameters:: pdf - The PDF document that is being processed.
Throws:: IOException - If an IO error occurs.

processPage

protected void processPage(PDPage page,
                           COSStream content)
                    throws IOException

This will process the contents of a page.

Parameters:: page - The page to process.; content - The contents of the page.
Throws:: IOException - If there is an error processing the page.

startParagraph

protected void startParagraph()
                       throws IOException

Start a new paragraph. Default implementation is to do nothing. Subclasses may provide additional information.

Throws:: IOException - If there is any error writing to the stream.

endParagraph

protected void endParagraph()
                     throws IOException

End a paragraph. Default implementation is to do nothing. Subclasses may provide additional information.

Throws:: IOException - If there is any error writing to the stream.

startPage

protected void startPage(PDPage page)
                  throws IOException

Start a new page. Default implementation is to do nothing. Subclasses may provide additional information.

Parameters:: page - The page we are about to process.
Throws:: IOException - If there is any error writing to the stream.

endPage

protected void endPage(PDPage page)
                throws IOException

End a page. Default implementation is to do nothing. Subclasses may provide additional information.

Parameters:: page - The page we are about to process.
Throws:: IOException - If there is any error writing to the stream.

flushText

protected void flushText()
                  throws IOException

This will print the text to the output stream.

Throws:: IOException - If there is an error writing the text.

writeCharacters

protected void writeCharacters(TextPosition text)
                        throws IOException

Write the string to the output stream.

Parameters:: text - The text to write to the stream.
Throws:: IOException - If there is an error when writing the text.

showCharacter

protected void showCharacter(TextPosition text)

This will show add a character to the list of characters to be printed to the text file.

Overrides:: showCharacter in class PDFStreamEngine

Parameters:: text - The description of the character to display.

getStartPage

public int getStartPage()

This is the page that the text extraction will start on. The pages start at page 1. For example in a 5 page PDF document, if the start page is 1 then all pages will be extracted. If the start page is 4 then pages 4 and 5 will be extracted. The default value is 1.

Returns:: Value of property startPage.

setStartPage

public void setStartPage(int startPageValue)

This will set the first page to be extracted by this class.

Parameters:: startPageValue - New value of property startPage.

getEndPage

public int getEndPage()

This will get the last page that will be extracted. This is inclusive, for example if a 5 page PDF an endPage value of 5 would extract the entire document, an end page of 2 would extract pages 1 and 2. This defaults to Integer.MAX_VALUE such that all pages of the pdf will be extracted.

Returns:: Value of property endPage.

setEndPage

public void setEndPage(int endPageValue)

This will set the last page to be extracted by this class.

Parameters:: endPageValue - New value of property endPage.

setLineSeparator

public void setLineSeparator(String separator)

Set the desired line separator for output text. The line.separator system property is used if the line separator preference is not set explicitly using this method.

Parameters:: separator - The desired line separator string.

getLineSeparator

public String getLineSeparator()

This will get the line separator.

Returns:: The desired line separator string.

setPageSeparator

public void setPageSeparator(String separator)

Set the desired page separator for output text. The line.separator system property is used if the page separator preference is not set explicitly using this method.

Parameters:: separator - The desired page separator string.

getWordSeparator

public String getWordSeparator()

This will get the word separator.

Returns:: The desired word separator string.

setWordSeparator

public void setWordSeparator(String separator)

Set the desired word separator for output text. The PDFBox text extraction algorithm will output a space character if there is enough space between two words. By default a space character is used. If you need and accurate count of characters that are found in a PDF document then you might want to set the word separator to the empty string.

Parameters:: separator - The desired page separator string.

getPageSeparator

public String getPageSeparator()

This will get the page separator.

Returns:: The page separator string.

shouldSuppressDuplicateOverlappingText

public boolean shouldSuppressDuplicateOverlappingText()

Returns:: Returns the suppressDuplicateOverlappingText.

getCurrentPageNo

protected int getCurrentPageNo()

Get the current page number that is being processed.

Returns:: A 1 based number representing the current page.

getOutput

protected Writer getOutput()

The output stream that is being written to.

Returns:: The stream that output is being written to.

getCharactersByArticle

protected List getCharactersByArticle()

Character strings are grouped by articles. It is quite common that there will only be a single article. This returns a List that contains List objects, the inner lists will contain TextPosition objects.

Returns:: A double List of TextPositions for all text strings on the page.

setSuppressDuplicateOverlappingText

public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)

By default the text stripper will attempt to remove text that overlapps each other. Word paints the same character several times in order to make it look bold. By setting this to false all text will be extracted, which means that certain sections will be duplicated, but better performance will be noticed.

Parameters:: suppressDuplicateOverlappingTextValue - The suppressDuplicateOverlappingText to set.

shouldSeparateByBeads

public boolean shouldSeparateByBeads()

This will tell if the text stripper should separate by beads.

Returns:: If the text will be grouped by beads.

setShouldSeparateByBeads

public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)

Set if the text stripper should group the text output by a list of beads. The default value is true!

Parameters:: aShouldSeparateByBeads - The new grouping of beads.

getEndBookmark

public PDOutlineItem getEndBookmark()

Get the bookmark where text extraction should end, inclusive. Default is null.

Returns:: The ending bookmark.

setEndBookmark

public void setEndBookmark(PDOutlineItem aEndBookmark)

Set the bookmark where the text extraction should stop.

Parameters:: aEndBookmark - The ending bookmark.

getStartBookmark

public PDOutlineItem getStartBookmark()

Get the bookmark where text extraction should start, inclusive. Default is null.

Returns:: The starting bookmark.

setStartBookmark

public void setStartBookmark(PDOutlineItem aStartBookmark)

Set the bookmark where text extraction should start, inclusive.

Parameters:: aStartBookmark - The starting bookmark.

shouldSortByPosition

public boolean shouldSortByPosition()

This will tell if the text stripper should sort the text tokens before writing to the stream.

Returns:: true If the text tokens will be sorted before being written.

setSortByPosition

public void setSortByPosition(boolean newSortByPosition)

The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen. For example, a PDF writer may write out all text by font, so all bold or larger text, then make a second pass and write out the normal text.
The default is to not sort by position.

A PDF writer could choose to write each character in a different order. By default PDFBox does not sort the text tokens before processing them due to performance reasons.

Parameters:: newSortByPosition - Tell PDFBox to sort the text positions.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.pdfbox.util Class PDFTextStripper

charactersByArticle

output

PDFTextStripper

getText

getText

writeText

writeText

processPages

startDocument

endDocument

processPage

startParagraph

endParagraph

startPage

endPage

flushText

writeCharacters

showCharacter

getStartPage

setStartPage

getEndPage

setEndPage

setLineSeparator

getLineSeparator

setPageSeparator

getWordSeparator

setWordSeparator

getPageSeparator

shouldSuppressDuplicateOverlappingText

getCurrentPageNo

getOutput

getCharactersByArticle

setSuppressDuplicateOverlappingText

shouldSeparateByBeads

setShouldSeparateByBeads

getEndBookmark

setEndBookmark

getStartBookmark

setStartBookmark

shouldSortByPosition

setSortByPosition

org.pdfbox.util
Class PDFTextStripper