|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.pdfbox.util.PDFStreamEngine
org.pdfbox.util.PDFTextStripper
This class will take a pdf document and strip out all of the text and ignore the formatting and such.
Field Summary | |
protected Vector |
charactersByArticle
The charactersByArticle is used to extract text by article divisions. |
protected Writer |
output
The stream to write the output to. |
Constructor Summary | |
PDFTextStripper()
Instantiate a new PDFTextStripper object. |
Method Summary | |
protected void |
endDocument(PDDocument pdf)
This method is available for subclasses of this class. |
protected void |
endPage(PDPage page)
End a page. |
protected void |
endParagraph()
End a paragraph. |
protected void |
flushText()
This will print the text to the output stream. |
protected List |
getCharactersByArticle()
Character strings are grouped by articles. |
protected int |
getCurrentPageNo()
Get the current page number that is being processed. |
PDOutlineItem |
getEndBookmark()
Get the bookmark where text extraction should end, inclusive. |
int |
getEndPage()
This will get the last page that will be extracted. |
String |
getLineSeparator()
This will get the line separator. |
protected Writer |
getOutput()
The output stream that is being written to. |
String |
getPageSeparator()
This will get the page separator. |
PDOutlineItem |
getStartBookmark()
Get the bookmark where text extraction should start, inclusive. |
int |
getStartPage()
This is the page that the text extraction will start on. |
String |
getText(COSDocument doc)
Deprecated. |
String |
getText(PDDocument doc)
This will return the text of a document. |
String |
getWordSeparator()
This will get the word separator. |
protected void |
processPage(PDPage page,
COSStream content)
This will process the contents of a page. |
protected void |
processPages(List pages)
This will process all of the pages and the text that is in them. |
void |
setEndBookmark(PDOutlineItem aEndBookmark)
Set the bookmark where the text extraction should stop. |
void |
setEndPage(int endPageValue)
This will set the last page to be extracted by this class. |
void |
setLineSeparator(String separator)
Set the desired line separator for output text. |
void |
setPageSeparator(String separator)
Set the desired page separator for output text. |
void |
setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
Set if the text stripper should group the text output by a list of beads. |
void |
setSortByPosition(boolean newSortByPosition)
The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen. |
void |
setStartBookmark(PDOutlineItem aStartBookmark)
Set the bookmark where text extraction should start, inclusive. |
void |
setStartPage(int startPageValue)
This will set the first page to be extracted by this class. |
void |
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
By default the text stripper will attempt to remove text that overlapps each other. |
void |
setWordSeparator(String separator)
Set the desired word separator for output text. |
boolean |
shouldSeparateByBeads()
This will tell if the text stripper should separate by beads. |
boolean |
shouldSortByPosition()
This will tell if the text stripper should sort the text tokens before writing to the stream. |
boolean |
shouldSuppressDuplicateOverlappingText()
|
protected void |
showCharacter(TextPosition text)
This will show add a character to the list of characters to be printed to the text file. |
protected void |
startDocument(PDDocument pdf)
This method is available for subclasses of this class. |
protected void |
startPage(PDPage page)
Start a new page. |
protected void |
startParagraph()
Start a new paragraph. |
protected void |
writeCharacters(TextPosition text)
Write the string to the output stream. |
void |
writeText(COSDocument doc,
Writer outputStream)
Deprecated. |
void |
writeText(PDDocument doc,
Writer outputStream)
This will take a PDDocument and write the text of that document to the print writer. |
Methods inherited from class org.pdfbox.util.PDFStreamEngine |
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getXObjects, processOperator, processOperator, processStream, processSubStream, setColorSpaces, setFonts, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix, showString |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected Vector charactersByArticle
protected Writer output
Constructor Detail |
public PDFTextStripper() throws IOException
IOException
- If there is an error loading the properties.Method Detail |
public String getText(PDDocument doc) throws IOException
doc
- The document to get the text from.
IOException
- if the doc state is invalid or it is encrypted.public String getText(COSDocument doc) throws IOException
doc
- The document to extract the text from.
IOException
- If there is an error extracting the text.getText( PDDocument )
public void writeText(COSDocument doc, Writer outputStream) throws IOException
doc
- The document to extract the text.outputStream
- The stream to write the text to.
IOException
- If there is an error extracting the text.writeText( PDDocument, Writer )
public void writeText(PDDocument doc, Writer outputStream) throws IOException
doc
- The document to get the data from.outputStream
- The location to put the text.
IOException
- If the doc is in an invalid state.protected void processPages(List pages) throws IOException
pages
- The pages object in the document.
IOException
- If there is an error parsing the text.protected void startDocument(PDDocument pdf) throws IOException
pdf
- The PDF document that is being processed.
IOException
- If an IO error occurs.protected void endDocument(PDDocument pdf) throws IOException
pdf
- The PDF document that is being processed.
IOException
- If an IO error occurs.protected void processPage(PDPage page, COSStream content) throws IOException
page
- The page to process.content
- The contents of the page.
IOException
- If there is an error processing the page.protected void startParagraph() throws IOException
IOException
- If there is any error writing to the stream.protected void endParagraph() throws IOException
IOException
- If there is any error writing to the stream.protected void startPage(PDPage page) throws IOException
page
- The page we are about to process.
IOException
- If there is any error writing to the stream.protected void endPage(PDPage page) throws IOException
page
- The page we are about to process.
IOException
- If there is any error writing to the stream.protected void flushText() throws IOException
IOException
- If there is an error writing the text.protected void writeCharacters(TextPosition text) throws IOException
text
- The text to write to the stream.
IOException
- If there is an error when writing the text.protected void showCharacter(TextPosition text)
showCharacter
in class PDFStreamEngine
text
- The description of the character to display.public int getStartPage()
public void setStartPage(int startPageValue)
startPageValue
- New value of property startPage.public int getEndPage()
public void setEndPage(int endPageValue)
endPageValue
- New value of property endPage.public void setLineSeparator(String separator)
separator
- The desired line separator string.public String getLineSeparator()
public void setPageSeparator(String separator)
separator
- The desired page separator string.public String getWordSeparator()
public void setWordSeparator(String separator)
separator
- The desired page separator string.public String getPageSeparator()
public boolean shouldSuppressDuplicateOverlappingText()
protected int getCurrentPageNo()
protected Writer getOutput()
protected List getCharactersByArticle()
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
suppressDuplicateOverlappingTextValue
- The suppressDuplicateOverlappingText to set.public boolean shouldSeparateByBeads()
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
aShouldSeparateByBeads
- The new grouping of beads.public PDOutlineItem getEndBookmark()
public void setEndBookmark(PDOutlineItem aEndBookmark)
aEndBookmark
- The ending bookmark.public PDOutlineItem getStartBookmark()
public void setStartBookmark(PDOutlineItem aStartBookmark)
aStartBookmark
- The starting bookmark.public boolean shouldSortByPosition()
public void setSortByPosition(boolean newSortByPosition)
newSortByPosition
- Tell PDFBox to sort the text positions.
|
|||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |