2005-10-10 created 2007-03-27 updated on tagged PDF and ABBYY recommendation 2008-04-06 updated on SVG and Mars
NOTE: In late 2006, Adobe Labs introduced their own XML vocabulary for PDF: Mars . This obviates the need to introduce any contending alternate XML vocabulary.
Instead of graphic command streams, Mars uses SVG. For areas where PDF exceeds SVG, Adobe has extended SVG with private namespace extensions. Also, SVG's support for animation and scripting is not used; the PDF scripting object model is still used, so only static SVG can be brought into a Mars PDF.
Acrobat 8 or Reader 8 with the Mars plugin can perform bi-directional interconversion of Mars XML and PDF (they have the same infoset). The plugin is currently only availabe for Intel-based Macs and Windows. There is currently no sign of any command-line or server-based tools available.
Before going any further, let me clarify something about PDF, because I've gotten multiple queries from people on the net who have stumbled onto this page, who seem to be under the impression that PDF has useful semantic information in it. In general it doesn't. There is the ability to embed actual XML (see XMP at http://www.adobe.com/products/xmp/standards.html ). And some PDF authoring programs support "tagged PDF". Tagged PDF is the moral equivalent of the "alt" tag in HTML -- it indicates the logical structure of the pages for alternate reader software, for accessibility purposes. This structure uses a vocabulary unique to PDF ("Art" for article, "Sect" for section, etc.) It is not stored in the PDF format as XML (unlike XMP metadata), but as a content stream.
But a random PDF file quite likely has none of this. Instead it just has the equivalent of dumbed-down postscript. That means that even the concept of a word is not always apparent, given that some PDF generation software will absolutely position individual characters. Reconstructing a paragraph would require heuristics to guess where columns are, and then undoing any hyphenation that was done. For an excellent article on that challenge (written by some engineers at a small firm acquired by Apple) , see http://www.idealliance.org/papers/dx_xml03/papers/05-03-03/05-03-03.html Also see Tamir Hassan's project report at http://www.dbai.tuwien.ac.at/staff/hassan/pdf2html/final.pdf as well as his other publications at www.tamirhassan.com
If you happen to get a "tagged PDF" and in fact it has tagged all the content (which is difficult to verify) then you are home free. But don't assume you'll find that.
To be clear, you can do the following with free software:
Beyond that, the pickings are meager.
There is pdftohtml.sourceforge.net, untouched since 2003. It is a C++ GPL tool using xpdf (an old 2.x version). It extracts text, and puts all the images in each page into a single background image. It doesn't extract anything else. The generated html uses absolute positioning on a line-by-line bases (when the -c option is used).
There is Tamir Hassan's pdf2html at http://www.dbai.tuwien.ac.at/staff/hassan/pdf2html/index.html It is reliant on an obsolete version of jpedal. It does not extract images, does not support recent PDF format versions, and messed up the character sets on the files I tried.
There is pdftoxml.sourceforge.net, also untouched since 2003. It is an LGPL java tool, using an old snapshot of JPedal. It has zero documentation, and no released files, but it does have CVS. It built fine using ant, but barfed with a NumberFormatException on the first pdf file I handed to it. It looks from browsing the source code that it iterates through the Page list and for each will dump Lines, Polygons, and Boxes.
pstoedit has a plugin with the ability to convert PDF to SVG. This plugin is closed source shareware (though the base pstoedit is open source); available for Windows or Linux. Useful certainly, but it doesn't extract everything, and since no source is available, it is hard to do more with it.
pdf2svg claims to convert PDF to SVG using Poppler (a PDF rendering library based on xpdf) and Cairo (a 2D vector graphics output library). I have not tested it . Not to be confused with other identically named projects/products.
Adobe InDesign 2.0 does have some limited support for XML (basically, importing/exporting PDF tags).
Adobe hosts an online conversion service at http://www.adobe.com/products/acrobat/access_onlinetools.html .
There are several third-party commercial tools such as these that convert PDF to various XML vocabularies:
There are other commercial tools that go from some XML vocabulary to PDF:
The result of running "java org.pdfbox.AsXML foo.pdf foo.xml" gives a foo.xml file that looks something like:
<document> <header> %PDF-1.4 </header> <body> <Dictionary key='1' gen='0'> <dict> <dictentry name='Pages'><ref key='2' gen='0'/></dictentry> <dictentry name='Type'><name>Catalog</name></dictentry> <dictentry name='PageLabels'><ref key='3' gen='0'/></dictentry> <dictentry name='Metadata'><ref key='4' gen='0'/></dictentry> </dict> </Dictionary> <Dictionary key='5' gen='0'> <dict> <dictentry name='ModDate'><string>D:20050910211227-04'00'</string></dictentry> <dictentry name='CreationDate'><string>D:20050910211227-04'00'</string></dictentry> <dictentry name='Title'><string>Unknown</string></dictentry> <dictentry name='Creator'><string>QuarkXPress: pictwpstops filter 1.0</string></dictentry> <dictentry name='Producer'><string>Acrobat Distiller 6.0.1 for Macintosh</string></dictentry> ...
This is really just a proof of concept. There are a whole bunch of problems:
Also, for all I know this might not have been the right interception point in the bowels of pdfbox. For example, a trace function could be introduced at parsing time to do this on the fly. Also, this could be done instead by directly using the XML SAX2 APIs (either at parse or at save time).
But the main problem is that this particular XML export is too low level for most purposes. It useful for some things -- and it certainly ensures roundtripping with perfect fidelity -- but most of the time you don't want a dump at the level of the primitive PDF atoms like Array, Dictonary, Float, String, etc. Rather, what you want are objects like Text, Image, Font, Script, Page, Graphic, etc. You also want:
At the moment I'm not planning pushing this much further. I was just scratching an itch...
-mda
if (value != null) { // this is purely defensive, if entry is set to null instead of removed if( value != null ) {