Converting PDF to XML

Mark D. Anderson (mda@discerning.com)
  2005-10-10 created
  2007-03-27 updated on tagged PDF and ABBYY recommendation
  2008-04-06 updated on SVG and Mars

NOTE: In late 2006, Adobe Labs introduced their own XML vocabulary for PDF: Mars . This obviates the need to introduce any contending alternate XML vocabulary.

Instead of graphic command streams, Mars uses SVG. For areas where PDF exceeds SVG, Adobe has extended SVG with private namespace extensions. Also, SVG's support for animation and scripting is not used; the PDF scripting object model is still used, so only static SVG can be brought into a Mars PDF.

Acrobat 8 or Reader 8 with the Mars plugin can perform bi-directional interconversion of Mars XML and PDF (they have the same infoset). The plugin is currently only availabe for Intel-based Macs and Windows. There is currently no sign of any command-line or server-based tools available.

Introduction

Being able to convert PDF files to some sort of XML would have all sorts of uses: There are a gazillion tools for slicing and dicing XML. You can read XML in a text editor. It allows for an approach to PDF modification that doesn't entail learning somebody's PDF API. It can become the basis for interconversion among document formats.

Before going any further, let me clarify something about PDF, because I've gotten multiple queries from people on the net who have stumbled onto this page, who seem to be under the impression that PDF has useful semantic information in it. In general it doesn't. There is the ability to embed actual XML (see XMP at http://www.adobe.com/products/xmp/standards.html ). And some PDF authoring programs support "tagged PDF". Tagged PDF is the moral equivalent of the "alt" tag in HTML -- it indicates the logical structure of the pages for alternate reader software, for accessibility purposes. This structure uses a vocabulary unique to PDF ("Art" for article, "Sect" for section, etc.) It is not stored in the PDF format as XML (unlike XMP metadata), but as a content stream.

But a random PDF file quite likely has none of this. Instead it just has the equivalent of dumbed-down postscript. That means that even the concept of a word is not always apparent, given that some PDF generation software will absolutely position individual characters. Reconstructing a paragraph would require heuristics to guess where columns are, and then undoing any hyphenation that was done. For an excellent article on that challenge (written by some engineers at a small firm acquired by Apple) , see http://www.idealliance.org/papers/dx_xml03/papers/05-03-03/05-03-03.html Also see Tamir Hassan's project report at http://www.dbai.tuwien.ac.at/staff/hassan/pdf2html/final.pdf as well as his other publications at www.tamirhassan.com

If you happen to get a "tagged PDF" and in fact it has tagged all the content (which is difficult to verify) then you are home free. But don't assume you'll find that.

Free Solutions

Given how obviously useful this would be, you might think that there would be a whole bunch of free tools for doing this. Well, there aren't any. Zippo. There are at least a half-dozen active java-based PDF library projects, and none of them support this. There is an equal number of free libraries in other programming languages (combined), and they don't do it either.

To be clear, you can do the following with free software:

Beyond that, the pickings are meager.

There is pdftohtml.sourceforge.net, untouched since 2003. It is a C++ GPL tool using xpdf (an old 2.x version). It extracts text, and puts all the images in each page into a single background image. It doesn't extract anything else. The generated html uses absolute positioning on a line-by-line bases (when the -c option is used).

There is Tamir Hassan's pdf2html at http://www.dbai.tuwien.ac.at/staff/hassan/pdf2html/index.html It is reliant on an obsolete version of jpedal. It does not extract images, does not support recent PDF format versions, and messed up the character sets on the files I tried.

There is pdftoxml.sourceforge.net, also untouched since 2003. It is an LGPL java tool, using an old snapshot of JPedal. It has zero documentation, and no released files, but it does have CVS. It built fine using ant, but barfed with a NumberFormatException on the first pdf file I handed to it. It looks from browsing the source code that it iterates through the Page list and for each will dump Lines, Polygons, and Boxes.

pstoedit has a plugin with the ability to convert PDF to SVG. This plugin is closed source shareware (though the base pstoedit is open source); available for Windows or Linux. Useful certainly, but it doesn't extract everything, and since no source is available, it is hard to do more with it.

pdf2svg claims to convert PDF to SVG using Poppler (a PDF rendering library based on xpdf) and Cairo (a 2D vector graphics output library). I have not tested it . Not to be confused with other identically named projects/products.

Commercial Solutions

On the commercial side, Acrobat 6.0 has a "Save As XML". But it is a joke, saving even less than information than Acrobat's "Save as HTML" -- which is itself quite poor, doing worse than even the free tools like pdftohtml. The various Acrobat "Save As XML" and "Save as HTML" basically save a collection of images. It seems to do a better job for example than Apple Preview at saving a page as an image -- if that is what you want. ImageMagick "convert" is also good at converting a pdf page to a single image.

Adobe InDesign 2.0 does have some limited support for XML (basically, importing/exporting PDF tags).

Adobe hosts an online conversion service at http://www.adobe.com/products/acrobat/access_onlinetools.html .

There are several third-party commercial tools such as these that convert PDF to various XML vocabularies:

I don't know how well they do except for the ABBYY product -- and I can attest to it doing a decent job. It will stitch together columns, and will undo hyphenation. They leverage their OCR engine to recognize columns. It is only available for Windows, and does not appear to be scriptable in a server application. They do separately sell an SDK which runs on unix-like systems (abbyy.com/sdk/) but I don't know what pricing is like.

There are other commercial tools that go from some XML vocabulary to PDF:

What I did

I grabbed the latest CVS of pdfbox.org and hacked up COSWriter.java. I did not modify any files, I added files only:

The result of running "java org.pdfbox.AsXML foo.pdf foo.xml" gives a foo.xml file that looks something like:

<document>
 <header>
%PDF-1.4
 </header>
 <body>
  <Dictionary key='1' gen='0'>
   <dict>
    <dictentry name='Pages'><ref key='2' gen='0'/></dictentry>
    <dictentry name='Type'><name>Catalog</name></dictentry>
    <dictentry name='PageLabels'><ref key='3' gen='0'/></dictentry>
    <dictentry name='Metadata'><ref key='4' gen='0'/></dictentry>

   </dict>

  </Dictionary>
  <Dictionary key='5' gen='0'>
   <dict>
    <dictentry name='ModDate'><string>D:20050910211227-04'00'</string></dictentry>
    <dictentry name='CreationDate'><string>D:20050910211227-04'00'</string></dictentry>
    <dictentry name='Title'><string>Unknown</string></dictentry>
    <dictentry name='Creator'><string>QuarkXPress: pictwpstops filter 1.0</string></dictentry>
    <dictentry name='Producer'><string>Acrobat Distiller 6.0.1 for Macintosh</string></dictentry>

...

This is really just a proof of concept. There are a whole bunch of problems:

Some of these problems could be addressed by an XML transformation after the fact. The ones having to do with encodings are pretty crippling though.

Also, for all I know this might not have been the right interception point in the bowels of pdfbox. For example, a trace function could be introduced at parsing time to do this on the fly. Also, this could be done instead by directly using the XML SAX2 APIs (either at parse or at save time).

But the main problem is that this particular XML export is too low level for most purposes. It useful for some things -- and it certainly ensures roundtripping with perfect fidelity -- but most of the time you don't want a dump at the level of the primitive PDF atoms like Array, Dictonary, Float, String, etc. Rather, what you want are objects like Text, Image, Font, Script, Page, Graphic, etc. You also want:

At the moment I'm not planning pushing this much further. I was just scratching an itch...

-mda


P.S.: there is a bug in COSWriter.java where one of these tests should be on name, not on value:
                if (value != null)
                {
                    // this is purely defensive, if entry is set to null instead of removed
                    if( value != null )
                    {