Microsoft Word A lot of users seemingly only know how to use one tool, Microsoft Word. Note that in general, Word export is going to not retain semantic information that was never there. Many users of Microsoft Word never us a single named style (such as "Heading 1"), but keep the entire document in one style and override its formatting everywhere. There are several commercial tools that either act as a plugin on the client side to improve the html generated by Word, or clean it up on the server side. For example, many "ttw" (through-the-web) editors support cleaning of pasted Microsoft Word. * Plugins/Addins/Macros There are a variety of plugins/addins available for Word. Of course you or your users have to install them. http://www.yawcpro.com/ Saves better HTML than Microsoft does, or saving as XML in older versions of Word. save to HTML or XML (DocBook) from Microsoft Word (97, 2000 and XP) there is an associated online service at http://www.yawconline.com TonesNotes http://workspaces.gotdotnet.com/tonesnotes is a Word AddIn for blogging (shared source license, C#). WordToWiki http://tikiwiki.org/tiki-index.php?page=WordToWiki_swythan http://confluence.atlassian.com/display/CONFEXT/Word+to+Confluence+Converter Visual Basic script * Using WordprocessingML Starting with Word 2003, it offers the ability to save as XML, into WordprocessingML. http://www.microsoft.com/office/xml/default.mspx http://msdn.microsoft.com/office/understanding/xmloffice/documentation/default.aspx http://www.microsoft.com/downloads/details.aspx?familyid=FE118952-3547-420A-A412-00A2662442D9&displaylang=en http://rep.oio.dk/Microsoft.com/officeschemas/welcome.htm WMF graphics are exported as VML. Microsoft offers wmlview.exe at: http://www.microsoft.com/downloads/details.aspx?familyid=19676b18-1bcd-4852-93ba-0b5a203ea731&displaylang=en which will convert Word 2003 XML into HTML using xslt. Older versions of Word (Word X for Mac, etc.) do not support WordprocessingML. There is an open source project to transform Microsoft Word's XML to something useful is: http://www.oscom.org/projects/wordxml/ * Using exported HTML To clean up the messy html that Word produces from "Save as Web Page", there are numerous tools, including: tidy http://tidy.sourceforge.net/docs/quickref.html#word-2000 Word Unmunger http://luke.francl.org/software/word-unmunger/ http://www.wordcleaner.com/ (commercial) http://textism.com/wordcleaner/ (web service, closed source) cut and paste into Dreamweaver MX2004 (special support for fixing Word html) http://owl.english.purdue.edu/lab/manuals/worddoc.html open in OO and save again Microsoft Office 2000 HTML Filter 2.0 http://www.microsoft.com/downloads/details.aspx?FamilyID=209ADBEE-3FBD-482C-83B0-96FB79B74DED&displaylang=EN http://office.microsoft.com/en-us/assistance/HA010549981033.aspx good explanation of what it has to do Microsoft Word 2003 has "Web page, filtered" which is more minimal. * Parsing .doc There are several independent open source projects that parse Microsoft Word files. abiword http://www.abisource.com GPL C/C++. command line: abiword -to txt file.doc abiword -to html file.doc While AbiWord uses wv and libwmf2, it supplies command-line conversion that is superior to wv, at least for HTML. For example, it generates HTML using CSS, while wvHtml does not. Also it supports footnotes, endnotes, headers, footers, and floating text frames: http://www.abisource.com/lxr/source/abi/src/wp/impexp/xp/ie_imp_MsWord_97.cpp http://www.abisource.com/lxr/source/abi/src/wp/impexp/xp/ie_exp_HTML.cpp Note that AbiSource has their own CVS: http://www.abisource.com/lxr/source/wv/ http://www.abisource.com/bonsai/cvsquery.cgi wvware http://wvware.sourceforge.net/ GPL C (libwmf is LGPL) Caolan McNamara ( http://www.skynet.ie/~caolan/ http://blogs.linux.ie/caolan/ ) started; he now works for RedHat and formerly for Sun on StarOffice. now maintained by Dominic Lachowicz ( http://www.advogato.org/person/cinamod/ ) who works for AbiSource, and formerly for Appligent. several different libraries: wv is a library and command line utility used by abiword includes a copy of libole2: http://cvs.sourceforge.net/viewcvs.py/wvware/wv/libole2/ wv2 is a forked library (and no command-line) only used by KWord/KOffice relies on libole2: gnome libole2 http://cvs.gnome.org/viewcvs/libole2/ libwmf is used by abiword; reads vector graphics out of .doc and converts to PS, SVG, FIG, PNG, or JPEG WMF = Windows Meta File. EMF = Enhanced Meta File is 32-bit. strictly speaking WMF is only 16-bit (Windows 3.x can't do EMF). In addition to being 32-bit, EMF has some extra commands. (in CVS it is "libwmf2", maintained by AbiSource; the original "libwmf" written by Caolan is no longer maintained) the wvWare command-line utilities work by having an external template file (in xml) that is used to look up macros as it iterates over the data structure. http://www.abisource.com/lxr/source/wv/wvWare.c http://www.abisource.com/lxr/source/wv/xml/ http://cvs.sourceforge.net/viewcvs.py/wvware/wv/wvWare.c http://cvs.sourceforge.net/viewcvs.py/wvware/wv/xml/ antiword http://www.winfield.demon.nl/ GPL C. command-line text extraction only. Open Office http://www.openoffice.org/ LGPL C/C++. source code of filter: http://ooo.ximian.com/lxr/source/sw/sw/source/filter/ command-line: a bit is awkward, requiring UNO: http://udk.openoffice.org/ there is a http://sourceforge.net/projects/joott a java project which uses UNO to remote control OO An example of a cgi which uses pyuno to convert MS docs to PDF http://www.skynet.ie/~caolan/Fragments/ooo-cgi.html POI http://jakarta.apache.org/poi/ Apache license Java. Most of these have a means of extracting text from a Word doc file via the command-line. * Saving as text Word itself also allows saving as text in several text formats. As far as I can tell, here is how the text formats differ. This is based on minimal testing, but already contradicts official documentation: Format EOL Charset Comment Text Only system ascii single-line paragraphs Text Only with Line Breaks system system wraps at 110 MS-DOS Text DOS ascii single-line paragraphs MS-DOS Text with Line Breaks DOS ascii wraps at 110 The "system EOL" is \r\n on Windows and \r on Mac. The "system charset" is Cp1252 on Windows and MacRoman (aka charset=macintosh) on Mac. For example, in Cp1252, an ndash (U+2013) is octal 226 = hex 96 = decimal 150. In MacRoman, an ndash is octal 320 = hex D0 = decimal 208 Some Microsoft Word versions also offer: Text with Layout - ans or asc Unicode Text - system line ending, no line breaks, UTF-16 encoding Regardless, paragraphs are marked with extra blank lines. Bullet characters and smart quotes are among the "funny" characters you have to look out for. See: "WD2000: What Happens to Fields When You Save As Text" http://support.microsoft.com/kb/q211688/ * Saving as RTF You can also make Word save as rtf, then convert rtf to html separately. * Scripting of Word This is Windows only, and Internet Explorer only Open a client-side file: Use a spell checker: // http://msdn.microsoft.com/library/en-us/script56/html/wscondrivingapplications.asp // Dan Rollins if mime-type is application/octet-stream, will always be external. if application/msword, may or may not according to Tools -> Folder Options -> File Types -> ".doc" -> Advanced -> "Browse in Same Window" (in at least XP) The LaunchinIE control will allow scripting without confirmation: http://www.whirlywiryweb.com/q/launchinie.asp Clipboard access from IE: http://msdn.microsoft.com/workshop/author/datatransfer/overview.asp Embedding Microsoft Word in IE: http://west-wind.com/weblog/posts/1299.aspx http://www.cse.msu.edu/~merrickm/wordauto.html http://www.webreference.com/js/column55/index.html http://msdn.microsoft.com/library/default.asp?url=/library/en-us/script56/html/wscondrivingapplications.asp http://www.pcmag.com/article2/0,4149,428392,00.asp http://fall.cerrocoso.edu/studenthelp/tutorials/spellchecker.htm Scripting Word on the server side: Perl Win32::Ole http://aspn.activestate.com/ASPN/Mail/Message/perl-win32-web/2284879 TBD what can actually be done purely on the client side.