A clever tool that works as advertised
Definitely give this app a try. You'll need to do some initial
configuration to teach it, for example, which elements in your XML
files are inline elements and which are block elements, which
elements need to be handled as "verbatim" elements, and which you
want it to whitespace-normalize. But once you've done that initial
configuration, I think you'll find it works as expected --
including handling mixed content correctly.
In testing it with a number of files of 15,000+ lines, I never
found a single instance of it adding whitespace where it shouldn't
have been added or deleting whitespace where it should have been
preserved, or wrapping or indenting anything I didn't ask it
For a few more details, see the
xmlhack item (http://xmlhack.com/read.php?item=2154) I
wrote about it.
Important tool with some current limitations
This is an important addition to the DocBook
toolchain -- at least as important for its ability to
do "standalone" HTML->DocBook conversion as it is for
its ability to produce DocBook from Java source
documentation. And as far as I know there are no other
open-source tools available for converting HTML to
As of the 0.29 release, however, I think you can't yet
expect it to always produce valid DocBook that doesn't
require some manual cleanup (though it does always generate
clean well-formed XML -- nicely indented even).
The validity limitations I've seen relate mostly to
the fact that HTML permit certain kinds of markup
instances that really aren't complete, though they are
valid against the HTML DTD. When these markup instances get
converted to DocBook, which does require more complete
structures, they may not be valid.
For, example, in HTML, it's valid for a definition
list (dl element) to contain only a term (dt) with no
corresponding description (dd). But the DocBook Doclet
will convert that to a Variablelist containing a Term
but no associated Listitem (the equivalent of dd). This
generates validity errors because the Variablelist
content model requires a Listitem.
But validity errors like that are fairly easy to
find and clean up manually, so it's not that big of a
limitation. For future releases, it would be very
useful to have some logic in DocBook Doclet to detect
and automatically correct certain instances like that,
so that they don't need to be corrected manually.