TagSoup is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty, and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command line processor that reads HTML files, and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
|Tags||Text Processing Markup HTML/XHTML|
|Operating Systems||OS Independent|
Release Notes: The main issue was with HTML comments, which were very badly broken: any > character would terminate one, so commenting out elements did not work properly. Everything should now be correct. Everyone should update who possibly can. Additionally, nnnn (with capital X) now works, some debugging code was removed from PYXWriter, a Unicode BOM at the beginning of a document is skipped, and the new version of Saxon is supported as an XSLT processor. Documentation has been added on SAX features and properties specific to TagSoup.
Release Notes: A DOCTYPE declaration will be output if there is one in the input. The --ignorable switch was added to preserve whitespace in element content. The --output-encoding switch was added to specify output encoding. The default values for html/@version were removed. Various minor bugs were fixed.
Release Notes: All known bugs are fixed and all features considered appropriate have been added. This release is ready for full production use.
No changes have been submitted for this release.