In fact, most people working with XML actually need something from the second group, if they can find it. After all, if you needed to install a Web browser, would you start looking for HTML parsers and TCP/IP stacks to download? The trouble is that it hardly makes sense to write about the second group in a single "XML" article because the range of applications is so broad. Robin Cover lists over 500 different XML-based formats and projects, ranging from the American Iron and Steel Institute XML Workgroup to the VISA XML Invoice Specification to the Mind Reading Markup Language, and there are hundreds (perhaps thousands) more XML-based formats not in the list, not to mention the dozens (or hundreds) of applications like AbiWord, Gnumeric, and Flight Gear that happen to use proprietary XML-based save or configuration formats.
So, if you're looking for information on setting up distributed computing with SOAP and .NET, handling VISA transactions, or interpreting mind-reading data sheets, you should start by going to sites that deal specifically with those areas. If they're actually using XML (rather than just talking about it), they should have links to any available software and tutorials that you can use.
If you're still reading, I will assume that you could not find any high-level software, and that you have decided to build your own system from some of the low-level XML building blocks that follow.
From the start, the best support for XML has been in Java. Many of the original XML designers and implementors were Open Source Java programmers, and both Java and XML are Unicode-based. (XML causes headaches for languages that use 8-bit character sets).
The Perl and Python communities, however, have worked very hard to catch up. Python can wrap any Java-based XML library, so Python programmers can (technically) claim that they can do anything the Java users can do, while announcements are constantly appearing for new XML-related Perl modules.
Support for XML and XML-related applications in C and C++ is growing, but is still weak relative to Java, Python, and Perl. XML-based projects sometimes end up shifting from C++ to Java to take advantage of better support (and to avoid the character-size problem).
There is at least some XML software for nearly every programming language currently in use (and some that are not); if in doubt, try a freshmeat search in the XML category.
An XML parser is a tool to read raw XML in text form and convert it into tokens or a parse tree (much as the parser in a C compiler does). Parsers are usually available as libraries that you can link into your application, and some are also available as standalone applications for error-checking documents in batch mode.
Many of the more naive XML books and articles will try to convince you that the parser is the starting point for all XML work. Technically, that's true (there's a parser hidden at the bottom of almost all XML software), but parsers operate at a low level and require an advanced knowledge of the XML. There are two good reasons that you should download and install a parser:
If neither of these applies to you, skip the rest of this section for now.
In the XML community, there are two standard interfaces that nearly all parsers implement for passing information to an application: The Simple API for XML (SAX) and the Document Object Model (DOM). SAX is a streaming, event-based interface, while DOM is a random-access, tree-based interface. XML programmers usually code against these interfaces rather than the parsers' own native interfaces, so it will be easy to change to a different parser library later (in Java, users can even change parser libraries at runtime). If you need help choosing between DOM and SAX, take a look at my old Events vs. Trees paper at the SAX site. It is best to look for parsers that support SAX2 and DOM2 rather than the original SAX and DOM interfaces.
In the Java world, the Apache XML Project's Xerces parser is probably the most popular and full-featured, and it includes SAX2 and DOM2 support. For applications for which size is important (such as Java applets), the AElfred parser (now allegedly bundled in the GNU JAXP project) has a very small footprint and SAX2 support.
Most Perl XML packages rely on the XML::Parser module, which you will need to install before you can use anything else.
In the C/C++ world, the SAX and DOM interfaces are not standardized as well as they are for Java, so some parser dependencies are inevitable. Apache's C++ version of Xerces is becoming stable and more feature-rich, though it usually lags behind the Java version. For a smaller footprint, you can use the C-based Expat parser, which provides a streaming interface similar (but not identical) to SAX. Expat provides the core XML support for both Perl and Mozilla, so it is very stable and well-supported. For other Open Source projects (including most Gnome software), the C-based XML parser of choice is libxml. Like Expat, libxml has been heavily testing in demanding environments.
A transformation engine is a library or program that can modify an XML document automatically according to a set of rules, much as sed or awk do for ordinary text. Instead of regular expressions, most transformation engines read rules from a separate XML file using a format called XSLT.
Some XSLT engines are available as libraries that you can link into your applications, and most are available standalone (for use in batch scripts). Unfortunately, XSLT libraries have no standard interfaces like SAX and DOM for XML parsers, so once you've coded for one XSLT library, switching to another might require a fair bit of refactoring. You need to spend more time up front ensuring that you have the right library for your needs.
In the Java world, the two most popular XSLT engines are Michael Kay's SAXON and the Apache XML Project's Xalan. James Clark's XT, the original reference XSLT engine, is not as full-featured or actively maintained as the others, but is still solid.
C++-based XSLT engines can run slightly faster than their Java-based cousins, but, in general, the C/C++ versions are not as far developed or debugged, so crashes or missing features are much more common. Apache's Xalan-C++ is probably the most stable of them.
In Perl or Python, your best bet is to invoke either a Java- or C++-based XSLT engine as an external process (or to wrap one in Python), though Perl does have a native XML::XSLT module for those who want to try it.
Note that XSLT is not your only option for transformations, and is often not the best one. Since XSLT has a tree-based data model (usually in-memory), it can be extremely slow and resource-hungry for larger XML documents, making it inappropriate for high-demand server-side use. Sometimes there is no alternative but to custom-code your transformation in Java, C/C++, Perl, or Python, working directly with the streaming parser output. SAXON has some support for on-the-fly streaming transformations, so it might be worth investigating if speed or resource usage is an issue.
An XML browser is a viewer that can display an arbitrary XML document in a way optimized for human users. Since XML data can represent just about anything, that's not a trivial problem. There are two main types of general-purpose XML browsers:
Tree-oriented browsers have some (limited) value for technical specialists navigating through a tree of hierarchical data structures, but their main advantage is that they are easy to write (just hook a tree widget up to the DOM), so there are many of them available. If you have a free evening, you can roll your own with (say) the Java Swing JTree widget.
For most serious work, you will need a document-oriented browser. The bad news is that they are very hard to write and maintain. The good news is that the two major ones are also two of the world's best-known software applications: The closed source Microsoft Internet Explorer (version 5+) and the Open Source Mozilla (or Netscape version 6+). Both can take any arbitrary XML document, together with a stylesheet in XSLT or CSS format, and render it so that it looks like a regular HTML page.
XML editors face the same problem as XML browsers: There is no single, obvious editing interface for all types of XML data. What is appropriate for an XML-based government report, for example, is probably not appropriate for an XML-based vector graphic.
As with browsers, there are two major types of general-purpose XML editors:
In nearly all cases, you will want either a document-based editor (possibly with a tree-based sidebar to help with navigation) or a customized form interface (in which case you may need to build from the parser level up). Some editors allow the use of DTDs or Schemas for guided authoring, which can be extremely valuable.
The one robust, production-grade, general-purpose, Open Source XML editor is the PSGML package for Emacs and XEmacs. PSGML uses DTDs for guided authoring and autocompletion, but it has extremely limited visual capabilities (mainly just markup highlighting). Unfortunately, it has no facility for turning off DTD validation, so it produces many annoying errors and warnings when working with XML documents that have no fixed DTD.
Another fairly robust (but restricted and closed source) XML editor is Henry Thompson's XED, which uses heuristics rather than DTD validation for guided authoring.
Regular expressions and proximity tests are useful for full-text searches on prose documents, while tightly-structured SQL queries are useful for searches through relational data tables. XML falls somewhere between, and does not yet have a finalized query language of its own (the XML Query language is still under development).
As a result, the most common way to query an XML document programatically (without setting up a full-fledged XML-oriented database and search engine) is through XPath, a simple language for specifying a location in an XML document. Many XML-based tools, including all XSLT engines, have XPath support built in. If you are designing an application that requires standalone XPath support, a good bet is Jaxen.
Open Source XML search engines should all be considered experimental at this point, since none has been widely implemented, and (as mentioned above) there is not yet a finished XML Query language.