JCPP is a complete, compliant, standalone, pure Java implementation of the C preprocessor. It is intended to be of use to people writing C-style compilers in Java using tools like sablecc, antlr, JLex, CUP, and so forth. It has been used to successfully preprocess much of the source code of the GNU C library.
GNU m4 is an implementation of the traditional Unix macro processor. It is mostly SVR4 compatible, although it has some extensions (for example, handling more than 9 positional parameters to macros). GNU m4 also has built-in functions for including files, running shell commands, doing arithmetic, etc. Autoconf needs GNU m4 for generating `configure' scripts, but not for running them.
pyPEG is a quick and easy solution for creating a parser in Python programs. pyPEG uses a PEG language in Python data structures to parse, so it can be used dynamically to parse nearly every context free language. The output is a plain Python data structure called pyAST, or, as an alternative, XML.
SILVERCODERS DocToText is a powerful utility which can convert documents in many formats to plain text. It includes a console application and C/C++ library, which allows embedding text extraction mechanisms into other applications. It supports MS Office binary formats (MS Word (DOC), MS Excel (XLS, XLSB), MS PowerPoint (PPT), and Rich Text Format (RTF)), OpenDocument formats (text documents (ODT), spreadsheets (ODS), presentations (ODP) and graphics (ODG)), Office Open XML formats (MS Word (DOCX), MS Excel (XLSX), and MS PowerPoint (PPTX)), iWork formats (PAGES, NUMBERS, KEYNOTE), OpenDocument Flat XML formats (FODP, FODS, FODT), Portable Document Format (PDF), Email files (EML), and HyperText Markup Language (HTML). DocToText can extract text not only from the document body but also from annotations (comments) embedded in odt, doc, docx, or rtf files and read metadata like author, last modification date, or number of pages. It can be used as a fast console viewer, and is able to convert corrupted OpenDocument and Office Open XML documents. It can be used to recover text even if other recovery methods failed.
HTMLDOC converts HTML files and Web pages into indexed HTML, PostScript, and PDF files suitable for online viewing and printing. It can be used as a standalone GUI application, in a batch document processing environment, as a Web-based report generation application, or in embedded environments to support printing of HTML content. It runs on all Unix platforms as well as Mac OS X and Windows 2000 and higher.
Mini-XML is a small XML parsing library that you can use to read XML and XML-like data files in your application without requiring large non-standard libraries. It only requires an ANSI C compatible compiler (GCC works, as do most vendors' ANSI C compilers) and a "make" program. It supports reading of UTF-8 and UTF-16 and writing of UTF-8 encoded XML strings and files, and provides a hierarchical view of the file via a linked-list tree structure of typed nodes and functions for managing, traversing, indexing, and searching the tree.
libextractor is a library used to extract meta-data from files of arbitrary type. It is designed to use helper-libraries to perform the actual extraction, and to be trivially extendable by linking against external extractors for additional file types. The goal is to provide developers of file-sharing networks, file managers, and WWW-indexing bots with a universal library to obtain meta-data about files. It includes a shell-command and bindings for Java (JNI) and Python.