cpdetector is a small yet clever framework for codepage detection that integrates different strategies. It may be used as a library for third party software that accesses textual data over network. It also includes a best-practice implementation in form of a command line tool that allows sorting and transforming large collections of documents based on their codepage. Available strategies include: jchardet (exclusion, frequency analysis, and guessing), detection of the HTML charset property, and detection of the XML encoding declaration.
|Tags||Communications Information Management Internet Web Indexing/Search Software Development Internationalization Libraries Java Libraries|
Release Notes: This release fixes a crash in command line mode when an invalid declared charset (the "" charset) was found. The return code of the command line tool (CodepageProcessor) does not return 0 in case of an error anymore. A bug that broke the ability to reset input streams after detection was fixed.
Release Notes: This major bugfix release fixes two issues in commandline batch mode. The switch to skip moving undetected documents works now again. No attempt will be made to transcode undetected documents (the latter caused exceptional program flow).
Release Notes: This version is a stability release and fixes the byte order mark detection and incompatibility with OpenJDK. It also requires Java 1.5 now.
Release Notes: The release structure has been changed: cpdetetor.jar does not contain 3rd party library files anymore. Missing public functions are contained again. The proguard shrinker has been updated from version 3.8 to 4.2.
Release Notes: The proguard shrinker is now used, so the cpdetector jar is now more than ten times smaller. System.out is no longer used for logging in JChardetFacade. All packages were renamed with the prefix "info.monitorenter".