Projects / webbase

webbase

webbase is an Internet crawler. It is able to crawl and maintain millions of URLs and store information about them in a MySQL database. The interface is either a command line program or a C library. It contains hooks to plug a full text indexing database.

Tags
Licenses

Recent releases

  •  10 Sep 2001 13:58

    Release Notes: The exploration will now stop after or before loading an URL only - touch to force loading even if the content of the URL is in the database. Fixed index updating bug that removed documents from the index when they are found Not Modified by the crawler. Upgraded md5 code to GPL, and made other small utilities fixes.

    •  15 Jan 2001 02:46

      Release Notes: -version now shows the version number. An allocation error when updating the full text index and a name server timeout condition handling optimization have been fixed. /etc/my.cnf, ~/.my.cnf, and datadir/my.cnf are now used instead of ~/.my.cnf alone.

      •  28 Dec 2000 22:48

        Release Notes: Implementation of dynamic updating of the fulltext index, and fixes for a last modified time update bug, a mysql-3.23.19a-gamma namespace conflict, and a bug that left the start point in virgin state artificially.

        •  23 Dec 2000 19:54

          Release Notes: A -crawlers option to run simultaneous crawlers and a signal handling function for graceful interuption of the crawlers, and the ability for url, url_complete, and url_content tables to grow over 4GB. The hook library is dynamically loadable with the -hook option so that specific full indexing strategies can be implemented as plugins. The -where_url option is taken in account when rebuilding the full text index with -rebuild. Extensions and MIME types have been added to the list of known MIME types. The auth field of the start table was removed because it was not used.

          •  27 Oct 2000 16:03

            Release Notes: The crawler manual page was completely reviewed for correctness. Bug fixes were made in the mifluz interface. The -agent option was implemented. The -show option family was added to display all URL information from an exploration starting point. The configuration script was improved. Major leaks and concurency problems were fixed in the langrec interface. The scope of the allow/disallow comparison was widened to include CGI parameters. Code to use .my.cnf files (if any) was restored.

            Screenshot

            Project Spotlight

            OpenStack4j

            A Fluent OpenStack client API for Java.

            Screenshot

            Project Spotlight

            TurnKey TWiki Appliance

            A TWiki appliance that is easy to use and lightweight.