Release Notes: This version improves crawl stability and has been used in a page crawl of 1/3 billion pages. The indexing plugin API was improved to allow plugins to have configure screens. A new example Word Filter plugin has been added. Yioop can now crawl Tor networks. A Manage Groups pane has now been added to Yioop.
Release Notes: This version includes a new hybrid inverted index/suffix tree indexing scheme that should make calculating search results from future crawls faster (doesn't affect old crawls). It can make use of HTTP ETag: and Expire: information when deciding whether to download a URL it has seen before. It also supports the creation of classifiers using active learning. These can be used to label and add scoring information to documents during a crawl. This release includes improvements to the RSS feed news_updater and a segmenter for Chinese.
Release Notes: This release adds a simple language called Page Rules for controlling how data is extracted from webpages during the summary creation phase of indexing. It also adds the ability to index records coming from a database query and adds a generic text importer which works on plain text, gzip'd, and bzip'd text records. Other features in this version of Yioop are Atom support as a News Feed Search Source, a dedicated process new_updater.php for handling news updates, and a better algorithm for distributing archive data during an archive crawl. Many other minor improvements have been made.
Release Notes: This release supports materializing as new indexes query-based combinations (crawl mixes) of old search indexes. This should make query performance of crawl mixes much better. Cache pages of search results now have a new history UI which allows you to search cache pages in all indexes you have, much like the way Internet Archive does. Yioop now supports spell corrections on searches after they have been performed, and it has an API for transliterating between roman and other scripts. Query performance has been improved over previous versions, and lots of minor bugs have been fixed.
Release Notes: This release adds an activity to manage search media sources. For now, one can add Video and RSS sources. When configured, RSS feeds download hourly and are integrated into search results. Also new is a command line tool for configuring Yioop in VPS settings. An Italian stemmer has been added, as well as more translations. This version implements some important bugfixes in robot handling, as well as unit testing of these. Yioop! now works in PHP 5.4 as well as PHP 5.3, and plays friendlier with more recent versions of Xampp on Windows.
Release Notes: This release adds image and video subsearch abilities and improves the formatting of Yioop on smart phones. Exact phrase searching is now done using only posting list information, rather than regex. Archive crawls can now, like Web crawls, exploit the presence of multiple queue_servers. Index iterators and summary lookup operations have been factored apart to better optimize Yioop search speed. Improvements have also been made to keyword suggestions.
Release Notes: This release adds initial support for word suggestions as a user types in queries. The bigramming used to speed common two word queries now works with n word grams. N word gram filter files can now be created using Wikipedia raw page count dumps. This version adds support for * and $ in allowed- and disallowed-to-crawl sites. Using this, the user can crawl sites to a fixed depth. Robots.txt processing now supports * and $ in robot.txt paths. Support for NOSNIPPET, NOARCHIVE, and X-Robots-tag HTTP headers has also been implemented. A tool for editing search summaries after a crawl has also been added.
Release Notes: The crawler now has its own DNS caching mechanism independent of cURL's. Yioop now has a detection mechanism for when websites are becoming congested. The user can also set a quota on the number of URLs downloaded per hour from sites. A webcrawl statistics page can now be generated for a crawl. Bugs in robots.txt handling and in archive handling, which were introduced in 0.82, have been fixed. The demo site now features an example crawl of 100 million pages crawled with the previous version of the software.
Release Notes: This release improved scalability by allowing multiple machines to maintain portions of the "to crawl next" queue. Query processing can also be split amongst machines, with different machines being responsible for documents of a given hash. Yioop! now supports mirroring of machines. Two word phrases as determined by an XML file such as Wikipedia URL dump can now be treated as a logical unit. The Yioop! model-view-controller framework has been made easier to extend and documentation for it has been added to the website.
Release Notes: This version supports starting, stopping, and viewing log files of the queue server and fetchers from a Web interface. One can now inject new URLs into an active crawl via a Web interface. This version of Yioop! supports re-crawling of pages after a fixed number of days. Also, the file extensions that are crawled, the number of bytes downloaded per page, and how Yioop! weighs different page components can now all be controlled through a Web interface rather than just the config.php file. Improvements have also been made to how HTML Processor extracts text to index.