The most visible change is the switch of the embedded database from GDBM to QDBM.
QDBM's advantages are:
The gatherer is much faster now on systems which gather over fast links like local network. In this scenario, the bottleneck still seems to be the database access, especially when gathering large amount (some GBs) of data on current hardware (ix86, 650MHz, IDE harddisks).
While QDBM increased the gathering speed quite a lot in this scenario, the gatherer still suffers from the phenomenon of wild harddisk activity without actually doing any collection. Even though it happens much later than with GDBM, this still needs some investigation. It seems to be a filesystem buffer issue of the underlying OS, so perhaps this is the reason why there are/were some databases making raw access to the disk bypassing the filesystem layer of the OS.
The improved speed won't be noticeable in situations where the bandwidth is the main limiting factor, of course and there is still some work to be done.
In an additional attempt to speed up the gathering, the default configuration was changed to not to sort keywords, which saves two forks (sort and uniq) for each document. This will make summaries larger but I have to check if I can get rid of the "keywords" attribute alltogether or have to write a function to create word list. However, this will depends on how we will map SOIF to XML.
I will also get rid of URI attribute from summaries, which was just introduced to improve the search results when using Glimpse.
Current integration QDBM to Harvest is ugly but makes it easy to tweak the parameters. This will be cleaned up after fine tuning of QDBM parameters.