Right now we are trying to build our indices with stock gw we've got with our commercial Webinator and the process is too slow. From this discussion archive we learned that it is not really possible to run multiple gw's on the single database reliably. With single gw (with -5 option which for some reason only results in maximum 2 simultaneous walker processes) we are able to index 7-10 thousand pages a day (200-300MB/day) on 1Mbps+(!) link. At this speed it would take us >100 days to index our total domain sample which is not acceptable for our purposes.
Scripted walker was the possible answer to this problem and some of our other requirements (such as pre-screening of the pages before they go into the database) but it is virtually defunct for us with the current limit on table size.
I understand that one can always force us to buy full Texis to do the Webinator-class job, but we hope that less drastic solution might be possible.
You must have other issues constraining Webinator's speed here. Its not that slow, and never has been. On almost any reasonable hardware platform available today Webinator 2 has been able to acquire more than 1000 pages/minute.
* Check your DNS. Make sure that its local and resolving names quickly.
* Make sure you have enough free RAM. You should have at least 25-30MB free for each thread you want to run. (This may be the reason you can only run 2 when you asked for 5). In addition, you should have at least 50MB free for the OS so that it may perform adequate disk caching.
* Don't use network mounted drives. Use only local disks, and preferably ones that have superior seek and read/write performance.
* Check your CPU to ensure some other process is not consuming too much CPU time or I/O channel bandwidth.
* Check your network to ensure that it is performing up to spec. Routing delays, packet loss, and saturation can really slow you down. BTW: You can test Webinator's true speed by indexing a local machine.
Webinator 2-3 spends most of its time recording links to the TODO table and checking the HTML table for pre-existing URLs. Running more than 5 crawler threads on a normal machine usually saturates the I/O channel, but that is only when you are acquiring new pages at an extreme rate.
Webinator 4 does not rely on the TODO table to the extent that 2-3 does. It does most of its accounting work in RAM. This allowed us to devote most of the disk channel to writing newly acquired pages, but at the cost of more RAM per process.
It seems that the bulk of the time is actually spent in
> gw ... -wipetodo -noindex
which we run after walking each site.
I have made a posting with details on this in http://thunderstone.master.com/texis/ma ... =39edd1693
> We're working on a solution to your current issue (30M). We're discussing it
> and will get back to you very soon.
Are there any updates on this? (30M limit on Vortex-accessed .tbl files)
Your slowness issues with gw should have been resolved by our last response in the other thread (-O and gw -st "delete from options where Name='URL'") and eliminated your need for the scripted walker.
We will be announcing the Webinator 4 release and license levels in upcoming weeks.