Slow gw -wipetodo

valery
Posts: 26
Joined: Thu Mar 15, 2001 9:24 pm

Slow gw -wipetodo

Post by valery »

Hi,

We are using commercial webinator to index several thousands of websites in our link collection. After running gw on each website, we do
> gw -wipetodo -noindex
This is the snippet from gw.log:
_________________________________
2001/04/29 02:40:01 End (6417) Visited 8 pages total
2001/04/29 02:40:01 Begin (18305) /home/httpd/html/BIOZAK/webinator/bin/gw -d/home/httpd/html/BIOZAK/webinator/db -wipetodo -noindex
2001/04/29 02:52:30 End (18305) Visited 0 pages total
-----------------------------

This is gw -wipetodo -noindex taking 12 minutes 29 seconds on 700MHz Pentium-III with 792MB RAM. CPU utilization was virtually 0% so I am not really sure where it spends all this time - it shouldn't be doing any networking while it is wipetodo-ing, right?.
From the log file, you can see that the previous site had only 8 pages to contribute to database -> todo list for it can't be too large.

Any ideas?

Also, there is another peculiarity we noticed:
We run walker on batches of 200-500 sites at a time. At the starting of one of the batches it gave us
_______________________________
2001/04/29 02:25:59 Creating Unique index on Non-unique data
----------------------------
even though we never run gw without -unique on this database.
How could this happen?

Thanks for your support. Please tell us if you need any additional information.

Valery
bart
Posts: 251
Joined: Wed Apr 26, 2000 12:42 am

Slow gw -wipetodo

Post by bart »

This is probably due to lock contention for the TODO table by other running Webinator processes.

Judging from this and other postings and looking at your company website, it appears that manner in which you are operating Webinator is both outside the scope of the program's intended operational efficacy as well as its license limitations.

Webinator 2 is (intentionally) missing the componentry required to allow it to be effective as a global web crawler. Commercial Webinator's intended use is to allow corps. to index information owned or controlled by their enterprise.

The reason you are having so many issues is because of the absent pieces of software that turn Webinator into a global indexer. When Thunderstone is used to index large numbers of domains a dispatcher process is used to coordinate the efforts of many individual crawlers so that they operate with efficiency and a minimum of overlapping effort. One of our customers maintains a index of 9-10 million domains on a single Sun machine with single site latency averaging 45 days.

License fees for software like this are not measured in hundreds of dollars though. This is not an effort to sell you a Texis license, but if you really want to get the job done that you're describing you'll definitely have to spend more than $700. Before you get irritated with us, call around to the other companies who sell comparable software; on average they'll charge substantially more than we do and their software will require a lot more hardware to do the same job.

I'd suggest that you call Thunderstone Sales at +216-631-8544 for more info. Give them the background on what you're trying to do, how many hits/day you expect against the database, and how many records/pages will be indexed. They'll provide you with a quick estimate of both license fees and how long it will take to deploy the software that'll do your job correctly.
bart
Posts: 251
Joined: Wed Apr 26, 2000 12:42 am

Slow gw -wipetodo

Post by bart »

BTW: try deleting SYSLOCKS.* to resolve the lock-contention issue. Make sure no other Webinators are running when you do this.
valery
Posts: 26
Joined: Thu Mar 15, 2001 9:24 pm

Slow gw -wipetodo

Post by valery »

Hi,

thank you for your comments. I would like you and me not to get lost in the legal debate, though, and attend the technical matter at hand instead (gw -wipetodo).

However, since Thunderstone decided to switch to the legal side, I'm going to reply below:

> it appears that manner in which you are operating Webinator is both outside
> the scope of the program's intended operational efficacy as well as its
> license limitations.
> ...
> Commercial Webinator's intended use is to allow corps. to index information
> owned or controlled by their enterprise.
Then please make changes to the license agreement located at
http://www.thunderstone.com/texis/site/ ... mmlic.html:

> ATTACHMENT A: ACCEPTABLE USE of LICENSED PROGRAM
>
> The general purpose of the Licensed Program is to create and provide
> access to searchable indexes of textual materials
> which are accessible via Internet Web Servers.
nothing is mentioned about INTERNAL documents only.

> During indexing the
> Licensed Program will access documents via the
> HTTP 1.0 protocol. After acquisition and indexing the Licensed Program will
> provide a search interface to the index through an HTTP server.
> You may use the Licensed Program in a manner consistent with its general
> purpose pursuant to the following limits of Acceptable Use:
>
> 1.Client modifications to the Licensed Program's search interface may be
> made to improve styling compatibility with respect to the rest of the Client's
> Web Server.
no violation here

> 2.Ancillary functionality may be added by the Client to increase the efficacy
> of the Licensed Program's search interface with respect to the rest of the
> Client's Web Server.
no violation here

> 3.No modification or added ancillary functionality may be made by the Client
> which would have the effect of increasing the scope of the Licensed Program
> beyond its intended general purpose.
no violation here

> 4.The Licensed Program may not be used to index textual data referenced by
> or residing in any third party relational database owned, managed, or
> controlled by the Client unless the sum of textual data from third party
> database amounts to no more than than thirty percent of any index created
> by the Licensed Program.
no violation here. While we do have a medium-sized third-party database on our site, we do not use Texis to access it in any way. The webinator purchase was specifically made to perform general tests of your software and decide on its applicability to search functions on our proprietary database.

> 5.The Licensed Program may not be used in any manner which would
> enhance or supplant the capabilities of any third party database program.
no violation here. The Licensed Program does indeed enhance the capabilities of our SERVICE to our clients, but this is what Licensed Program is supposed to do.

> 6.The Licensed Program must serve as a supportive entity on the Client's
> Web Site as opposed to the primary function or purpose of the Client's Web
> Site.
No violation here. Moreover, none of the Texis product interface is available for public use on our website. As I mentioned above, the purchase of Webinator license has a purpose of testing your products and access to its interfaces is restricted to our internal staff. However, if you believe that you were able to access said interfaces, I would appreciate your sending us the exact address.

Please correct me if I am mistaken in any of the above.

Overall, I understand your position and underlying assumptions.
However, falsely accusing people with license violations does not help
> gw -wipetodo -noindex
run faster.
> This is probably due to lock contention for the TODO table by other running
> Webinator processes.
This was the only running gw process according to 'ps -ef'. The corresponding command runs only after the previous 'gw ... -5 -noindex <website>' has finished. Can the processes leave the locks behind after they have finished/been killed? How can one clear the locks if they do?
Please consider answering these technical questions.

Thanks,
Valery.
valery
Posts: 26
Joined: Thu Mar 15, 2001 9:24 pm

Slow gw -wipetodo

Post by valery »

Sorry, didn't see your last message about SYSLOCKS table. I'll try that, thanks!

Valery.
bart
Posts: 251
Joined: Wed Apr 26, 2000 12:42 am

Slow gw -wipetodo

Post by bart »

There is already a new License Agreement and ATTACHMENT A in the works which will help clarify some of the more liberal interpretations of the current agreement. You're right; my job isn't to debate the legal matters, just to identify where they may occur and attempt to resolve them pre-arbitration. So, I won't digress into a hair splitting debate over the agreement here.

The matter at hand is really your technical problem. The only reason I mentioned the legal aspect is that your technical problem is caused mostly by using the program in a manner that not intended by its design or its acceptable use.

Since all subsequent updates and versions of the $700 Webinator will necessarily be bound by license language that specifically constrains its usage to the purposes we really intended, I felt it might be best to let you of how we felt before you spent a bunch of time trying to work around the programmatic limitations.

My point is a financial one:

We do have software to do what you desire. Its more expensive than C-Webinator, but pretty much off-the-shelf and you could be running with a fully populated database within days. It should also be more cost effective than any other software solution you can purchase or the costs of lawyers wrangling over C-Webinator's limitations of acceptable use.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Slow gw -wipetodo

Post by mark »