Search Limits?

michaelbarton
Posts: 13
Joined: Mon May 23, 2005 3:27 pm

Search Limits?

Post by michaelbarton »

We are running webinator 4.3.7 and I am trying to spider a very large set of files on a web-enabled file share. Ther are approximately 22k files (over 9GB - word, excel, pdf, ppt, ...). Webinator makes it through about 6,500 files (about 1.5 GB). Is there a limit to the amount of stuff that webinator can spider?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Search Limits?

Post by mark »

When you say "makes it through" what do you mean. It gets that far and gets stuck? Or stops as if there were no more? or??? What does the walk status say?

There's your license limit which controls how many pages you can index.

If you have a 32 bit version you're limited to 2GB of data in a table file. Check the size of files in your database.

Check your license with
texis -license
Check your version with
texis -version
michaelbarton
Posts: 13
Joined: Mon May 23, 2005 3:27 pm

Search Limits?

Post by michaelbarton »

It stops as if there is no more.

The Walk Summary says:
Walk Summary ? Last complete walk: 2005-06-29 12:30:58 (took 3 hours 55 minutes 45 seconds)
Success. 6,868 pages (-1,539,673,580 bytes) (936 errors) (994 duplicates)

Here is my License/Version information:
Texis Web Script (Vortex) Copyright (c) 1996-2003 Thunderstone - EPI, Inc.
Commercial Webinator Version 4.02.1042579553 of Jan 14, 2003 (i686-intel-winnt-3
2)

License Information

Current time is: Jun 29 2005 16:26:54 Eastern Daylight Time
Init since boot: Jun 25 2005 02:04:25 Eastern Daylight Time
License created: Mar 17 2003 11:24:33 Eastern Standard Time
License expires: never
License verified: Oct 2 2003 14:41:22 Eastern Daylight Time
Last verify try: Jun 29 2005 16:14:00 Eastern Daylight Time
Maximum version: Aug 8 2003 23:59:59 Eastern Daylight Time
Serial number: 7430
Current hits a day: 25 since Jun 29 2005 16:08:30 Eastern Daylight Time
Previous days: Jun 28: 885 Jun 27: 724 Jun 26: 655 Jun 25: 51 Jun 24: 12 Jun 23:
451 Jun 22: 604 Jun 21: 679 Jun 20: 789 Jun 19: 355
Highest hits a day: 999
Maximum hits a day: 10000
Current table rows: 0
Maximum table rows: 20000
Current table size: 0
Maximum table size: unlimited
Current database rows: 0
Maximum database rows: unlimited
Current database size: 0
Maximum database size: unlimited
Current total rows: 0
Maximum total rows: unlimited
Current total size: 0
Maximum total size: unlimited
Flags: Non-intranetable Auto growth
Texis flags: Index Trig Del Upd Ins Sel Grant Revoke Viol. no create SSL
Vortex flags: Texis name Env check Comment (c) Visible (c) Create Webinator tabl
es
Texis monitor process: pid 2408
Highest concurrent users since init: 5 at Jun 29 2005 10:36:15 Eastern Daylight
Time
Concurrent users in the past minute: 2

Defaults

Equivs path: builtin
User equivs: F:\Thunderstone Software\Webinator/eqvsusr
Vortex log: F:\Thunderstone Software\Webinator/texis/vortex.log
Database: F:\Thunderstone Software\Webinator/texis/testdb
Vortex script: F:\Thunderstone Software\Webinator/texis/testdb/index

F:\iPlanet\Servers\docs\enet\cgiexe>texis -version
Texis Web Script (Vortex) Copyright (c) 1996-2003 Thunderstone - EPI, Inc.
Commercial Webinator Version 4.02.1042579553 of Jan 14, 2003 (i686-intel-winnt-3
2)
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Search Limits?

Post by mark »

Look at the Walk Status from the left side menu of the interface, not the summary. It's a whole page unto itself and has much more detail about the process.

You are running a 32 bit version. Check the sizes of files in the database directory. Make sure none are near 2GB.
michaelbarton
Posts: 13
Joined: Mon May 23, 2005 3:27 pm

Search Limits?

Post by michaelbarton »

ok, I checked the db2 directory, the largest file is: 500mb (html.tbl). Here are the results from the status page:
Webinator Walk Report for innovationcenter

Creating database F:\Thunderstone Software\Webinator/texis/innovation/db2...Done.
Walk started at 2005-06-30 04:00:01 (by schedule)
JavaScript walking not enabled by current license
HTTPS walking disabled
Start fetching at http://xxxxxx/default.asp
Start fetching at http://xxxxxx/projects/
Ignore urls containing any of the following:
/cgi-bin/
~
?

started 1 (3708) on http://xxxxxx/default.asp
started 2 (4256) on http://xxxxxx/projects/
247 pages fetched (71,986,094 bytes) from http://xxxxxx/default.asp
6405 pages fetched (2,147,483,647 bytes) from http://xxxxxx/projects/
6652 pages fetched (-1,610,788,191 bytes) Total
957 errors Total
801 duplicate pages Total

Creating search index on fetched pages...Done.
Verifying usability of new walk.

Walk finished at 2005-06-30 06:38:37 (took 2 hours 36 minutes 8 seconds)

---------------
looks like the 6405 pages from the projects directory crossed that 2 GB mark...

The rest of the page talks about the errors (duplicate pages, broken hyperlinks, filesizes too big) Let me know if you want to see that also.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Search Limits?

Post by mark »

Yeah, the Download size rolls over at 2GB but that doesn't matter. It's not a limit, just a display issue.
It looks like it thinks it's finished. I guess it's mainly an issue of why your count disagrees with webinator's. Do you have all the desired extensions listed in your extensions list? Take a look under list/edit urls to see if you can spot something that's missing. If so find the page that links to the missing file in list/edit urls and click its "Children" link to see if the missing file is listed and if there's an error next to it. If it's listed but no error it was probably rejected due to the settings. Set the verbosity to 4 and do a rewalk-type "new" walk. It will then report the reason for each url that's discarded.
michaelbarton
Posts: 13
Joined: Mon May 23, 2005 3:27 pm

Search Limits?

Post by michaelbarton »

Thanks for all the feed back. It looks like I was missing a couple extensions, that and there are a bunch of executables in there that probably can't be searched.
michaelbarton
Posts: 13
Joined: Mon May 23, 2005 3:27 pm

Search Limits?

Post by michaelbarton »

One more question...for binary files (jpg, exe, ...) will webinator at least index the filename?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Search Limits?

Post by mark »

Yes if they are walked. Make sure you include "URL" in the "Index Fields" setting under all walk settings.
michaelbarton
Posts: 13
Joined: Mon May 23, 2005 3:27 pm

Search Limits?

Post by michaelbarton »

I don't see an "Index Fields" setting on my version. (4.02)
Post Reply