Table html too big while processing url

evought
Posts: 7
Joined: Tue Aug 08, 2000 11:39 am

Table html too big while processing url

Post by evought »

Hi,

When indexing our site with the dowalk script, we are encountering the following error statement:

100 dowalk(procpage) 433: Table html too big while processing url

The html.tbl file has reached over 30 MB.

Any ideas as to what may be causing this and how to resolve it would be very much appreciated.

Erik
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Table html too big while processing url

Post by mark »

The texis that comes with Webinator 2 limits tables loaded via vortex to 30 mb. Webinator 4 addresses that issue. You can see the licensing levels and try the beta at
http://www.thunderstone.com/texis/site/ ... ator4.html

The full version will be released "very soon now".
evought
Posts: 7
Joined: Tue Aug 08, 2000 11:39 am

Table html too big while processing url

Post by evought »

Thanks, but is there any reason why that table is so large? On other indexes I have developed of much larger sites, that table is much, much smaller.

Am I doing something wrong? Is there any way to reduce this table size in the future?

Thanks and cheers,
Erik
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Table html too big while processing url

Post by mark »

That table contains all of the text from the walked pages. If you have more pages or more actual content (as opposed to HTML coding) per page the table will be larger. If you've deleted a lot of data, there's free space in the table file that should get reused.
evought
Posts: 7
Joined: Tue Aug 08, 2000 11:39 am

Table html too big while processing url

Post by evought »

My understanding was that the html.tbl file contains the following:

Field Description

Url The URL of the HTML page
Ref The URL of a reference (link) on the HTML page
id Unique record id.

and that the file html.blb contained the content.

Is this incorrect?
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Table html too big while processing url

Post by mark »

The default schema will place the page text into the blb file. So the .tbl file will not contain that part. There are more fields in the html table than you describe though. None of them large. So if you're using the standard schema with the Body in a blob the bulk of the data will be in the .blb file unless the pages are mostly empty.
evought
Posts: 7
Joined: Tue Aug 08, 2000 11:39 am

Table html too big while processing url

Post by evought »

I just ran an index of the same site from our own machine (other process was on client machine) and have had no problems on our side. The final html.tbl file from our side is about 1 MB, however when the same process is run from the client machine, the file ends up being over 30 MB.

Any ideas what may cause the difference?
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Table html too big while processing url

Post by mark »

I'm assuming you started with a new empty database or wiped it first.
Are the following the same?
dowalk script
number of urls retrieved
database schema (html.tbl and html.blb both exist)
texis version
operating system
evought
Posts: 7
Joined: Tue Aug 08, 2000 11:39 am

Table html too big while processing url

Post by evought »

In both cases started with a new database. The other elements are basically the same with the exception of the OS WinNT vs. WinNT Small Business Server version.

I presume the Texis version is the same as they two softwares were purchased within a couple of month, but can double check this. How do I check the version number?

Regarding the number of urls, how can I extract this from the table or log to compare the two?
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Table html too big while processing url

Post by mark »

texis -version

to count urls:
gw -st "select count(Url) from html"
to see the list of urls:
gw -st "select Url from html"
Post Reply