disk usage by webinator index

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

disk usage by webinator index

Post by Thunderstone »



I have been testing the free version of webinator with a view
to purchasing the commercial version to index our site. What I would
like to know is, how much disk space does the index use.

Our site contains approximately 110,000 html documents.
we have approximately 870 Mb of html files, with a similar amount of
gif's and jpg's. When you quote your index sizes as 15% of the
document collections does this figure mean that the index is 15% of
the size of the html files or 15% of the all the files used i.e including
gifs etc. I need to know this to figure out how much space is
required on our webserver.

Also I have been experimenting with reducing the retrieved page
size with the -z option. This produced a smaller index since less body
text is stored in the index. Are there any reasons why I shouldn't
make this value smaller than the default and is there a recommended
size ? The default value seems rather large to me.


Dr Chris Barran
Special Projects
Corporate Information & Computing Services
University of Sheffield



User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

disk usage by webinator index

Post by Thunderstone »



Images, such as .gif and .jpg, are not text and are not indexed by
Webinator so they don't count. Index size is based on the size
of the stored data; the fetched and parsed pages in Webinator's case.
Also since Webinator stores the parsed text and not the full HTML
in the database, the usage will be even less. Parsed text is usually
MUCH smaller than the source HTML.

So Webinator storage of the data is generally between 20%-50% of
the original HTML. Then the search index on the data is about 20%-30%
of that. Those 2 figures together constitute the total database size.

There's a very good reason not to reduce the page size with -z.
It truncates your pages at the point you set. You lose everything
beyond that, so you're not indexing all of your data.
Naturally indexing less data takes less space, but do you really
want to be missing data?
The purpose of -z is simply to prevent being dumped on by huge pages.




Post Reply