Database size

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Database size

Post by Thunderstone »




I have a question about the Webinator database size. I am indexing a very large number of documents that total about 4 gb in size. I have webinator set to only retrieve the first 1mb of each file (which in most cases is the entire file). I expect that my database will be about the same size as the original data (correct me if I am wrong). As I update the index I am having webinator look for new or changed files with the command line options: -e"- 0 days" -V -X. Will this keep adding all of the database entries to the database, thus making the database continue to grow, or will it replace entries as appropriate only adding new entries for new files?

I know this sounds like a silly question, but I just want to be sure before I run out of disk space on my webinator machine, so that I can figure some other way to do this.
--------------
Daniell Freed (dxf@dewittross.net)
Compstaff
Dewitt, Ross & Stevens


User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Database size

Post by Thunderstone »



The ratios vary a lot based on how heavily coded the HTML is, but the
overall database size is generally much smaller than the original HTML
dataset. (Note that the individual table files may not exceed 2 gigs on
32 bit systems.)

The -e option updates the records for urls it refetches. It does not
append another copy unless it's a new page.

You should not use a time period as short as "-0 days". You could get
stuck in loop where it might never finish. A good rule of thumb is to
make the time period longer than the time it takes to do the entire walk.




Post Reply