new user questions

User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

new user questions

Post by Thunderstone »



i just downloaded the eval version of webinator and so far i like the ease of
installation and setup a lot. i am a little concerned, however, at its reliance
on http for building the database. some questions:

will there be a FreeBSD version?

is there a way to make gw build the database by walking a local filesystem
tree instead of going through the (exponentially slower) web server?

also in the speed area, i would like to know if subsequent walk operations are
faster, or does it just rewalk every single page every time?

basically i will need to be rebuilding several very large databases every two
days or so and if it's going to take 8 hours each time, we will probably have to
go with another product.

is it possible to merge databases? ideally i'd envision running a filesystem
build on several machines and then merging their databases together to form one
uber-db. this way i could minimize my db-building-time while intelligently
spreading out the load over several machines. (we have many high powered
machines, it seems a shame to have one machine be the bottleneck.)

Jon Drukman jsd@gamespot.com
-------------------------------------------------------------------------
System Administrator SpotMedia Communications
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

new user questions

Post by Thunderstone »




I believe that you can use the BSDI version.


Webinator doesn't walk the local filesystem for several very good reasons:

A: What a website presents via the webserver is not always the same
as the files that are within the HTDOCS tree; eg. dynamic HTML
, server side includes, parsed html, protected documents and
framed docs.

B: Large sites making use of several machines would have problems with
remote filesystems.

C: Its very error prone. There are a lot of cases where files exist within
the HTML tree that have not been exported and linked in, and may never be.
The end user might select a doc from the Webinator index that you
never intended to publish live.

Access via HTTP is not _exponentially_ slower. Only on very small machines
would you be able to notice the speed differential between direct file access
and HTTP access.

Perhaps what you are actually seeing as a speed issue is the default
waiting period between page fetches. Try using "gw -w0" and see if that
does'nt speed things up for you.


This really depends on your strategy. If you're adding pages then the answer
is no. If you changed a lot of hyperlinks then you probably have to re-walk.


Thunderstone's site is 1200 pages, on a pentium 200 it takes 79 seconds to
walk it. With only one gw running at a time, you can acquire 50,000 pages
in an hour, and 1.2 Million in a day. (This assumes that you have a lot
of bandwidth to the server) .

Since you are using the free Webinator which is limited to 10,000 pages,
there's no way it should take more than 20 minutes even on a slow machine
to hit that limit.

Things that could cause the slowness you describe :

1: The subject Webserver is already being beaten to death.
2: You're not using the -w0 option
3: The Network
Local: Ethernet has more collisions than a DWI convention.
Remote: ISP needs to feed the hamster in the Cisco.


> is it possible to merge databases? ideally i'd envision running a filesystem

Yes it is possible to merge databases and distribute loads, you will need
a T1000 Webinator license to do this though. None of this should be required
however. We have some very highly hit web sites using Thunderstone, and
unless you intend to handle millions of searches / day you should not need
to consider these measures. A single reasonably configured machine should
be adequate.

Thunderstone



User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

new user questions

Post by Thunderstone »



thanks for the answers... i have some more questions though. :)

On 07-Feb-97 Thunderstone - EPI wrote:

this worked great. (just wanted to say that so that if someone else searches
the mailing list archives, they will see this!)


that helped. why does it default to 5 seconds? that seems awfully wasteful.
(footnote: sgi's crappy NFS didn't help at all. always index locally, it
seems!)

are

my site is very large and pages change all over the place via automatic
systems. there is no record of what is changing. so there is no way for me to
know which pages need to be re-examined. i suppose i could write a script that
went through the site and figured out which pages needed updating and then
piped that into gw.


we are going to be indexing several dozen fairly large sites, several of which
do get millions of hits per day. we want a person to be able to search for a
term in all of them from one box on one form. the mega-DB option seems like it
will handle this from a functionality standpoint; i just have this nightmare of
the search machine constantly thrashing (already heavily loaded) web servers to
keep the index up to date.

more questions that i could not find the answers to in the man pages:

can i specify a regular expression for filenames to EXCLUDE from gw? i don't
want to look at any file named "*_lb.html", for example.

can gw be told to NOT incorporate the text found in ALT="..." tags? as it is,
all our pages start with ad banners with alt text, so when someone does a
query, all the page abstracts start with the ad text! not ideal.