Questions when indexing pubpages.unh.edu

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Questions when indexing pubpages.unh.edu

Post by Thunderstone »



Hi,

As a lurker on this list for six months or more, I want
to say how impressed I am by the support Mark, Bart, and
others provide by means of this list.

Now a few questions from my first serious use of
Webinator. I'm indexing the server where we allow
students and others to have personal Web pages, so I
expect there will be some pages with horrible syntax and
grossly incorrect content. There are at least 15K pages
on the server.

1. I left it at the default level 2 message reporting
and as it chugs through the URLs, it reports two
numbers, e.g.
2113/20831
I think the lefthand number is the number of URLs or
pages processed, but I'm not sure what the righthand
number is.

2. My run went for about two hours and then seemed to
hang on the two numbers above. I gave it another
half hour, the process seemed to be doing nothing
(on a DEC Unix system), so I killed it with
CTRL-C and I see in the wg.log file that it says it
got signal 2 and attempted to quit nicely. The last
URL entered before that is
http://pubpages.unh.edu/~dmks/?S=A
and when I look at that page I can see one .htm
file in the directory, but it is mostly binary
stuff, even though named as HTML.

Have I just hit an occasional hazard that Webinator
won't be able to deal with? Is the solution to
exclude that page and resume walking? In fact, at
this point do I want to specify a rewalk to continue
walk (and index) the remaining pages? I could turn
on a higher level of message reporting, I wonder if
that really gets me anything?

3. I went on to index after stopping the walk and
the index seems OK. I can search and find the
sorts of things I would expect, EXCEPT that I can't
get any regular expressions to work. The index is at:
http://unhinfo.unh.edu/cgi-bin/texis/webinator/search/
Since I can find the name "sand" with a normal search,
it seems I should be able to do something like any of these
/s.nd
/[A-Z]and
Do I have to activate REX searching in some way I did not
notice?

- Jim Cerny, Computing & Information Services, Univ.NH
jim.cerny@unh.edu



User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Questions when indexing pubpages.unh.edu

Post by Thunderstone »




Thanks, its nice to know we're appreciated.


Free Webinator will only index 10,000 pages, but I assume you
already know that.



The first number is the how many pages that have been crawled,
the second is the number of URLs it has discovered and possibly
inserted into the refs table.


When I checked, the Web server on pubpages.unh.edu was corrupted
and was retrurning garbage in the headers. You might want to
restart the daemon. However this will not affect Webinator too much.

You might want to use the "-w0" option to gw to remove the delay between
page fetches.


Nope, just restart gw with the "-w0" option and it will continue.



Webinator will not allow an all linear search, this is to protect
the server from nasty queries. Add a keyword to your query and it
will do what you expected.


Post Reply