Adding multiple pages/metadata/reindexing/memory fault

User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Adding multiple pages/metadata/reindexing/memory fault

Post by Thunderstone »



Hi,
I have 4 problems with Webinator 2.0 maybe someone can help me with.

ADDING MULTIPLE PAGES
I am am having trouble adding multiple pages to an existing index. I an
using -g (without the -a) to get a page and all of its links.

gw -d/www/doc/webinator/newindex -g
http://www.itc.virginia.edu/progress/97/toc.html

However, gw only indexes the first page.

Getting http://128.143.22.53/robots.txt...Not there...Ok.
http://www.itc.virginia.edu/progress/97/toc.html
1/23
Visited 1 pages total
Indexing new pages

How do I get it to add all 23 of the links from this page to the index? In
the documentation it says if you omit the -a option it will get all the
URLs from the page. However, whenever I run it, it is only indexing the
single page not the rest of the section (everything in the /progress/97/
subdirectory that is linked in from the starting page). When I run it with
the -a option I get the exact same results.

METADATA
It does not seem like Webinator is counting metadata when it does relvancy
ranking. I used the -meta=keywords,description,author option when creating
my index. I get all types of messages, sometimes it says "You must rebuild
the database to enable meta data storage." Does this mean I have to wipe my
index and start over? Why is that? How can I make sure the metadata is
being indexed?

REINDEXING ON A SCHEDULE
I have not had any luck with this option (gw -rewalk="every day at 1am").
How can I make sure that when my index is updated each night that the same
rules will apply as when I first created the index? I have at least 10
options I created the index with and then added some individual pages after
that so when the update happens each night is is still following the
indexing behavior I specified when I originally created the index?

MEMORY FAULT
When I add too many options to the gw command (more than 8, I usually get a
memory fault error). When I tried to specify the options in a separate
file using the -m option, this does not work because I have a list of -j
URLs and GW quits and says the URL is not allowed (seems like it is
confusing -j with -x) so since I can not use the -m option and I get a
memory fault when adding the list of -j URLs on the command line, I am not
able to create the index I want.

Thanks,
Lara




User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Adding multiple pages/metadata/reindexing/memory fault

Post by Thunderstone »




Just as the manual says: "Get just the single page specified by URL and quit."
(http://www.thunderstone.com/gw2man/node23.html)


You don't want to fetch just the single page, you want to walk the page.
Assuming your todo list is empty, just walk the page normally (without -g).
If you don't want to get documents beyond the second level, use the maxdepth
setting, -D1 (http://www.thunderstone.com/gw2man/node22.html), to walk just
the first 2 levels. You may also want to use the -j option to prevent fetching
of any pages outside of the given directory.


You need to use the -meta option when performing the walk, so that Webinator
knows to store it. You can't index or search data that has not been stored.
(http://www.thunderstone.com/gw2man/node21.html)
Also, if you are using Webinator 1.x database, there is no support for meta
data in the database. You have to start with a new database created (-create)
with version 2 to use meta data.

Look at your meta data with a command something like this:
gw -s "select Url,Meta from html"


Make sure you are using -rewalk on a database created with version 2.
-rewalk will rewalk all of the URLs you have specified in the past with
the options last used (or those recalled with -recall).

-rewalk works best where all of the URLs in a given database are walked
using the same set of options. If you have a complicated procedure that
indexes different URLs with different option sets, you will have to use
a manual procedure that you could put into a small shell script.


It would appear that the long command line is causing some kind of
problem on your platform. Option files can be arbitrarily large and
work just as if the options came from the command line. Make sure
you only use one option per line and leave off the leading -, but not
the option letter. Also make sure there are no extra leading or trailing
blanks on the line.
(http://www.thunderstone.com/gw2man/node18.html)

e.g.:

jhttp://www.mysite.com/theplace/