getting results of site indexing

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

getting results of site indexing

Post by Thunderstone »

User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

getting results of site indexing

Post by Thunderstone »




The end of gw.log will show how many pages were visited during the last
walk. Counting rows in the html table will show how many were
_successfully_ walked:

gw -s "select count(id) from html"

If you have multiple walks in the same database, you can count for just
pages walked, say, today:

gw -s "select count(id) from html where id > 'start of today'"

-Kai


User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

getting results of site indexing

Post by Thunderstone »



On Mon, 2 Aug 1999, Rafe Colburn wrote:


There is a log file generated every time you reindex a site. You can use the
Unix instruction wc -l gw.log (or whatever the log file is called) to count
the number of lines in the file. Not all the lines pertain to a successful
file...there are error lines, etc. You can use a "grep" piped into a "wc -l"
to focus in on counting the error lines, etc...

--
E. Loren Buhle, Jr. Ph.D. INTERNET: buhle@carelife.com
P.O. Box 218 Phone: 610-622-4293
Lansdowne, PA 19050 FAX: 610-622-1343



User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

getting results of site indexing

Post by Thunderstone »



Kai's response didn't really expose the key concept:
It's SQL... just ask it anything you want to know.
(Study the database structure in the gw docs.)

Try:
gw -d. -st "select count(Url) from html"
gw -d. -st "select count(Url),Reason from error group by Reason"
gw -d. -st "select Depth,count(Url) from html group by Depth
order by Depth"

Querying the error table is often instructive:
Start with
gw -d. -st "select Url,Reason from error"
and refine your query as you feel motivated.

A clever query is to grab the URLs of the "404 Document not found"
errors and then query the refs table to see which pages had the
bad links:
gw -d. -st "select refs.Url,refs.Ref,error.Reason from error,refs
where refs.Url=error.Url and Reason='Document not found'
order by refs.Url"

A brute force "analysis" of gw.log can also be instructive.
Assuming you're running Unix, try:
grep Retrieving gw.log|cut -d" " -f1,2|sed 's/:.*$/:xx/'|uniq -c
That gives you an hourly report on the progress of the walk.

Gary Alderman
(just another customer, not a Thunderstone employee)





Post Reply