getting results of site indexing

Post by **Thunderstone** » Mon Aug 02, 1999 4:31 pm

Is there any way to run gw and see how many pages were indexed, or get any
other statistics on how the indexing run went?

--
Rafe Colburn | work: http://interpath.net
Interpath Communications | personal: http://rc3.org
rafe.colburn@interpath.net | phone: 919.253.5947

Post by **Thunderstone** » Mon Aug 02, 1999 4:45 pm

The end of gw.log will show how many pages were visited during the last
walk. Counting rows in the html table will show how many were
_successfully_ walked:

gw -s "select count(id) from html"

If you have multiple walks in the same database, you can count for just
pages walked, say, today:

gw -s "select count(id) from html where id > 'start of today'"

-Kai

Post by **Thunderstone** » Mon Aug 02, 1999 5:40 pm

On Mon, 2 Aug 1999, Rafe Colburn wrote:

There is a log file generated every time you reindex a site. You can use the
Unix instruction wc -l gw.log (or whatever the log file is called) to count
the number of lines in the file. Not all the lines pertain to a successful
file...there are error lines, etc. You can use a "grep" piped into a "wc -l"
to focus in on counting the error lines, etc...

--
E. Loren Buhle, Jr. Ph.D. INTERNET: buhle@carelife.com
P.O. Box 218 Phone: 610-622-4293
Lansdowne, PA 19050 FAX: 610-622-1343

Post by **Thunderstone** » Mon Aug 02, 1999 5:50 pm

Kai's response didn't really expose the key concept:
It's SQL... just ask it anything you want to know.
(Study the database structure in the gw docs.)

Try:
gw -d. -st "select count(Url) from html"
gw -d. -st "select count(Url),Reason from error group by Reason"
gw -d. -st "select Depth,count(Url) from html group by Depth
order by Depth"

Querying the error table is often instructive:
Start with
gw -d. -st "select Url,Reason from error"
and refine your query as you feel motivated.

A clever query is to grab the URLs of the "404 Document not found"
errors and then query the refs table to see which pages had the
bad links:
gw -d. -st "select refs.Url,refs.Ref,error.Reason from error,refs
where refs.Url=error.Url and Reason='Document not found'
order by refs.Url"

A brute force "analysis" of gw.log can also be instructive.
Assuming you're running Unix, try:
grep Retrieving gw.log|cut -d" " -f1,2|sed 's/:.*$/:xx/'|uniq -c
That gives you an hourly report on the progress of the walk.

Gary Alderman
(just another customer, not a Thunderstone employee)