Wipe-to-do

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Wipe-to-do

Post by Thunderstone »



idas:~/wrweb/webinator/artists{525}$
/w/home/jrota/wrweb/webinator/bin/gw -d/w/
home/jrota/wrweb/webinator/test -wipetodo
Wiping todo list
Getting http://208.129.255.81/robots.txt...Not there...Ok.
000 Can't get address for host `www.juststings.com': Unknown error
000 Can't get address for host `www.juststings.com': Unknown error
Getting http://207.141.233.50/robots.txt...Got it...Ok.
Getting http://216.87.33.130/robots.txt...Not there...Ok.
Getting http://207.155.248.72/robots.txt...Got it...Ok.
000 Can't get address for host `www.infini.com': Unknown error
000 Can't get address for host `www.infini.com': Unknown error
000 Can't get address for host `www.roncentrola.com': Unknown error
000 Can't get address for host `www.roncentrola.com': Unknown error
000 Can't get address for host `www.vocelinc.com': Unknown error
000 Can'


Then when it tries to index a new site, it still runs into these
"should-be-wiped" domains.

Could it be that I switched to using the -r tag some time ago?

Jim




User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Wipe-to-do

Post by Thunderstone »



-wipetodo clears the list of yet-to-be-fetched pages.
It does not clear the remembered list of starting urls that
you have specified on the command line (or list file).
It doesn't necessarily fetch those urls, but does try to
do name lookup on them. To clear that list do
gw -st "delete from options where Name='URL'"





User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Wipe-to-do

Post by Thunderstone »



Jumping in here, because I have a similar problem. I have a rather large
list of URLs that we specify in a list file, and several of them are
invalid. I'd like to cut down the time it takes to walk/rewalk this
database. I have a couple of questions:

1. If I type the command you specified below, Mark, will it delete
*all* the URLs from the "remembered" list, or do I have to specify each
URL in place of 'URL' in your command?

2. I built this list from an existing database by querying the URL field
from an existing html table. If a URL cannot be retrieved, does it get
entered into the database?

3. In general, what's the best way to remove dead links from the
database to keep them from getting "walked"?


--
Mark Miller
mark@chalkboardcom.com
Chalkboard Communications (206) 459-5577
http://www.chalkboardcom.com


Mark wrote:


User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Wipe-to-do

Post by Thunderstone »




Enter the SQL exactly as below. Do not substitute a specific url.
It will delete all urls from the remembered list.


If a url can not be retrieved it does not get added to the database.
I'm not sure why you would give a complete list from the previous
database instead of simply supplying the same starting point(s) again.
Or use -rewalk.


First, there's no way to identify a dead link without trying to fetch it.
If you give only starting points, instead of every url in the previous
database, it will only attempt to fetch those that are linked in.
The same is true for -rewalk.
There's also the -e option with -X and -V that will refetch urls in the
current database and delete ones that are not found. See the manual for usage.




Post Reply