restart gw with pinged links and grab remaining pages

staylor00 · Post by **staylor00** » Thu Apr 19, 2001 12:18 pm

Hello.

I am walking a site and after 10 hrs., a message 'gw killed' displayed. The walk had not completed because there are >1000 pages and hyperlinks in that subdirectory.

1) a) what would cause this 'abort'? b) Is this walking driving up the bandwidth on the site running 'gw' considerably during this 10-hour period?

2) what does the 2nd number of 870/21045 mean during the walk? I believe the 870 is the page number (link) gw is working on.

3) During this walk and without using the -o option, the hyperlinks were still 'pinged' to make sure they were active. This is a great feature. When I restarted the walk again using 'gw -moptions.set' and without the domain on the command line, it just went through the pages it had in the todo list without pinging the links on those pages. I tried this command in the past and the only way I could get those links to be pinged was to use the '-Force' option. I don't want to re-ping the sites already pinged, especially since it takes so long to walk this site. (a) How can I get gw to ping the links on those pages without starting from scratch and (b) grab/walk the remaining pages from that subdirectory that did not get added to the 'todo' list before the 'gw killed' message?

Thank you.

Post by **mark** » Thu Apr 19, 2001 12:28 pm

"gw killed" is not a message of ours. It must be from your shell or kernel. Some person or the kernel must have killed gw for some reason.

That many pages should take minutes, not hours, unless you're using large -w times or have very slow dns or web servers.

The second number is how many links of all kinds seen so far.

I don't know what you mean by pinging. You're not using it in any traditional sense and it is not a webinator term. If you run gw with the exact command line you did the first time it will pickup where it left off. The one caveat is that the hyperlinks and/or offsite urls from the current page when it was killed may not have been completed. If that's the case, you could delete the page it was working on when it was killed from the database then add that url to the gw command line when you restart it.

staylor00 · Post by **staylor00** » Thu Apr 19, 2001 1:30 pm

What I mean by 'ping' is in the same context as the command used that is somewhat similar to 'tracert'. It sends a signal from the user via ISP to the specified web server to see if the path has within it, a broken connection enroute or whether the domain or IP address exists or working properly.

In 'gw.log', entries such as, "
2001/04/12 05:35:00 Can't get address for host `www.entrepreneurs.net': Host name lookup failure" is 'pinging' the web server of a hyperlink off of the page being indexed. This lets me know which hyperlinks may no longer exist and to take them off of the page.

When gw is restarted, it no longer provides this information. It just adds the page it is working on to the database; in this manner, it only takes minutes to add a large number of pages to the database.

The -w I am using is the default value of 30 seconds. When it can't locate the hyperlink's web server, it moves on after 30 seconds (although, I have seen where it went back again and again to try again). This makes it very time-consuming for pages with a lot of hyperlinks.

1) Would it be less work involved if I delete the 'todo' list using the '-wipetodo' command and then enter the command, 'gw -moptions.set' without the domain URL to have it pick up where it left off with the same level of detail (hyperlink pinging info) from the previous (original) walk and hopefully, finish grabbing the remaining pages to add to the 'todo' list?

2) Is there a way, at times when I want to, prevent 'gw' from pinging the server of hyperlinks and just add the pages from the main site's server into the database? If so, would this be done via -w, such as -w0? If done via -w0, is this a 'double-edged sword' whereby, it could overlook (not include) the pages from the main server if the response is not immediate during higher traffic periods?

3) Also, does 'walking' drive up bandwidth considerably on the server running gw during this 10-hr period enough to make the server's admin 'kill' the gw program?

Thank you.

Post by **mark** » Thu Apr 19, 2001 2:00 pm

I know what ping is. It has nothing to do with DNS lookup and nothing gw does causes a "ping". Actually it's very likely that DNS lookup never even hits the machine it's looking up at all. It hits it's nameserver which is ususally a completely different host. And even that says nothing about the page specified by the URL, only the host.

It may be that your/their DNS is slow and doesn't answer within the normal name looup timeout. But by the time you get around to running gw again it has answered and it's in your local nameserver cache.

30 is not the default for -w. The default is 0. In old versions, before 2.5, is was 5 seconds.

Using -wipetodo before restarting will cause many pages to be lost and never walked. The todo table is how gw is able to pickup where it left off.

-w controls how long to wait *between* fetches, not how long to wait for an answer. Use -w0 unless the webserver you're walking can't handle rapid fetches. -t controls how long to wait for an answer. It's default is 30.

Perhaps you want to use -L if your nameservers are so slow.

Walking is like surfing the site with images and java turned off. The bandwidth usage is similar except that gw with -w0 will go faster than a user clicking. -w30 will go slower than a user.

staylor00 · Post by **staylor00** » Thu Apr 19, 2001 2:53 pm

Thanks again.

The default of -w0 is what I used and the 30 seconds is the -t value. I referred to it incorrectly.

(1) How can I choose to walk a site without it trying to look at the hyperlinks on the page? Will the -L option prevent this? The walk slows down considerably when it gets to a hyperlink and messages such as "Unknown host" or "Host name lookup failure" or "No address associated with name" are encountered due to a e.g. 'dead link' (this is a large database and it requires a lot of work to find out what these 'dead links' are and have them removed). Therefore, most times, I just want it to look at the pages only on the main site and ignore the existence of hyperlinks. I would sometimes only want to see hyperlink information when using the -o option, not when I am not. How to restrict the walk to pages only on the main server?

(2) Is there a downside to using the -L option? Is there information that will be excluded if this option is used?

Thank you.

Post by **mark** » Thu Apr 19, 2001 3:09 pm

You can't walk a site without looking at hyperlinks. -L will prevent dns lookups on them so offsite urls don't slow you down.

I can't say much more than the manual does about -L. See http://www.thunderstone.com/site/gw25man/node65.html
It's pretty safe to use.

staylor00 · Post by **staylor00** » Fri Apr 20, 2001 4:39 pm

Hello.

When using the -L option, will 'gw' still be able to locate those websites that have a dynamic IP address?

Are there any options that when used, can't locate dynamic IP addresses?

Thank you.

Post by **mark** » Fri Apr 20, 2001 4:49 pm

Yes.

No.