Robot pass / homepage pass ???

jasondwitt · Post by **jasondwitt** » Tue Aug 21, 2001 4:03 pm

All i want to do is scan a list of file based url's
and pull the home page into the database. I can't
tell whats going on and things appear to be going
so very slow (2 hours +) to scan just 500 urls.
Here's the command line i'm using:

bin/gw
-v100 -L -N -R -z65000 -t10 -p500 -r -D0 -w0
-meta=keywords,description
-ddb23178_tmp "&/tmp/url-list.txt"

Watching the log i see this kind of stuff:

200 Database created

Adding todo: http://cyberspaceprogramming.com/ 0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0

178 Trying to insert duplicate value (cyberspaceprogramming.com/) in index /web/texis/www/webinator/db23178_tmp/xtodourl.btr

Adding todo: http://cyberspiderwebdesign.com/
0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0

Getting http://64.217.39.131/robots.txt... 0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0 Disallowed MIME type (r)
Not there...Ok.

1) I've used -r to ignore robots.txt to see if that
will speed things up and its still pulling the
robots.txt. Does ignore still pull the file?

2) I've let gw run for like 2 hours and its just
so slow and when i look in the database directory
i never really see any of the database files
growing in size. Yet scanning a single large
site does fine. Things happen, the database grows
to 12-15 Meg. Takes about 3-4 minutes.

4) The GW logs always seem to to say 'Adding Todo'.
If i'm just pulling the single home page it would
seem that it would add it right THEN and not wait
for a LATER "todo".

I guess my whole issue is that things seem really
slow, i don't see the db growing, and all i see
is the "todo" like its scanning all the robots.txt
files and then wanting to go back for another pass
at the sites when all is need is just the one pass
pulling the homepage.

I've written my own perl spider and it just tears
through the list so i dont think its a machine or
network issue.

Yes, i have started with a fresh database with
no residual 'todo' stuff floating around.

A little frustrated... and maybe a little tired too.

Any comments on helping me to understand whats
happening or how to speed things up would be
appreciated.

thanks
-jason

Post by **mark** » Tue Aug 21, 2001 4:34 pm

Except when using -g gw loads *all* urls specified on the command line and in list files into todo. This involves doing DNS lookups on them also. Then it starts walking what's in the todo list and will stop at the desired amount. If you only want to walk 500 top level urls, only place 500 in the list file. How many do you have?

-g will fetch pages without placing them into todo first.

As for the robots, the -r should prevent fetching of those. Did the output show
Option: "-r"
at the beginning with the other options?

jasondwitt · Post by **jasondwitt** » Tue Aug 21, 2001 4:47 pm

All i have is the 500 urls and they're all in the
url-list.txt file. The lowercase "-r" is shown in
the startup diags from gw but it still seems to
be asking for the robots.txt stuff. Also, i've used
the -p option to limit the pages walked and a depth
of 0. This is an automated script so I'm trying to
avoid the -g thing because I need to be able to adjust
the depth programmitically.

Spider Start
========================================================
Option: "-ddb23178_tmp"
Option: "-v100"
Option: "-L"
Option: "-N"
Option: "-R"
Option: "-z65000"
Option: "-t10"
Option: "-p500"
Option: "-r"
Option: "-D0"
Option: "-w0"
Option: "-meta=keywords,description"
Option: "-M/Mozilla/4.0 (compatible; MSIE 5.01;)"
Reporting myself as "Mozilla/4.0 (compatible; MSIE 5.01;)"
Reporting myself as "Mozilla/4.0 (compatible; MSIE 5.01;)"
Option: "-ddb23178_tmp"
100 User _SYSTEM has been added without a password.
100 User PUBLIC has been added without a password.
200 Database created
0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0 Disallowed MIME type (r)
Not there...Ok.
Adding todo: http://cyberspaceprogramming.com/
0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0
178 Trying to insert duplicate value (cyberspaceprogramming.com/) in index /home/hayden/texis/www/webinator/db23178_tmp/xtodourl.btr
Getting http://207.150.192.12/robots.txt... 0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0 Disallowed MIME type (r)
Not there...Ok.

Post by **mark** » Tue Aug 21, 2001 5:20 pm

I'm not able to replicate the behavior you're seeing. What's your gw version (gw -version) and platform?

Also, depending on what you're doing you might be able to do it all in a vortex script using dowalk_beta as a starting point, instead of driving gw from another program.

Post by **mark** » Tue Aug 21, 2001 5:21 pm

It might also be helpful if you upload your url list to
ftp://ftp.thunderstone.com/pub/people/jdwitt/
so I can try to replicate using your list.

jasondwitt · Post by **jasondwitt** » Tue Aug 21, 2001 5:42 pm

Here's the version stuff and my development box is
a compaq proliant running Redhat Linux 7.1.

Webinator WWW Site Indexer Version 2.56 (Free)
Copyright(c) 1995,1996,1997,1998,1999,2000 Thunderstone EPI Inc.
Release: 20010621

Before I upload stuff and maybe waste your time, i think
i'll create a small set of urls to play with (25) and
also create a different test script to see what
happens. If the result log is still not looking right
I'll upload it to the ftp site for you to take a look
at.

Thanks
-jason

jasondwitt · Post by **jasondwitt** » Wed Aug 22, 2001 12:51 am

When I extracted the bin/gw command stuff to a distinct
test script things went much better. Rather than a wipe
command to delete the old database, we were actually
supposed to be deleting the whole temp directory structure... that wasnt happening! Also, i found several
other bin/gw running in the background....

So, basically, a multitude of little issues. Things
are going much better now. Thanks for the effort!
Sorry to be so much trouble.

-jason