All i want to do is scan a list of file based url's
and pull the home page into the database. I can't
tell whats going on and things appear to be going
so very slow (2 hours +) to scan just 500 urls.
Here's the command line i'm using:
bin/gw
-v100 -L -N -R -z65000 -t10 -p500 -r -D0 -w0
-meta=keywords,description
-ddb23178_tmp "&/tmp/url-list.txt"
Watching the log i see this kind of stuff:
200 Database created
Adding todo: http://cyberspaceprogramming.com/ 0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0
178 Trying to insert duplicate value (cyberspaceprogramming.com/) in index /web/texis/www/webinator/db23178_tmp/xtodourl.btr
Adding todo: http://cyberspiderwebdesign.com/
0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0
Getting http://64.217.39.131/robots.txt... 0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0 Disallowed MIME type (r)
Not there...Ok.
1) I've used -r to ignore robots.txt to see if that
will speed things up and its still pulling the
robots.txt. Does ignore still pull the file?
2) I've let gw run for like 2 hours and its just
so slow and when i look in the database directory
i never really see any of the database files
growing in size. Yet scanning a single large
site does fine. Things happen, the database grows
to 12-15 Meg. Takes about 3-4 minutes.
4) The GW logs always seem to to say 'Adding Todo'.
If i'm just pulling the single home page it would
seem that it would add it right THEN and not wait
for a LATER "todo".
I guess my whole issue is that things seem really
slow, i don't see the db growing, and all i see
is the "todo" like its scanning all the robots.txt
files and then wanting to go back for another pass
at the sites when all is need is just the one pass
pulling the homepage.
I've written my own perl spider and it just tears
through the list so i dont think its a machine or
network issue.
Yes, i have started with a fresh database with
no residual 'todo' stuff floating around.
A little frustrated... and maybe a little tired too.
Any comments on helping me to understand whats
happening or how to speed things up would be
appreciated.
thanks
-jason
and pull the home page into the database. I can't
tell whats going on and things appear to be going
so very slow (2 hours +) to scan just 500 urls.
Here's the command line i'm using:
bin/gw
-v100 -L -N -R -z65000 -t10 -p500 -r -D0 -w0
-meta=keywords,description
-ddb23178_tmp "&/tmp/url-list.txt"
Watching the log i see this kind of stuff:
200 Database created
Adding todo: http://cyberspaceprogramming.com/ 0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0
178 Trying to insert duplicate value (cyberspaceprogramming.com/) in index /web/texis/www/webinator/db23178_tmp/xtodourl.btr
Adding todo: http://cyberspiderwebdesign.com/
0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0
Getting http://64.217.39.131/robots.txt... 0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0 Disallowed MIME type (r)
Not there...Ok.
1) I've used -r to ignore robots.txt to see if that
will speed things up and its still pulling the
robots.txt. Does ignore still pull the file?
2) I've let gw run for like 2 hours and its just
so slow and when i look in the database directory
i never really see any of the database files
growing in size. Yet scanning a single large
site does fine. Things happen, the database grows
to 12-15 Meg. Takes about 3-4 minutes.
4) The GW logs always seem to to say 'Adding Todo'.
If i'm just pulling the single home page it would
seem that it would add it right THEN and not wait
for a LATER "todo".
I guess my whole issue is that things seem really
slow, i don't see the db growing, and all i see
is the "todo" like its scanning all the robots.txt
files and then wanting to go back for another pass
at the sites when all is need is just the one pass
pulling the homepage.
I've written my own perl spider and it just tears
through the list so i dont think its a machine or
network issue.
Yes, i have started with a fresh database with
no residual 'todo' stuff floating around.
A little frustrated... and maybe a little tired too.
Any comments on helping me to understand whats
happening or how to speed things up would be
appreciated.
thanks
-jason