options preventing crawling

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

options preventing crawling

Post by Thunderstone »



Hello,

I'm having trouble crawling a site (http://www.santosha.com/moksha/) when
using the following options file. I can't figure out why when I use this
config. options gw doesn't pick up any links at the above URL, and when I
don't use this config file, it does.... gw's -v9 output below.

Thanks!

Otis

# avoid ns lookups for finding server aliases
L
# do not store refs in refs table
R
# crawl breadth first
b
# collect META keywords, description
meta=description
meta=keywords
# accept files with extensions
fasp
fshtml
fphtml
fjhtml
fhts
fhtx
# remove pages that no longer exist from Db
X
# use If-Modified-Since conditional GET
V
# don't report Mozilla user-agent but rather Webinator
M
# don't index after crawling (done periodically from cron)
noindex
# skip Urls with strings that are known not to be good Urls
x/RealMedia/Ads
x/event\.ng
x//printme/
x//email_display/
x//article_print/
x/_flat\.html/
x//wwwboard/
x/\.cgi
x/\.pl


# gw m/tmp/options -Q -jhttp://www.santosha.com/moksha
http://www.santosha.com/moksha
Getting http://216.22.163.138/robots.txt...Not there...Ok.
Using meta data field
Adding todo: http://www.santosha.com/moksha
Saving options and URLs to lastrun
http://www.santosha.com/moksha
0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0 Retrieving
0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0 Off site
0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0
0: TotLinks: 0, Links: 0/ 0, Good: 0, New: 0 Delaying 2
Visited 1 pages
Visited 0 pages
Visited 1 pages total
Remember to run "gw -index" to update the index when you finish a batch
Host: 216.22.163.138:80 (1) www.santosha.com
getip() called 1 times. 1 hits
gethostbyname() called 0 times




User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

options preventing crawling

Post by Thunderstone »



www.santosha.com is an alias for santosha.com, which does a redirect to
santosha.com. Since you are supplying the -L option, it does not do a
name lookup, and will not follow the offsite redirect. You should use
http://santosha.com/moksha/ as your initial URL.

Otis Gospodnetic said:


Post Reply