Crawling large sites?

python79 · Post by **python79** » Wed May 31, 2006 4:06 pm

I attempted to crawl Walmart.com and it only accessed 33 pages - is there a setting I need to change? I tried both a wildcard (*.*) for the extensions as well as the defualt and removed all exclusions.

Post by **John** » Wed May 31, 2006 4:52 pm

It looks at the least as if you would want to add .gsp to the extensions, remove ? from the exclusions and turn Strip Query to N.

If you increase verbosity to 4 you will also see why it skips any URL it finds and does not index.

python79 · Post by **python79** » Fri Jun 02, 2006 10:17 am

I tried crawling http://www.budget.com/budgetWeb/home/home.ex but it only crawled 211 pages. I turned Strip Query to N, allowed All Extensions, removed all exclusions, and changed Max Redirects to 0 (as I'm only interested in redirects). What other setting can I change to allow it to crawl the entire site?

Post by **mark** » Fri Jun 02, 2006 11:08 am

Did you remove ? from the exclusions?
Did you turn off Stay Under since many of the links aren't "under" budgetWeb/home?

View the "Walk Status" and look for errors.

Go to "List/Edit URLs" and submit then click on your Base Url. From there click on "Children". Then you will see what hyperlinks were found on the page. Ones that are clickable were indexed. Ones that are not clickable were not indexed. If Webinator attempted to index it but couldn't the error will be listed after the url. If there is no error listed, Webinator did not try to index it. Pages don't get indexed if they are on a different site, not under the base url on the same site, or don't qualify based on the settings (such as robots.txt, Extensions, Exclusions, etc). If you can't figure out why particular urls are rejected turn verbosity up to 4 and do a new walk. Then go back to Children page under List/Edit URLs to find the reason.

If you think there are pages not being indexed that should be please provide an example providing the url of a page that you think should be indexed and the url of the page that refers (has a link) to it and has been indexed.