website crawl not as expected

KMandalia · Post by **KMandalia** » Mon Jul 26, 2004 8:30 am

I have about 21 websites that I wish to crawl. I am listing all of them in the Base URL on all walk settings strings. I have specified exclusions as full path (http://www.mysite.com/myfolder/). However, the total number of pages after a successfull walk is fewer (almost half) of what I expected. In the base url I am listing websites as (http://www.mysite.com), does it make any difference of having a '/' at the end?
I am only specifying base url and exclusions (no REX or additional exclusion etc.). Any help?

Post by **mark** » Mon Jul 26, 2004 10:43 am

At the top level the trailing slash is less important. It will work either way but you may get duplicate warnings about the / and non-/ versions.

Review the walk errors to see if there's anything unexpected. Errors on a page will generally prevent following links from that page. Review your settings to see if there's anything that would prevent desired urls from being processed. Some sites need to remove ? from the exclusions and turn off "strip queries".

Find a page that you expect to be in the database but isn't. Determine the page that links to that page (it's parent). Find the parent in List/Edit urls. Click "Children" to see what links were found on that page. Look for the missing one. See if there's any error listed next to it. You can turn verbosity up to 4 to generate "errors" for all hyperlinks discarded due to rules such as bad extension, etc.