website crawl not as expected

Post Reply
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

website crawl not as expected

Post by KMandalia »

I have about 21 websites that I wish to crawl. I am listing all of them in the Base URL on all walk settings strings. I have specified exclusions as full path (http://www.mysite.com/myfolder/). However, the total number of pages after a successfull walk is fewer (almost half) of what I expected. In the base url I am listing websites as (http://www.mysite.com), does it make any difference of having a '/' at the end?
I am only specifying base url and exclusions (no REX or additional exclusion etc.). Any help?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

website crawl not as expected

Post by mark »

At the top level the trailing slash is less important. It will work either way but you may get duplicate warnings about the / and non-/ versions.

Review the walk errors to see if there's anything unexpected. Errors on a page will generally prevent following links from that page. Review your settings to see if there's anything that would prevent desired urls from being processed. Some sites need to remove ? from the exclusions and turn off "strip queries".

Find a page that you expect to be in the database but isn't. Determine the page that links to that page (it's parent). Find the parent in List/Edit urls. Click "Children" to see what links were found on that page. Look for the missing one. See if there's any error listed next to it. You can turn verbosity up to 4 to generate "errors" for all hyperlinks discarded due to rules such as bad extension, etc.
Post Reply