latest dowalk doesn't crawl pages without extensions

KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

latest dowalk doesn't crawl pages without extensions

Post by KMandalia »

website pages like, http://www.somesite.com/xyz?123+345 are not getting crawled. This behavior has changed since I have one old 5.0.10 database that do have this pages crawled. I have no restrictions for this websites and the website is listed in the base url section as http://www.somesite.com and in one of the category as http://www.somesite.com/*. I am not walking all extensions as tech support told me that webpages without any extensions are always walked.

What's wrong with the new script, or is it me?
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

latest dowalk doesn't crawl pages without extensions

Post by John »

If you have verbosity at 4, what is the reason listed for why that URL is not followed?
John Turnbull
Thunderstone Software
hiti
Posts: 26
Joined: Tue Aug 07, 2007 3:37 am

latest dowalk doesn't crawl pages without extensions

Post by hiti »

John
Even I m facing the same problem.I was crawling a site that does not have any extension.Probably the url is made dynamic through the htaccess file for that site.Now the webinator is showing all the failures in the walk status.
I just wanted to know what settings i m supposed to do to so as to make the page that has no extension get crwaled.
Plz help
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

latest dowalk doesn't crawl pages without extensions

Post by John »

Pages with no extension should be crawled by default. There is probably some other setting that is causing the problem. What message is given for why the page is not indexed?
John Turnbull
Thunderstone Software
hiti
Posts: 26
Joined: Tue Aug 07, 2007 3:37 am

latest dowalk doesn't crawl pages without extensions

Post by hiti »

John
There are two types of messages I am getting in the error log:
1.Off site Link
2.Unwanted Prefix.

Actually the site contains the advertisements and sponsors ad also along with the product details.I have set the verbosity to 4.
Please letme know what specific settings i need to do in the webinator
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

latest dowalk doesn't crawl pages without extensions

Post by John »

The offsite link message indicates that the link is to a different server, and won't be followed. If there are specific other servers you want indexed you can add those.

The unwanted prefix message generally indicates that your base url has several components to it, and you have "Stay Under" set so it will only follow links with the same prefix. For example if your base url is http://somesite/news/index.html then only links starting with http://somesite/news/ will be indexed if Stay Under is set to Y.
John Turnbull
Thunderstone Software
hiti
Posts: 26
Joined: Tue Aug 07, 2007 3:37 am

latest dowalk doesn't crawl pages without extensions

Post by hiti »

Thanks John for a swift reply.The StayUnder is already set to Y by me.
The site for which i am getting this problem is shopping.com.If you can view this site in browser you will find that none of the page has any kind of extension.

This is my customised code for the site shopping.com for which i never get any success.

<if $baseUrl eq "http://shopping.com"><!-- Not Working -->
<rex '>><div class\="contentIndent">\P=!</h1>+\F</h1>' $rawdoc><$StoryTitle=$ret>
<rex '>><div class\="prodImage">\P=!</div>+\F</div>' $rawdoc><$ImgRowData=$ret>
<rex '>><div id\="long" style\="display: block;">\P=!</div>+\F</div>' $rawdoc><$StoryRowDescription=$ret>
<sandr '>><div class\="boxMid">=!<div class\="boxBtmRt">+<div class\="boxBtmRt">' '' $rawdoc><$rawdoc=$ret>
<sandr '>><div id\="saiArea">=!</iframe>+</iframe>' '' $rawdoc><$rawdoc=$ret>
<$ImgRowData=$StoryRowDescription>
<$SiteName ="Shopping">
<filterStory>
</if>

Please help
Thanks in advance
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

latest dowalk doesn't crawl pages without extensions

Post by John »

It looks as if shopping.com redirects to www#.shopping.com where # is some number. Those will be treated as off-site unless you have enabled those, and I'm not sure where in the code your script is, or where $baseUrl is getting set.
John Turnbull
Thunderstone Software
hiti
Posts: 26
Joined: Tue Aug 07, 2007 3:37 am

latest dowalk doesn't crawl pages without extensions

Post by hiti »

Yes John.I completely agreed with you.The code that i gave u in my earlier post is the one i have added in the dowalk script.
Please let me know if there is any flaw in the code.Also what specific settings i need to do in the webinator to get this site crawl.
Please help
Thanks in advance
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

latest dowalk doesn't crawl pages without extensions

Post by mark »

Add the www#.shopping.com variations to the extra domains field or use one of them as your base url instead of just "shopping.com".
Post Reply