latest dowalk doesn't crawl pages without extensions

KMandalia · Post by **KMandalia** » Mon Sep 20, 2004 9:25 am

website pages like, http://www.somesite.com/xyz?123+345 are not getting crawled. This behavior has changed since I have one old 5.0.10 database that do have this pages crawled. I have no restrictions for this websites and the website is listed in the base url section as http://www.somesite.com and in one of the category as http://www.somesite.com/*. I am not walking all extensions as tech support told me that webpages without any extensions are always walked.

What's wrong with the new script, or is it me?

hiti · Post by **hiti** » Fri Aug 17, 2007 2:38 am

John
Even I m facing the same problem.I was crawling a site that does not have any extension.Probably the url is made dynamic through the htaccess file for that site.Now the webinator is showing all the failures in the walk status.
I just wanted to know what settings i m supposed to do to so as to make the page that has no extension get crwaled.
Plz help

hiti · Post by **hiti** » Sat Aug 18, 2007 12:12 am

John
There are two types of messages I am getting in the error log:
1.Off site Link
2.Unwanted Prefix.

Actually the site contains the advertisements and sponsors ad also along with the product details.I have set the verbosity to 4.
Please letme know what specific settings i need to do in the webinator

hiti · Post by **hiti** » Mon Aug 20, 2007 12:20 am

Thanks John for a swift reply.The StayUnder is already set to Y by me.
The site for which i am getting this problem is shopping.com.If you can view this site in browser you will find that none of the page has any kind of extension.

This is my customised code for the site shopping.com for which i never get any success.

<if $baseUrl eq "http://shopping.com">
<rex '>><div class\="contentIndent">\P=!</h1>+\F</h1>' $rawdoc><$StoryTitle=$ret>
<rex '>><div class\="prodImage">\P=!</div>+\F</div>' $rawdoc><$ImgRowData=$ret>
<rex '>><div id\="long" style\="display: block;">\P=!</div>+\F</div>' $rawdoc><$StoryRowDescription=$ret>
<sandr '>><div class\="boxMid">=!<div class\="boxBtmRt">+<div class\="boxBtmRt">' '' $rawdoc><$rawdoc=$ret>
<sandr '>><div id\="saiArea">=!</iframe>+</iframe>' '' $rawdoc><$rawdoc=$ret>
<$ImgRowData=$StoryRowDescription>
<$SiteName ="Shopping">
<filterStory>
</if>

Please help
Thanks in advance

hiti · Post by **hiti** » Tue Aug 21, 2007 9:27 am

Yes John.I completely agreed with you.The code that i gave u in my earlier post is the one i have added in the dowalk script.
Please let me know if there is any flaw in the code.Also what specific settings i need to do in the webinator to get this site crawl.
Please help
Thanks in advance

Post by **mark** » Tue Aug 21, 2007 12:33 pm

Add the www#.shopping.com variations to the extra domains field or use one of them as your base url instead of just "shopping.com".

hiti · Post by **hiti** » Wed Aug 22, 2007 1:48 am

I tried both ways .The error i got is somewhat like this:

Not in requirements http://www10.shopping.com/

I didn't get any success by either of the ways.I am only getting the failed pages.The crawling gets stopped automatically.Many of the other error messages were :-
The link : http://www.shopping.com/xCH-home_and_garden
Had this error: Unwanted prefix
Referenced by : http://www.shopping.com/
http://www.shopping.com/xPC-Hasbro-Tran ... r-Megatron

I urgently need this site to get crawled .Plz help

Post by **mark** » Wed Aug 22, 2007 10:31 am

I just created a walk using the latest scripts and all default settings except for
Base URL: http://www.shopping.com/
Exclusions: remove ? and ~ from exclusions
Strip Queries: N
and was able to crawl them just fine. I stopped after 100 pages so as to not bug them.

I'm not sure where you got http://shopping.com or http://www10.shopping.com . Anything other than http://www.shopping.com just redirects back to www.shopping.com when I try.

I also notice that the website is presented differently to webinator than to my browser. Instead of the /xYADDA links it presents all links with query strings and category numbers. Setting user agent to an IE6 agent string makes it give the /xYADDA links which I was also able to crawl ok. This way I got the page you mention as rejected above.

neetu · Post by **neetu** » Thu Aug 23, 2007 1:59 am

I wanted to know that how can i get success without any error pages.Is that possible that i get only success page.Also i want to know that should i keep robot.txt to No , to make few sites to crawl

Please reply.

hiti · Post by **hiti** » Thu Aug 23, 2007 8:40 am

Mark
I worked on as per your message but this time got only one success and three failed.However i need to ask you one thing,in the Useragent i have added Internet Explorer version 6.0.Is it correct?Or Shoud i write only IE6?
One thing more is webinator crawls site in different browsers in different manner?Do we need to crawl the sites in IE only?

Thanks in advance