latest dowalk doesn't crawl pages without extensions

KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

latest dowalk doesn't crawl pages without extensions

Post by KMandalia »

website pages like, http://www.somesite.com/xyz?123+345 are not getting crawled. This behavior has changed since I have one old 5.0.10 database that do have this pages crawled. I have no restrictions for this websites and the website is listed in the base url section as http://www.somesite.com and in one of the category as http://www.somesite.com/*. I am not walking all extensions as tech support told me that webpages without any extensions are always walked.

What's wrong with the new script, or is it me?
hiti
Posts: 26
Joined: Tue Aug 07, 2007 3:37 am

latest dowalk doesn't crawl pages without extensions

Post by hiti »

John
Even I m facing the same problem.I was crawling a site that does not have any extension.Probably the url is made dynamic through the htaccess file for that site.Now the webinator is showing all the failures in the walk status.
I just wanted to know what settings i m supposed to do to so as to make the page that has no extension get crwaled.
Plz help
hiti
Posts: 26
Joined: Tue Aug 07, 2007 3:37 am

latest dowalk doesn't crawl pages without extensions

Post by hiti »

John
There are two types of messages I am getting in the error log:
1.Off site Link
2.Unwanted Prefix.

Actually the site contains the advertisements and sponsors ad also along with the product details.I have set the verbosity to 4.
Please letme know what specific settings i need to do in the webinator
hiti
Posts: 26
Joined: Tue Aug 07, 2007 3:37 am

latest dowalk doesn't crawl pages without extensions

Post by hiti »

Thanks John for a swift reply.The StayUnder is already set to Y by me.
The site for which i am getting this problem is shopping.com.If you can view this site in browser you will find that none of the page has any kind of extension.

This is my customised code for the site shopping.com for which i never get any success.

<if $baseUrl eq "http://shopping.com"><!-- Not Working -->
<rex '>><div class\="contentIndent">\P=!</h1>+\F</h1>' $rawdoc><$StoryTitle=$ret>
<rex '>><div class\="prodImage">\P=!</div>+\F</div>' $rawdoc><$ImgRowData=$ret>
<rex '>><div id\="long" style\="display: block;">\P=!</div>+\F</div>' $rawdoc><$StoryRowDescription=$ret>
<sandr '>><div class\="boxMid">=!<div class\="boxBtmRt">+<div class\="boxBtmRt">' '' $rawdoc><$rawdoc=$ret>
<sandr '>><div id\="saiArea">=!</iframe>+</iframe>' '' $rawdoc><$rawdoc=$ret>
<$ImgRowData=$StoryRowDescription>
<$SiteName ="Shopping">
<filterStory>
</if>

Please help
Thanks in advance
hiti
Posts: 26
Joined: Tue Aug 07, 2007 3:37 am

latest dowalk doesn't crawl pages without extensions

Post by hiti »

Yes John.I completely agreed with you.The code that i gave u in my earlier post is the one i have added in the dowalk script.
Please let me know if there is any flaw in the code.Also what specific settings i need to do in the webinator to get this site crawl.
Please help
Thanks in advance
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

latest dowalk doesn't crawl pages without extensions

Post by mark »

Add the www#.shopping.com variations to the extra domains field or use one of them as your base url instead of just "shopping.com".
hiti
Posts: 26
Joined: Tue Aug 07, 2007 3:37 am

latest dowalk doesn't crawl pages without extensions

Post by hiti »

I tried both ways .The error i got is somewhat like this:

Not in requirements http://www10.shopping.com/

I didn't get any success by either of the ways.I am only getting the failed pages.The crawling gets stopped automatically.Many of the other error messages were :-
The link : http://www.shopping.com/xCH-home_and_garden
Had this error: Unwanted prefix
Referenced by : http://www.shopping.com/
http://www.shopping.com/xPC-Hasbro-Tran ... r-Megatron


I urgently need this site to get crawled .Plz help
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

latest dowalk doesn't crawl pages without extensions

Post by mark »

I just created a walk using the latest scripts and all default settings except for
Base URL: http://www.shopping.com/
Exclusions: remove ? and ~ from exclusions
Strip Queries: N
and was able to crawl them just fine. I stopped after 100 pages so as to not bug them.

I'm not sure where you got http://shopping.com or http://www10.shopping.com . Anything other than http://www.shopping.com just redirects back to www.shopping.com when I try.

I also notice that the website is presented differently to webinator than to my browser. Instead of the /xYADDA links it presents all links with query strings and category numbers. Setting user agent to an IE6 agent string makes it give the /xYADDA links which I was also able to crawl ok. This way I got the page you mention as rejected above.
neetu
Posts: 9
Joined: Wed Aug 22, 2007 1:07 am

latest dowalk doesn't crawl pages without extensions

Post by neetu »

I wanted to know that how can i get success without any error pages.Is that possible that i get only success page.Also i want to know that should i keep robot.txt to No , to make few sites to crawl

Please reply.
hiti
Posts: 26
Joined: Tue Aug 07, 2007 3:37 am

latest dowalk doesn't crawl pages without extensions

Post by hiti »

Mark
I worked on as per your message but this time got only one success and three failed.However i need to ask you one thing,in the Useragent i have added Internet Explorer version 6.0.Is it correct?Or Shoud i write only IE6?
One thing more is webinator crawls site in different browsers in different manner?Do we need to crawl the sites in IE only?

Thanks in advance
Post Reply