latest dowalk doesn't crawl pages without extensions

hiti · Post by **hiti** » Wed Aug 22, 2007 1:48 am

I tried both ways .The error i got is somewhat like this:

Not in requirements http://www10.shopping.com/

I didn't get any success by either of the ways.I am only getting the failed pages.The crawling gets stopped automatically.Many of the other error messages were :-
The link : http://www.shopping.com/xCH-home_and_garden
Had this error: Unwanted prefix
Referenced by : http://www.shopping.com/
http://www.shopping.com/xPC-Hasbro-Tran ... r-Megatron

I urgently need this site to get crawled .Plz help

Post by **mark** » Wed Aug 22, 2007 10:31 am

I just created a walk using the latest scripts and all default settings except for
Base URL: http://www.shopping.com/
Exclusions: remove ? and ~ from exclusions
Strip Queries: N
and was able to crawl them just fine. I stopped after 100 pages so as to not bug them.

I'm not sure where you got http://shopping.com or http://www10.shopping.com . Anything other than http://www.shopping.com just redirects back to www.shopping.com when I try.

I also notice that the website is presented differently to webinator than to my browser. Instead of the /xYADDA links it presents all links with query strings and category numbers. Setting user agent to an IE6 agent string makes it give the /xYADDA links which I was also able to crawl ok. This way I got the page you mention as rejected above.

neetu · Post by **neetu** » Thu Aug 23, 2007 1:59 am

I wanted to know that how can i get success without any error pages.Is that possible that i get only success page.Also i want to know that should i keep robot.txt to No , to make few sites to crawl

Please reply.

hiti · Post by **hiti** » Thu Aug 23, 2007 8:40 am

Mark
I worked on as per your message but this time got only one success and three failed.However i need to ask you one thing,in the Useragent i have added Internet Explorer version 6.0.Is it correct?Or Shoud i write only IE6?
One thing more is webinator crawls site in different browsers in different manner?Do we need to crawl the sites in IE only?

Thanks in advance

Post by **mark** » Thu Aug 23, 2007 11:38 am

neetu, I'm not sure what you're asking.
Turning verbosity down to 2 will eliminate all of the merely informational "error" messages.
Nothing will eliminate all errors unless there are no errors fetching and processing pages from the site.

Please read the docs on using robots.txt in the Webinator manual as well as in the general discussion about what robots.txt is that is linked to from the Webinator docs. Then decide if you want to have Webinator respect robots.txt or now. You probably need to decide that on a site by site basis.

Post by **mark** » Thu Aug 23, 2007 11:43 am

hiti, the useragent string should be a copy of the string sent by the IE browser so that the site thinks it's being accessed by IE instead of an unknown crawler. I used

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

as taken from an actual web access log entry from someone using IE.

Webinator doesn't "use" any browser. Webinator is, in effect, a browser. The User Agent setting just lets you fake the server into thinking that Webinator is IE so that it presents pages in the same manner as it would to someone using IE.

All of this only applies to sites that present pages differently to different browsers. You need to determine that on a site by site basis. You may be able to get away with always using an IE agent string as above.

neetu · Post by **neetu** » Tue Aug 28, 2007 3:40 am

While crawling a site images are not appearing with the links. The site gets crawled without images.
Now the problem is images are there on that crawled site but being picked by some another site whose link is http://eur.news1.yimg.com/eur.yimg.com/ ... 0VPJBPfw--
and the link of the site i crawled is http://uk.news.yahoo.com/afp/20070828/t ... 640_1.html

i have also crawled the site by putting the image link in the extra domain field of webinator but again it gets crawled without image.

Please reply me asap.

Post by **mark** » Tue Aug 28, 2007 11:08 am

Not sure what you're getting at. Images aren't normally crawled. Why would you want to index images? If you really want to you can add .jpg to the extensions list. Hyperlinks aren't domains. The domain of the above image link would be eur.news1.yimg.com . Maybe what you want is the "offsite-pages" option?

If you'd like Thunderstone to configure your crawls for you you should open a tech support ticket to request consulting time.

hiti · Post by **hiti** » Wed Aug 29, 2007 9:32 am

Mark
I am still not able to crawl the site http://www.shopping.com/ .I have created a new profile and then crawled it again and i used the settings mentioned by you but still didn't get any page crawled.I have put www10.shopping.com in the extra domain also.I know you guys get success when you crawled this site .But till date i m not able to crawl any success page .
Most of the error messages were Offsite.However i got this message also
The link : http://www.shopping.com/xCH-baby_care
Had this error: Will not allocate 1702 bytes of memory: JavaScript exceeded scriptmem limit at http://img.shopping.com/jfe/JavaFrontEn ... ure.js:170
Referenced by : http://www.shopping.com/
http://www.shopping.com/?PG=13
http://www.shopping.com/?whatsHotGrp=1
http://www.shopping.com/?whatsHotGrp=4
http://www.shopping.com/xCC--creative_labs
http://www.shopping.com/?whatsHotGrp=5
http://www.shopping.com/?whatsHotGrp=6
http://www.shopping.com/aa21
http://www.shopping.com/?whatsHotGrp=9
http://www.shopping.com/top_searches

Plz help

Post by **mark** » Wed Aug 29, 2007 10:10 am

If you'd like Thunderstone to configure your crawls for you you should open a tech support ticket to request consulting time.