latest dowalk doesn't crawl pages without extensions

Post by **mark** » Thu Aug 23, 2007 11:38 am

neetu, I'm not sure what you're asking.
Turning verbosity down to 2 will eliminate all of the merely informational "error" messages.
Nothing will eliminate all errors unless there are no errors fetching and processing pages from the site.

Please read the docs on using robots.txt in the Webinator manual as well as in the general discussion about what robots.txt is that is linked to from the Webinator docs. Then decide if you want to have Webinator respect robots.txt or now. You probably need to decide that on a site by site basis.

Post by **mark** » Thu Aug 23, 2007 11:43 am

hiti, the useragent string should be a copy of the string sent by the IE browser so that the site thinks it's being accessed by IE instead of an unknown crawler. I used

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

as taken from an actual web access log entry from someone using IE.

Webinator doesn't "use" any browser. Webinator is, in effect, a browser. The User Agent setting just lets you fake the server into thinking that Webinator is IE so that it presents pages in the same manner as it would to someone using IE.

All of this only applies to sites that present pages differently to different browsers. You need to determine that on a site by site basis. You may be able to get away with always using an IE agent string as above.

neetu · Post by **neetu** » Tue Aug 28, 2007 3:40 am

While crawling a site images are not appearing with the links. The site gets crawled without images.
Now the problem is images are there on that crawled site but being picked by some another site whose link is http://eur.news1.yimg.com/eur.yimg.com/ ... 0VPJBPfw--
and the link of the site i crawled is http://uk.news.yahoo.com/afp/20070828/t ... 640_1.html

i have also crawled the site by putting the image link in the extra domain field of webinator but again it gets crawled without image.

Please reply me asap.

Post by **mark** » Tue Aug 28, 2007 11:08 am

Not sure what you're getting at. Images aren't normally crawled. Why would you want to index images? If you really want to you can add .jpg to the extensions list. Hyperlinks aren't domains. The domain of the above image link would be eur.news1.yimg.com . Maybe what you want is the "offsite-pages" option?

If you'd like Thunderstone to configure your crawls for you you should open a tech support ticket to request consulting time.

hiti · Post by **hiti** » Wed Aug 29, 2007 9:32 am

Mark
I am still not able to crawl the site http://www.shopping.com/ .I have created a new profile and then crawled it again and i used the settings mentioned by you but still didn't get any page crawled.I have put www10.shopping.com in the extra domain also.I know you guys get success when you crawled this site .But till date i m not able to crawl any success page .
Most of the error messages were Offsite.However i got this message also
The link : http://www.shopping.com/xCH-baby_care
Had this error: Will not allocate 1702 bytes of memory: JavaScript exceeded scriptmem limit at http://img.shopping.com/jfe/JavaFrontEn ... ure.js:170
Referenced by : http://www.shopping.com/
http://www.shopping.com/?PG=13
http://www.shopping.com/?whatsHotGrp=1
http://www.shopping.com/?whatsHotGrp=4
http://www.shopping.com/xCC--creative_labs
http://www.shopping.com/?whatsHotGrp=5
http://www.shopping.com/?whatsHotGrp=6
http://www.shopping.com/aa21
http://www.shopping.com/?whatsHotGrp=9
http://www.shopping.com/top_searches

Plz help

Post by **mark** » Wed Aug 29, 2007 10:10 am

If you'd like Thunderstone to configure your crawls for you you should open a tech support ticket to request consulting time.

hiti · Post by **hiti** » Fri Aug 31, 2007 9:16 am

Mark
Please letme know how to request for tech support ticket.One thing more i was crawling a site http://linux.slashdot.org and i added http://slashdot.org/
as extra domain but didn't get any success.
Can u help me with that
Thanks
Hiti

hiti · Post by **hiti** » Tue Sep 04, 2007 9:41 am

I am writing the customised code for my all the sites.Is there any way by which we can replace a single quote in the regular expression.Because i have come across many sites that use single quotes in the html. So i want to know how can we make use of single quotes in the regular expression

Would statement like the below will work?

<rex '>><h1 class\='newsheadlinearticle'>\P=!</h1>+\F</h1>' $rawdoc><$StoryTitle=$ret>

hiti · Post by **hiti** » Wed Sep 05, 2007 9:25 am

Ok Is the below statement correct?

<rex '>><h1 class\=\x27'newsheadlinearticle\x27'>\P=!</h1>+\F</h1>' $rawdoc><$StoryTitle=$ret>

I have written the expression for this statement <h1 class='newsheadlinearticle'>Test Title</h1>