Page 2 of 3

Crawling Issues in Few Sites

Posted: Tue Aug 21, 2007 12:34 pm
by mark

Crawling Issues in Few Sites

Posted: Wed Aug 22, 2007 1:27 am
by neetu
I have crawled a social networking site whose all the pages (approx 10000+) gets failed, maximum errors are of offsite links. Not even a single success i got.
As far as my setting is concern my verbosity is 4
I have setted the Max frame to 20 as i was getting error "Too Many IFRAMES"

Where as some sites were easily crawled by webinator.

Please help me regarding this, How i should set my setting to get this site crawl.

Crawling Issues in Few Sites

Posted: Wed Aug 22, 2007 10:03 am
by mark
Can't begin to guess at correct settings without knowing anything about the specific site. Maybe turn verbosity down to 2 so you're not flooded with messages and can see the real errors.

Crawling Issues in Few Sites

Posted: Thu Aug 23, 2007 2:55 am
by neetu
I have crawled the site http://polishlinux.org/ but i did not get any success and the error message appeared as

The link : http://polishlinux.org/category/linux/pclinuxos/
Had this error: Offsite
Referenced by : http://www.polishlinux.org/author/wiezyr/
http://www.polishlinux.org/author/Riklaunim/
http://www.polishlinux.org/author/riklaunim/

Please reply ASAP.

Crawling Issues in Few Sites

Posted: Thu Aug 23, 2007 9:52 am
by John
www.polishlinux.org is a different site than polishlinux.org (different hostname). You should probably give http://www.polishlinux.org/ as the base URL.

Crawling Issues in Few Sites

Posted: Fri Aug 24, 2007 12:11 am
by neetu
As we have checked the both the URL www.polishlinux.org and polishlinux.org are pointing to the same server. I don't think that both the URL are diffrent. Does webinator interprets both the URLS in diffrent ways?

Thank you,
Neetu

Crawling Issues in Few Sites

Posted: Fri Aug 24, 2007 10:13 am
by mark
Yes, as john said the string "polishlinux.org" is different than "www.polishlinux.org" and is therefore a distinct hostname and possibly a distinct web server. The crawler can't know they are the same host. Use the site's preferred name or add the other name as an extra domain.

Crawling Issues in Few Sites

Posted: Mon Sep 17, 2007 9:23 am
by hiti
There are few sites that donot give any success on the crawling.http://gigaom.com/ is one of them.I have tried by changing the settings and code but get no luck
Can you please crawl this site and letme know whether it gets crawl and giv success .Why this site only results in the failure pages.Is the code responsible for it or some special settings are required for this site?
Thanks in advance

Crawling Issues in Few Sites

Posted: Mon Sep 17, 2007 11:05 am
by mark
That site crawls using standard scripts and default settings. If you have specific questions please post them. Otherwise please open a ticket to arrange paid consulting for Thunderstone to setup your crawls.

Crawling Issues in Few Sites

Posted: Tue Sep 18, 2007 7:24 am
by hiti
My question is when i put this site for crawling i get only offsite links in the error log .The regular expression that i m using to get the story description is

<rex '>><div class=\x27cont\x27>\P=!</p>+\F</p>' $rawdoc><$StoryRowDescription=$ret>

as the whole stoyry is in the div as below
<div class='cont'>story......</p>

Is the above regular expression being generated for the above div is correct?
Any suggestions will be welcomed.