Crawling Issues in Few Sites

User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Crawling Issues in Few Sites

Post by mark »

neetu
Posts: 9
Joined: Wed Aug 22, 2007 1:07 am

Crawling Issues in Few Sites

Post by neetu »

I have crawled a social networking site whose all the pages (approx 10000+) gets failed, maximum errors are of offsite links. Not even a single success i got.
As far as my setting is concern my verbosity is 4
I have setted the Max frame to 20 as i was getting error "Too Many IFRAMES"

Where as some sites were easily crawled by webinator.

Please help me regarding this, How i should set my setting to get this site crawl.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Crawling Issues in Few Sites

Post by mark »

Can't begin to guess at correct settings without knowing anything about the specific site. Maybe turn verbosity down to 2 so you're not flooded with messages and can see the real errors.
neetu
Posts: 9
Joined: Wed Aug 22, 2007 1:07 am

Crawling Issues in Few Sites

Post by neetu »

User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Crawling Issues in Few Sites

Post by John »

John Turnbull
Thunderstone Software
neetu
Posts: 9
Joined: Wed Aug 22, 2007 1:07 am

Crawling Issues in Few Sites

Post by neetu »

As we have checked the both the URL www.polishlinux.org and polishlinux.org are pointing to the same server. I don't think that both the URL are diffrent. Does webinator interprets both the URLS in diffrent ways?

Thank you,
Neetu
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Crawling Issues in Few Sites

Post by mark »

Yes, as john said the string "polishlinux.org" is different than "www.polishlinux.org" and is therefore a distinct hostname and possibly a distinct web server. The crawler can't know they are the same host. Use the site's preferred name or add the other name as an extra domain.
hiti
Posts: 26
Joined: Tue Aug 07, 2007 3:37 am

Crawling Issues in Few Sites

Post by hiti »

There are few sites that donot give any success on the crawling.http://gigaom.com/ is one of them.I have tried by changing the settings and code but get no luck
Can you please crawl this site and letme know whether it gets crawl and giv success .Why this site only results in the failure pages.Is the code responsible for it or some special settings are required for this site?
Thanks in advance
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Crawling Issues in Few Sites

Post by mark »

That site crawls using standard scripts and default settings. If you have specific questions please post them. Otherwise please open a ticket to arrange paid consulting for Thunderstone to setup your crawls.
hiti
Posts: 26
Joined: Tue Aug 07, 2007 3:37 am

Crawling Issues in Few Sites

Post by hiti »

My question is when i put this site for crawling i get only offsite links in the error log .The regular expression that i m using to get the story description is

<rex '>><div class=\x27cont\x27>\P=!</p>+\F</p>' $rawdoc><$StoryRowDescription=$ret>

as the whole stoyry is in the div as below
<div class='cont'>story......</p>

Is the above regular expression being generated for the above div is correct?
Any suggestions will be welcomed.
Post Reply