Strange walk error

Post Reply
michel.weber
Posts: 256
Joined: Sat Oct 08, 2005 12:40 pm

Strange walk error

Post by michel.weber »

I have a profile with very strange behaviour (V6.3.4 and 7.0.1)

The walk start off normally, when i check walk status after some minutes it says :

0 pages in todo
1,276 pages scheduled to be refreshed in the next hour
3,259 pages visited in the last hour (1,276 success/1,983 failed)
1,276 pages in index
0 items in replication queue.

Pages recently walked

1276 pages (132,704 bytes) so far.
1983 errors so far.
0 duplicate pages so far.

When i come back later it says walk failed, no usable index.
If in the middle i do a 'pause walk and live', i get the same error

I created the profile from scratch and didn't change very much from the default settings.
If you want to reproduce the problem, it is a public web site.

Base URL : http://www.coe.int/t/congress/newssearc ... 31/12/2029
Required REX : /newssearch/
Meta Tags : author
copyright
dimSector
dimSectorLevel
dimDocType
dimLanguage
dimEvent
dimEntity
dimTheme
dimGeo
dimSecurity
dimFilingPlan
Keep Tags ?
Begin End
<div id="data" </div id="data"
<meta name= >

Word Definition : [\alnum\'\x80-\xff]{1,70}

Any ideas are welcome

------------------------------------------------------

Search Appliance Walk Report for www_Congress_News_S_Liv

Creating database /usr/local/morph3/texis/wwwCongressNewsSLiv.4a5724ea7/db2...Done.
Walk started at 2009-07-10 13:28:33 (by user)
Verbosity set to 4
JavaScript walking enabled
HTTPS walking enabled
Start fetching at http://www.coe.int/t/congress/newssearc ... 31/12/2029
http://www.coe.int/t/congress/newssearc ... 31/12/2029
Ignore urls containing any of the following:
~
/_vti
/cgi-bin/
/Wires/
/worklist.asp?
.asp?link=http
.asp?MenuL=E
.asp?MenuL=F
.asp?MenuL=GER
.asp?MenuL=ITA
.asp?MenuL=RUS
administration/
ToPrint=yes
WcdDoc.asp
WCDsearch.asp

2009-07-10 13:28:33 started 1 new (4149) on http://www.coe.int/t/congress/newssearc ... 31/12/2029
Using primer: http://www.coe.int/t/congress/newssearc ... 31/12/2029
2009-07-10 13:37:09 Process memory limit exceeded (current: 55,390,208; limit: 50,000,000) (4149)
1619 pages fetched (168,376 bytes) from http://www.coe.int/t/congress/newssearc ... 31/12/2029 took 9 minutes 15 seconds
2009-07-10 13:37:48 started 1 (5561) Resume 4a5725e11b
Using primer: http://www.coe.int/t/congress/newssearc ... 31/12/2029
2009-07-10 13:46:30 Process memory limit exceeded (current: 50,028,544; limit: 50,000,000) (5561)
1980 pages fetched (252,688 bytes) from http://www.coe.int/t/congress/newssearc ... 31/12/2029 took 8 minutes 42 seconds
2009-07-10 13:46:31 started 1 (6282) Resume 4a5725e11b
Using primer: http://www.coe.int/t/congress/newssearc ... 31/12/2029
2009-07-10 13:56:26 Process memory limit exceeded (current: 50,012,160; limit: 50,000,000) (6282)
1971 pages fetched (204,984 bytes) from http://www.coe.int/t/congress/newssearc ... 31/12/2029 took 9 minutes 55 seconds
2009-07-10 13:56:26 started 1 (7187) Resume 4a5725e11b
2009-07-10 14:04:34 Nothing to refresh at /=http=-post?>>\L://www.coe.int/\L (7187)
Using primer: http://www.coe.int/t/congress/newssearc ... 31/12/2029
1845 pages fetched (193,400 bytes) from http://www.coe.int/t/congress/newssearc ... 31/12/2029 took 8 minutes 8 seconds
7415 pages fetched (771,160 bytes) Total
6587 errors Total
0 duplicate pages Total

Creating search index on fetched pages...Done.
Creating spell-checker dictionaries...Done.
2009-07-10 14:04:38 0 Extra Indexes done
Done.
Verifying usability of new walk.
Abandoning new walk. Cannot generate test query: No usable terms in index xh_TiDsKyMtBy_ViMoDpPp.
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

Strange walk error

Post by jason112 »

I was able to crawl the pages just fine, using what settings you specified.

If you create a new profile, set that as the Base URL, and add .asp to extensions, does it get content?
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

Strange walk error

Post by mark »

Looks like your keep tags are stripping all of the content except for some meta data so there's nothing to index.
michel.weber
Posts: 256
Joined: Sat Oct 08, 2005 12:40 pm

Strange walk error

Post by michel.weber »

Thanks mark for spotting that
Post Reply