Cannot completely convert US-ASCII to UTF-8

Post Reply
kliu
Posts: 2
Joined: Wed Jun 25, 2008 7:12 pm

Cannot completely convert US-ASCII to UTF-8

Post by kliu »

Hi there,

I encountered the following problem. Do you know why and how I can fix?

started 1 new (25780) on http://lists.helixcommunity.org/pipermail/audio-cvs/

000 /usr/local/morph3/texis/scripts/dowalk(doprimer) 281: Cannot completely convert US-ASCII to UTF-8: Out-of-range character sequence at source offset 55802 in the function htusascii_to_utf8
0 pages fetched (56,038 bytes) from http://lists.helixcommunity.org/pipermail/audio-cvs/
1 errors
0 duplicate pages
No pages fetched. Search not updated.


The link : http://lists.helixcommunity.org/pipermail/audio-cvs/
Had this error: Cannot completely convert US-ASCII to UTF-8: Out-of-range character sequence at source offset 55802



Thanks,

Kinson
User avatar
John
Site Admin
Posts: 2597
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Cannot completely convert US-ASCII to UTF-8

Post by John »

The big issue is that the page declares itself to be US-ASCII, whereas it has a UTF-8 encoded copyright symbol at the bottom, so really the page should declare itself UTF-8. However the crawl should still continue, and a test I did here using that as a base url got many more pages before I stopped it.

Which version of the appliance software are you using?
John Turnbull
Thunderstone Software
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

Cannot completely convert US-ASCII to UTF-8

Post by jason112 »

Are you setting anything that would limit the walk to just that page?

The page has <META NAME="robots" CONTENT="noindex,follow">, so the crawl will not store content for the page regardless of the charset error.

But it should get links from that page (and does in our tests).
kliu
Posts: 2
Joined: Wed Jun 25, 2008 7:12 pm

Cannot completely convert US-ASCII to UTF-8

Post by kliu »

John,

Here is the info, let me know if it's not what you want:

Version: Search Appliance Server Version 5.01.1123091539 20050803 (i686-unknown-linux2.4.9-64-32)
Scripts Version: 5.4.15
Details: dowalk: 5.4.15/2.364 dowalk: 5.4.15/2.294 appliance: 5.4.15/1.106 search: 5.4.15/2.206

Jason112,

Are you setting anything that would limit the walk to just that page?

No and it crawls the main site and the other subdomains fine.
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

Cannot completely convert US-ASCII to UTF-8

Post by jason112 »

I'd definitely recommend updating to the latest packages; there have been 63 script updates alone, fixing many issues and adding many features since 5.4.15.
Post Reply