Page 1 of 1

French accented character problem

Posted: Mon Jan 08, 2007 1:25 pm
by edev
Hi,

I crawl pages in French and noticed a problem in displaying accented French characters. When French accented charaters are entered directly instead of their HTML entity encoding, for example é instead of "é", I'm getting incorrect display.

One page has none HTML encoded title in French like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
<!-- START HEAD -->
<head>
<title>Le Comité consultatif de rédaction</title>
...

And the search result comes as

Le Comit&#65533; consultatif de r&#65533;daction

for the title field.

All pages are UTF-8 encoded.

Because the titles of these pages are described in XML so we cannot use HTML entity encoding for French accented characters, whereas the rest of the website uses HTML entity encoding for all accented characters. Is there a way to avoid this?

Thank you in advance!

French accented character problem

Posted: Mon Jan 08, 2007 1:27 pm
by edev
Just posted it the message and I noticed it translated the search result into encoding. On the actual search page you see a question mark inside a square, not "&#65533".

French accented character problem

Posted: Mon Jan 08, 2007 5:57 pm
by Kai
Is there an example public URL where this search can be seen (with an example query)? I'm assuming a Content-Type of text/html is being returned by the web server for those XML pages, so that Webinator can crawl them.

French accented character problem

Posted: Mon Jan 08, 2007 9:53 pm
by edev
Hi Kai,

Yes that's correct...the webpage is rendered by an application so Webinator crawls the content of the page. The URL to see the search result on the Webinator screen is:
http://205.193.6.125/texis.exe/webinato ... mit=Submit

and our production site which integrates webinator is:
http://www.culture.ca/process-search-e. ... 45a25dcbb3

Thank you Kai!

French accented character problem

Posted: Wed Jan 10, 2007 11:55 am
by Kai
The second result in that example
declares its charset to be UTF-8, but the content is actually ISO-8859-1; hence the title `Le Comit...' has an incorrect last character. Either the content or the Content-Type header must be changed to conform to the other, and the site rewalked.

French accented character problem

Posted: Wed Jan 10, 2007 2:54 pm
by edev
That's what the problem is! thank you!