French accented character problem

Post Reply
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

French accented character problem

Post by edev »

Hi,

I crawl pages in French and noticed a problem in displaying accented French characters. When French accented charaters are entered directly instead of their HTML entity encoding, for example é instead of "é", I'm getting incorrect display.

One page has none HTML encoded title in French like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
<!-- START HEAD -->
<head>
<title>Le Comité consultatif de rédaction</title>
...

And the search result comes as

Le Comit&#65533; consultatif de r&#65533;daction

for the title field.

All pages are UTF-8 encoded.

Because the titles of these pages are described in XML so we cannot use HTML entity encoding for French accented characters, whereas the rest of the website uses HTML entity encoding for all accented characters. Is there a way to avoid this?

Thank you in advance!
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

French accented character problem

Post by edev »

Just posted it the message and I noticed it translated the search result into encoding. On the actual search page you see a question mark inside a square, not "&#65533".
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

French accented character problem

Post by Kai »

Is there an example public URL where this search can be seen (with an example query)? I'm assuming a Content-Type of text/html is being returned by the web server for those XML pages, so that Webinator can crawl them.
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

French accented character problem

Post by edev »

User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

French accented character problem

Post by Kai »

The second result in that example
declares its charset to be UTF-8, but the content is actually ISO-8859-1; hence the title `Le Comit...' has an incorrect last character. Either the content or the Content-Type header must be changed to conform to the other, and the site rewalked.
edev
Posts: 127
Joined: Wed Sep 14, 2005 5:10 pm

French accented character problem

Post by edev »

That's what the problem is! thank you!
Post Reply