Keywords are not encoded properly

Post Reply
Bastian
Posts: 10
Joined: Wed Nov 21, 2007 10:16 am

Keywords are not encoded properly

Post by Bastian »

Hi there,
i'm new to webinator and run in a seriously strange problem.

The special Chars in French (êéèà) or German (öäü) will be encoded without problems in the body. But in the Keywords the special chars are be stored as strange Symbols like ê or é.

After comb out the dowalk script i end up here and hope that someone have an idea or solution for this problem.

We are running Webinator 5.1.50 on Windows and the Charsets are set to ISO-8859-1 for storage and default source charset.

Thanks in advance

bastian.
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

Keywords are not encoded properly

Post by jason112 »

Where are you seeing these improper keyword values (list/edit URLs, match info, etc)?

Is there a public page that is experiencing this problem that we could try? Or if it's not publicly accessible, would you be able to provide one to us somehow?

The default storage charset is UTF-8, so highbit characters (like é) _should_ be stored as a two-byte sequence; it's likely that the keywords simply aren't being decoded properly at some point, while the body is.
Bastian
Posts: 10
Joined: Wed Nov 21, 2007 10:16 am

Keywords are not encoded properly

Post by Bastian »

Thanks for your reply.

Ah pardon i forgot. The improper values are in the list/edit URLs and on the Match Info.

Searches after the special Chars will bring up only results with those chars in the body but none with special chars in the Keywords.

You can access a example page here:
http://www.vol.be.ch/site/fr/home/lanat ... nu_februar

The Keywords of this page contains the word "pêche" in the Keywords but not in the Body. When i search for this word the site won't show up in the Searchresults.

Here the direct link to the search page to show you how the engine behaves.
http://195.141.109.133/scripts/texis.ex ... mit=Submit

I hope this information will help a bit.

So, i think too that the highbit characters are handled different in the body as in the keywords, but i didn't stumbled over a special handling while trying to understand the dowalk script.
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

Keywords are not encoded properly

Post by jason112 »

When crawling the page you mentioned, I can see that highbit characters are not encoded properly for the keywords. This only happens when the storage charset is ISO-8859-1, and is ok when it's UTF-8. I'll look in to this.

highlighting of géantes in results works fine when I try it;
> We are running Webinator 5.1.50 on Windows

We always recommend updating to the latest Webinator scripts (5.1.65) as a general tip, which can be grabbed from:
http://www.thunderstone.com/texis/site/ ... ripts.html
Bastian
Posts: 10
Joined: Wed Nov 21, 2007 10:16 am

Keywords are not encoded properly

Post by Bastian »

Thats the Point, i can't set the storage charset to UTF-8 because the output is rendered in UTF-8 too when i do so. And this would have side effects on special chars too.

I tested the 5.1.65 scripts (locally on our development server) and they behave the same way.

So there should be a possibility to have set everything on ISO-8859-1 and get clean results.

I personally think that the misstake will be made in the <a name=findmeta name>-section in the dowalk script. when i manually insert some special chars e.g. <$data="äüöüéàè"> just before <urlinfo header $name> starts. They will be handled correctly. So i think the <urlinfo header $foo> script-command doesn't work correct with special chars.
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

Keywords are not encoded properly

Post by jason112 »

> when i manually insert some special chars e.g.
> <$data="äüöüéàè"> just before <urlinfo header $name>
> starts. They will be handled correctly. So i think the
> <urlinfo header $foo> script-command doesn't work
> correct with special chars.

Somewhat; it looks like the problem is that the <urlinfo> functions in <findmeta> always return UTF-8, and that needs to be converted to the storage charset.

A call to <charsetconv from="UTF-8"> after collecting the data fixes the problem. A script update should be available later today or tomorrow that provides this fix.
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

Keywords are not encoded properly

Post by jason112 »

Post Reply