Page 1 of 1
Keywords are not encoded properly
Posted: Wed Nov 21, 2007 10:44 am
by Bastian
Hi there,
i'm new to webinator and run in a seriously strange problem.
The special Chars in French (êéèà) or German (öäü) will be encoded without problems in the body. But in the Keywords the special chars are be stored as strange Symbols like ê or é.
After comb out the dowalk script i end up here and hope that someone have an idea or solution for this problem.
We are running Webinator 5.1.50 on Windows and the Charsets are set to ISO-8859-1 for storage and default source charset.
Thanks in advance
bastian.
Keywords are not encoded properly
Posted: Wed Nov 21, 2007 10:56 am
by jason112
Where are you seeing these improper keyword values (list/edit URLs, match info, etc)?
Is there a public page that is experiencing this problem that we could try? Or if it's not publicly accessible, would you be able to provide one to us somehow?
The default storage charset is UTF-8, so highbit characters (like é) _should_ be stored as a two-byte sequence; it's likely that the keywords simply aren't being decoded properly at some point, while the body is.
Keywords are not encoded properly
Posted: Wed Nov 21, 2007 11:41 am
by Bastian
Thanks for your reply.
Ah pardon i forgot. The improper values are in the list/edit URLs and on the Match Info.
Searches after the special Chars will bring up only results with those chars in the body but none with special chars in the Keywords.
You can access a example page here:
http://www.vol.be.ch/site/fr/home/lanat ... nu_februar
The Keywords of this page contains the word "pêche" in the Keywords but not in the Body. When i search for this word the site won't show up in the Searchresults.
Here the direct link to the search page to show you how the engine behaves.
http://195.141.109.133/scripts/texis.ex ... mit=Submit
I hope this information will help a bit.
So, i think too that the highbit characters are handled different in the body as in the keywords, but i didn't stumbled over a special handling while trying to understand the dowalk script.
Keywords are not encoded properly
Posted: Wed Nov 21, 2007 12:44 pm
by jason112
When crawling the page you mentioned, I can see that highbit characters are not encoded properly for the keywords. This only happens when the storage charset is ISO-8859-1, and is ok when it's UTF-8. I'll look in to this.
highlighting of géantes in results works fine when I try it;
> We are running Webinator 5.1.50 on Windows
We always recommend updating to the latest Webinator scripts (5.1.65) as a general tip, which can be grabbed from:
http://www.thunderstone.com/texis/site/ ... ripts.html
Keywords are not encoded properly
Posted: Fri Nov 23, 2007 5:42 am
by Bastian
Thats the Point, i can't set the storage charset to UTF-8 because the output is rendered in UTF-8 too when i do so. And this would have side effects on special chars too.
I tested the 5.1.65 scripts (locally on our development server) and they behave the same way.
So there should be a possibility to have set everything on ISO-8859-1 and get clean results.
I personally think that the misstake will be made in the <a name=findmeta name>-section in the dowalk script. when i manually insert some special chars e.g. <$data="äüöüéàè"> just before <urlinfo header $name> starts. They will be handled correctly. So i think the <urlinfo header $foo> script-command doesn't work correct with special chars.
Keywords are not encoded properly
Posted: Mon Nov 26, 2007 12:23 pm
by jason112
> when i manually insert some special chars e.g.
> <$data="äüöüéàè"> just before <urlinfo header $name>
> starts. They will be handled correctly. So i think the
> <urlinfo header $foo> script-command doesn't work
> correct with special chars.
Somewhat; it looks like the problem is that the <urlinfo> functions in <findmeta> always return UTF-8, and that needs to be converted to the storage charset.
A call to <charsetconv from="UTF-8"> after collecting the data fixes the problem. A script update should be available later today or tomorrow that provides this fix.
Keywords are not encoded properly
Posted: Thu Dec 06, 2007 3:02 pm
by jason112
The scripts on our website have been updated to contain
this fix, they can be downloaded from here:
http://www.thunderstone.com/texis/site/ ... ripts.html
There's also a similar fix applied in the <collectallmeta>
function, which is used by the "All Meta" setting.