Page 1 of 1

encoding problem

Posted: Thu Oct 11, 2007 5:40 am
by wanah
Hi,

When I performed a search in our website, I realised that there are some characters not shown correctly.

Example:

1: MiWorld - Send A Message - MMS Services - MMS Phone Models
V690, Motorola V80, Motorola V878, Motorola V3, Motorola A668, Motorola E680 • Nokia - Nokia 3100, Nokia 3200, Nokia 3300, Nokia 3530, Nokia 3650, Nokia 3660, ...
http://www.miworld.com.sg/message/miwor ... smodel.jsp

I had checked on the encoding and nothing seems to be abnormal.

Any idea? Thanks!

encoding problem

Posted: Thu Oct 11, 2007 10:46 am
by jason112
8226 is the "dot" manually placed on the left hand side of each paragraph (rather than using normal <li> elements).

It looks like Webinator is not interpreting this character properly for storage, we'll look in to it.

encoding problem

Posted: Thu Oct 11, 2007 11:18 am
by mark
That page claims to be ISO8859-1 but &#8226; is not a valid character in that set.

encoding problem

Posted: Thu Oct 11, 2007 11:43 am
by jason112
It looks like it's still a problem; http://eureka/test/encoding.html is that page with UTF-8 declared in meta; the dots are still stored as <B7> in the database, which is invalid UTF-8, for both the iso-8859-1 and UTF-8 versions. See the "jasonBoard" profile on crown.

Oddly enough, if I do <geturl> on the original URL, it shows a proper UTF-8 byte sequence in the text (<E2><80><A2>). Any ideas?

encoding problem

Posted: Thu Oct 11, 2007 1:11 pm
by jason112
We've confirmed there's a problem with 8226 references,
and will be releasing a scripts update that fixes it
soon.

The problem is specific to that individual character; are there any other problems you're observing?

encoding problem

Posted: Thu Oct 11, 2007 8:40 pm
by wanah
Ok when I view the source from the browser, &#8226; had actually became &#8226; and that's why it is viewed as &#8226; in the browser. This maybe due to the problem when webinator is crawling and saving the website's content into the database incorrectly.

I think all the non-ascii characters will face this problem. Hope this helps.

encoding problem

Posted: Fri Oct 12, 2007 8:32 am
by jason112
That's different from the behavior we're seeing. If you go to the list/edit URLs page for that URL, what's the body of the page look like? i.e. does it show &8226; on the page, or some invalid character?

encoding problem

Posted: Fri Oct 12, 2007 9:45 am
by jason112
also, what are the storage charset, source default charset (both from All Walk Settings) and display charset (from Search Settings) set to?

encoding problem

Posted: Fri Oct 12, 2007 9:54 am
by Kai
FYI, HTML entities (&...;) are always Unicode, regardless of Content-Type charset, because the official HTML charset is always Unicode; Content-Type charset is merely the "transfer encoding" for raw chars/non-entities.

Of course, sometimes authors nonetheless use entities that refer to the Content-Type charset instead of Unicode (I've wanted a <fetch> option to handle that), but that is incorrect.