encoding problem

Post Reply
wanah
Posts: 15
Joined: Mon Jul 30, 2007 2:29 am

encoding problem

Post by wanah »

Hi,

When I performed a search in our website, I realised that there are some characters not shown correctly.

Example:

1: MiWorld - Send A Message - MMS Services - MMS Phone Models
V690, Motorola V80, Motorola V878, Motorola V3, Motorola A668, Motorola E680 • Nokia - Nokia 3100, Nokia 3200, Nokia 3300, Nokia 3530, Nokia 3650, Nokia 3660, ...
http://www.miworld.com.sg/message/miwor ... smodel.jsp

I had checked on the encoding and nothing seems to be abnormal.

Any idea? Thanks!
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

encoding problem

Post by jason112 »

8226 is the "dot" manually placed on the left hand side of each paragraph (rather than using normal <li> elements).

It looks like Webinator is not interpreting this character properly for storage, we'll look in to it.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

encoding problem

Post by mark »

That page claims to be ISO8859-1 but &#8226; is not a valid character in that set.
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

encoding problem

Post by jason112 »

It looks like it's still a problem; http://eureka/test/encoding.html is that page with UTF-8 declared in meta; the dots are still stored as <B7> in the database, which is invalid UTF-8, for both the iso-8859-1 and UTF-8 versions. See the "jasonBoard" profile on crown.

Oddly enough, if I do <geturl> on the original URL, it shows a proper UTF-8 byte sequence in the text (<E2><80><A2>). Any ideas?
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

encoding problem

Post by jason112 »

We've confirmed there's a problem with 8226 references,
and will be releasing a scripts update that fixes it
soon.

The problem is specific to that individual character; are there any other problems you're observing?
wanah
Posts: 15
Joined: Mon Jul 30, 2007 2:29 am

encoding problem

Post by wanah »

Ok when I view the source from the browser, &#8226; had actually became &#8226; and that's why it is viewed as &#8226; in the browser. This maybe due to the problem when webinator is crawling and saving the website's content into the database incorrectly.

I think all the non-ascii characters will face this problem. Hope this helps.
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

encoding problem

Post by jason112 »

That's different from the behavior we're seeing. If you go to the list/edit URLs page for that URL, what's the body of the page look like? i.e. does it show &8226; on the page, or some invalid character?
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

encoding problem

Post by jason112 »

also, what are the storage charset, source default charset (both from All Walk Settings) and display charset (from Search Settings) set to?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

encoding problem

Post by Kai »

FYI, HTML entities (&...;) are always Unicode, regardless of Content-Type charset, because the official HTML charset is always Unicode; Content-Type charset is merely the "transfer encoding" for raw chars/non-entities.

Of course, sometimes authors nonetheless use entities that refer to the Content-Type charset instead of Unicode (I've wanted a <fetch> option to handle that), but that is incorrect.
Post Reply