encoding problem

wanah · Post by **wanah** » Thu Oct 11, 2007 5:40 am

Hi,

When I performed a search in our website, I realised that there are some characters not shown correctly.

Example:

1: MiWorld - Send A Message - MMS Services - MMS Phone Models
V690, Motorola V80, Motorola V878, Motorola V3, Motorola A668, Motorola E680 • Nokia - Nokia 3100, Nokia 3200, Nokia 3300, Nokia 3530, Nokia 3650, Nokia 3660, ...
http://www.miworld.com.sg/message/miwor ... smodel.jsp

I had checked on the encoding and nothing seems to be abnormal.

Any idea? Thanks!

Post by **jason112** » Thu Oct 11, 2007 10:46 am

8226 is the "dot" manually placed on the left hand side of each paragraph (rather than using normal <li> elements).

It looks like Webinator is not interpreting this character properly for storage, we'll look in to it.

Post by **mark** » Thu Oct 11, 2007 11:18 am

That page claims to be ISO8859-1 but • is not a valid character in that set.

Post by **jason112** » Thu Oct 11, 2007 11:43 am

It looks like it's still a problem; http://eureka/test/encoding.html is that page with UTF-8 declared in meta; the dots are still stored as <B7> in the database, which is invalid UTF-8, for both the iso-8859-1 and UTF-8 versions. See the "jasonBoard" profile on crown.

Oddly enough, if I do <geturl> on the original URL, it shows a proper UTF-8 byte sequence in the text (<E2><80><A2>). Any ideas?

Post by **jason112** » Thu Oct 11, 2007 1:11 pm

We've confirmed there's a problem with 8226 references,
and will be releasing a scripts update that fixes it
soon.

The problem is specific to that individual character; are there any other problems you're observing?

wanah · Post by **wanah** » Thu Oct 11, 2007 8:40 pm

Ok when I view the source from the browser, • had actually became • and that's why it is viewed as • in the browser. This maybe due to the problem when webinator is crawling and saving the website's content into the database incorrectly.

I think all the non-ascii characters will face this problem. Hope this helps.

Post by **jason112** » Fri Oct 12, 2007 8:32 am

That's different from the behavior we're seeing. If you go to the list/edit URLs page for that URL, what's the body of the page look like? i.e. does it show &8226; on the page, or some invalid character?

Post by **jason112** » Fri Oct 12, 2007 9:45 am

also, what are the storage charset, source default charset (both from All Walk Settings) and display charset (from Search Settings) set to?

Post by **Kai** » Fri Oct 12, 2007 9:54 am

FYI, HTML entities (&...

are always Unicode, regardless of Content-Type charset, because the official HTML charset is always Unicode; Content-Type charset is merely the "transfer encoding" for raw chars/non-entities.

Of course, sometimes authors nonetheless use entities that refer to the Content-Type charset instead of Unicode (I've wanted a <fetch> option to handle that), but that is incorrect.