Thunderstone Support Forums

Posted: **Thu Oct 11, 2007 5:40 am**

Hi,

When I performed a search in our website, I realised that there are some characters not shown correctly.

Example:

1: MiWorld - Send A Message - MMS Services - MMS Phone Models
V690, Motorola V80, Motorola V878, Motorola V3, Motorola A668, Motorola E680 • Nokia - Nokia 3100, Nokia 3200, Nokia 3300, Nokia 3530, Nokia 3650, Nokia 3660, ...
http://www.miworld.com.sg/message/miwor ... smodel.jsp

I had checked on the encoding and nothing seems to be abnormal.

Any idea? Thanks!

Posted: **Thu Oct 11, 2007 10:46 am**

8226 is the "dot" manually placed on the left hand side of each paragraph (rather than using normal <li> elements).

It looks like Webinator is not interpreting this character properly for storage, we'll look in to it.

Posted: **Thu Oct 11, 2007 11:18 am**

That page claims to be ISO8859-1 but • is not a valid character in that set.

Posted: **Thu Oct 11, 2007 11:43 am**

It looks like it's still a problem; http://eureka/test/encoding.html is that page with UTF-8 declared in meta; the dots are still stored as <B7> in the database, which is invalid UTF-8, for both the iso-8859-1 and UTF-8 versions. See the "jasonBoard" profile on crown.

Oddly enough, if I do <geturl> on the original URL, it shows a proper UTF-8 byte sequence in the text (<E2><80><A2>). Any ideas?

Posted: **Thu Oct 11, 2007 1:11 pm**

We've confirmed there's a problem with 8226 references,
and will be releasing a scripts update that fixes it
soon.

The problem is specific to that individual character; are there any other problems you're observing?

Posted: **Thu Oct 11, 2007 8:40 pm**

Ok when I view the source from the browser, • had actually became • and that's why it is viewed as • in the browser. This maybe due to the problem when webinator is crawling and saving the website's content into the database incorrectly.

I think all the non-ascii characters will face this problem. Hope this helps.

Posted: **Fri Oct 12, 2007 8:32 am**

That's different from the behavior we're seeing. If you go to the list/edit URLs page for that URL, what's the body of the page look like? i.e. does it show &8226; on the page, or some invalid character?

Posted: **Fri Oct 12, 2007 9:45 am**

also, what are the storage charset, source default charset (both from All Walk Settings) and display charset (from Search Settings) set to?

Posted: **Fri Oct 12, 2007 9:54 am**

FYI, HTML entities (&...

are always Unicode, regardless of Content-Type charset, because the official HTML charset is always Unicode; Content-Type charset is merely the "transfer encoding" for raw chars/non-entities.

Of course, sometimes authors nonetheless use entities that refer to the Content-Type charset instead of Unicode (I've wanted a <fetch> option to handle that), but that is incorrect.

Thunderstone Support Forums

encoding problem

encoding problem

encoding problem

encoding problem

encoding problem

encoding problem

encoding problem

encoding problem

encoding problem

encoding problem