HTML entities being stripped to ? on walk

rob133 · Post by **rob133** » Thu May 26, 2005 8:22 pm

Hi, I'm running Webinator 5.1.10-Unix. HTML entities such as  are turning into question marks in my results. I believe the munging is occuring on storage, not display, because I went into the search script and did a <send $Body> immediately after the SQL query, and still saw the question marks.

When you click through to the result page, you see the proper data. The data is UTF-8 and my servers (both the server with the page and the search server) output UTF-8 headers by default. I've tried setting Storage Charset to UTF-8 and leaving blank; same with Source Default Charset. I can't figure it out!

Here is an example:

http://search.dvcotechnology.com/cgi-bi ... =September

Any idea what's happening here?
Thanks

Post by **Kai** » Fri May 27, 2005 10:55 am

Set XML UTF-8 to N in All Walk Settings and rewalk. Unicode chars U+0080 through U+00FF are replaced with question marks when that setting is active, because those characters cause the entire page to error in XML mode on many browsers.

In version 5.1.15 of the scripts, XML UTF-8 defaults to N, as this "fixup" is only done at search time in XML mode, so those characters are not stripped when stored. (You would still need to turn it off and rewalk though.)

rob133 · Post by **rob133** » Fri May 27, 2005 1:34 pm

Thanks for your response. I set XML UTF-8 to N and rewalked. Now instead of question marks the entities show up as that little box (on win IE) or that weird capital A thing (on mac Firefox):

"PALO ALTO, Calif.  September 30, 2004  A leading..."

http://search.dvcotechnology.com/cgi-bi ... =September

What do you suggest?
Thanks

Post by **Kai** » Fri May 27, 2005 5:24 pm

Your look and feel should set the charset to UTF-8 (assuming Storage Charset and Display Charset are empty (the default) or UTF-8). Webinator 5.x stores and displays pages in UTF-8 by default for uniformity.

rob133 · Post by **rob133** » Fri May 27, 2005 5:53 pm

The look and feel is set to UTF-8. The server sends an http header to that effect, and I have even now updated the search settings on my test to include the standard html header, which includes the UTF-8 META tag. Storage Charset, Source Default Charset, and Display Charset are all set to UTF-8. Still I get the wrong characters.

FWIW, here is a test version of the search script, in which I execute <send $Body> immediately after the SQL query. The characters are already wrong at this point.

http://search.dvcotechnology.com/cgi-bi ... =September

I am completely baffled.

Post by **Kai** » Tue May 31, 2005 3:25 pm

The problem is that the source HTML page's &#NNN; entities use code points from a Windows 1252 character set, not Unicode. The HTML document character set -- which &#NNN; entities are supposed to be in -- is always Unicode, regardless of the document's character *encoding* (the header/<META> Content-Type charset). The encoding refers only to the byte-by-byte mapping of the content, not the logical interpretation of entities.

Many browsers work around documents that do not use Unicode for entities, and try to silently guess the appropriate charset to use, which is why the page displays in IE and Firefox. Webinator, however, currently expects entities to be in the Unicode character set as per the HTML 4 standard, where they map to control characters and thus do not display in UTF-8.

You'll need to change the entities in the source HTML pages to Unicode names, eg. ’ for , “ for , ” for , and — for .

Alternatively, you might be able to edit the search script to re-HTML-escape such entities, which would let a browser try to guess the charset as it does for the original URL. This will not work if other entities outside U+0080 - U+009F are used in the site. Edit the search script, and change the <a name=setupresult> function such that this line:

<strfmt $fmt $mmfmt $dabstract><$dabstract=$ret>

becomes:

<strfmt "%!V" $dabstract>
<strfmt "%mbllH" $mmfmt $dabstract>
<$dabstract = $ret>

rob133 · Post by **rob133** » Tue May 31, 2005 5:27 pm

Ahh, that's it. Stupid Windows-1252 entities. Converting them to Unicode-safe equivalents did the trick. Thanks for your help!

For the benefit of other readers, I used this table for my conversion:

http://intertwingly.net/stories/2004/04 ... ingWindows