Additional Field Encoding

Post Reply
fowlers
Posts: 25
Joined: Tue Feb 06, 2007 10:53 am

Additional Field Encoding

Post by fowlers »

I recently added an additional field for the meta description tag to one of our search profiles. In the Data From Field section, I set the REX Search to .+, the Replace to \1, the From Field to Description, and the To field to my additional field MetaDes. I do this so that I can choose to use the meta description if it's there - otherwise I use the query abstract (many sites crawled and not all docs have meta description).

However, there seems to be an encoding issue with something as simple as the registered trademark symbol. When I crawl this page - http://iwww.plasticsportal.com/products/ultraform.html - I get the following in the MetaDes additional field:
<MetaDes>Acetal polyoxymethylene (POM) copolymer products under the tradename Ultraform&#65533;...</MetaDes>
The unknown character should be a ®. Then, when loading the xml doc from the appliance, it fails because of "Invalid character in the given encoding." After setting the Output to No on the additional field, it works fine. Any ideas?
User avatar
Kai
Site Admin
Posts: 1271
Joined: Tue Apr 25, 2000 1:27 pm

Additional Field Encoding

Post by Kai »

That works here, with the latest scripts (texisScripts-6.3.0). Are you sure something downstream (the browser?) isn't mapping the U+00AE (registered sign) character to U+FFFD (&#65533;)? I can't find anything in the Appliance that would do that kind of mapping on that URL's text, given its labelled charset of ISO-8859-1.

Do you have Storage Charset or Source Default Charset set?
fowlers
Posts: 25
Joined: Tue Feb 06, 2007 10:53 am

Additional Field Encoding

Post by fowlers »

storage charset is set to UTF-8
source default charset is ISO-8859-1

and i haven't installed the newest scripts...but will do here in a minute
Post Reply