UTF8 Query Parameters

sroth
Posts: 44
Joined: Mon Jul 23, 2007 11:21 am

UTF8 Query Parameters

Post by sroth »

I pass this as a query parameter: t=%C3%93lafur

It should decode as: Ólafur

The vortex decoding results in this: ". lafur" (no quotes)

The decoded string is incorrect in vortext. Is there as way to correct this?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

UTF8 Query Parameters

Post by Kai »

That text as a query string should be URL-decoded to UTF-8 as expected automatically. Are you using any special decoding statement in Vortex (e.g. `<strfmt "%...">'); if so what?

It may be that the web server is altering the encoding before it reaches Vortex. Try printing the variable re-URL-encoded, so we can see what bytes it contains: `<fmt "t=[%U]\n" $t>'. That should print `t=[%C3%93lafur]'; if not, the web server might be altering `t' first.
sroth
Posts: 44
Joined: Mon Jul 23, 2007 11:21 am

UTF8 Query Parameters

Post by sroth »

The result of the ftm statement is: t=[%C3%93lafur]

I'm running under windows and using the built in webserver.

It looks like it's getting munged when I run any <sandr> on the $t variable.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

UTF8 Query Parameters

Post by Kai »

What is the <sandr> you're using? Are you verifying the munge by printing the result with `<fmt "[%U]" $ret>' to see the hex bytes?
sroth
Posts: 44
Joined: Mon Jul 23, 2007 11:21 am

UTF8 Query Parameters

Post by sroth »

Looks like it's the punctuation character class that munges it. Is \punct valid?

<sandr '\punct' ' ' $t>
sroth
Posts: 44
Joined: Mon Jul 23, 2007 11:21 am

UTF8 Query Parameters

Post by sroth »

I switch the regex to and now I'm OK,
<sandr '[^a-zA-Z0-9\x80-\xff ]' ' ' $t>


I think I need to add an index espression to handle this search:

Query `zoë saldana' would require post-processing: Index expression(s) only partially prefix-match term `zoë'

What should I add to the index to support the extended characters?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

UTF8 Query Parameters

Post by Kai »

Could be the class; the classes are dependent on the current locale, which may not be set or obvious, and in any event REX and its classes are byte-based so the latter may not work as expected on UTF-8 data.

For that warning, add something like this index expression:

[\alnum\x80-\xff]{1,70}

It needs to include hi-bit characters; the default expression of `alnum{2,99}' only includes 7-bit ASCII.
sroth
Posts: 44
Joined: Mon Jul 23, 2007 11:21 am

UTF8 Query Parameters

Post by sroth »

Thanks, Everything is working as expected now.