Indexing Unicode and ANSI

kevin31 · Post by **kevin31** » Tue Jul 15, 2003 6:03 pm

I am trying to make text that is either Unicode or that contains extended ASCII (ANSI) characters searchable. I am using the <timport> function in Vortex. Here is the schema definition for reading the file containing the data:
<$schema = "
# take multiple records
multiple
recdelim \x1A
# name type tag
field FormID varchar /\x1\x30\P=[\alnum]+ UNKNOWN
field FormType varchar /\x1\x31\P=[\alnum]+ UNKNOWN
field TextBlob varchar /\x1\x32\P=[^\x1]+ UNKNOWN
field ActionFlag varchar /\x1\x33\P=[\alnum]+ UNKNOWN
">

and here is code that builds the metamorph index:

<SQL "set keepnoise=1"></SQL>
<SQL "set delexp=0"></SQL>
<SQL "set addexp='[\alnum\x80-\xFF]{2,99}'"></SQL>
<SQL "CREATE METAMORPH INVERTED INDEX IXT_tForms on tForms (TextBlob, FormType, FormID)"></SQL>

The data is correct in the file, but I am not sure it is correct in the Texis db. When I run a query on text containing extended ASCII characters it will not return results. Can you see a problem with my REX expressions perhaps? Thanks.

Post by **mark** » Tue Jul 15, 2003 6:17 pm

In your timport expressions you probably want
[\alnum\x80-\xFF]
rather than just [\alnum]

Just do a straight select from the table to see what you're getting loaded. Once the correct data is loaded then worry about the indexing and search.

kevin31 · Post by **kevin31** » Tue Jul 15, 2003 6:32 pm

The timport expression in the schema currently reads
field TextBlob varchar /\x1\x32\P=[^\x1]+ UNKNOWN

Shouldn't the [^\x1] match on everything except the one character?

Post by **mark** » Tue Jul 15, 2003 11:14 pm

I was referring to the other expressions in case those fields were relevant.

Just do a straight select from the table to see what you're getting loaded. Once the correct data is loaded then worry about the indexing and search.

What's the sql for your query?
Do you get any error/warning messages within html comments when you execute your query?

Post by **mark** » Tue Jul 15, 2003 11:15 pm

Oh, and the \x1 should be \x01

kevin31 · Post by **kevin31** » Wed Jul 16, 2003 1:43 pm

When I do a straight select from sql the extended ASCII characters do not look correct at the TSQL command line, although they are correct in the file being imported.
Here is a simple query:
"select * from tForms where TextBlob likep 'hello'"
The resulting output should contain extended ASCII characters, which I cannot paste here because they do not display correctly on your site unforetunately. It appears that a single extended ASCII character is getting split into two characters.

I tried changing the [^\x1] to [^\x01] as you suggested. Is there something else the timport function requires to read extended ASCII?

Post by **mark** » Wed Jul 16, 2003 2:51 pm

Timport and Texis won't modify your 8-bit bytes, so if the bytes are in there you probably have a display problem on your terminal. Try dumping the output of tsql to a file and dumping that file with the "hex" command to verify it's contents.

What specifically is incorrect?
What's your sql insert statement look like?
Do you manipulate the fields from timport at all before inserting them?

kevin31 · Post by **kevin31** » Wed Jul 16, 2003 4:21 pm

Part of my problem is on my end in the writing of the data file which is being read by timport. If you can answer these questions it might help me.
- Assuming I produce a valid Unicode data file, can timport read that?
- What would the schema look like, including the REX expressions on text and integer fields, for reading a Unicode file?

Post by **Kai** » Wed Jul 16, 2003 6:29 pm

If the encoding for the Unicode data does not contain nul bytes -- eg. UTF-8, ISO-8859 -- then it can be timported. (UTF-16 may present a problem due to nuls, however some pre-/post-processing with Vortex should be able to handle it, eg. <strfmt "%v">.)

In your schemas, make sure to change \x1 to \x01 for a proper escape. Also, any fields that could be Unicode should use [\alnum\x80-\xFF] instead of \alnum, to pick up the extended chars.

The fact that you're seeing a single-byte extended-ASCII character (ISO-8859?) apparently being translated to 2 bytes likely indicates some display or mapping issue: timport/Texis will copy varchar field as-is to the table, no encoding translation. The way to be sure what's in the table is to redirect tsql output to a file and hex dump it (tsql -d "..." >filename; hex filename).

kevin31 · Post by **kevin31** » Thu Jul 17, 2003 1:54 pm

For the moment I can skipping the unicode requirement and indexing extended ASCII characters only. My understanding is that this covers most western European countries. I have been able to get Texis to read and index this data. However, I access my vortex script via CGI which means that the search term is URL encoded on the query string, and when extended chars are URL encoded a char such as letter 'a' with an accent (Alt-225 or E1 in hex) will wind up looking like "%ffffffe1". This is because as a signed byte it is represented as -31. The Vortex script will interpret "%e1" correctly, but not the former. Is there something I can do in the Vortex script to solve this?