gw stalled on japanese document

mikep · Post by **mikep** » Fri Sep 10, 2004 9:42 am

I tried using gw version 2.7 to index a web site and it stalled on a specific document. I tried to open this file in IE, and it is entirely Japanese with no english at all.

1) Does indexing with gw 2.7 support the same character translation as the dowalk script?

2) Will gw use all the Windows character sets I have installed? ie do I just need to add support for Japanese at the Windows level or does gw use its own routines?

3) k is currently set to \alnum(2,30). Will changing this make a difference?

Post by **Kai** » Fri Sep 10, 2004 6:28 pm

What release of gw is this (printed by gw -version)? Certain releases before 20040223 could hang on truncated hi-bit documents like Japanese.

1) Yes, with the exception of the Storage and Display Charset settings, which are handled by the dowalk and search scripts. gw will use the default of UTF-8.

2) Webinator uses its own routines for character set translation, as native/system support varies by platform. So installing character sets in Windows will not affect Webinator.

3) Your index expression will need to include hi-bit bytes as well for the non-ASCII entities; you could use an index expression such as [\alnum\x80-\xff]{1,40} for this.

Additional notes on indexing Japanese:

Since Japanese characters, unlike European, are essentially words and appear together with no space separation, searching for individual word-characters may not match everything (ie. in English it would be like searching for `Test' in a document with the text `ThisIsATestOfThings': it's a substring and won't match). So you'll probably want to add another index expression that treats each UTF-8 character as a separate word for Japanese:

[\xc0-\xfd]=[\x80-\xbf]+

Then at search time, you'll need to make a query for multiple contiguous UTF-8 entities into a phrase (eg. query `MyTest' would become `"My Test"'). You can use this Vortex code:

<-- Double-quote each group of UTF-8 entities to make a phrase: -->
<sandr "[\x80-\xfd]+" '\x22\1\x22' $query>
<-- Separate each UTF-8 entity with space to make each a word: -->
<sandr "[\xc0-\xfd]=[\x80-\xbf]+" "\x20\1\2" $ret>
<-- Remove leading spaces from phrases: " x" becomes "x": -->
<sandr '\x22\P=\x20=\F[\xc0-\xfd]=' "" $ret>
<$query = $ret>