Cyrillic HTML pages

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Cyrillic HTML pages

Post by Thunderstone »



Andy Savchenkov said:

You will need to use the -k option to gw when you index to include these
other characters in the index. Normally gw only indexes the letters used
in English. You will need to do something like the following:

gw -d- -unindex
gw -d- -k"[\alpha\x80-\xff]{2,99}" -index

to create an index which includes all the characters with the 8-bit set.
If you only want the characters that are typically used to display then
you would change the \x80 to \xa0. I am unsure as to which of the characters
are used in Cyrillic, so I'm not sure if \xa0 would be sufficient.

John Turnbull
-------------
Thunderstone Technical Support
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Cyrillic HTML pages

Post by Thunderstone »



Hello everybody.

I'm using free copy of webinator to index and search on my web site. But the
problem is that all pages on this site are written in russian.
First of all Webinator indexed the pages but did not show russian words
properly.
I've solved this problem by using lexical filter HTML2TXT that strips out
html tags and does not correct text:

gw -n"text/html,htm,html2txt"

Now I can see cyrillic letters on the screen but I can not search russian
words. With english words everything is ok.
When I use direct URL to webinator:

http://www.rcnet.ru/cgi-bin/webinator?a ... fixproc=on&
thesaurus=0&cmd=+Go+&cmd=find&db=db&disp=all&grsz=10

I can see normal search results, but if I type arg=ËÏÍÐÁÎÉÑ everythig
becomes very bad - webinator can not find anything.
This word ËÏÍÐÁÎÉÑ exists in db/html.tbl.

Please give me advise how I can solve this problem with national language.

-----------------------------------------------
Andy Savchenkov
Race Communications USA, Inc.

Post Reply