RE(2): Cyrillic HTML pages

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

RE(2): Cyrillic HTML pages

Post by Thunderstone »



Hi All,

Using the -n option sounded like a neat idea for processing foreign characters
in HTML documents. The only problem I'm having is that the -n option
turns off the internal gw HTML parser. In plainer english, after I convert
the foreign characters with my external "plugin", Webinator won't parse
hyperlinks <A HREF=....>.

Is there a way to have an external plugin, but still let webinator parse
the links? How did Andy Savchenkov do it?

Thanks,

-Kevin McCarthy
AMD
kevin.mccarthy@amd.com

---- webinator(a)thunderstone.com's Message ----

Andy Savchenkov said:

You will need to use the -k option to gw when you index to include these
other characters in the index. Normally gw only indexes the letters used
in English. You will need to do something like the following:

gw -d- -unindex
gw -d- -k"[\alpha\x80-\xff]{2,99}" -index

to create an index which includes all the characters with the 8-bit set.
If you only want the characters that are typically used to display then
you would change the \x80 to \xa0. I am unsure as to which of the characters
are used in Cyrillic, so I'm not sure if \xa0 would be sufficient.

John Turnbull
-------------
Thunderstone Technical Support
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

RE(2): Cyrillic HTML pages

Post by Thunderstone »



Egads! This is getting complex.

The thought behind the Plugins was/is:


+ yes--> HTML Parser ---\
Fetched URL -> Webinator -> HTML? | ->Database
+ no---> Mime Plugin ---/

There isn't currently the structure for preprocessing the HTML
with a Plugin. I suppose it could be done though.

Are there other uses for this beyond language translation?

Thanks,
Bart


User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

RE(2): Cyrillic HTML pages

Post by Thunderstone »



Thunderstone - EPI said:

The need for this should be reduced with the upcoming version of Webinator
which will not do translation of foreign characters. This should leave
foreign text alone, which should be correct for most cases.

John Turnbull
-------------
Thunderstone Technical Support
Post Reply