We have to create a multilingual search engine on a company portal that supports four languages. We are using code based on the dowalk sample script to index our pages.
We found that the <urlcp 8bithtml no>, which transforms accented characters to unaccented ones when using <fetch>, <urltext>, etc. was quite nice.
However, we don't know how to implement this same logic when performing a LIKEP search. In other words, how do we tranform a search query string to unaccented equivalents with the same ease as 8bithtml?
You can use an often overlooked feature of <fetch> for that. That is a feature that let's you give fetch the HTML, and it will act as if it fetched it, e.g.
It works, but here's a caveat to anyone watching: you have to pass a string (as the first parameter) that will return either a valid page, or in the case of the URL given here, a 404 error page. Don't try passing it an empty string, or just "about:blank", etc. It has to be a well-formed http URL.
Otherwise, the text is returned - but WITHOUT the character translation. This is kinda strange behavior, for <fetch> no? Making an http: request for nothing? Or deciding to not make the translation because a page that would never be reached is not a well-formed URL?
I forgot to explicitly make that point, although that is why I used that URL. It uses the URL to decide how to handle the data. Since it has a .html extension it treats it as HTML, otherwise it can't tell if the data should be treated as HTML or just plain text.
No, it doesn't make the HTTP call. It just uses the URL to decide how to treat the data, for example the mime type, and the URL to use to make relative URLs absolute.
Because I can't stop pestering you on this topic - what is <urltext> translating the HTML encoded (e.g. ü) characters to? I get a "division sign" in the output to the DOS box, for example.
If <urlcp 8bithtml> is on (the default), known HTML escape sequences such as ü will be replaced with their 8-bit single character equivalent (decimal code 252 in this case). If <urlcp 8bithtml> is off, then the nearest US-ASCII character is substituted (decimal code 117 or "u" in this case).
If you're getting a division sign, it's probably because the font doesn't display 8-bit ISO-8859-1 characters correctly, or some intermediate process is garbling the characters. Print the output of <urltext> directly to a file and examine the relevant section with a hex editor. Or print the output of <urltext> with <fmt "%U" $ret> and look for the relevant section to be encoded as %FC -- the URL encoding for ASCII 252.
Ok, now our problem is getting uglier: the search is being performed from a DLL written in C that hooks into the Texis API. Texis doesn't appear to have this functionality (why should it? It just deals with the database!)
In other words, while we are encoding and indexing the body of the html page in unaccented format, the search is being done from within a DLL, so the <fetch> trick won't work... as the API will still be performing the search with accented characters.
Does anyone have any ideas on how to emulate the 8 to 7 bit translation within the API, making sure that it does the same type of character mapping?
Texis still has the 8bithtml functionality, as Vortex uses Texis. You can use the C call htmakepage() to get the behavior of <fetch "http://localhost/index.html" $html>:
#include "http.h"
HTOBJ *obj;
HTPAGE *pg = HTPAGEPN;
char *myhtmlbuffer = "some <B>HTML</B> to test";
obj = openhtobj(); /* open an object */
htsetflags(obj, HTSF_DO8BIT, 0); /* turn off 8-bit HTML flag */
pg = htmakepage(obj, "http://localhost/index.html", myhtmlbuffer);
if (pg != HTPAGEPN)
{
if (htformatpage(pg)) /* form <urltext> data */
{
/* formatted text is pg->txt buffer, with length pg->txtsz */
}
pg = closehtpage(pg); /* free the object */
}
obj = closehtobj(obj);