8bithtml behavior for LIKEP?

Crash.Alpha · Post by **Crash.Alpha** » Sun Sep 23, 2001 4:52 pm

We have to create a multilingual search engine on a company portal that supports four languages. We are using code based on the dowalk sample script to index our pages.

We found that the <urlcp 8bithtml no>, which transforms accented characters to unaccented ones when using <fetch>, <urltext>, etc. was quite nice.

However, we don't know how to implement this same logic when performing a LIKEP search. In other words, how do we tranform a search query string to unaccented equivalents with the same ease as 8bithtml?

Thanks!

Carlo

Post by **John** » Sun Sep 23, 2001 5:49 pm

You can use an often overlooked feature of <fetch> for that. That is a feature that let's you give fetch the HTML, and it will act as if it fetched it, e.g.

<urlcp 8bithtml no>
<fetch http://localhost/dummy.html $query>
<urltext>

might do what you want.

Crash.Alpha · Post by **Crash.Alpha** » Mon Sep 24, 2001 12:52 am

It works, but here's a caveat to anyone watching: you have to pass a string (as the first parameter) that will return either a valid page, or in the case of the URL given here, a 404 error page. Don't try passing it an empty string, or just "about:blank", etc. It has to be a well-formed http URL.

Otherwise, the text is returned - but WITHOUT the character translation. This is kinda strange behavior, for <fetch> no? Making an http: request for nothing? Or deciding to not make the translation because a page that would never be reached is not a well-formed URL?

It DOES work though... thanks, John.

Carlo

Post by **John** » Mon Sep 24, 2001 3:32 am

I forgot to explicitly make that point, although that is why I used that URL. It uses the URL to decide how to handle the data. Since it has a .html extension it treats it as HTML, otherwise it can't tell if the data should be treated as HTML or just plain text.

Crash.Alpha · Post by **Crash.Alpha** » Mon Sep 24, 2001 4:30 am

Does it actually make the http call? PLEASE say it doesn't! Then I can say that I've found a solution...

Carlo

Post by **John** » Mon Sep 24, 2001 9:34 am

No, it doesn't make the HTTP call. It just uses the URL to decide how to treat the data, for example the mime type, and the URL to use to make relative URLs absolute.

Crash.Alpha · Post by **Crash.Alpha** » Mon Sep 24, 2001 1:24 pm

Because I can't stop pestering you on this topic - what is <urltext> translating the HTML encoded (e.g. ü) characters to? I get a "division sign" in the output to the DOS box, for example.

Post by **Kai** » Mon Sep 24, 2001 1:51 pm

If <urlcp 8bithtml> is on (the default), known HTML escape sequences such as ü will be replaced with their 8-bit single character equivalent (decimal code 252 in this case). If <urlcp 8bithtml> is off, then the nearest US-ASCII character is substituted (decimal code 117 or "u" in this case).

If you're getting a division sign, it's probably because the font doesn't display 8-bit ISO-8859-1 characters correctly, or some intermediate process is garbling the characters. Print the output of <urltext> directly to a file and examine the relevant section with a hex editor. Or print the output of <urltext> with <fmt "%U" $ret> and look for the relevant section to be encoded as %FC -- the URL encoding for ASCII 252.

Crash.Alpha · Post by **Crash.Alpha** » Tue Sep 25, 2001 10:12 am

Ok, now our problem is getting uglier: the search is being performed from a DLL written in C that hooks into the Texis API. Texis doesn't appear to have this functionality (why should it? It just deals with the database!)

In other words, while we are encoding and indexing the body of the html page in unaccented format, the search is being done from within a DLL, so the <fetch> trick won't work... as the API will still be performing the search with accented characters.

Does anyone have any ideas on how to emulate the 8 to 7 bit translation within the API, making sure that it does the same type of character mapping?

Post by **Kai** » Tue Sep 25, 2001 10:43 am

Texis still has the 8bithtml functionality, as Vortex uses Texis. You can use the C call htmakepage() to get the behavior of <fetch "http://localhost/index.html" $html>:

#include "http.h"

HTOBJ *obj;
HTPAGE *pg = HTPAGEPN;
char *myhtmlbuffer = "some <B>HTML</B> to test";
obj = openhtobj(); /* open an object */
htsetflags(obj, HTSF_DO8BIT, 0); /* turn off 8-bit HTML flag */
pg = htmakepage(obj, "http://localhost/index.html", myhtmlbuffer);
if (pg != HTPAGEPN)
{
if (htformatpage(pg)) /* form <urltext> data */
{
/* formatted text is pg->txt buffer, with length pg->txtsz */
}
pg = closehtpage(pg); /* free the object */
}
obj = closehtobj(obj);