8bithtml behavior for LIKEP?

Crash.Alpha
Posts: 9
Joined: Sun Sep 23, 2001 1:26 pm

8bithtml behavior for LIKEP?

Post by Crash.Alpha »

We have to create a multilingual search engine on a company portal that supports four languages. We are using code based on the dowalk sample script to index our pages.

We found that the <urlcp 8bithtml no>, which transforms accented characters to unaccented ones when using <fetch>, <urltext>, etc. was quite nice.

However, we don't know how to implement this same logic when performing a LIKEP search. In other words, how do we tranform a search query string to unaccented equivalents with the same ease as 8bithtml?

Thanks!

Carlo
User avatar
John
Site Admin
Posts: 2623
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

8bithtml behavior for LIKEP?

Post by John »

You can use an often overlooked feature of <fetch> for that. That is a feature that let's you give fetch the HTML, and it will act as if it fetched it, e.g.

<urlcp 8bithtml no>
<fetch http://localhost/dummy.html $query>
<urltext>

might do what you want.
John Turnbull
Thunderstone Software
Crash.Alpha
Posts: 9
Joined: Sun Sep 23, 2001 1:26 pm

8bithtml behavior for LIKEP?

Post by Crash.Alpha »

It works, but here's a caveat to anyone watching: you have to pass a string (as the first parameter) that will return either a valid page, or in the case of the URL given here, a 404 error page. Don't try passing it an empty string, or just "about:blank", etc. It has to be a well-formed http URL.

Otherwise, the text is returned - but WITHOUT the character translation. This is kinda strange behavior, for <fetch> no? Making an http: request for nothing? Or deciding to not make the translation because a page that would never be reached is not a well-formed URL?

It DOES work though... thanks, John.

Carlo
User avatar
John
Site Admin
Posts: 2623
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

8bithtml behavior for LIKEP?

Post by John »

I forgot to explicitly make that point, although that is why I used that URL. It uses the URL to decide how to handle the data. Since it has a .html extension it treats it as HTML, otherwise it can't tell if the data should be treated as HTML or just plain text.
John Turnbull
Thunderstone Software
Crash.Alpha
Posts: 9
Joined: Sun Sep 23, 2001 1:26 pm

8bithtml behavior for LIKEP?

Post by Crash.Alpha »

Does it actually make the http call? PLEASE say it doesn't! Then I can say that I've found a solution...

Carlo
User avatar
John
Site Admin
Posts: 2623
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

8bithtml behavior for LIKEP?

Post by John »

No, it doesn't make the HTTP call. It just uses the URL to decide how to treat the data, for example the mime type, and the URL to use to make relative URLs absolute.
John Turnbull
Thunderstone Software
Crash.Alpha
Posts: 9
Joined: Sun Sep 23, 2001 1:26 pm

8bithtml behavior for LIKEP?

Post by Crash.Alpha »

Because I can't stop pestering you on this topic - what is <urltext> translating the HTML encoded (e.g. &uuml;) characters to? I get a "division sign" in the output to the DOS box, for example.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

8bithtml behavior for LIKEP?

Post by Kai »

If <urlcp 8bithtml> is on (the default), known HTML escape sequences such as &uuml; will be replaced with their 8-bit single character equivalent (decimal code 252 in this case). If <urlcp 8bithtml> is off, then the nearest US-ASCII character is substituted (decimal code 117 or "u" in this case).

If you're getting a division sign, it's probably because the font doesn't display 8-bit ISO-8859-1 characters correctly, or some intermediate process is garbling the characters. Print the output of <urltext> directly to a file and examine the relevant section with a hex editor. Or print the output of <urltext> with <fmt "%U" $ret> and look for the relevant section to be encoded as %FC -- the URL encoding for ASCII 252.
Crash.Alpha
Posts: 9
Joined: Sun Sep 23, 2001 1:26 pm

8bithtml behavior for LIKEP?

Post by Crash.Alpha »

Ok, now our problem is getting uglier: the search is being performed from a DLL written in C that hooks into the Texis API. Texis doesn't appear to have this functionality (why should it? It just deals with the database!)

In other words, while we are encoding and indexing the body of the html page in unaccented format, the search is being done from within a DLL, so the <fetch> trick won't work... as the API will still be performing the search with accented characters.

Does anyone have any ideas on how to emulate the 8 to 7 bit translation within the API, making sure that it does the same type of character mapping?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

8bithtml behavior for LIKEP?

Post by Kai »

Texis still has the 8bithtml functionality, as Vortex uses Texis. You can use the C call htmakepage() to get the behavior of <fetch "http://localhost/index.html" $html>:

#include "http.h"

HTOBJ *obj;
HTPAGE *pg = HTPAGEPN;
char *myhtmlbuffer = "some <B>HTML</B> to test";
obj = openhtobj(); /* open an object */
htsetflags(obj, HTSF_DO8BIT, 0); /* turn off 8-bit HTML flag */
pg = htmakepage(obj, "http://localhost/index.html", myhtmlbuffer);
if (pg != HTPAGEPN)
{
if (htformatpage(pg)) /* form <urltext> data */
{
/* formatted text is pg->txt buffer, with length pg->txtsz */
}
pg = closehtpage(pg); /* free the object */
}
obj = closehtobj(obj);