HTML entities and search

bhiggins · Post by **bhiggins** » Wed Mar 08, 2006 5:06 pm

We have a database in which special characters are saved as HTML entities. I'll use Guantanamo as my example. In the database it's saved as Guant&#aacute;anamo. Of course, a search for Guantanamo won't find it.

There are any number of ways to approach this, but I'm wondering if there is something we can do from the search side rather than doing a lot of reworking of the DB, which is huge.

If you believe sandr or regex work on the DB is required I'd appreciate any suggestions on that end as well.

(FYI: This is full-blown Texis running on a Unix server. I'm in the unfortunate position of picking up support from a guy who has left the company.)

Thanks for any and all input.

Bob

Bob

Post by **John** » Wed Mar 08, 2006 5:30 pm

If you have a limited and known number of words with HTML entities you might be able to make use of the thesaurus to aid with the search.

Do you know roughly what proportion of the records have
HTML entities? A Vortex script to process the table
is probably the best plan, and either updating the
existing table if the number of records to edit is
small, or by creating a replacement table if you will
be changing most of the table. The Vortex fetch and
strfmt functions can help with decoding HTML.

bhiggins · Post by **bhiggins** » Wed Mar 08, 2006 6:06 pm

I only have to deal with 10% of the database; the rest was handled differently. Of that 10% I'd estimate a third of them have accents.

Some background: This is a database of newspaper stories. For the older records the developer had a separate field in the DB where he stored the non-accented version of each accented word (resume, cliche, etc.). The search would then find either version of the word. When HTML entities were introduced about a year ago it broke that functionality. So that's what I'm trying to remedy.

Meanwhile, on equivalence: There's something I'm not doing right, apparently. I created a very simple eqvsusr.lst file with just one line:

guantanamo,guantánamo

Ran backref, no errors reported. Equivs are on in the search page. But search results are no different. What might I be missing?

Thanks,
Bob

Post by **mark** » Thu Mar 09, 2006 11:03 am

; is special in the equiv. Escape it with \
guantanamo,guant&aacute\;namo

So you're saying all new data has the entities? I'd take John's suggestion and get rid of the entities by replacing them with their equivalent characters at import time as well as fixup the small fracton of records that already have the entities. Then the search will continue to work as before.

bhiggins · Post by **bhiggins** » Thu Mar 09, 2006 12:30 pm

I'll try out your suggestions, thanks.

I escaped the ; in the equivalence file, but results are the same on search. Any further suggestions welcomed.

Bob

Post by **mark** » Thu Mar 09, 2006 2:02 pm

Do you have all the requisite settings to use and allow the equivs?
<apicp eqprefix /full/path/to/your/equiv/file>
<apicp keepeqvs 1>
<apicp alequivs 1>
Make sure the equiv file is readable by the texis user.
Check the source of your results page and/or vortex.log to see if there are any errors generated.