I could really use some help in creating a routine similar to the existing ignore tags / keep tags options, but simpler.
Basically I need a new option that takes a series of words space delimited, and if ANY of them are found in the walked page, their occurrences should be deleted. If I can already do this trickery with the existing 'ignore tags' setup, I'd love to know! They're not tags per se, and I don't have a start/end delimeter, just a series of individual words I want totally removed from the content.
Probably looking at where noisechar is processed would be the way to go, that removes specific characters from the content. The main differences would be splitting the field into words, and then building an expression to make sure the words are on word boundaries and not embedded as substrings in other words if that is what you want.
Something like this to split your setting into multiple words assuming that "words" don't contain any REX special characters:
<split nonempty "\space+" $SSc_removewords></split>
<loop $ret>
<strfmt ">>=%s=>>=" $ret>
<$removewords=$removewords $ret>
</loop>
Something like this to break the text up if you consider words to be strings of alpha-numeric:
<$ret="[\alnum]+" "[^\alnum]+">
<rex $ret $page>
<$pageitems=$ret>
Then processing. Blank out any matching words then put the text back together:
<sandr $removewords '' $pageitems>
<sum "%s" $ret>
<$page=$ret>
For the second one you might want to take the added leading and trailing spaces back off. Add a couple of expressions to the end of the removewords list after the words list:
<$removewords=$removewords ">>= " " =>>=">
No. You don't want to work on the html page. You need to work on the extracted text which happens after manglepage. Also manglepage only applies to html files, not other formats. A good place would be in procpage, maybe right before the <datafromfield> call.
Well, I don't want this processed on the extracted text, I wanted it processed on all non-binary downloaded files so that the cached content is also adjusted (not just the index, or resulting text).
Perhaps applying the sandr to $htmlpage inside the procpage function then?
I also don't think the pre-processing of the user input data needs to happen on every page, so that should go near the discard/keep html setup I think in the applysettings function just before the <local hb he> line.
Then, yes, it needs to be done to $htmlpage in manglepage and $needmangle should be set if the remove words option is set. Whether before or after the keep/ignore processing is debatable. I'd lean towards after.
I don't know what you mean by "user input data" or how it applies to this discussion.