new generic filter routine in dowalk

gillsr · Post by **gillsr** » Wed Apr 02, 2008 5:12 am

I could really use some help in creating a routine similar to the existing ignore tags / keep tags options, but simpler.

Basically I need a new option that takes a series of words space delimited, and if ANY of them are found in the walked page, their occurrences should be deleted. If I can already do this trickery with the existing 'ignore tags' setup, I'd love to know! They're not tags per se, and I don't have a start/end delimeter, just a series of individual words I want totally removed from the content.

Many thanks!

Post by **John** » Wed Apr 02, 2008 7:39 am

Probably looking at where noisechar is processed would be the way to go, that removes specific characters from the content. The main differences would be splitting the field into words, and then building an expression to make sure the words are on word boundaries and not embedded as substrings in other words if that is what you want.

gillsr · Post by **gillsr** » Wed Apr 02, 2008 11:31 am

Could you post any examples on the process of building the expression with word boundaries from a list of words grabbed from the UI?

Post by **mark** » Wed Apr 02, 2008 12:10 pm

Something like this to split your setting into multiple words assuming that "words" don't contain any REX special characters:
<split nonempty "\space+" $SSc_removewords></split>
<loop $ret>
<strfmt ">>=%s=>>=" $ret>
<$removewords=$removewords $ret>
</loop>

Something like this to break the text up if you consider words to be strings of alpha-numeric:
<$ret="[\alnum]+" "[^\alnum]+">
<rex $ret $page>
<$pageitems=$ret>

Then processing. Blank out any matching words then put the text back together:
<sandr $removewords '' $pageitems>
<sum "%s" $ret>
<$page=$ret>

Post by **mark** » Wed Apr 02, 2008 12:13 pm

An alternative approach:

<$removewords=>
<split nonempty "\space+" $SSc_removewords></split>
<loop $ret>
<strfmt "[^\\alnum]\\P=>>%s=\\F[^\\alnum]" $ret>
<$removewords=$removewords $ret>
</loop>
<$removewords=$removewords ">>= " " =>>=">
<strfmt " %s " $page>
<sandr $removewords '' $ret>
<$page=$ret>

Post by **mark** » Wed Apr 02, 2008 12:17 pm

For the second one you might want to take the added leading and trailing spaces back off. Add a couple of expressions to the end of the removewords list after the words list:
<$removewords=$removewords ">>= " " =>>=">

I've edited the above to include that option.

gillsr · Post by **gillsr** » Wed Apr 02, 2008 7:42 pm

Very helpful, thanks!

To put this in practice, is it reasonable to put this before the: <local hb he> line (where the discard/keep html stuff happens):

<$s_removewords=>
<split nonempty "\space+" $SSc_removewords></split>
<loop $ret>
<$needmangle=Y>
<strfmt "[^\\alnum]\\P=>>%s=\\F[^\\alnum]" $ret>
<$s_removewords=$s_removewords $ret>
</loop>
<$s_removewords=$s_removewords ">>= " " =>>=">

Then, update the manglepage routine to add this near the top:

<if $s_removewords ne "">
<sandr $s_removewords "" $htmlpage>
<$htmlpage=$ret>
</if>

Post by **mark** » Thu Apr 03, 2008 10:14 am

No. You don't want to work on the html page. You need to work on the extracted text which happens after manglepage. Also manglepage only applies to html files, not other formats. A good place would be in procpage, maybe right before the <datafromfield> call.

gillsr · Post by **gillsr** » Thu Apr 03, 2008 3:38 pm

Well, I don't want this processed on the extracted text, I wanted it processed on all non-binary downloaded files so that the cached content is also adjusted (not just the index, or resulting text).

Perhaps applying the sandr to $htmlpage inside the procpage function then?

I also don't think the pre-processing of the user input data needs to happen on every page, so that should go near the discard/keep html setup I think in the applysettings function just before the <local hb he> line.

Post by **mark** » Thu Apr 03, 2008 4:04 pm

Then, yes, it needs to be done to $htmlpage in manglepage and $needmangle should be set if the remove words option is set. Whether before or after the keep/ignore processing is debatable. I'd lean towards after.

I don't know what you mean by "user input data" or how it applies to this discussion.