new generic filter routine in dowalk

gillsr
Posts: 14
Joined: Wed Apr 02, 2008 5:06 am

new generic filter routine in dowalk

Post by gillsr »

I could really use some help in creating a routine similar to the existing ignore tags / keep tags options, but simpler.

Basically I need a new option that takes a series of words space delimited, and if ANY of them are found in the walked page, their occurrences should be deleted. If I can already do this trickery with the existing 'ignore tags' setup, I'd love to know! They're not tags per se, and I don't have a start/end delimeter, just a series of individual words I want totally removed from the content.

Many thanks!
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

new generic filter routine in dowalk

Post by John »

Probably looking at where noisechar is processed would be the way to go, that removes specific characters from the content. The main differences would be splitting the field into words, and then building an expression to make sure the words are on word boundaries and not embedded as substrings in other words if that is what you want.
John Turnbull
Thunderstone Software
gillsr
Posts: 14
Joined: Wed Apr 02, 2008 5:06 am

new generic filter routine in dowalk

Post by gillsr »

Could you post any examples on the process of building the expression with word boundaries from a list of words grabbed from the UI?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

new generic filter routine in dowalk

Post by mark »

Something like this to split your setting into multiple words assuming that "words" don't contain any REX special characters:
<split nonempty "\space+" $SSc_removewords></split>
<loop $ret>
<strfmt ">>=%s=>>=" $ret>
<$removewords=$removewords $ret>
</loop>

Something like this to break the text up if you consider words to be strings of alpha-numeric:
<$ret="[\alnum]+" "[^\alnum]+">
<rex $ret $page>
<$pageitems=$ret>

Then processing. Blank out any matching words then put the text back together:
<sandr $removewords '' $pageitems>
<sum "%s" $ret>
<$page=$ret>
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

new generic filter routine in dowalk

Post by mark »

An alternative approach:

<$removewords=>
<split nonempty "\space+" $SSc_removewords></split>
<loop $ret>
<strfmt "[^\\alnum]\\P=>>%s=\\F[^\\alnum]" $ret>
<$removewords=$removewords $ret>
</loop>
<$removewords=$removewords ">>= " " =>>="><!-- optional -->
<strfmt " %s " $page>
<sandr $removewords '' $ret>
<$page=$ret>
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

new generic filter routine in dowalk

Post by mark »

For the second one you might want to take the added leading and trailing spaces back off. Add a couple of expressions to the end of the removewords list after the words list:
<$removewords=$removewords ">>= " " =>>=">

I've edited the above to include that option.
gillsr
Posts: 14
Joined: Wed Apr 02, 2008 5:06 am

new generic filter routine in dowalk

Post by gillsr »

Very helpful, thanks!

To put this in practice, is it reasonable to put this before the: <local hb he> line (where the discard/keep html stuff happens):

<$s_removewords=>
<split nonempty "\space+" $SSc_removewords></split>
<loop $ret>
<$needmangle=Y><!-- indicate that raw html needs preprocessing -->
<strfmt "[^\\alnum]\\P=>>%s=\\F[^\\alnum]" $ret>
<$s_removewords=$s_removewords $ret>
</loop>
<$s_removewords=$s_removewords ">>= " " =>>="><!-- optional -->

Then, update the manglepage routine to add this near the top:

<if $s_removewords ne "">
<sandr $s_removewords "" $htmlpage>
<$htmlpage=$ret>
</if>
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

new generic filter routine in dowalk

Post by mark »

No. You don't want to work on the html page. You need to work on the extracted text which happens after manglepage. Also manglepage only applies to html files, not other formats. A good place would be in procpage, maybe right before the <datafromfield> call.
gillsr
Posts: 14
Joined: Wed Apr 02, 2008 5:06 am

new generic filter routine in dowalk

Post by gillsr »

Well, I don't want this processed on the extracted text, I wanted it processed on all non-binary downloaded files so that the cached content is also adjusted (not just the index, or resulting text).

Perhaps applying the sandr to $htmlpage inside the procpage function then?

I also don't think the pre-processing of the user input data needs to happen on every page, so that should go near the discard/keep html setup I think in the applysettings function just before the <local hb he> line.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

new generic filter routine in dowalk

Post by mark »

Then, yes, it needs to be done to $htmlpage in manglepage and $needmangle should be set if the remove words option is set. Whether before or after the keep/ignore processing is debatable. I'd lean towards after.

I don't know what you mean by "user input data" or how it applies to this discussion.
Post Reply