new generic filter routine in dowalk

User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

new generic filter routine in dowalk

Post by mark »

There's a db1.long or db2.long file in the dataspace directory which is $logfile.
Or just change the write to write to some other file

<write /tmp/debug.log>
gillsr
Posts: 14
Joined: Wed Apr 02, 2008 5:06 am

new generic filter routine in dowalk

Post by gillsr »

So instead of the above <pre>...</pre> around <manglepage> what would it be? The existing snippet messes up the GUI.

In any case, if I look at the logfile which is readable (thanks for the filename tip), I see that the sandr doesn't appear to be doing any good as my remove word is still there :(.

<pre>s_removewords[0]="[^\alnum]\P=>>MYWORDHERE=\F[^\alnum]"
s_removewords[1]=">>= "
s_removewords[2]=" =>>="
html before=...
etc.

Hopefully it's just a small tweak to the regex?
gillsr
Posts: 14
Joined: Wed Apr 02, 2008 5:06 am

new generic filter routine in dowalk

Post by gillsr »

OK i think I figured it out. This line:

<local hb he>
<$needmangle=N><!-- do we need to process HTML before extracting

was AFTER my code which was resetting needmangle back to N, and thus the code wasn't getting called in <manglepage> . I moved my (your!) code to after that section ending in </local>, and it appears to do the trick!
gillsr
Posts: 14
Joined: Wed Apr 02, 2008 5:06 am

new generic filter routine in dowalk

Post by gillsr »

Ok, one related question... I tried using the 'ignore tags' feature for this but it didn't quite work as expected, and ended up wiping huge chunks of my resulting HTML.

If I want to remove all data between this:

Start:
<b>This is my start: ...

End:
... my ending</a>

What would be the easiest way of doing that? I want to delete the smallest match possible anywhere it exists on the page, so it doesn't get too greedy on accident.

Thanks much!
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

new generic filter routine in dowalk

Post by mark »

Ignore tags will delete everything between and including the strings you specify, they need not be just html tags.
The removal is not greedy in the sense that it will take the largest match. It will take the first match. But it is greedy in the sense that if the end tag doesn't exist it will remove to end of file. You can change that by changing the line
<strfmt ">>%s=!%s*%s?" $hb $he $he>
to
<strfmt ">>%s=!%s*%s" $hb $he $he>
Then it will remove only if both begin and end tag exist.
gillsr
Posts: 14
Joined: Wed Apr 02, 2008 5:06 am

new generic filter routine in dowalk

Post by gillsr »

Great tip, thank you!
Post Reply