new generic filter routine in dowalk

Post by **mark** » Tue Apr 22, 2008 2:14 pm

There's a db1.long or db2.long file in the dataspace directory which is $logfile.
Or just change the write to write to some other file

<write /tmp/debug.log>

gillsr · Post by **gillsr** » Tue Apr 22, 2008 10:18 pm

So instead of the above <pre>...</pre> around <manglepage> what would it be? The existing snippet messes up the GUI.

In any case, if I look at the logfile which is readable (thanks for the filename tip), I see that the sandr doesn't appear to be doing any good as my remove word is still there

.

<pre>s_removewords[0]="[^\alnum]\P=>>MYWORDHERE=\F[^\alnum]"
s_removewords[1]=">>= "
s_removewords[2]=" =>>="
html before=...
etc.

Hopefully it's just a small tweak to the regex?

gillsr · Post by **gillsr** » Tue Apr 22, 2008 10:26 pm

OK i think I figured it out. This line:

<local hb he>
<$needmangle=N><!-- do we need to process HTML before extracting

was AFTER my code which was resetting needmangle back to N, and thus the code wasn't getting called in <manglepage> . I moved my (your!) code to after that section ending in </local>, and it appears to do the trick!

gillsr · Post by **gillsr** » Tue Apr 22, 2008 10:40 pm

Ok, one related question... I tried using the 'ignore tags' feature for this but it didn't quite work as expected, and ended up wiping huge chunks of my resulting HTML.

If I want to remove all data between this:

Start:
<b>This is my start: ...

End:
... my ending</a>

What would be the easiest way of doing that? I want to delete the smallest match possible anywhere it exists on the page, so it doesn't get too greedy on accident.

Thanks much!

Post by **mark** » Wed Apr 23, 2008 9:55 am

Ignore tags will delete everything between and including the strings you specify, they need not be just html tags.
The removal is not greedy in the sense that it will take the largest match. It will take the first match. But it is greedy in the sense that if the end tag doesn't exist it will remove to end of file. You can change that by changing the line
<strfmt ">>%s=!%s*%s?" $hb $he $he>
to
<strfmt ">>%s=!%s*%s" $hb $he $he>
Then it will remove only if both begin and end tag exist.

gillsr · Post by **gillsr** » Wed Apr 23, 2008 1:03 pm

Great tip, thank you!