Page 1 of 1

Parser Troubles.

Posted: Tue May 19, 2009 2:00 pm
by gerry.odea
I'm trying to build a paser for this:

<a href="/search?hl=en&q=sports+cars&revid=238886396&ei=7usSSvz_BZiu8QSH7aSQBA&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=1"><b>sports</b> cars</a>

but I don't want to base it on

>><a=!href\=+href\=="?[^" >]+[^>]*>=[^<\x0a]+</a>=\x0a=

because their will be other <a href></a> that will be pulled in. I want to only pull in the <a href>'s that have revisions_inline in the url string of the <a href>

Parser Troubles.

Posted: Tue May 19, 2009 2:47 pm
by mark
Most reliable might be to do another pass over the list returned by your first expression.

<rex ".*>>revisions_inline=.*" $ret>

Parser Troubles.

Posted: Tue May 19, 2009 2:49 pm
by mark
Or you could use <fetch> to parse for you, and use <urlinfo links> instead of your expression to get the list of urls on the page. Then rex for revisions_inline in that list.

Parser Troubles.

Posted: Tue May 19, 2009 3:13 pm
by gerry.odea
I'm doing this instead. But it won't bring in the title. Can you tell me why?

<a name = GETRELATED>
<$searchurl = "http://www.domain.com/search?q=xyzzy">
<$imports='
recdelim >><table class\="ts std"
firstmatch
field Title varchar(40) />><a>\P=!</a>+
field Title2 varchar(40) />><b>\P=!</b>+
'>
</a>



<table class="ts std" id=brs style="padding:0 0 1em"><caption class="med nobr" style="padding-bottom:6px;text-align:left">Searches related to: <b>dogs</b></caption><tr><td style="padding:0 0 7px;padding-right:34px;vertical-align:top"><a href="/search?hl=en&q=dog+breeds&revid=416594276&ei=eQQTSumHLeewtgfqvfWSBA&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=1"><b>dog breeds</b></a><td style="padding:0 0 7px;padding-right:34px;vertical-align:top"><a href="/search?hl=en&q=pictures+of+dogs&revid=416594276&ei=eQQTSumHLeewtgfqvfWSBA&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=2"><b>pictures of</b> dogs</a><td style="padding:0 0 7px;padding-right:34px;vertical-align:top"><a href="/search?hl=en&q=dogs+types&revid=416594276&ei=eQQTSumHLeewtgfqvfWSBA&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=3">dogs <b>types</b></a><td style="padding:0 0 7px;padding-right:34px;vertical-align:top"><a href="/search?hl=en&q=information+about+dogs&revid=416594276&ei=eQQTSumHLeewtgfqvfWSBA&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=4"><b>information about</b> dogs</a><tr><td style="padding:0 0 7px;padding-right:34px;vertical-align:top"><a href="/search?hl=en&q=dogs+health&revid=416594276&ei=eQQTSumHLeewtgfqvfWSBA&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=5">dogs <b>health</b></a><td style="padding:0 0 7px;padding-right:34px;vertical-align:top"><a href="/search?hl=en&q=dog+games&revid=416594276&ei=eQQTSumHLeewtgfqvfWSBA&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=6"><b>dog games</b></a><td style="padding:0 0 7px;padding-right:34px;vertical-align:top"><a href="/search?hl=en&q=dog+names&revid=416594276&ei=eQQTSumHLeewtgfqvfWSBA&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=7"><b>dog names</b></a><td style="padding:0 0 7px;padding-right:34px;vertical-align:top"><a href="/search?hl=en&q=adopt+a+dog&revid=416594276&ei=eQQTSumHLeewtgfqvfWSBA&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=8"><b>adopt a dog</b></a></table>

Parser Troubles.

Posted: Tue May 19, 2009 3:21 pm
by mark
field Title varchar(40) />><a>\P=!</a>+

There is no "<a>" in the text. Perhaps you meant
field Title varchar(40) />><a=[^>]*>\P=!</a>+

Parser Troubles.

Posted: Tue May 19, 2009 3:23 pm
by gerry.odea
Yes that helped a bit now I'm stuck on this not matching up:

recdelim >><table class\="ts std"

for

<table class="ts std" id=brs style="padding:0 0 1em">

Parser Troubles.

Posted: Tue May 19, 2009 3:24 pm
by gerry.odea
do I need to add something between "ts std"?

Parser Troubles.

Posted: Tue May 19, 2009 4:39 pm
by mark
Can there be multiple <table class="ts std" sections in the data? And would you want the first item from each of those? If not then maybe you don't want a recdelim at all.