MetaTrouble

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

MetaTrouble

Post by Thunderstone »



Hey,

I came across a pretty tricky thought whilst setting up
some metacrawler script. The trouble being, whenever I search
for a term using the metacrawler, e.g. beer, my results look something like:


Lycos:
www.beer.com
www.beer.de
www.beer.com/brew/index.htm
www.beer.org
Yahoo:
www.beer.com
www.beer.net
www.beer.org

and so on, whereby I get loads of duplicate search results.
I want -nice and neat- ONE result per domain, e.g. I want beer.com once as
opposed to beer.com from Lycos, yahoo, and all the others again.

Is there any easy way to cross-check the domains of the results, so
that every result would only get displayed if the domain has not
yet been used in an earlier result ?

Appreciate your help,


Robert Zrim



User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

MetaTrouble

Post by Thunderstone »




You could keep the domains shown so far in an xtree, and check each
new one as you get 'em. Around the <timport> loop that displays results,
try something like this (note that this is untested example code):

<$n = 0> <!-- reset count per engine -->
<timport ...> <!-- loop for each result -->
<rex '>>http://\P=[^>"\space/]+' $Link> <!-- get the domain -->
<lower $ret>
<$domain = $ret>
<xtree search $domain domains> <!-- was it shown already? -->
<IF "" eq $ret> <!-- nope -->
<xtree insert $domain domains>
<$n = ($n + 1)> <!-- count it -->
... display result ...
</IF>
</timport>

Note that you have to count results yourself now, instead of using $next,
because some results are dropped: use $n instead of $next.

-Kai


Post Reply