MetaTrouble

Post by **Thunderstone** » Tue Jun 22, 1999 10:05 am

Hey,

I came across a pretty tricky thought whilst setting up
some metacrawler script. The trouble being, whenever I search
for a term using the metacrawler, e.g. beer, my results look something like:

Lycos:
www.beer.com
www.beer.de
www.beer.com/brew/index.htm
www.beer.org
Yahoo:
www.beer.com
www.beer.net
www.beer.org

and so on, whereby I get loads of duplicate search results.
I want -nice and neat- ONE result per domain, e.g. I want beer.com once as
opposed to beer.com from Lycos, yahoo, and all the others again.

Is there any easy way to cross-check the domains of the results, so
that every result would only get displayed if the domain has not
yet been used in an earlier result ?

Appreciate your help,

Robert Zrim

Post by **Thunderstone** » Tue Jun 22, 1999 11:02 am

You could keep the domains shown so far in an xtree, and check each
new one as you get 'em. Around the <timport> loop that displays results,
try something like this (note that this is untested example code):

<$n = 0> 
<timport ...> 
<rex '>>http://\P=[^>"\space/]+' $Link> 
<lower $ret>
<$domain = $ret>
<xtree search $domain domains> 
<IF "" eq $ret> 
<xtree insert $domain domains>
<$n = ($n + 1)> 
... display result ...
</IF>
</timport>

Note that you have to count results yourself now, instead of using $next,
because some results are dropped: use $n instead of $next.

-Kai