Prevent duplicates but not always

thomas.amsler · Post by **thomas.amsler** » Mon Nov 19, 2007 7:21 am

It is a bit difficult to explain, but our setup is a bit special. Anyway we have a CMS holding more than 200 country sites that share some pages and pdfs in different languages. As it is not feasible to have a separate index for each and every site (more and more sites are added) we have one huge index and the search filters the result depending on the site it was called from. Our problem is that these shared data has to be indexed multiple times with other parameters for the filtering. For HTML pages I just added a meta tag with a description and set Duplicate Check Fields to Body and Description. This works great, but for PDFs I can't have a meta tag to tell the index that this PDF is not a duplicate when it is crawled on another site as indexed initially. Is there a way I could achieve the needed behavior?

Our parameters to distinguish the country sites are "virtual" folders in the url as the CMS ignores them if they are not specially configured:
/l2/g0/s1961/
l# stands for the language id (internal id from the CMS)
g# stands for a CUG (0 = not protected)
s# stands for the site id (internal id from the CMS)
If we could identify these parts and not consider a pdf or html page a duplicate if these differ from an earlier indexed document with the same body information, our problems would be solved.

Post by **mark** » Mon Nov 19, 2007 10:56 am

Besides thedescribed case do these sites have a lot of aliases such that different urls point to the same content that you don't want stored? Perhaps the simplest solution is to just turn off duplicate prevention.

Otherwise you could modify the dowalk script to look for that url pattern. If present include the url in the hash.

thomas.amsler · Post by **thomas.amsler** » Mon Nov 19, 2007 11:19 am

Turning of duplicate prevention caused even more problems (see my other post with the max depth problem) and we would again exceed the license limit. Else there should not be many aliases if at all.

I was a bit afraid of modifying the dowalk script, as I am not used to the vortex script language, but it seems that I was able to change the computehash method and the calls to it to achieve exactly this. I'm currently running a test walk to see if my changes did the trick.