It is a bit difficult to explain, but our setup is a bit special. Anyway we have a CMS holding more than 200 country sites that share some pages and pdfs in different languages. As it is not feasible to have a separate index for each and every site (more and more sites are added) we have one huge index and the search filters the result depending on the site it was called from. Our problem is that these shared data has to be indexed multiple times with other parameters for the filtering. For HTML pages I just added a meta tag with a description and set Duplicate Check Fields to Body and Description. This works great, but for PDFs I can't have a meta tag to tell the index that this PDF is not a duplicate when it is crawled on another site as indexed initially. Is there a way I could achieve the needed behavior?
Our parameters to distinguish the country sites are "virtual" folders in the url as the CMS ignores them if they are not specially configured:
/l2/g0/s1961/
l# stands for the language id (internal id from the CMS)
g# stands for a CUG (0 = not protected)
s# stands for the site id (internal id from the CMS)
If we could identify these parts and not consider a pdf or html page a duplicate if these differ from an earlier indexed document with the same body information, our problems would be solved.
Our parameters to distinguish the country sites are "virtual" folders in the url as the CMS ignores them if they are not specially configured:
/l2/g0/s1961/
l# stands for the language id (internal id from the CMS)
g# stands for a CUG (0 = not protected)
s# stands for the site id (internal id from the CMS)
If we could identify these parts and not consider a pdf or html page a duplicate if these differ from an earlier indexed document with the same body information, our problems would be solved.