Not Duplicate but appliance says they are.

jgdoke · Post by **jgdoke** » Fri Jan 21, 2005 10:17 am

The link : http://www.ab.com/networks/ethernet.html
Is a duplicate of: http://www.ab.com/abjournal/
Referenced by : http://www.ab.com/
http://www.ab.com/logix/softlogix/

I just opened them and they are not the same page..

?????

John

Post by **mark** » Fri Jan 21, 2005 10:22 am

Go to list/edit urls and see what text was extracted from each. It's probably the same.

jgdoke · Post by **jgdoke** » Fri Jan 21, 2005 12:21 pm

http://www.ab.com/networks/ethernet.html

Is NOT in the list.

Guessing that because it is a duplicate it deletes the page. Looking at the code from each there are no similarities.

Post by **mark** » Fri Jan 21, 2005 1:44 pm

I tried a crawl of just those 2 pages with dups off. Both seem to come up with no body text. Not sure why. Will require more study of the html on those pages.

jgdoke · Post by **jgdoke** » Fri Jan 21, 2005 2:05 pm

You are correct. The list url's shows zero bytes text from the page..

ABjournal is one of our high traffic areas, please let me know an answer ASAP.

Thank you
John

Post by **mark** » Fri Jan 21, 2005 2:24 pm

Those pages are returning different content based on user-agent. Adjust your user-agent to something the webserver likes. Maybe something like this will make it behave

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)

Post by **mark** » Mon Jan 24, 2005 12:14 pm

Not sure I what I was looking at before, but looking at this again it would appear that the problem is not client related, but is that both of those pages have no text content, only . The appliance can find the links to the desired pages, http://www.ab.com/abjournal/nov2004/index.html and http://www.ab.com/networks/ethernet/index.html, but won't try to follow any links on the duplicate (empty) page.