Remove duplicates from existing database?

jamon
Posts: 163
Joined: Wed Jun 26, 2002 9:35 am

Remove duplicates from existing database?

Post by jamon »

I have several profiles, walked separately, each created with the -unique feature. But, then we combine profiles by copying from one profile to another. This allows us, for example, to put the online catalog contents into 8 different profiles, without walking the catalog 8 times.

The problem is that we end up with some duplicate page entries in the resulting, combined database. The pages are the same page, referenced by different URLs, e.g. http://www.company.com/Consumer/blahblah.html vs. http://www.company.com/Wholesaler/blahblah.html, where both URLs actually point to the same single page.

Is there any way to eliminate them, other than individually preventing the pages from being walked in one profile or the other?

Thanks!
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Remove duplicates from existing database?

Post by mark »

Use the Hash field of the html table to identify the dups. Same Hash means same content. Lookup Hash in html before inserting the new record.
jamon
Posts: 163
Joined: Wed Jun 26, 2002 9:35 am

Remove duplicates from existing database?

Post by jamon »

Hmmm. Syntax currently being used:

<sql db=$sourcedb row "select * from html">
<sql db=$destdb novars "insert into html values(counter,$Hash,$Size,$Visited,$Dlsecs,$Depth,$Url,$Title,$Body,$Keywords,$Description,$Meta,$Catno)"></sql>
</sql>


Logically, the new syntax would be something like "insert into html values (...) where inner_sql.Hash != outer_sql.Hash. How does that work in the real world of Vortex SQL.
jamon
Posts: 163
Joined: Wed Jun 26, 2002 9:35 am

Remove duplicates from existing database?

Post by jamon »

And, then I'll have to do the Refs for just the rows that were inserted. So, the Refs insert needs to be changed, too. Right?

<sql db=$sourcedb row "select * from html">
<sql db=$destdb novars "insert into html values(counter,$Hash,$Size,$Visited,$Dlsecs,$Depth,$Url,$Title,$Body,$Keywords,$Description,$Meta,$Catno)"></sql>
</sql>

<sql db=$sourcedb row "select * from refs">
<sql db=$destdb novars "insert into refs values($Url,$Ref)"></sql>
</sql>
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Remove duplicates from existing database?

Post by mark »

outline:
<sql db=source row "select * from html">
<sql db=dest novars max=1 "select * from html where Hash=$Hash"></sql>
<if $loop eq 0><!-- not found -->
<sql db=dest "insert into html..."></sql>
<sql db=source row "select * from refs where Url=$Url">
<sql db=dest novars "insert into refs..."></sql>
</sql>
</if>
</sql>
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Remove duplicates from existing database?

Post by mark »

The uniq index on html(Hash) can't be done if you're respecting meta robots and haven't disabled SSc_metarobotsplaceholder in the dowalk script. Anything with meta robots noindex,follow will have a Hash of 0, so there could be multiples of 0.

Along those lines, you may also want to insert all Hashes of 0 so that parent surfing works. If you don't care about "show parents" you can skip them.
jamon
Posts: 163
Joined: Wed Jun 26, 2002 9:35 am

Remove duplicates from existing database?

Post by jamon »

This works:

<A NAME=Wh_To_Cons>
<$sourcedb=$whcatdb>
<$destdb=$consdb>
<sql db=$sourcedb row "select * from html">
<sql db=$destdb novars max=1 "select * from html where Hash=$Hash"></sql>
<if $loop eq 0 or $Hash eq 0><!-- not found -->
<sql db=$destdb "insert into html values(counter,$Hash,$Size,$Visited,$Dlsecs,$Depth,$Url,$Title,$Body,$Keywords,$Description,$Meta,$Catno)"></sql>
<sql db=$sourcedb row "select * from refs where Url=$Url">
<sql db=$destdb novars "insert into refs values($Url,$Ref)"></sql>
</sql>
</if>
</sql>
</A> <!-- end Wh_To_Cons -->

This, however (another section of the same script), does not work:

<A NAME=Wh_To_Ret>
<$sourcedb=$whcatdb>
<$destdb=$retdb>
<sql db=$sourcedb row "select * from html">
<sandr "/Consumer/" "/Retailer/" $Url>
<$TmpUrl = $ret>
<sql db=$destdb novars max=1 "select * from html where Hash=$Hash"></sql>
<if $loop eq 0 or $Hash eq 0><!-- not found -->
<sql db=$destdb "insert into html values(counter,$Hash,$Size,$Visited,$Dlsecs,$Depth,$TmpUrl,$Title,$Body,$Keywords,$Description,$Meta,$Catno)"></sql>
<sql db=$sourcedb row "select * from refs where Url=$Url">
<sandr "/Consumer/" "/Retailer/" $Ref>
<$TmpRef = $ret>
<sql db=$destdb novars "insert into refs values($TmpUrl,$TmpRef)"></sql>
</sql>
</if>
</sql>
</A> <!-- end Wh_To_Ret -->

I know that as soon as I post it, I'll probably see the error, but right now, I don't see it.

In the second example, the pages are not being copied from one DB to the other.
jamon
Posts: 163
Joined: Wed Jun 26, 2002 9:35 am

Remove duplicates from existing database?

Post by jamon »

I bet I need "where Url=$TmpUrl" (instead of Url=$Url) just before the sandr for the TmpRef.

Trying that...
jamon
Posts: 163
Joined: Wed Jun 26, 2002 9:35 am

Remove duplicates from existing database?

Post by jamon »

No, that doesn't make sense. I need the Refs from the source database, where the Url is still $Url, not $TmpUrl.

Has the value of $Url been lost with the intermediate queries?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Remove duplicates from existing database?

Post by mark »

Use novars on your insert statements so your original fields don't get overwritten by your inserted fields.
Post Reply