Page 1 of 2
Remove duplicates from existing database?
Posted: Wed Apr 21, 2004 9:27 am
by jamon
I have several profiles, walked separately, each created with the -unique feature. But, then we combine profiles by copying from one profile to another. This allows us, for example, to put the online catalog contents into 8 different profiles, without walking the catalog 8 times.
The problem is that we end up with some duplicate page entries in the resulting, combined database. The pages are the same page, referenced by different URLs, e.g.
http://www.company.com/Consumer/blahblah.html vs.
http://www.company.com/Wholesaler/blahblah.html, where both URLs actually point to the same single page.
Is there any way to eliminate them, other than individually preventing the pages from being walked in one profile or the other?
Thanks!
Remove duplicates from existing database?
Posted: Wed Apr 21, 2004 9:53 am
by mark
Use the Hash field of the html table to identify the dups. Same Hash means same content. Lookup Hash in html before inserting the new record.
Remove duplicates from existing database?
Posted: Wed Apr 21, 2004 1:07 pm
by jamon
Hmmm. Syntax currently being used:
<sql db=$sourcedb row "select * from html">
<sql db=$destdb novars "insert into html values(counter,$Hash,$Size,$Visited,$Dlsecs,$Depth,$Url,$Title,$Body,$Keywords,$Description,$Meta,$Catno)"></sql>
</sql>
Logically, the new syntax would be something like "insert into html values (...) where inner_sql.Hash != outer_sql.Hash. How does that work in the real world of Vortex SQL.
Remove duplicates from existing database?
Posted: Wed Apr 21, 2004 1:12 pm
by jamon
And, then I'll have to do the Refs for just the rows that were inserted. So, the Refs insert needs to be changed, too. Right?
<sql db=$sourcedb row "select * from html">
<sql db=$destdb novars "insert into html values(counter,$Hash,$Size,$Visited,$Dlsecs,$Depth,$Url,$Title,$Body,$Keywords,$Description,$Meta,$Catno)"></sql>
</sql>
<sql db=$sourcedb row "select * from refs">
<sql db=$destdb novars "insert into refs values($Url,$Ref)"></sql>
</sql>
Remove duplicates from existing database?
Posted: Wed Apr 21, 2004 1:59 pm
by mark
outline:
<sql db=source row "select * from html">
<sql db=dest novars max=1 "select * from html where Hash=$Hash"></sql>
<if $loop eq 0><!-- not found -->
<sql db=dest "insert into html..."></sql>
<sql db=source row "select * from refs where Url=$Url">
<sql db=dest novars "insert into refs..."></sql>
</sql>
</if>
</sql>
Remove duplicates from existing database?
Posted: Wed Apr 21, 2004 3:43 pm
by mark
The uniq index on html(Hash) can't be done if you're respecting meta robots and haven't disabled SSc_metarobotsplaceholder in the dowalk script. Anything with meta robots noindex,follow will have a Hash of 0, so there could be multiples of 0.
Along those lines, you may also want to insert all Hashes of 0 so that parent surfing works. If you don't care about "show parents" you can skip them.
Remove duplicates from existing database?
Posted: Wed Jun 23, 2004 12:28 pm
by jamon
This works:
<A NAME=Wh_To_Cons>
<$sourcedb=$whcatdb>
<$destdb=$consdb>
<sql db=$sourcedb row "select * from html">
<sql db=$destdb novars max=1 "select * from html where Hash=$Hash"></sql>
<if $loop eq 0 or $Hash eq 0><!-- not found -->
<sql db=$destdb "insert into html values(counter,$Hash,$Size,$Visited,$Dlsecs,$Depth,$Url,$Title,$Body,$Keywords,$Description,$Meta,$Catno)"></sql>
<sql db=$sourcedb row "select * from refs where Url=$Url">
<sql db=$destdb novars "insert into refs values($Url,$Ref)"></sql>
</sql>
</if>
</sql>
</A> <!-- end Wh_To_Cons -->
This, however (another section of the same script), does not work:
<A NAME=Wh_To_Ret>
<$sourcedb=$whcatdb>
<$destdb=$retdb>
<sql db=$sourcedb row "select * from html">
<sandr "/Consumer/" "/Retailer/" $Url>
<$TmpUrl = $ret>
<sql db=$destdb novars max=1 "select * from html where Hash=$Hash"></sql>
<if $loop eq 0 or $Hash eq 0><!-- not found -->
<sql db=$destdb "insert into html values(counter,$Hash,$Size,$Visited,$Dlsecs,$Depth,$TmpUrl,$Title,$Body,$Keywords,$Description,$Meta,$Catno)"></sql>
<sql db=$sourcedb row "select * from refs where Url=$Url">
<sandr "/Consumer/" "/Retailer/" $Ref>
<$TmpRef = $ret>
<sql db=$destdb novars "insert into refs values($TmpUrl,$TmpRef)"></sql>
</sql>
</if>
</sql>
</A> <!-- end Wh_To_Ret -->
I know that as soon as I post it, I'll probably see the error, but right now, I don't see it.
In the second example, the pages are not being copied from one DB to the other.
Remove duplicates from existing database?
Posted: Wed Jun 23, 2004 12:30 pm
by jamon
I bet I need "where Url=$TmpUrl" (instead of Url=$Url) just before the sandr for the TmpRef.
Trying that...
Remove duplicates from existing database?
Posted: Wed Jun 23, 2004 12:33 pm
by jamon
No, that doesn't make sense. I need the Refs from the source database, where the Url is still $Url, not $TmpUrl.
Has the value of $Url been lost with the intermediate queries?
Remove duplicates from existing database?
Posted: Wed Jun 23, 2004 1:17 pm
by mark
Use novars on your insert statements so your original fields don't get overwritten by your inserted fields.