Thunderstone Support Forums

Posted: **Wed Aug 27, 2003 2:34 pm**

Using commerical (enterprise) webinator (RedHat 9) we have a program to drop and recreate all the indexes on the html table. Somewhere along the way it appears as though duplicate pages must have gotten into the table as we get thousands of the following error (in the vortex log) when we try to rebuild the unique index on Hash:

178 Aug 27 06:47:11 /webinator/dev/idx_rebuild:67: Trying to insert duplicate value (000000000) in index (temp RAM DBF)

and before that we get one error:

100 Aug 27 06:47:11 /webinator/dev/idx_rebuild:67: Creating Unique index on Non-unique data

If I am correct that this is because some duplicate pages got in somehow (maybe the unique index was not on for one of the walks), then how can I fix this? I thought of just "create new_table as select distinct * from html" and then copy the new table over html -- but the duplicate pages are likely to have some other field that is different (Url, etc.) so that won't work. Any suggestions, or could this be some other problem?

Thanks,

Marcos

Posted: **Wed Aug 27, 2003 3:09 pm**

Hash is not supposed to have a unique index. See the dowalk script for what indices are appropriate.

The 000000000 hash is used for pages that have meta robots noindex,follow so that parent/child linkage works. The content of those pages is not stored so they won't show up in a search.

Posted: **Wed Aug 27, 2003 3:51 pm**

OK -- that makes sense. The reason (I think) that we have a unique constraint is because we consolidate multiple walks into this one html table (it never gets walked directly into) and we wanted to prevent duplicates across walks. Maybe we need to rethink that. (I still do not understand why this throws thousands of errors -- I only get 1 row when I select count(*) from html where domain_id = '000000000' -- but maybe that is because of the constraint)

So let me put my question differently -- if we want to remove duplicates from an existing html table is there an easy way to do that? I suppose I could write a vortex script that takes each row and checks for duplicates before inserting it...

Posted: **Wed Aug 27, 2003 4:03 pm**

An individual html table will not have dups unless duplicate detection has been turned off in the profile.

If you want to merge two tables that might have duplicates you'll need to do it one row at a time in vortex and check for existence before inserting.

Another option, if you don't care about parent child navigation, is to edit dowalk and change <$SSc_metarobotsplaceholder=Y> to <$SSc_metarobotsplaceholder=N>. Then it won't store the 0 hash records and you unique index method will work.