dowalk hangs

Post Reply
pnam
Posts: 18
Joined: Fri May 18, 2001 1:20 pm

dowalk hangs

Post by pnam »

Hi, i'm using a slightly modified version of dowalk. the main mod is that before it starts spidering, i have it clear out the todo, html and refs tables (in that order). this way when changes are made, we can capture them. we set up a cron job to run this guy nightly. it seems not that in 3 of our environments, manually running the script just makes it hang. I've deleted and recreated the db in one of the environments, and manually ran the spider and it worked okay. any ideas ?
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

dowalk hangs

Post by Kai »

How exactly are you clearing out the tables? If you do a "delete from" with no "where" clause it may take a while on large tables, as every row is removed from the table and indexes one at a time. It would be faster to drop and re-make the tables and indexes.
bart
Posts: 251
Joined: Wed Apr 26, 2000 12:42 am

dowalk hangs

Post by bart »

It would be a whole lot faster and easier to drop and re-create the tables than it is to delete their records.
pnam
Posts: 18
Joined: Fri May 18, 2001 1:20 pm

dowalk hangs

Post by pnam »

I'm doing

<sql "delete from todo"></sql>
<sql "delete from html where Url not matches 'http%'"></sql>
<sql "delete from refs"></sql>

because we're only spidering part of the site. we have another system that creates new pages and updates the thunderstone db as needed. those pages have a Url beginning http(s). So, we don't want to drop the entire html table.
Is there any referential integrity constraints b/w html, refs, and todo ?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

dowalk hangs

Post by mark »

todo is small and pretty much unused by dowalk.
refs is very related to html. It is required for the "parents" link on the search interface and error reporting in the walk to work. If you don't care about those you can save time and space by not even inserting into refs.

You might speed up the delete from html if you form your pattern so you can use "matches" instead of "not matches".
Url matches 'thesitetodelete.com%'
Post Reply