Hi, i'm using a slightly modified version of dowalk. the main mod is that before it starts spidering, i have it clear out the todo, html and refs tables (in that order). this way when changes are made, we can capture them. we set up a cron job to run this guy nightly. it seems not that in 3 of our environments, manually running the script just makes it hang. I've deleted and recreated the db in one of the environments, and manually ran the spider and it worked okay. any ideas ?
dowalk hangs
dowalk hangs
How exactly are you clearing out the tables? If you do a "delete from" with no "where" clause it may take a while on large tables, as every row is removed from the table and indexes one at a time. It would be faster to drop and re-make the tables and indexes.
dowalk hangs
It would be a whole lot faster and easier to drop and re-create the tables than it is to delete their records.
dowalk hangs
I'm doing
<sql "delete from todo"></sql>
<sql "delete from html where Url not matches 'http%'"></sql>
<sql "delete from refs"></sql>
because we're only spidering part of the site. we have another system that creates new pages and updates the thunderstone db as needed. those pages have a Url beginning http(s). So, we don't want to drop the entire html table.
Is there any referential integrity constraints b/w html, refs, and todo ?
<sql "delete from todo"></sql>
<sql "delete from html where Url not matches 'http%'"></sql>
<sql "delete from refs"></sql>
because we're only spidering part of the site. we have another system that creates new pages and updates the thunderstone db as needed. those pages have a Url beginning http(s). So, we don't want to drop the entire html table.
Is there any referential integrity constraints b/w html, refs, and todo ?
dowalk hangs
todo is small and pretty much unused by dowalk.
refs is very related to html. It is required for the "parents" link on the search interface and error reporting in the walk to work. If you don't care about those you can save time and space by not even inserting into refs.
You might speed up the delete from html if you form your pattern so you can use "matches" instead of "not matches".
Url matches 'thesitetodelete.com%'
refs is very related to html. It is required for the "parents" link on the search interface and error reporting in the walk to work. If you don't care about those you can save time and space by not even inserting into refs.
You might speed up the delete from html if you form your pattern so you can use "matches" instead of "not matches".
Url matches 'thesitetodelete.com%'