Webinator just...stops

b.sims
Posts: 99
Joined: Fri Oct 26, 2001 10:40 am

Webinator just...stops

Post by b.sims »

I'm running webinator on one machine against another in the same domain. For some reason, it just stops: each refresh no longer shows any new pages indexed. There are no indications of abandons or time outs, it just serves the same results page and doesn't move.

Task manager shows two texis processes taking large amounts of CPU time, but I can't tell what they are doing. Any idea what could be causing this?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Webinator just...stops

Post by mark »

What does the last part of the report before the list of recently fetched pages say? It may be building search indices or such.
b.sims
Posts: 99
Joined: Fri Oct 26, 2001 10:40 am

Webinator just...stops

Post by b.sims »

User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Webinator just...stops

Post by mark »

PID 1672 is the crawler process. If it's one of the ones eating cpu, kill it. That will cutoff the walk at that point and make it live.

What's your page timeout set to?
What non-default settings are you using?
What's your Texis version (texis -version) and Webinator scripts version?
b.sims
Posts: 99
Joined: Fri Oct 26, 2001 10:40 am

Webinator just...stops

Post by b.sims »

Killing 1672 gives me about 30 new pages; and then it stops updating again. So I kill the other process, and as you say I it creates the index and goes live.

However, I know the walk is not finished; we should have several thousand more pages.

Killing Page timeout is set to 60.
Non-default settings:

Different exclusions; rewalk schedule set for once a week; strip queries off; ignore case on; stay under on; DNS mode internal (this is in an attempt to solve a different problem encountered when walking a side from inside the DMZ).
b.sims
Posts: 99
Joined: Fri Oct 26, 2001 10:40 am

Webinator just...stops

Post by b.sims »

I've tried running the same run again, and it completes without intervention but with only 746 pages; this problem seems therefore to be due to an occasional difficulty switching over to indexing than a serious bug.

Of course, that means I'm back to looking at the 'Why so few pages' question.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Webinator just...stops

Post by mark »

Unfortunately with that many pages it's not always simple to see why you're missing someting. Start with the error and dup reports to see if something that's missing is your gateway to the rest of the pages.

Beyond that will require some knowlege of the site being walked. Review the walked urls with List/Edit Urls with any eye toward spotting the lack of some extension or area of the site. Or find a page from the site you think should be indexed but isn't. Find the parent of that page and see if that's in the database. Look at the parent's children list to see if the desired child is listed and if there's any error associated with it.

Turning verbosity up to 4 will cause Webinator to log every rejected url as an error so you can track why it's rejecting something you think it shouldn't.
b.sims
Posts: 99
Joined: Fri Oct 26, 2001 10:40 am

Webinator just...stops

Post by b.sims »

The problem seems to be depth related. I have looked through the sites, and it seems that nothing beyond a depth of 4 is indexed. The working (external) walk goes to a depth of 12.

In both cases, the Webinator setting for depth is -1. I have just run the walk again with a depth of 20 just in case - same result.

The page I selected as a test (it has a depth of 4) appears in the not-working run as expected; its depth 5 children are shown as children but are not selectable (ie not stored in the database).

Manually selection SSc_maxdepth from the database confirms that it is 20, so the setting is being stored.
b.sims
Posts: 99
Joined: Fri Oct 26, 2001 10:40 am

Webinator just...stops

Post by b.sims »

Sorry if I'm looking in the wrong place here?

A dump of all the errors for this database contains no reference to either 'page1' (the page at depth 4 which is indexed) or 'page2' (the page at depth 5 which is a child of the above and is ignored). Note that the two pages are one character different in their filenames, extension included.

If I search for the URL of 'page1' in the list URL page and call up its record, there is nothing in the 'Errors' field.

The Vortex and Monitor logs don't seem to give any clues either.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Webinator just...stops

Post by mark »

From the page1 details, click "children". page2 should be listed. If it's a hyperlink, it's in the database. If it's not a hyperlink it's not in the database and, with verbosity 4 or higher, should have a reason in parens after it unless the whole walk was stopped due to max depth, max bytes, or user. If there's no error listed you might also check INSTALLDIR/texis/vortex.log to see if anything unusual happened during the walk.
Post Reply