Webinator just...stops

b.sims · Post by **b.sims** » Mon Nov 24, 2003 2:48 pm

I'm running webinator on one machine against another in the same domain. For some reason, it just stops: each refresh no longer shows any new pages indexed. There are no indications of abandons or time outs, it just serves the same results page and doesn't move.

Task manager shows two texis processes taking large amounts of CPU time, but I can't tell what they are doing. Any idea what could be causing this?

Post by **mark** » Mon Nov 24, 2003 4:00 pm

What does the last part of the report before the list of recently fetched pages say? It may be building search indices or such.

b.sims · Post by **b.sims** » Tue Nov 25, 2003 7:09 am

No, it looks like a normal run in progress:

started 1 (1672) on http://ioc.unesco.org/iocweb/index.php
711 pages (104,284,073 bytes) so far.
194 errors so far.
81 duplicate pages so far.

711 http://ioc.unesco.org/igospartners/sgg3os1.htm (109,213 bytes)
710 http://ioc.unesco.org/igospartners/igusthom.htm (16,296 bytes)
709 http://ioc.unesco.org/igospartners/igusspen.htm (20,563 bytes)
708 http://ioc.unesco.org/igospartners/igusland.htm (38,290 bytes)

........

Post by **mark** » Tue Nov 25, 2003 10:53 am

PID 1672 is the crawler process. If it's one of the ones eating cpu, kill it. That will cutoff the walk at that point and make it live.

What's your page timeout set to?
What non-default settings are you using?
What's your Texis version (texis -version) and Webinator scripts version?

b.sims · Post by **b.sims** » Tue Nov 25, 2003 1:22 pm

Killing 1672 gives me about 30 new pages; and then it stops updating again. So I kill the other process, and as you say I it creates the index and goes live.

However, I know the walk is not finished; we should have several thousand more pages.

Killing Page timeout is set to 60.
Non-default settings:

Different exclusions; rewalk schedule set for once a week; strip queries off; ignore case on; stay under on; DNS mode internal (this is in an attempt to solve a different problem encountered when walking a side from inside the DMZ).

b.sims · Post by **b.sims** » Tue Nov 25, 2003 1:40 pm

I've tried running the same run again, and it completes without intervention but with only 746 pages; this problem seems therefore to be due to an occasional difficulty switching over to indexing than a serious bug.

Of course, that means I'm back to looking at the 'Why so few pages' question.

Post by **mark** » Tue Nov 25, 2003 2:25 pm

Unfortunately with that many pages it's not always simple to see why you're missing someting. Start with the error and dup reports to see if something that's missing is your gateway to the rest of the pages.

Beyond that will require some knowlege of the site being walked. Review the walked urls with List/Edit Urls with any eye toward spotting the lack of some extension or area of the site. Or find a page from the site you think should be indexed but isn't. Find the parent of that page and see if that's in the database. Look at the parent's children list to see if the desired child is listed and if there's any error associated with it.

Turning verbosity up to 4 will cause Webinator to log every rejected url as an error so you can track why it's rejecting something you think it shouldn't.

b.sims · Post by **b.sims** » Thu Dec 04, 2003 7:01 am

The problem seems to be depth related. I have looked through the sites, and it seems that nothing beyond a depth of 4 is indexed. The working (external) walk goes to a depth of 12.

In both cases, the Webinator setting for depth is -1. I have just run the walk again with a depth of 20 just in case - same result.

The page I selected as a test (it has a depth of 4) appears in the not-working run as expected; its depth 5 children are shown as children but are not selectable (ie not stored in the database).

Manually selection SSc_maxdepth from the database confirms that it is 20, so the setting is being stored.

b.sims · Post by **b.sims** » Thu Dec 04, 2003 10:01 am

Sorry if I'm looking in the wrong place here?

A dump of all the errors for this database contains no reference to either 'page1' (the page at depth 4 which is indexed) or 'page2' (the page at depth 5 which is a child of the above and is ignored). Note that the two pages are one character different in their filenames, extension included.

If I search for the URL of 'page1' in the list URL page and call up its record, there is nothing in the 'Errors' field.

The Vortex and Monitor logs don't seem to give any clues either.

Post by **mark** » Thu Dec 04, 2003 11:17 am

From the page1 details, click "children". page2 should be listed. If it's a hyperlink, it's in the database. If it's not a hyperlink it's not in the database and, with verbosity 4 or higher, should have a reason in parens after it unless the whole walk was stopped due to max depth, max bytes, or user. If there's no error listed you might also check INSTALLDIR/texis/vortex.log to see if anything unusual happened during the walk.