New Walk stuck looping?

Post Reply
jamon
Posts: 163
Joined: Wed Jun 26, 2002 9:35 am

New Walk stuck looping?

Post by jamon »

I run a walk, it gets to some number of pages, and seems to get stuck, maybe looping. The number of pages processed remains the same, but the number of duplicates and the number of errors continues to increase, and will seemingly do so forever.

We are using Commercial Webinator 4.3.7 for Windows, with the PDF plugin.

For whatever reason, the texis version appears to be different: Commercial Webinator Version 4.02.1031937844 of Sep 13, 2002

We are walking http://www.moen.com, which is publically accessible. I'll give you the walk settings if you want to try it yourself.

Here's an excerpt of what we are seeing on for the walk status:

Webinator Walk Report for Builder

Creating database d:\thunderstonesoftware\webinator/texis/Builder/db1...Done.
Walk started at 2004-04-02 11:25:02 (by user)
Verbosity set to 3
JavaScript walking not enabled by current license
HTTPS walking disabled
Start fetching at http://www.moen.com/Builder/BuilderHome.cfm
http://www.moen.com/Builder/BuilderHome.cfm
Ignore urls containing any of the following:
/cgi-bin/
~
ptype=w
ptype=r
ptype=b
ptype=c
/productcatalog/
contest=
&page=
DealerInfoAction
started 1 (5700) on http://www.moen.com/Builder/BuilderHome.cfm
798 pages fetched (17,555,478 bytes) from http://www.moen.com/Builder/BuilderHome.cfm
started 1 (860) on http://showhouse.moen.com/
421 pages fetched (89,066,182 bytes) from http://showhouse.moen.com/
started 1 (4572) on http://www.moen.com/Consumer/legal.cfm
11 pages fetched (11,964,319 bytes) from http://www.moen.com/Consumer/legal.cfm
started 1 (5464) on http://showhouse.moen.com/
0 pages fetched (89,066,182 bytes) from http://showhouse.moen.com/
started 1 (4572) on http://www.moen.com/Consumer/legal.cfm
0 pages fetched (11,964,319 bytes) from http://www.moen.com/Consumer/legal.cfm
started 1 (5464) on http://showroomofdistinction.moen.com/
0 pages fetched (0 bytes) from http://showroomofdistinction.moen.com/
started 1 (5464) on http://showhouse.moen.com/
0 pages fetched (89,066,182 bytes) from http://showhouse.moen.com/
started 1 (4852) on http://www.moen.com/
0 pages fetched (11,953,080 bytes) from http://www.moen.com/
started 1 (4572) on http://showhouse.moen.com/
0 pages fetched (89,066,182 bytes) from http://showhouse.moen.com/
started 1 (5464) on http://www.moen.com/
0 pages fetched (11,953,080 bytes) from http://www.moen.com/
started 1 (4572) on http://showhouse.moen.com/
0 pages fetched (89,066,182 bytes) from http://showhouse.moen.com/
started 1 (5464) on http://www.moen.com/
0 pages fetched (11,953,080 bytes) from http://www.moen.com/
started 1 (4572) on http://showhouse.moen.com/
0 pages fetched (89,066,182 bytes) from http://showhouse.moen.com/
started 1 (5464) on http://www.moen.com/
0 pages fetched (11,953,080 bytes) from http://www.moen.com/
started 1 (5464) on http://showhouse.moen.com/
1230 pages (630,893,686 bytes) so far.
70 errors so far.
2448 duplicate pages so far.

1230 http://www.moen.com/Consumer/Products/K ... FPSink.cfm (14,567 bytes)
1229 http://www.moen.com/Consumer/BuyMoen/bu ... Number.cfm (747 bytes)
1228 http://www.moen.com/Consumer/products/s ... wering.cfm (13,488 bytes)

There is nothing in the vortex.log file since the start of the latest rewalk, but there is some older info in there, probably from my killing previous runs:

000 Apr 2 11:01:57 [webinatoradmin=webinatoradmin]:1112: Index d:\thunderstonesoftware\webinator\texis\Builder\db1\xhtmlid reported to exist, but does not. in the function opendbidx
006 Apr 2 11:02:04 [webinatoradmin=webinatoradmin]:1102: (5328) Can't write stdout via web server to 10.4.3.76 (Broken pipe); exiting
000 Apr 2 11:02:08 [webinatoradmin=webinatoradmin]:1112: Index d:\thunderstonesoftware\webinator\texis\Builder\db1\xhtmlid reported to exist, but does not. in the function opendbidx
006 Apr 2 11:02:12 (812) Can't write stdout (Bad file descriptor); exiting

Suggestions?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

New Walk stuck looping?

Post by mark »

Try a small mod to the dowalk script. Change
<$checkinhtml=0>
to
<$checkinhtml=1>
jamon
Posts: 163
Joined: Wed Jun 26, 2002 9:35 am

New Walk stuck looping?

Post by jamon »

That worked. Thanks!
jamon
Posts: 163
Joined: Wed Jun 26, 2002 9:35 am

New Walk stuck looping?

Post by jamon »

Is there someplace that I should have been able to find that information? I don't see anything about that flag in any of the other messages on the support board.
The change that we recently made to our walk settings was to allow the walk to follow links to other servers in the domain. Does that flag cause dowalk to check for a duplicate URL before checking to see if the resulting page is a duplicate page? Any idea how much time I'll be adding to my walks, percentage-wise, by having to use that flag? I probably have an option to specify URLs, and then say 'Y' to "Stay Under?"

What's the best way to go, from a performance point of view?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

New Walk stuck looping?

Post by mark »

No, it's not documented anywhere. The need can arise in some cases where there's multiple sites being walked. It will probably become automatic for multiple sites in a future script release.

The performance/time impact of turning on checkinhtml should be minimal.
jamon
Posts: 163
Joined: Wed Jun 26, 2002 9:35 am

New Walk stuck looping?

Post by jamon »

Thank you.
Post Reply