Replication Queue stuck

Jit
Posts: 15
Joined: Mon Jun 11, 2007 8:22 am

Replication Queue stuck

Post by Jit »

We have two server and have set it up as follows:

Settings:
-----------------------------------------------------
Server #1 - Profiles: A, B, C
A: Search Profile - Receiver Profile updated based on B&C
B: Small subset of our website - daily indexed, (~7,000 pages)
C: All site - weekly indexed (~270,000 pages)

Server #2 - A
A: Receiver Profile, updated based on #1: B, C

There is a cross-over cable between #1, #2 for replication (192.x.x.x, cluster members),
but are otherwise accessed normally.
Both servers are upto date as of today
-----------------------------------------------------

With the above setting, using just profiles A,B works quite well and replication is not a problem. The issues arise with Profile C and the huge number of pages that it indexes.

Baseline:
Running Profile C takes about 16 hours without replication on server #1

Problem with replication to #1A, #2A:
It took over a week to index the site via Profile C. I put it out of its misery on the 11th day. In the last couple of days it was pretty much stuck on 170,000 pages in the index and was perhaps indexing 1 or 2 pages per hour. The replication queue was also not changing.

To solve the indexing issue, I modified the settings, deselecting the "remove common feature" (via topic - "walk doesn't stop" http://thunderstone.master.com/texis/ma ... 45422b5510).

With this feature disabled, it finishes the index (of 270k pages) in about 70 hours (I figure that this increase in time from the baseline was due to part of replication occurring while indexing was going on), however the replication queue is stuck with a bit over 300k items remaining (so 150k per each server). I left it at that as I was otherwise occupied for several weeks, and when I came back to it, the replication queue hasn't changed.

I have tried this twice and am stuck on pretty much the same area. I have concluded that there is an issue with replication. Any suggestions?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Replication Queue stuck

Post by mark »

Does the replication process show up on Maintenance->tech support info? (look for Command containing "/replsend.txt" and "profile=your_profile_name")
Check the vortex.log for any replication related errors.
If you click on the "replication queue" link does it say that the replication process is running?
Jit
Posts: 15
Joined: Mon Jun 11, 2007 8:22 am

Replication Queue stuck

Post by Jit »

Yup,
Here's what i'm getting.
PID PPID USER %CPU %MEM VSZ-MB RSS-MB STAT STIME TIME COMMAND
1369 1 texis 62.6 0.6 18 6 S Jul06 11-08:23:35 /usr/local/morph3/bin/texis profile="C"/usr/local/morph3/texis/scripts/dowalk/replsend.txt

I've replaced the profile name with "C"

It takes an incredibly long time to load the walk status page to get to the replication queue..but that last time i checked this morning it indicated that it was still running. As you can see from the "TIME" above, i've left it going for quite a while...
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Replication Queue stuck

Post by mark »

Does the vortex.log indicate any problems? See Maintenance->Manage logs
Jit
Posts: 15
Joined: Mon Jun 11, 2007 8:22 am

Replication Queue stuck

Post by Jit »

Nothing that stands out. ie, nothing in terms of /replsend or the PID of 1369. There is a recurring:

006 2007-07-22 05:38:09 /dowalk:15135: (14063) Stdout error to 127.0.0.1: Connection reset/closed?; exiting

with different ip addresses (ie. to server #2 as well). And of course the occasion document not found walk related errro
User avatar
John
Site Admin
Posts: 2623
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

Replication Queue stuck

Post by John »

Which version of the Search Appliance software are you running? On the replication status page which URL is at the top of the list? Is there anything unique about that page, for example lots of links?
John Turnbull
Thunderstone Software
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Replication Queue stuck

Post by mark »

Any vortex.log errors on the receiving machine?
Jit
Posts: 15
Joined: Mon Jun 11, 2007 8:22 am

Replication Queue stuck

Post by Jit »

Version: Search Appliance Server Version 5.01.1170347731 20070201 (i686-unknown-linux2.4.9-64-32)
Scripts Version: 6.2.11
Details: dowalk: 6.2.11/2.499 dowalk: 6.2.11/2.414 dowalk: /1.6 appliance: 6.2.11/1.210 search: 6.2.11/2.362 DB: /1.6

....The urls are changing..but an example is..
17 d, 16 hr+ ago I 1369 localhost "C" URL1
17 d, 16 hr+ ago E 0 localhost "C" URL2
Jit
Posts: 15
Joined: Mon Jun 11, 2007 8:22 am

Replication Queue stuck

Post by Jit »

Good point on the recieving machine..but nothing that I noticed...the only thing recurring (approx every day) is the following two..

000 2007-07-24 10:15:56 /usr/local/morph3/texis/scripts/dowalk:2389: SQLPrepare() failed with -1 in the function prepntexis
100 2007-07-24 10:15:56 /usr/local/morph3/texis/scripts/dowalk:2389: Index /usr/local/morph3/texis/"C".466460684/db1/xh_TiDsKyMtBy_ViMoDpPp appears to be being updated
Jit
Posts: 15
Joined: Mon Jun 11, 2007 8:22 am

Replication Queue stuck

Post by Jit »

Perhaps the title shouldn't be "Replication queue stuck" but "slow replication" as it is replicating slowly...(ie, i don't expect something to be in the replication queue for 17 days!)