Replication Queue stuck

Jit · Post by **Jit** » Tue Jul 24, 2007 10:56 am

We have two server and have set it up as follows:

Settings:
-----------------------------------------------------
Server #1 - Profiles: A, B, C
A: Search Profile - Receiver Profile updated based on B&C
B: Small subset of our website - daily indexed, (~7,000 pages)
C: All site - weekly indexed (~270,000 pages)

Server #2 - A
A: Receiver Profile, updated based on #1: B, C

There is a cross-over cable between #1, #2 for replication (192.x.x.x, cluster members),
but are otherwise accessed normally.
Both servers are upto date as of today
-----------------------------------------------------

With the above setting, using just profiles A,B works quite well and replication is not a problem. The issues arise with Profile C and the huge number of pages that it indexes.

Baseline:
Running Profile C takes about 16 hours without replication on server #1

Problem with replication to #1A, #2A:
It took over a week to index the site via Profile C. I put it out of its misery on the 11th day. In the last couple of days it was pretty much stuck on 170,000 pages in the index and was perhaps indexing 1 or 2 pages per hour. The replication queue was also not changing.

To solve the indexing issue, I modified the settings, deselecting the "remove common feature" (via topic - "walk doesn't stop" http://thunderstone.master.com/texis/ma ... 45422b5510).

With this feature disabled, it finishes the index (of 270k pages) in about 70 hours (I figure that this increase in time from the baseline was due to part of replication occurring while indexing was going on), however the replication queue is stuck with a bit over 300k items remaining (so 150k per each server). I left it at that as I was otherwise occupied for several weeks, and when I came back to it, the replication queue hasn't changed.

I have tried this twice and am stuck on pretty much the same area. I have concluded that there is an issue with replication. Any suggestions?

Post by **mark** » Tue Jul 24, 2007 12:24 pm

Does the replication process show up on Maintenance->tech support info? (look for Command containing "/replsend.txt" and "profile=your_profile_name")
Check the vortex.log for any replication related errors.
If you click on the "replication queue" link does it say that the replication process is running?

Jit · Post by **Jit** » Tue Jul 24, 2007 1:45 pm

Yup,
Here's what i'm getting.
PID PPID USER %CPU %MEM VSZ-MB RSS-MB STAT STIME TIME COMMAND
1369 1 texis 62.6 0.6 18 6 S Jul06 11-08:23:35 /usr/local/morph3/bin/texis profile="C"/usr/local/morph3/texis/scripts/dowalk/replsend.txt

I've replaced the profile name with "C"

It takes an incredibly long time to load the walk status page to get to the replication queue..but that last time i checked this morning it indicated that it was still running. As you can see from the "TIME" above, i've left it going for quite a while...

Post by **mark** » Tue Jul 24, 2007 2:01 pm

Does the vortex.log indicate any problems? See Maintenance->Manage logs

Jit · Post by **Jit** » Tue Jul 24, 2007 2:07 pm

Nothing that stands out. ie, nothing in terms of /replsend or the PID of 1369. There is a recurring:

006 2007-07-22 05:38:09 /dowalk

(14063) Stdout error to 127.0.0.1: Connection reset/closed?; exiting

with different ip addresses (ie. to server #2 as well). And of course the occasion document not found walk related errro

Post by **John** » Tue Jul 24, 2007 2:11 pm

Which version of the Search Appliance software are you running? On the replication status page which URL is at the top of the list? Is there anything unique about that page, for example lots of links?

Post by **mark** » Tue Jul 24, 2007 2:15 pm

Any vortex.log errors on the receiving machine?

Jit · Post by **Jit** » Tue Jul 24, 2007 2:19 pm

Version: Search Appliance Server Version 5.01.1170347731 20070201 (i686-unknown-linux2.4.9-64-32)
Scripts Version: 6.2.11
Details: dowalk: 6.2.11/2.499 dowalk: 6.2.11/2.414 dowalk: /1.6 appliance: 6.2.11/1.210 search: 6.2.11/2.362 DB: /1.6

....The urls are changing..but an example is..
17 d, 16 hr+ ago I 1369 localhost "C" URL1
17 d, 16 hr+ ago E 0 localhost "C" URL2

Jit · Post by **Jit** » Tue Jul 24, 2007 2:21 pm

Good point on the recieving machine..but nothing that I noticed...the only thing recurring (approx every day) is the following two..

000 2007-07-24 10:15:56 /usr/local/morph3/texis/scripts/dowalk

SQLPrepare() failed with -1 in the function prepntexis
100 2007-07-24 10:15:56 /usr/local/morph3/texis/scripts/dowalk

Index /usr/local/morph3/texis/"C".466460684/db1/xh_TiDsKyMtBy_ViMoDpPp appears to be being updated

Jit · Post by **Jit** » Tue Jul 24, 2007 2:54 pm

Perhaps the title shouldn't be "Replication queue stuck" but "slow replication" as it is replicating slowly...(ie, i don't expect something to be in the replication queue for 17 days!)