replication - not all pages are copied across

rodger.spring
Posts: 11
Joined: Wed Mar 06, 2002 9:16 am

replication - not all pages are copied across

Post by rodger.spring »

I have a profile that has 4254 pages.
The replication process does not get all the pages.

First attempt the second Search Appliance received 4253 pages. (page id 150013 was missing).

I cleared the entries (Profile Tools -> Edit urls -> delete on http://*.*). Then reran the process. The Main Search Appliance found the 4254 pages and the second Search Appliance received 4251 pages (missing pages 158084, 198829, 198933). This time both Search Appliances had 150013 (the page that was missing on the first run).

All Walk Settings
Page _URL -> two text files that list the urls
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

replication - not all pages are copied across

Post by mark »

Are you doing new or refresh crawls? Refresh may not touch every page.

Check the replication q to ensure that it's empty before comparing.

Look at the missing pages with list/edit urls to see if there's anything out of the ordinary about them.

Look at the maintenance->manage logs texis vortex.log and error.log on both systems to see if there are any replication related messages.
rodger.spring
Posts: 11
Joined: Wed Mar 06, 2002 9:16 am

replication - not all pages are copied across

Post by rodger.spring »

I am using a refresh.
I delete all pages before starting the spider.

Replication queue is empty.

On the third try both Search Appliances had 4254 pages.

On fourth try the replication had 4253 (missing 180509). When looking at the pages themselves I do not see anything unusual.

I do see anything suspicious in the texis -> error.log, or vortex.log on either of the search appliances.

If I delete the page for 180509 from the first Search Appliance and run the Basic Walk then both search appliances were updated (both now had a count of 4254).
rodger.spring
Posts: 11
Joined: Wed Mar 06, 2002 9:16 am

replication - not all pages are copied across

Post by rodger.spring »

update
should read
I do not see anything suspicious in the texis logs
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

replication - not all pages are copied across

Post by mark »

If a page is not due for refresh it won't be refetched nor replicated.

You should start with an empty replication destination, do a complete mode new crawl on the source then set the mode for subsequent crawls to refresh. That will keep the databases in sync.
rodger.spring
Posts: 11
Joined: Wed Mar 06, 2002 9:16 am

replication - not all pages are copied across

Post by rodger.spring »

I am missing something?

The database on SA #1 and SA #2 are empty.
The spider from SA #1 uses the "Page URL" files to guide its crawl. SA #1 retrieves the page and adds it to its database but on rare cases the retrieve page is not replicated to SA #2.

Since the databases on SA #1 and SA #2 are initially empty there should be no issue of "due for refresh".
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

replication - not all pages are copied across

Post by mark »

I don't know. Never thought of trying refresh on an empty database? Why do that instead of new if you want to start clean.
rodger.spring
Posts: 11
Joined: Wed Mar 06, 2002 9:16 am

replication - not all pages are copied across

Post by rodger.spring »

Background story:
a) client requirement - once a month the content of the profile is to be completely refreshed
b) a profile can have a "Rewalk Type" of "New" or "Refresh" and the best working choice is "Refresh"
c) so once a month to emulate "New" the content for the profile is deleted
d) I can not figure out a way to programmatically alter the rewalk type but I believe I can programmatically purge the content of the profile and so I can put the maintenance process into a program
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

replication - not all pages are copied across

Post by mark »

Set "Maximum Refresh Time" to 30 days.
Do one complete mode new walk.
Set mode to refresh and "Rewalk Schedule" to daily.
If you don't want daily refreshes set "Minimum Refresh Time" to 30 days as well.

No programming required.
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

replication - not all pages are copied across

Post by mark »

In general though you should be able to update individual settings by submitting the form with cmd=Update and just the new and old setting for the value. The old value isn't so important so long as it's different than the new one.

SSc_rewalktype=refresh&oldSSc_rewalktype=new&cmd=Update
Post Reply