Crawling Speed

rmehrotra
Posts: 17
Joined: Thu Jul 28, 2005 3:12 pm

Crawling Speed

Post by rmehrotra »

Hi,
I have thunderstone running and crawling my consumer site, Initially I configured the Thunderstone appliance for a daily schedule call but it used to run only for 3-4 Hrs and then stop. So I moved to 1 Hr and then finally 15 min. Still it seems that within that period its not crawling all the time and stops for some time before starting the next walk.(Not sure if its true).

Also I am trying to speed up indexing process; I found some of the parameters like Threads; Max Requests. It will be great if someone can tell me their optimal values or some other ways to make this fast.

-rm
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Crawling Speed

Post by mark »

Is the walk not completing? It will stop when it's done, then check if it needs to do something new on the schedule period. Check the walk status page.

If you have just one site, set servers to 1. Otherwise set it to the number of sites, but no more than about 5.

What kind of rates are you seeing? Does the appliance have fast connectivity to the server being walked? Does the server being walked return pages quickly? What non-default settings are you using?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Crawling Speed

Post by mark »

Also, what are the predominant file types? pdf and doc files take longer to process than html and text.
rmehrotra
Posts: 17
Joined: Thu Jul 28, 2005 3:12 pm

Crawling Speed

Post by rmehrotra »

Well following is the scenario:
There could be around 600K pages generated in my site. Just for testing I configured following initially:

Max Page Size = 50000 (to see the system behavior)
Parallel Threads=5 Server=2
Rewalk Type= refresh
Schedule = 15 Min
Also when we got the appliance it had around 2500 pages already indexed as sample.

Initially I saw that after starting the process it indexed like 440 pages in 43min and then stopped and on the walk-status page all the buttons (Stop/Pause) etc disappeared.

Then after waiting for almost 20 min (assuming it will restart after 15 min) I finally started it again by clicking start on Admin page.

All the pages on my site are HTML/JSP. I will try with server=2 also. But what is the use of Threads parameter?

Thanks for the responses.

-rm
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Crawling Speed

Post by mark »

Threads is how many pages to open simultaneously on the same server. More than 2 or 3 is probably not helpful. More than 1 or 2 can overload some servers.

Refresh only walks pages that are due. Review the entire walk status to see what's due and how many are done and why the walk is stopping.
rmehrotra
Posts: 17
Joined: Thu Jul 28, 2005 3:12 pm

Crawling Speed

Post by rmehrotra »

thanks Mark, Ok so I will reduce the current threads to 2 from 5. So does that mean when we want to index all the pages (ex 500K here) of the site first time we should go with "New" instead of "Refresh" to index all of them.

-rm
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Crawling Speed

Post by mark »

A new crawl starts clean and walks everything. A refresh uses the existing data and refetches stale pages, gets new pages, and deletes missing pages. Either should result in a complete dataset. If you make non-trivial changes to the include/exclude rules you may want to do a new walk though.

If a new walk stops because of hitting resource limits you can switch to refresh mode to have it complete.
rmehrotra
Posts: 17
Joined: Thu Jul 28, 2005 3:12 pm

Crawling Speed

Post by rmehrotra »

Thanks Mark.
I am attaching the latest data that I collected in last 3-4 hrs. Hope this will clear you how is it behaving, the data is from Walk status page:
-------------------------------------------------------------------------------------
Time Status Comments
11:20 3,564 pages in todo
10,314 pages scheduled for the next hour
1,623 pages visited in the last hour
42,962 pages total

11:42 3,564 pages in todo
10,126 pages scheduled for the next hour
1,456 pages visited in the last hour
42,962 pages total

11:49 3,564 pages in todo
10,186 pages scheduled for the next hour
1,265 pages visited in the last hour
42,962 pages total (Status page is refresshing but the tottal no is still 42962.)

12:02 3,564 pages in todo (Stopped at this time.)
10,271 pages scheduled for the next hour
1,228 pages visited in the last hour
42,962 pages total

12:07 3,572 pages in todo (Walk Started Automatically)
10,546 pages scheduled for the next hour
1,017 pages visited in the last hour
43,270 pages total

12:23 4,270 pages in todo
10,792 pages scheduled for the next hour
1,020 pages visited in the last hour
43,475 pages total

12:30 4,270 pages in todo
10,619 pages scheduled for the next hour
1,053 pages visited in the last hour
43,475 pages total

12:38 4,270 pages in todo
10,365 pages scheduled for the next hour
1,299 pages visited in the last hour
43,474 pages total (See here suddenly we have one page less than previous status.)

12:47 4,270 pages in todo
10,384 pages scheduled for the next hour
1,672 pages visited in the last hour
43,475 pages total

12:49 4,270 pages in todo
10,358 pages scheduled for the next hour
1,698 pages visited in the last hour
43,475 pages total

12:56 4,270 pages in todo (Data is not changing)
10,358 pages scheduled for the next hour
1,698 pages visited in the last hour
43,475 pages total

1:01 PM 4,270 pages in todo (Data is not changing Almost stopped)
10,358 pages scheduled for the next hour
1,698 pages visited in the last hour
43,475 pages total


1:15 4,269 pages in todo (Started automatically)
10,355 pages scheduled for the next hour
1,081 pages visited in the last hour
43,485 pages total

1:44 4,274 pages in todo
10,047 pages scheduled for the next hour
851 pages visited in the last hour
43,825 pages total
-------------------------------------------------------------------------------------
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Crawling Speed

Post by mark »

Ok. It's refreshing on schedule. But the refresh is stopping before everything's refreshed. Scroll down in the walk status to see why it's stopping.
rmehrotra
Posts: 17
Joined: Thu Jul 28, 2005 3:12 pm

Crawling Speed

Post by rmehrotra »

Thanks for prompt responses..

is there any way to log these activities or see the log
what happened in last one day or so?

-rm
Post Reply