pause walk

KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

pause walk

Post by KMandalia »

This is the last message.

First of all, reindex didn't work. You said index could have been modified or deleted. I created a brand new profile, paused it one time and continued it one time. What did I do wrong?

Why would you have extra domains text box? So that you don't have to list individual sites in base url. How do you include all the sites in the domain into one of the categories, I replaced www with '*'. This has to work. How do I walk domains?

'Stay Under=N'was specifically for forrester.com/find/ in base url. It didn't stay under but it didn't brought any pages from 'find'as well.

check out
http://www.google.com/search?sourceid=n ... acul%2Ecom

I don't get any pages from acul.com since none of the page have any extensions. You say that url with no extensions are always walked. But I got two examples that contradict it.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

pause walk

Post by mark »

Don't know what happened to the index. See if vortex.log says anything about it or processes running out of memory or otherwise getting killed.

Extra domains is mainly for when you want to walk an enterprise that has various hosts that reference each other that you may not know about or there's a bunch. I wasn't suggesting it's the wrong thing to use necessarily.

I have over 600 urls matching http://www.forrester.com/find* . Note the lack of trailing slash which they do not use in their urls.

Something odd about acul.com. The walk only works on that site if netmode sys is on. We're checking into that.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

pause walk

Post by mark »

When you say "reindex didn't work" are you getting the same error while searching?
What did reindex say when you ran it? What does the walk status say about it?
Your disk isn't full is it?
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

pause walk

Post by KMandalia »

reindex was successful. Same error ('xyz' would require linear search). Disk has 100 GB free. It only worked after I checked 'search linear'. Why is that?

And you are right. I think I had trailing slash after find. I think I had find/*.

Here is the thing about NCUA. none of the index.htm files are in the database. I thought when you specify index.htm it just considers www.ncua.gov and www.ncua.gov/index.htm the same ONLY FOR DUPLICATE PAGE PURPOSE. What I discovered is none of the links on the left navigation bar were in the database (when you click them it takes you to index.htm page and if the page is not in the database, none of the url in that page would be right?)

That's why I have only 4000 NCUA pages while NCUA.GOV has about 8000 pages of valuable (almost unique) information.
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

pause walk

Post by KMandalia »

Yes, of course!

So,should I just take out index.htm and index.html? I really don't want to do that.

Did you noticed the problem? Mark ran the walk and you may have the url list for *ncua.gov*. Check if you don't have any index.htm for ncua.gov.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

pause walk

Post by mark »

KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

pause walk

Post by KMandalia »

It didn't started working automatically. I had to select 'linear search' in search settings. I didn't wanted to. Strange...

What could have made the index non-searchable on a fresh profile. If someone searches after pausing the walk and if I continue the walk, would it mess up the index? Because that is the ONLY possibility in my case.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

pause walk

Post by mark »

All I can think of is what John was suggesting above. If there was no error indicated in the index creation phase all I can guess is that you've customized the dowalk and/or search script such that the fields being searched are not the same ones being indexed (or the order was changed). The "create metamorph inverted index" fields in dowalk, Title\Description\Keywords\Meta\Body by default, must agree exactly with the sql in search, where Title\Description\Keywords\Meta\Body.
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

pause walk

Post by KMandalia »

You are absolutely right. I planned to take keywords out from both dowalk and search. I ran into problems in 5.0.13 dowalk after I downloaded script and removed 'Keywords' from two places in dowalk (Remember the domain issue?). At the same time I changed the SQL query in search script. But after updating to 5.0.14, I forgot to put Keywords back in search script.

Thanks for all the help. I trully appreciate it.
Post Reply