Page 2 of 3
pause walk
Posted: Thu Sep 16, 2004 12:37 pm
by mark
"I just let it run continuously." isn't entirely accurate. I did pause it once or twice early on then let it run after that.
pause walk
Posted: Thu Sep 16, 2004 1:21 pm
by KMandalia
This is the last message.
First of all, reindex didn't work. You said index could have been modified or deleted. I created a brand new profile, paused it one time and continued it one time. What did I do wrong?
Why would you have extra domains text box? So that you don't have to list individual sites in base url. How do you include all the sites in the domain into one of the categories, I replaced www with '*'. This has to work. How do I walk domains?
'Stay Under=N'was specifically for forrester.com/find/ in base url. It didn't stay under but it didn't brought any pages from 'find'as well.
check out
http://www.google.com/search?sourceid=n ... acul%2Ecom
I don't get any pages from acul.com since none of the page have any extensions. You say that url with no extensions are always walked. But I got two examples that contradict it.
pause walk
Posted: Thu Sep 16, 2004 2:55 pm
by mark
Don't know what happened to the index. See if vortex.log says anything about it or processes running out of memory or otherwise getting killed.
Extra domains is mainly for when you want to walk an enterprise that has various hosts that reference each other that you may not know about or there's a bunch. I wasn't suggesting it's the wrong thing to use necessarily.
I have over 600 urls matching
http://www.forrester.com/find* . Note the lack of trailing slash which they do not use in their urls.
Something odd about acul.com. The walk only works on that site if netmode sys is on. We're checking into that.
pause walk
Posted: Thu Sep 16, 2004 2:57 pm
by mark
When you say "reindex didn't work" are you getting the same error while searching?
What did reindex say when you ran it? What does the walk status say about it?
Your disk isn't full is it?
pause walk
Posted: Thu Sep 16, 2004 3:42 pm
by KMandalia
reindex was successful. Same error ('xyz' would require linear search). Disk has 100 GB free. It only worked after I checked 'search linear'. Why is that?
And you are right. I think I had trailing slash after find. I think I had find/*.
Here is the thing about NCUA. none of the index.htm files are in the database. I thought when you specify index.htm it just considers
www.ncua.gov and
www.ncua.gov/index.htm the same ONLY FOR DUPLICATE PAGE PURPOSE. What I discovered is none of the links on the left navigation bar were in the database (when you click them it takes you to index.htm page and if the page is not in the database, none of the url in that page would be right?)
That's why I have only 4000 NCUA pages while NCUA.GOV has about 8000 pages of valuable (almost unique) information.
pause walk
Posted: Thu Sep 16, 2004 4:38 pm
by John
Are you sure you are searching the same fields as you are indexing?
If you have index.htm in the "Index Name" setting it will fetch the URL without the index.htm to provide the shorter URL, and avoid continually fetching the page and determining it is a duplicate based on content.
pause walk
Posted: Thu Sep 16, 2004 5:43 pm
by KMandalia
Yes, of course!
So,should I just take out index.htm and index.html? I really don't want to do that.
Did you noticed the problem? Mark ran the walk and you may have the url list for *ncua.gov*. Check if you don't have any index.htm for ncua.gov.
pause walk
Posted: Thu Sep 16, 2004 6:20 pm
by mark
I see your search is working now.
There are no index.htm's because
http://www.ncua.gov/ALManagementInvest/ is the same page as
http://www.ncua.gov/ALManagementInvest/index.htm and webinator is stripping the index.htm to prevent redundant entries.
pause walk
Posted: Fri Sep 17, 2004 4:04 pm
by KMandalia
It didn't started working automatically. I had to select 'linear search' in search settings. I didn't wanted to. Strange...
What could have made the index non-searchable on a fresh profile. If someone searches after pausing the walk and if I continue the walk, would it mess up the index? Because that is the ONLY possibility in my case.
pause walk
Posted: Tue Sep 21, 2004 11:01 am
by mark
All I can think of is what John was suggesting above. If there was no error indicated in the index creation phase all I can guess is that you've customized the dowalk and/or search script such that the fields being searched are not the same ones being indexed (or the order was changed). The "create metamorph inverted index" fields in dowalk, Title\Description\Keywords\Meta\Body by default, must agree exactly with the sql in search, where Title\Description\Keywords\Meta\Body.