search not indexing numbered directories

Post Reply
webmaster2697
Posts: 12
Joined: Thu Oct 05, 2006 4:15 pm

search not indexing numbered directories

Post by webmaster2697 »

Hello,

I am using Webinator 5.0.2-Unix-w/plugin.

I have scoured the manual.

My search is not indexing some numbered directories containing various legacy mainframe logs which are on a web server. I want all content indexed, all files are plain ASCII, nothing binary. The mainframe developers need some love.

The full path I am trying to index is:
http://servername/CENTRALLOGS/00000136/LOGS/JF9999.TXT

The indexing stops at 00000136, the search result shows me a text file listing of the directory's content. I need clickable URLs and I need JF9999.TXT to be indexed.

My complete settings follow.

Thanks in advance for any help.

Best regards,
Bert


All Walk Settings
Current Profile: legacylogs Webinator 5.0.2-Unix-w/plugin

Database ? /usr/local/morph3/texis/legacylogs/db2
Walk Summary ? Last complete walk: 2006-10-05 16:03:47 (took 2 seconds)
Success. 85 pages (324,055 bytes)

Base URL ? http://prdcvs02/CENTRALLOGS/
Enterprise ? Yes
Domain


Robots ? robots.txt: Y
Meta: Y

Extensions ? .html .htm .txt .pdf .doc .xls .swf .TXT
Exclusions ? /cgi-bin/
~
?

Crawl Delay ? 0
Parallelism ? Threads: 5 Servers: 2

Verbosity ? 2
Rewalk Type ? New
Rewalk Schedule ? Frequency Daily 2AM

Watch URL ? (none)
Notify ? (none)
Categories ? Category (none)

URL Pattern (none)

URL File ? (none)

URL URL ? (none)

Single Page ? (none)
Page File ? (none)

Page URL ? (none)

Strip Queries ? N
Ignore Case ? Y
Extra Domains ? (none)
Extra Networks ? (none)
Extra URLs REX ? (none)
Exclusion REX ? (none)
Exclusion Prefix ? (none)

Required REX ? (none)
Required Prefix ? (none)
Max Page Size ? -1
Max Pages ? -1
Max Bytes ? -1
Max Depth ? -1
Page Timeout ? 60
Meta Tags ? (none)
Standard Meta ? Y
All Meta ? N
Keep HTML ? ALT Text Y
<STRIKE> Y
<DEL> Y
<FORM> Y

Remove Common ? N
Ignore Tags ? Begin (none)
End (none)

Keep Tags ? Begin (none)
End (none)

Plugin Split ? Depth 0
Bytes 0
AtPage (not checked)
Pages 0


Word Definition ? [\alnum\x80-\xff]{1,70}
[\alnum\x80-\xff.]{1,70}>>[.&']=[\alnum\x80-\xff.]{1,70}

Login Info ? (none)

Password (none)

Proxy ?
Proxy Login Info ? Name (none)

Password (none)

Cookie Source Path ? (none)
Temporary Dir ? (none)
Off-site Pages ? N
Stay Under ? Y
Prevent Duplicates ? Y
All Extensions ? Y
Store Refs ? Y
Inline Iframes ? Y
Max Frames ? 20
Execute JavaScript ? N Note: This feature not enabled by current license
Fetch JavaScript ? N
Debug JavaScript ? N
Protocols ? HTTP FTP
Embedded Security ? Any
Entropy Source ? Standard
Max Redirects ? 12
Index Name ? index.html index.htm
DNS Mode ? Internal
Net Mode ? Internal
User Agent ? Mozilla/4.0 (compatible; T-H-U-N-D-E-R-S-T-O-N-E)
Mime Types ? */*

Default Refresh Time ? 1 hour
Minimum Refresh Time ? 1 minute
Maximum Refresh Time ? 90 days
Maximum Process Size ? Small
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

search not indexing numbered directories

Post by John »

If you set verbosity to 4 it should tell you why it doesn't index any URLs it doesn't. Also if you go into List/Edit URLs you should be able to follow the path via Children links, and see if it is not seeing the links, or not following them.
John Turnbull
Thunderstone Software
webmaster2697
Posts: 12
Joined: Thu Oct 05, 2006 4:15 pm

search not indexing numbered directories

Post by webmaster2697 »

Hello,

Thank you for responding, much appreciated.

I raised the verbosity to 4, but all I could see that was new aftwer doing a walk was:
4097 errors

My database is at:
/usr/local/morph3/texis/legacylogs/db1

I did a tail -n4500 /usr/local/morph3/texis/vortex.log and got a large number of identical lines:
178 2006-10-06 13:05:00 /usr/local/morph3/texis/scripts/webinator/dowalk:2153: Trying to insert duplicate value (http://servername.mydomain.com/CENTRALLOGS/) in index /usr/local/morph3/texis/legacylogs/db2/xerrorurl.btr

By following the path via Children links, I think you mean going to the "List/Edit URLs" and searching for a wildcard "*". This gives me a low number of pages (basically the number of directories I have). All I get is "Depth: 1 click away from Home". Perhaps a specific config for the depth is the problem.

If I click on the link:
http://servername/CENTRALLOGS/00000136

there is a children link. When I click on that, I get a long list that begins with:

http://servername.mydomain.com/CENTRALLOGS/ (Offsite )
http://servername.mydomain.com/CENTRALL ... 136.DIRECT (Offsite )
http://servername.mydomain.com/CENTRALL ... 136.PARAMC (Offsite )
http://servername.mydomain.com/CENTRALL ... TADSF.EXEC (Offsite )
http://servername.mydomain.com/CENTRALL ... GSADAC.SQL (Offsite )
http://servername.mydomain.com/CENTRALL ... GSADLC.SQL (Offsite )
http://servername.mydomain.com/CENTRALL ... GSADRE.SQL (Offsite )
http://servername.mydomain.com/CENTRALL ... ADAC.XEDIT (Offsite )
http://servername.mydomain.com/CENTRALL ... ADLC.XEDIT (Offsite )
http://servername.mydomain.com/CENTRALL ... ADR1.XEDIT (Offsite )
http://servername.mydomain.com/CENTRALL ... ADR2.XEDIT (Offsite )

So I corrected the server name from "servername" to "servername.mydomain.com", updated, ran the indexing.

This time I got only:
1 errors

which seems to be:
100 2006-10-06 13:21:01 /usr/local/morph3/texis/scripts/webinator/dowalk:1687: User PUBLIC has been added without a password.

I try my search for "http://servername/CENTRALLOGS/00000136/LOGS/JF9999.TXT" again, no go.

I redid the wildcard search, click on Children and get the same list as before, this time without offsite.

What should I try next?

Best regards,
Bert Szoghy
webmaster2697
Posts: 12
Joined: Thu Oct 05, 2006 4:15 pm

search not indexing numbered directories

Post by webmaster2697 »

Hello,

I made progress.

I removed "?" from the exclusions and now have hundreds of files indexed. Great!

I do a test search for "JF9999.TXT", I get the index of the file's directory:

Index of /CENTRALLOGS/00000136/LOGS
Parent Directory 00000136.08301506.TXT 00000136.RAPPSOM.TXT JF9999.TXT Apache/1.3.29 Server at servername.mydomain.com Port 80...
http://servername.mydomain.com/...S/00000136/LOGS

What I wanted here was the file linked directly.

Next, when I search a string which is inside the file "JF9999.TXT", I get no results. I expected to get the file linked directly also.

Thanks in advance,
Bert
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

search not indexing numbered directories

Post by John »

John Turnbull
Thunderstone Software
webmaster2697
Posts: 12
Joined: Thu Oct 05, 2006 4:15 pm

search not indexing numbered directories

Post by webmaster2697 »

Hello,

Thank you very much for your response.

In plaintext (no URL links) it displays the following:

Pages linked by http://servername.mydomain.com/CENTRALL ... 00136/LOGS
Select a link to see information about that page.
(links that are not selectable are not in the database)

http://servername.mydomain.com/CENTRALLOGS/00000136/
http://servername.mydomain.com/CENTRALL ... 301506.TXT
http://servername.mydomain.com/CENTRALL ... APPSOM.TXT
http://servername.mydomain.com/CENTRALL ... JF9999.TXT

So the file is visible, just not aggressively indexed which is what we need.

Thank you in advance fotr any suggestions.

Best regards,
Bert
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

search not indexing numbered directories

Post by John »

That was with Verbosity 4? It should indicate why the link was not followed in that case unless the crawled was stopped early. Does the Walk Status page show pages still in todo? Does it indicate why they walk finished, such as reached max pages?
John Turnbull
Thunderstone Software
webmaster2697
Posts: 12
Joined: Thu Oct 05, 2006 4:15 pm

search not indexing numbered directories

Post by webmaster2697 »

Hello,

The problem was fixed. There were some very large text files found (35 Mb) and the database was changed from "New" to "Update".

This allowed the engine to get through after a few iterations (we araised the frequency as well).

Thank you for the help,
Bert
Post Reply