not all pages are indexed

Post Reply
cindy_walker
Posts: 36
Joined: Tue Jul 24, 2001 2:16 pm

not all pages are indexed

Post by cindy_walker »

We've noticed that Webinator is skipping a certain directory (and its subfolders) when it indexes. Its not clear to me why this is happening. The only thing odd about the page is that the DOCTYPE statement on the main page for the site is below the meta tags, not at the top where it should be. Would that make a difference?

You mentioned in a previous thread that the ref table holds information about what pages Webinator touched as it indexed. Could you tell me what command to execute to get that information?

There's nothing in the URL of the skipped pages that would cause them to be overlooked that I can see. Any suggestions you have for what else to look for as a cause of the problem would be appreciated.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

not all pages are indexed

Post by mark »

Location of the doctype won't make a difference.
Things that might exclude it:
Extensions, Exclusions, Exclusion REX, Exclusion Prefix,
Max depth, Page timeout, meta robots setting on the page itself
The server's robots.txt
Nothing links to it.
Only linked by javascript and you don't have javascript turned on.
Password protected and password not provided in walk settings.
Check the walk status for any error about that page.
Go to list/edit urls to find a page that links to the missing page. Click it's children link to see if the missing page is listed and what, if anything, it says about it.

texis -d YOURDB -s "select * from refs where Ref='http://YOUR_MISSING_PAGE'"
cindy_walker
Posts: 36
Joined: Tue Jul 24, 2001 2:16 pm

not all pages are indexed

Post by cindy_walker »

I entered the query you list and found two pages that link to the un-indexed page. I see the un-indexed page listed on the children pages, but it's unlinked. The page states that unlinked pages aren't in the database, but there's no indication of the reason. I can see other pages that were ignored, but its clear why - they're on servers Webinator hasn't been instructed to index. The problem page is on a server where Webinator hasn't had trouble indexing other content.

We're using the default exclusions. The extension, .html is on the extensions list. We haven't entered anything in Exclusion REX and Exclusion Prefix. Its two clicks from our main Intranet page. The walk status page remains silent about the problem page. We don't have a robots.txt file.

That leaves page timeout and meta robots on the page itself. There isn't a no-index,nofollow meta tag, just the ordinary meta tags.

I'm going to try tripling the timeout setting and re-indexing. If that doesn't work, I'll create a separate index just for that part of our Intranet and see how Webinator handles it. I'll report my progress on Monday.

If there's a command to extract errors related to the problem page, I'm interested.

I appreciate your previous thorough answer.

Enterprise Webinator Version 4.00.1006356665 of Nov 21, 2001 (mips-sgi-irix5.3-32)
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

not all pages are indexed

Post by John »

If you turn the verbosity to 4 or higher it will list the reason the page wasn't indexed on the parent's child links page after another crawl.
John Turnbull
Thunderstone Software
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

not all pages are indexed

Post by mark »

You don't need to do anything to extract errors. They will be printed on the walk status and when listing the child links. Turning verbosity up to 4 as suggested above will treat all rejections as errors so they get logged.
cindy_walker
Posts: 36
Joined: Tue Jul 24, 2001 2:16 pm

not all pages are indexed

Post by cindy_walker »

Setting verbosity to 4 was the key. Webinator flagged the problem directory (and many others it turns out) as offsite. I had thought that if a server was on the list of base URLs to walk that Webinator would follow all links on that server, but not so. You have to also list each server in the Extra Domains section. Now everything is indexed.

Thank you.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

not all pages are indexed

Post by mark »

It should consider sites listed in base urls as onsite. Can you provide an example base url and url that was missed?

Also, you might try updating your scripts to the latest 4.0 scripts from the webinator examples page, http://www.thunderstone.com/texis/site/ ... ample.html
cindy_walker
Posts: 36
Joined: Tue Jul 24, 2001 2:16 pm

not all pages are indexed

Post by cindy_walker »

I downloaded the new dowalk and webinatoradmin scripts. I'll give them a try in a little while.

Among the base URLs is this:

http://admin.dot.ca.gov

The problem directory that Webinator flagged as offsite was:

http://admin.dot.ca.gov/ASC/

Webinator treated every page as offsite when it was in a subdirectory of one of the base URLs. Pages in the web root directory were o.k.
Post Reply