Not indexing pages linked as xyz.com/123

Post Reply
User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Not indexing pages linked as xyz.com/123

Post by Thunderstone »




I'm having a problem with links not being walked that fit the following description:

gm -ddbdir -fshtml http://beerexpedition.com/northamerica.shtml

gw -ddbdir -s "select count(id) from html where id > 'start of today'"

returns 11 pages indexed.

There is a file northamerica.shtml that contains links to around 90 pages that each contain links to a total of 3000 pages. So, I try:

gm -ddbdir -fshtml http://beerexpedition.com/northamerica.shtml

it reports northamerica.shtml is already in the database and nothing is added to the database.

It appears that the only way I can get all the pages in the database is to index each individual directory under the domain. Seems strange.

Is it possible that gw is not walking links that look like this:

northamerica.html contains links like:

<a href="/ca">California</a>

in /ca there is an index.shtml file.

Anyone have any suggestions? I'd appreciate any pointers.

Cheers,
Jeff
--
-- Jeff Scott -- Technical Designer -- Real Beer, Inc.
-- jeff@realbeer.com -- (415) 522-1516x310 (voice) -- (415) 522-1535 (fax)
-- Internet Publishers and Consultants -- http://www.realbeer.com
-- The single largest source for beer information known to man!


User avatar
Thunderstone
Site Admin
Posts: 2504
Joined: Wed Jun 07, 2000 6:20 pm

Not indexing pages linked as xyz.com/123

Post by Thunderstone »



Normally gw will not fetch pages it already has in the database.
Therefore, it won't fetch pages referenced by that page either.

There are several approaches to updating a database.
There's -rewalk for a complete rewalk based on previous settings.
There's -e for partial or whole rewalk based on current content and settings.
There's -Force to force insertion of already fetched pages.
Please see the manual at http://www.thunderstone.com/gw25man/gw2.html
for details about the above options.

Another approach is to walk to a new database and replace the old with
the new when it's complete.




Post Reply