Missing pages

michel.weber
Posts: 256
Joined: Sat Oct 08, 2005 12:40 pm

Missing pages

Post by michel.weber »

Hi

When doing some checks i have noticed that a lot of pages on our web sites are not indexed.

It looks like any link like http://a.b.c/page2#topofpage is not followed.
Assuming this link is on http://a.b.c/page1 is this the expected behaviour?
User avatar
John
Site Admin
Posts: 2623
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

Missing pages

Post by John »

John Turnbull
Thunderstone Software
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Missing pages

Post by mark »

Not expected, unless your base url is http://a.b.c/dirone/file.html and the other url is http://a.b.c/dirtwo/file.html and you have "Stay under" on, which it is by default.

Find the page that links to the missing page. Lookup that page in List/Edit urls. Click "Children". See if the missing page is listed and if there's an error message next to it. If it's listed with no error, turn verbosity up to 4 and do a new walk then check again for the reason it was skipped.
michel.weber
Posts: 256
Joined: Sat Oct 08, 2005 12:40 pm

Missing pages

Post by michel.weber »

Hi

It seems to be something else entirely.

I did a test setup to limit the results.
I started the walk from the following entry-point :
http://home.coe.int/t/f/d.r.h/08_Salair ... it%C3%A9s/
None of the links on the page are followed. Only the page itself is indexed.

When i look at the children and their status codes, all the links on the page return one of 2 errors :
Offsite : which is normal, they really are offsite.
or
Unwanted prefix : which is not normal

all the 'exclusion' fields in the profile are empty. I even cleaned out the 'cgibin' nothing changes.

Here are example links :
http://home.coe.int/t/f/d.r.h/08_salair ... tation.asp (Unwanted prefix )
http://home.coe.int/t/f/d.r.h/08_salair ... /index.asp (Unwanted prefix )
User avatar
John
Site Admin
Posts: 2623
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

Missing pages

Post by John »

John Turnbull
Thunderstone Software
michel.weber
Posts: 256
Joined: Sat Oct 08, 2005 12:40 pm

Missing pages

Post by michel.weber »

User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Missing pages

Post by mark »

Can't see that page (password protected) to comment on how it's encoded. Maybe you have xml utf-8 turned on?
michel.weber
Posts: 256
Joined: Sat Oct 08, 2005 12:40 pm

Missing pages

Post by michel.weber »

Hi

The content type of the page is "text/html; charset=iso-8859-1"
The storage charset is UTF-8
The default charset is WINDOWS-1252
XML UTF-8 is off
The display charset is blank
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Missing pages

Post by Kai »

URLs should not have non-ASCII values, accoding to the URI spec, to avoid this kind of issue. `%C3%A9' is the UTF-8 version of ISO-8859-1 `%E9'. The HTML spec makes UCS (Unicode) the document character set for HTML, and recommends that non-ASCII chars in URLs be mapped from the character _encoding_ (ie. the Content-Type "charset" parameter) to UTF-8 and URL-encoded, which is what the Appliance does; so does Microsoft Explorer 6.0. Firefox 1.5.0.8, however, leaves them in the charset encoding and URL-encodes; this is an older/deprecated behavior I believe.

UTF-8 + URL-encoding was chosen as the standard so that URLs would be consistent regardless of document charset encoding.

However, the real solution is to avoid non-ASCII chars in URLs; you should edit the pages and change them to your desired (ie. ISO-8859-1) URL encoding, so that user-agents (Appliance and browsers) do not have to.
michel.weber
Posts: 256
Joined: Sat Oct 08, 2005 12:40 pm

Missing pages

Post by michel.weber »

I was afraid you were going to say something like that.

Unfortunately that will not be easily feasable. We have about 1500 people contributing to our web sites. The original documents are word documents which are converted to html

Thanks anyway