Page 2 of 3

Not able to log in to Webinator

Posted: Wed Mar 25, 2009 7:31 am
by John
If you turn the verbosity up to 4 it should list all the URLs it finds on that page and why they are not indexed.

Not able to log in to Webinator

Posted: Wed Mar 25, 2009 7:39 am
by erling.ervik
Some progress...
I think I have found the culpit. I changed XML UTF-8 to yes, and run a rewalk, and now it finds pages.

So far over 6000 pages are found in the last 75 minutes, so lets se what we got when it is finished.

I try to turn up verbosity to 4 on next run, if nessesary.

Not able to log in to Webinator

Posted: Wed Mar 25, 2009 9:56 am
by erling.ervik
So now we have indexed 9999 pages. That is the limit of the free version right?

I'm about to leave for today, but one thing I saw immediately is that the Norwegian characthers is not uses in the result:

Mange sp?rsm?l om bil i 2007 - TAD - INTRANETT

This line should have been:
Mange spørsmål om bil i 2007 - TAD - INTRANETT

æøå is replaced with a question (?) mark.

If I try a search that stats with some of this letters, the search will not be done. But I get suggestions on how to searc for similare words. (Search for "årsavgift" suggested "rsavgift" "rsavgifta" and so on.
How can I fix this? Something with UTF-8 again?

Not able to log in to Webinator

Posted: Wed Mar 25, 2009 9:58 am
by mark
You almost certainly don't want to be indexing "localhost". If you do the search will only be useful for someone running a browser directly on the server being indexed. In general you should give the name that your users will use to access the server as the name of the server to index in the base url.

Not able to log in to Webinator

Posted: Wed Mar 25, 2009 10:03 am
by mark
Not sure why xml utf-8 would change the crawl behavior.

Maybe you need to adjust Storage Charset or Source Default Charset.

If you provide example links we can investigate further.

Not able to log in to Webinator

Posted: Wed Mar 25, 2009 10:05 am
by erling.ervik
Mark, yes I know, but we can not afford to index this public site at the momement. So this is just testing on my local machine, to get a feel of how Webinator is doing compared to other systems. At the moment we seems to have problems with ASCII in the result set. See previous post.

Not able to log in to Webinator

Posted: Thu Mar 26, 2009 3:11 am
by erling.ervik
I tried to put ISO-8859-1 in storage char-set. Reindexed some pages, but still got question mark instead of the Norwegian chars.

It's not possible to seach for words begining with any of the Norwegian chars, as explained in previous mail. Which mean that if this can not be solved, Webinator is not useable for us.

Not able to log in to Webinator

Posted: Thu Mar 26, 2009 10:53 am
by mark
What's the charset of the page(s) being indexed?
Can you provide an example url? Make the message private if you don't want the world seeing the url.

Not able to log in to Webinator

Posted: Thu Mar 26, 2009 11:04 am
by Kai
Storage Charset almost certainly should be left empty (ie. UTF-8), as ISO-8859-1 might not have the range to store all needed characters. Source Default Charset may need to change (to the pages' charset), if the pages being crawled do not have a labeled charset.

At this point, we would need an example URL to diagnose further. It may be that the charset sent by the server, or the charset in the <meta> tags of the document, are unknown to Webinator, incorrect, etc.; this can only be determined by directly examining a URL.

Not able to log in to Webinator

Posted: Fri Mar 27, 2009 3:41 am
by erling.ervik
At the moment this website only lives on my testmachine inside our firewall. So I don't think you can reach it.

Unfortunately I can not spend more time testing Webinator at the moment. There are several other I need to test, but if I got any spare time, I will come back to you.

We are using UTF-8 as this little snippet taken from sorceview of a random page shows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head id="ctl00_Head1">
<!-- <title>

</title> -->
<meta name="TITLE" content="Jakt utenfor Norge" />
<meta name="RATING" content="General" />
<meta name="GENERATOR" content="EPiServer" />
<meta name="creation_date" content="Fri, 07 Sep 2007 08:48:00 GMT" />
<meta name="LAST-MODIFIED" content="Fri, 07 Sep 2007 08:57:37 GMT" />
<meta name="REVISED" content="Fri, 07 Sep 2007 08:57:37 GMT" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Page-Exit" content="progid:DXImageTransform.Microsoft.Fade(duration=.01)" />