Not able to log in to Webinator

User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Not able to log in to Webinator

Post by John »

If you turn the verbosity up to 4 it should list all the URLs it finds on that page and why they are not indexed.
John Turnbull
Thunderstone Software
erling.ervik
Posts: 11
Joined: Fri Mar 20, 2009 4:36 am

Not able to log in to Webinator

Post by erling.ervik »

Some progress...
I think I have found the culpit. I changed XML UTF-8 to yes, and run a rewalk, and now it finds pages.

So far over 6000 pages are found in the last 75 minutes, so lets se what we got when it is finished.

I try to turn up verbosity to 4 on next run, if nessesary.
erling.ervik
Posts: 11
Joined: Fri Mar 20, 2009 4:36 am

Not able to log in to Webinator

Post by erling.ervik »

So now we have indexed 9999 pages. That is the limit of the free version right?

I'm about to leave for today, but one thing I saw immediately is that the Norwegian characthers is not uses in the result:

Mange sp?rsm?l om bil i 2007 - TAD - INTRANETT

This line should have been:
Mange spørsmål om bil i 2007 - TAD - INTRANETT

æøå is replaced with a question (?) mark.

If I try a search that stats with some of this letters, the search will not be done. But I get suggestions on how to searc for similare words. (Search for "årsavgift" suggested "rsavgift" "rsavgifta" and so on.
How can I fix this? Something with UTF-8 again?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Not able to log in to Webinator

Post by mark »

You almost certainly don't want to be indexing "localhost". If you do the search will only be useful for someone running a browser directly on the server being indexed. In general you should give the name that your users will use to access the server as the name of the server to index in the base url.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Not able to log in to Webinator

Post by mark »

Not sure why xml utf-8 would change the crawl behavior.

Maybe you need to adjust Storage Charset or Source Default Charset.

If you provide example links we can investigate further.
erling.ervik
Posts: 11
Joined: Fri Mar 20, 2009 4:36 am

Not able to log in to Webinator

Post by erling.ervik »

Mark, yes I know, but we can not afford to index this public site at the momement. So this is just testing on my local machine, to get a feel of how Webinator is doing compared to other systems. At the moment we seems to have problems with ASCII in the result set. See previous post.
erling.ervik
Posts: 11
Joined: Fri Mar 20, 2009 4:36 am

Not able to log in to Webinator

Post by erling.ervik »

I tried to put ISO-8859-1 in storage char-set. Reindexed some pages, but still got question mark instead of the Norwegian chars.

It's not possible to seach for words begining with any of the Norwegian chars, as explained in previous mail. Which mean that if this can not be solved, Webinator is not useable for us.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Not able to log in to Webinator

Post by mark »

What's the charset of the page(s) being indexed?
Can you provide an example url? Make the message private if you don't want the world seeing the url.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

Not able to log in to Webinator

Post by Kai »

Storage Charset almost certainly should be left empty (ie. UTF-8), as ISO-8859-1 might not have the range to store all needed characters. Source Default Charset may need to change (to the pages' charset), if the pages being crawled do not have a labeled charset.

At this point, we would need an example URL to diagnose further. It may be that the charset sent by the server, or the charset in the <meta> tags of the document, are unknown to Webinator, incorrect, etc.; this can only be determined by directly examining a URL.
erling.ervik
Posts: 11
Joined: Fri Mar 20, 2009 4:36 am

Not able to log in to Webinator

Post by erling.ervik »

At the moment this website only lives on my testmachine inside our firewall. So I don't think you can reach it.

Unfortunately I can not spend more time testing Webinator at the moment. There are several other I need to test, but if I got any spare time, I will come back to you.

We are using UTF-8 as this little snippet taken from sorceview of a random page shows:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head id="ctl00_Head1">
<!-- <title>

</title> -->
<meta name="TITLE" content="Jakt utenfor Norge" />
<meta name="RATING" content="General" />
<meta name="GENERATOR" content="EPiServer" />
<meta name="creation_date" content="Fri, 07 Sep 2007 08:48:00 GMT" />
<meta name="LAST-MODIFIED" content="Fri, 07 Sep 2007 08:57:37 GMT" />
<meta name="REVISED" content="Fri, 07 Sep 2007 08:57:37 GMT" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Page-Exit" content="progid:DXImageTransform.Microsoft.Fade(duration=.01)" />
Post Reply