gerry.odea
Posts: 98 Joined: Fri Sep 19, 2008 9:33 am
Post
by gerry.odea » Fri Dec 05, 2008 8:55 am
I'm running into a problem when I fetch html that has foreign characters in the url strings on the html page.
For instance if I try to bring in a page that has these urls below, thunderstone will drop the foreign characters from the urls but not from the description.
<a href="/Top/World/Português">Português</a>
<a href="/Top/World/Français">Français</a>
<a href="/Top/World/Español">Español</a>
When I get them back from a fetch they look like this:
<a href="/Top/World/Portugus">Português</a>
<a href="/Top/World/Franais">Français</a>
<a href="/Top/World/Espaol">Español</a>
This is the fetch I am using:
<a name=DIRECTOR>
<if $urls like "/Top">
<fetch PARALLEL $urls>
<sandr '/Top' '/texis/open/geekie\?urls=/Top' $ret>
<$html=$ret>
<send $html>
<flush></fetch>
</if>
</a>
John
Site Admin
Posts: 2625 Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Post
by John » Fri Dec 05, 2008 10:05 am
How is the page encoded to begin with? The URL should probably be URL encoded in any case. Is there an example URL we can fetch?
John Turnbull
Thunderstone Software
gerry.odea
Posts: 98 Joined: Fri Sep 19, 2008 9:33 am
Post
by gerry.odea » Fri Dec 05, 2008 10:52 am
It's not encoded. It's just like this:
<table width=98% cellpadding=2 cellspacing=0>
<tr><td nowrap>
<td width=40% valign="top"><font face=arial size="2"><a href="/Top/World/Afrikaans">Afrikaans</a><font size=1> (581) </font><BR><a href="/Top/World/Arabic">Arabic</a><font size=1> (7103) </font><BR><a href="/Top/World/Armenian">Armenian</a><font size=1> (896) </font><BR><a href="/Top/World/Asturianu">Asturianu</a><font size=1> (95) </font><BR><a href="/Top/World/Azerbaijani">Azerbaijani</a><font size=1> (594) </font><BR><a href="/Top/World/BahasaIndonesia">Bahasa Indonesia</a><font size=1> (2620) </font><BR><a href="/Top/World/BahasaMelayu">Bahasa Melayu</a><font size=1> (601) </font><BR><a href="/Top/World/Bangla">Bangla</a><font size=1> (29) </font><BR><a href="/Top/World/Belarusian">Belarusian</a><font size=1> (222) </font><BR><a href="/Top/World/Bosanski">Bosanski</a><font size=1> (1770) </font><BR><a href="/Top/World/Brezhoneg">Brezhoneg</a><font size=1> (241) </font><BR><a href="/Top/World/Bulgarian">Bulgarian</a><font size=1> (5298) </font><BR><a href="/Top/World/Català">Català</a><font size=1> (46684) </font><BR><a href="/Top/World/ChineseSimplified">Chinese Simplified</a><font size=1> (61518) </font><BR><a href="/Top/World/ChineseTraditional">Chinese Traditional</a><font size=1> (16392) </font><BR><a href="/Top/World/Cymraeg">Cymraeg</a><font size=1> (628) </font><BR><a href="/Top/World/Dansk">Dansk</a><font size=1> (68284) </font><BR><a href="/Top/World/Deutsch">Deutsch</a><font size=1> (952435) </font><BR><a href="/Top/World/Eesti">Eesti</a><font size=1> (1307) </font><BR><a href="/Top/World/Español">Español</a><font size=1> (203497) </font><BR><a href="/Top/World/Esperanto">Esperanto</a><font size=1> (3841) </font><BR><a href="/Top/World/Euskara">Euskara</a><font size=1> (2237) </font><BR><a href="/Top/World/Français">Français</a><font size=1> (320926) </font><BR><a href="/Top/World/Frysk">Frysk</a><font size=1> (37) </font><BR><a href="/Top/World/Furlan">Furlan</a><font size=1> (170) </font><BR><a href="/Top/World/Føroyskt">Føroyskt</a><font size=1> (74) </font><BR><a href="/Top/World/Gaeilge">Gaeilge</a><font size=1> (112) </font><BR><a href="/Top/World/Galego">Galego</a><font size=1> (1815) </font><BR><a href="/Top/World/Greek">Greek</a><font size=1> (4071) </font><BR><a href="/Top/World/Gujarati">Gujarati</a><font size=1> (66) </font><BR><a href="/Top/World/Gàidhlig">Gàidhlig</a><font size=1> (199) </font><BR><a href="/Top/World/Hebrew">Hebrew</a><font size=1> (6836) </font><BR><a href="/Top/World/Hindi">Hindi</a><font size=1> (1005) </font><BR><a href="/Top/World/Hrvatski">Hrvatski</a><font size=1> (5695) </font><BR><a href="/Top/World/Interlingua">Interlingua</a><font size=1> (91) </font><BR><a href="/Top/World/Italiano">Italiano</a><font size=1> (224607) </font><BR><a href="/Top/World/Japanese">Japanese</a><font size=1> (192104) </font><BR><a href="/Top/World/Kannada">Kannada</a><font size=1> (105) </font><BR><a href="/Top/World/Kaszëbsczi">Kaszëbsczi</a><font size=1> (37) </font><BR><a href="/Top/World/Kazakh">Kazakh</a><font size=1> (147) </font><BR></font></td>
<td width=60% valign="top"><font face=arial size="2"><a href="/Top/World/Kiswahili">Kiswahili</a><font size=1> (47) </font><BR><a href="/Top/World/Korean">Korean</a><font size=1> (7226) </font><BR><a href="/Top/World/Kurdî">Kurdî</a><font size=1> (584) </font><BR><a href="/Top/World/Latviski">Latviski</a><font size=1> (4142) </font><BR><a href="/Top/World/Lietuvių">Lietuvių</a><font size=1> (6684) </font><BR><a href="/Top/World/LinguaLatina">Lingua Latina</a><font size=1> (89) </font><BR><a href="/Top/World/Lëtzebuergesch">Lëtzebuergesch</a><font size=1> (30) </font><BR><a href="/Top/World/Magyar">Magyar</a><font size=1> (9378) </font><BR><a href="/Top/World/Makedonski">Makedonski</a><font size=1> (237) </font><BR><a href="/Top/World/Marathi">Marathi</a><font size=1> (45) </font><BR><a href="/Top/World/Nederlands">Nederlands</a><font size=1> (115946) </font><BR><a href="/Top/World/Norsk">Norsk</a><font size=1> (17028) </font><BR><a href="/Top/World/Occitan">Occitan</a><font size=1> (106) </font><BR><a href="/Top/World/Ossetian">Ossetian</a><font size=1> (24) </font><BR><a href="/Top/World/Persian">Persian</a><font size=1> (1546) </font><BR><a href="/Top/World/Polska">Polska</a><font size=1> (87472) </font><BR><a href="/Top/World/Português">Português</a><font size=1> (28375) </font><BR><a href="/Top/World/Punjabi">Punjabi</a><font size=1> (66) </font><BR><a href="/Top/World/Română">Română</a><font size=1> (18450) </font><BR><a href="/Top/World/Rumantsch">Rumantsch</a><font size=1> (86) </font><BR><a href="/Top/World/Russian">Russian</a><font size=1> (70223) </font><BR><a href="/Top/World/Sardu">Sardu</a><font size=1> (308) </font><BR><a href="/Top/World/Shqip">Shqip</a><font size=1> (432) </font><BR><a href="/Top/World/Sicilianu">Sicilianu</a><font size=1> (38) </font><BR><a href="/Top/World/Slovensko">Slovensko</a><font size=1> (953) </font><BR><a href="/Top/World/Slovensky">Slovensky</a><font size=1> (3386) </font><BR><a href="/Top/World/Srpski">Srpski</a><font size=1> (2992) </font><BR><a href="/Top/World/Suomi">Suomi</a><font size=1> (11204) </font><BR><a href="/Top/World/Svenska">Svenska</a><font size=1> (46678) </font><BR><a href="/Top/World/Tagalog">Tagalog</a><font size=1> (532) </font><BR><a href="/Top/World/Taiwanese">Taiwanese</a><font size=1> (138) </font><BR><a href="/Top/World/Tamil">Tamil</a><font size=1> (285) </font><BR><a href="/Top/World/Tatarça">Tatarça</a><font size=1> (91) </font><BR><a href="/Top/World/Telugu">Telugu</a><font size=1> (166) </font><BR><a href="/Top/World/Thai">Thai</a><font size=1> (2098) </font><BR><a href="/Top/World/Türkçe">Türkçe</a><font size=1> (1113711) </font><BR><a href="/Top/World/Ukrainian">Ukrainian</a><font size=1> (5531) </font><BR><a href="/Top/World/Vietnamese">Vietnamese</a><font size=1> (892) </font><BR><a href="/Top/World/slenska">Íslenska</a><font size=1> (512) </font><BR><a href="/Top/World/Česky">Česky</a><font size=1> (25621) </font><BR></font></td>
</tr><tr><td height="10"></td></tr></td></table>
mark
Site Admin
Posts: 5519 Joined: Tue Apr 25, 2000 6:56 pm
Post
by mark » Fri Dec 05, 2008 12:09 pm
I'm not able to replicate what you describe using what you've posted. Perhaps there's some webserver or browser interaction going on? Exactly where are you seeing the missing characters? Try writing the result of the fetch to a file and the result of the sandr to another file and compare those to the original.
What version of texis are you using?
gerry.odea
Posts: 98 Joined: Fri Sep 19, 2008 9:33 am
Post
by gerry.odea » Fri Dec 05, 2008 12:20 pm
There is no difference on the fetch, maybe it has to do with the old version of Texis that I have?
I have:
Texis Web Script (Vortex) Copyright © 1996-1999 Thunderstone - EPI, Inc.
Commercial Version 2.6.929642470 of Jun 17, 1999 (i686-unknown-linux2.2.5)
Is there a work around for older versions of Texis?
mark
Site Admin
Posts: 5519 Joined: Tue Apr 25, 2000 6:56 pm
Post
by mark » Fri Dec 05, 2008 1:43 pm
I doubt I can resurrect anything that old to develop workarounds. Are you saying that sandr is causing the difference? We need to find exactly where the problem is and figure what's happening. At what point does the change first appear?
gerry.odea
Posts: 98 Joined: Fri Sep 19, 2008 9:33 am
Post
by gerry.odea » Fri Dec 05, 2008 2:14 pm
The change appears with the Fetch, not the sandr, if it was just sandr that would be easy. I removed the sandr and it was occuring still. The Fetch seems to be not be recognizing the characters such as ê,ñ,ç and is just dropping them off. Is there any command I can put in to stop that?
mark
Site Admin
Posts: 5519 Joined: Tue Apr 25, 2000 6:56 pm
Post
by mark » Fri Dec 05, 2008 4:32 pm
Are you seeing the difference in the browser or in a file saved from the script?
<fetch PARALLEL $urls>
<$html=$ret>
<write /tmp/sample.html><fmt "%s" $html></write>
<fmt "%s" $html>
</fetch>
mark
Site Admin
Posts: 5519 Joined: Tue Apr 25, 2000 6:56 pm
Post
by mark » Fri Dec 05, 2008 4:35 pm
Do you have any urlcp calls? Especially any reparent options?