Fetching HTML with foreign characters, like ê,ñ,ç

gerry.odea
Posts: 98
Joined: Fri Sep 19, 2008 9:33 am

Fetching HTML with foreign characters, like ê,ñ,ç

Post by gerry.odea »

I'm running into a problem when I fetch html that has foreign characters in the url strings on the html page.

For instance if I try to bring in a page that has these urls below, thunderstone will drop the foreign characters from the urls but not from the description.

<a href="/Top/World/Português">Português</a>
<a href="/Top/World/Français">Français</a>
<a href="/Top/World/Español">Español</a>

When I get them back from a fetch they look like this:


<a href="/Top/World/Portugus">Português</a>
<a href="/Top/World/Franais">Français</a>
<a href="/Top/World/Espaol">Español</a>


This is the fetch I am using:

<a name=DIRECTOR>
<if $urls like "/Top">
<fetch PARALLEL $urls>
<sandr '/Top' '/texis/open/geekie\?urls=/Top' $ret>
<$html=$ret>
<send $html>
<flush></fetch>
</if>
</a>
User avatar
John
Site Admin
Posts: 2625
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH

Fetching HTML with foreign characters, like ê,ñ,ç

Post by John »

How is the page encoded to begin with? The URL should probably be URL encoded in any case. Is there an example URL we can fetch?
John Turnbull
Thunderstone Software
gerry.odea
Posts: 98
Joined: Fri Sep 19, 2008 9:33 am

Fetching HTML with foreign characters, like ê,ñ,ç

Post by gerry.odea »

It's not encoded. It's just like this:

<table width=98% cellpadding=2 cellspacing=0>
<tr><td nowrap>
<td width=40% valign="top"><font face=arial size="2"><a href="/Top/World/Afrikaans">Afrikaans</a><font size=1> (581) </font><BR><a href="/Top/World/Arabic">Arabic</a><font size=1> (7103) </font><BR><a href="/Top/World/Armenian">Armenian</a><font size=1> (896) </font><BR><a href="/Top/World/Asturianu">Asturianu</a><font size=1> (95) </font><BR><a href="/Top/World/Azerbaijani">Azerbaijani</a><font size=1> (594) </font><BR><a href="/Top/World/BahasaIndonesia">Bahasa Indonesia</a><font size=1> (2620) </font><BR><a href="/Top/World/BahasaMelayu">Bahasa Melayu</a><font size=1> (601) </font><BR><a href="/Top/World/Bangla">Bangla</a><font size=1> (29) </font><BR><a href="/Top/World/Belarusian">Belarusian</a><font size=1> (222) </font><BR><a href="/Top/World/Bosanski">Bosanski</a><font size=1> (1770) </font><BR><a href="/Top/World/Brezhoneg">Brezhoneg</a><font size=1> (241) </font><BR><a href="/Top/World/Bulgarian">Bulgarian</a><font size=1> (5298) </font><BR><a href="/Top/World/Català">Català</a><font size=1> (46684) </font><BR><a href="/Top/World/ChineseSimplified">Chinese Simplified</a><font size=1> (61518) </font><BR><a href="/Top/World/ChineseTraditional">Chinese Traditional</a><font size=1> (16392) </font><BR><a href="/Top/World/Cymraeg">Cymraeg</a><font size=1> (628) </font><BR><a href="/Top/World/Dansk">Dansk</a><font size=1> (68284) </font><BR><a href="/Top/World/Deutsch">Deutsch</a><font size=1> (952435) </font><BR><a href="/Top/World/Eesti">Eesti</a><font size=1> (1307) </font><BR><a href="/Top/World/Español">Español</a><font size=1> (203497) </font><BR><a href="/Top/World/Esperanto">Esperanto</a><font size=1> (3841) </font><BR><a href="/Top/World/Euskara">Euskara</a><font size=1> (2237) </font><BR><a href="/Top/World/Français">Français</a><font size=1> (320926) </font><BR><a href="/Top/World/Frysk">Frysk</a><font size=1> (37) </font><BR><a href="/Top/World/Furlan">Furlan</a><font size=1> (170) </font><BR><a href="/Top/World/Føroyskt">Føroyskt</a><font size=1> (74) </font><BR><a href="/Top/World/Gaeilge">Gaeilge</a><font size=1> (112) </font><BR><a href="/Top/World/Galego">Galego</a><font size=1> (1815) </font><BR><a href="/Top/World/Greek">Greek</a><font size=1> (4071) </font><BR><a href="/Top/World/Gujarati">Gujarati</a><font size=1> (66) </font><BR><a href="/Top/World/Gàidhlig">Gàidhlig</a><font size=1> (199) </font><BR><a href="/Top/World/Hebrew">Hebrew</a><font size=1> (6836) </font><BR><a href="/Top/World/Hindi">Hindi</a><font size=1> (1005) </font><BR><a href="/Top/World/Hrvatski">Hrvatski</a><font size=1> (5695) </font><BR><a href="/Top/World/Interlingua">Interlingua</a><font size=1> (91) </font><BR><a href="/Top/World/Italiano">Italiano</a><font size=1> (224607) </font><BR><a href="/Top/World/Japanese">Japanese</a><font size=1> (192104) </font><BR><a href="/Top/World/Kannada">Kannada</a><font size=1> (105) </font><BR><a href="/Top/World/Kaszëbsczi">Kaszëbsczi</a><font size=1> (37) </font><BR><a href="/Top/World/Kazakh">Kazakh</a><font size=1> (147) </font><BR></font></td>
<td width=60% valign="top"><font face=arial size="2"><a href="/Top/World/Kiswahili">Kiswahili</a><font size=1> (47) </font><BR><a href="/Top/World/Korean">Korean</a><font size=1> (7226) </font><BR><a href="/Top/World/Kurdî">Kurdî</a><font size=1> (584) </font><BR><a href="/Top/World/Latviski">Latviski</a><font size=1> (4142) </font><BR><a href="/Top/World/Lietuvi&#371;">Lietuvi&#371;</a><font size=1> (6684) </font><BR><a href="/Top/World/LinguaLatina">Lingua Latina</a><font size=1> (89) </font><BR><a href="/Top/World/Lëtzebuergesch">Lëtzebuergesch</a><font size=1> (30) </font><BR><a href="/Top/World/Magyar">Magyar</a><font size=1> (9378) </font><BR><a href="/Top/World/Makedonski">Makedonski</a><font size=1> (237) </font><BR><a href="/Top/World/Marathi">Marathi</a><font size=1> (45) </font><BR><a href="/Top/World/Nederlands">Nederlands</a><font size=1> (115946) </font><BR><a href="/Top/World/Norsk">Norsk</a><font size=1> (17028) </font><BR><a href="/Top/World/Occitan">Occitan</a><font size=1> (106) </font><BR><a href="/Top/World/Ossetian">Ossetian</a><font size=1> (24) </font><BR><a href="/Top/World/Persian">Persian</a><font size=1> (1546) </font><BR><a href="/Top/World/Polska">Polska</a><font size=1> (87472) </font><BR><a href="/Top/World/Português">Português</a><font size=1> (28375) </font><BR><a href="/Top/World/Punjabi">Punjabi</a><font size=1> (66) </font><BR><a href="/Top/World/Român&#259;">Român&#259;</a><font size=1> (18450) </font><BR><a href="/Top/World/Rumantsch">Rumantsch</a><font size=1> (86) </font><BR><a href="/Top/World/Russian">Russian</a><font size=1> (70223) </font><BR><a href="/Top/World/Sardu">Sardu</a><font size=1> (308) </font><BR><a href="/Top/World/Shqip">Shqip</a><font size=1> (432) </font><BR><a href="/Top/World/Sicilianu">Sicilianu</a><font size=1> (38) </font><BR><a href="/Top/World/Slovensko">Slovensko</a><font size=1> (953) </font><BR><a href="/Top/World/Slovensky">Slovensky</a><font size=1> (3386) </font><BR><a href="/Top/World/Srpski">Srpski</a><font size=1> (2992) </font><BR><a href="/Top/World/Suomi">Suomi</a><font size=1> (11204) </font><BR><a href="/Top/World/Svenska">Svenska</a><font size=1> (46678) </font><BR><a href="/Top/World/Tagalog">Tagalog</a><font size=1> (532) </font><BR><a href="/Top/World/Taiwanese">Taiwanese</a><font size=1> (138) </font><BR><a href="/Top/World/Tamil">Tamil</a><font size=1> (285) </font><BR><a href="/Top/World/Tatarça">Tatarça</a><font size=1> (91) </font><BR><a href="/Top/World/Telugu">Telugu</a><font size=1> (166) </font><BR><a href="/Top/World/Thai">Thai</a><font size=1> (2098) </font><BR><a href="/Top/World/Türkçe">Türkçe</a><font size=1> (1113711) </font><BR><a href="/Top/World/Ukrainian">Ukrainian</a><font size=1> (5531) </font><BR><a href="/Top/World/Vietnamese">Vietnamese</a><font size=1> (892) </font><BR><a href="/Top/World/slenska">Íslenska</a><font size=1> (512) </font><BR><a href="/Top/World/&#268;esky">&#268;esky</a><font size=1> (25621) </font><BR></font></td>
</tr><tr><td height="10"></td></tr></td></table>
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Fetching HTML with foreign characters, like ê,ñ,ç

Post by mark »

I'm not able to replicate what you describe using what you've posted. Perhaps there's some webserver or browser interaction going on? Exactly where are you seeing the missing characters? Try writing the result of the fetch to a file and the result of the sandr to another file and compare those to the original.

What version of texis are you using?
gerry.odea
Posts: 98
Joined: Fri Sep 19, 2008 9:33 am

Fetching HTML with foreign characters, like ê,ñ,ç

Post by gerry.odea »

There is no difference on the fetch, maybe it has to do with the old version of Texis that I have?

I have:
Texis Web Script (Vortex) Copyright © 1996-1999 Thunderstone - EPI, Inc.
Commercial Version 2.6.929642470 of Jun 17, 1999 (i686-unknown-linux2.2.5)


Is there a work around for older versions of Texis?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Fetching HTML with foreign characters, like ê,ñ,ç

Post by mark »

I doubt I can resurrect anything that old to develop workarounds. Are you saying that sandr is causing the difference? We need to find exactly where the problem is and figure what's happening. At what point does the change first appear?
gerry.odea
Posts: 98
Joined: Fri Sep 19, 2008 9:33 am

Fetching HTML with foreign characters, like ê,ñ,ç

Post by gerry.odea »

The change appears with the Fetch, not the sandr, if it was just sandr that would be easy. I removed the sandr and it was occuring still. The Fetch seems to be not be recognizing the characters such as ê,ñ,ç and is just dropping them off. Is there any command I can put in to stop that?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Fetching HTML with foreign characters, like ê,ñ,ç

Post by mark »

Are you seeing the difference in the browser or in a file saved from the script?

<fetch PARALLEL $urls>
<$html=$ret>
<write /tmp/sample.html><fmt "%s" $html></write>
<fmt "%s" $html>
</fetch>
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Fetching HTML with foreign characters, like ê,ñ,ç

Post by mark »

Do you have any urlcp calls? Especially any reparent options?