utf errors thrown while downloading content.

Texis User · Post by **Texis User** » Sat Nov 25, 2006 1:49 am

Hi,

We have a script that downloads file contents.
Of late we have been seeing errors like -

000 2006-11-24 22:34:18 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 155690 in the function htutf16_to_utf8
000 2006-11-24 22:35:31 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 1826 in the function htutf16_to_utf8
000 2006-11-24 22:37:03 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 774 in the function htutf16_to_utf8
000 2006-11-24 22:41:35 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 4706 in the function htutf16_to_utf8
000 2006-11-24 22:55:06 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:55:09 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:56:24 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
10

This seem to cause CPU hikes and locks on table.

Any suggestions?

Texis User · Post by **Texis User** » Mon Nov 27, 2006 1:20 pm

Appreciate all the help I can get on this.

Thanks

Post by **mark** » Mon Nov 27, 2006 1:56 pm

Sounds like those are improperly encoded pages. They need to be fixed or the errors will continue.

No locks are used during conversion though, so that's unrelated. If you're using webinator set Parallelism:Threads to 1 to prevent processing of multiple pages from affecting each other.

Post by **Kai** » Wed Nov 29, 2006 12:02 pm

Can you post a public URL to one of the pages that gives a UTF-16 conversion error, as well as the specific message for that URL? It is most likely due to an erroneously encoded UTF-16 sequence, as mentioned.

Texis User · Post by **Texis User** » Thu Nov 30, 2006 3:36 pm

No public urls, its a intranet site.

Another thing we observe as I pasted in the last message, I see lot of such errors

000 2006-11-24 22:55:06 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:55:09 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:56:24 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
10

At this time I have lots of anytotxt scripts running and a unreasonable hike in not only CPU but /tmp too..

The line 516 corresponds to block where I call anytotxt.

Also is there anyway I could exit gracefully when Timeout occur in my Script?

I have set 300 sec for fetch timeouts but still see errors when my script tries to read a url

Post by **mark** » Thu Nov 30, 2006 5:12 pm

Large complicated files being processed? Some formats are particularly memory and processor intensive to convert to text.

Depends what you mean by gracefully. There's the <timeout></timeout> directive which lets you output a message when the timeout occurs. Personally I'd rather give a --timeout option to anytotx that was shorter than my script timeout so the script regains control when the conversion takes too long.

Texis User · Post by **Texis User** » Fri Dec 01, 2006 7:34 am

How do u set timeout for anytotx ?

Texis User · Post by **Texis User** » Fri Dec 01, 2006 8:16 am

Actually I have timeout set as
<$wrapanytotx = ($loc_program + "/wrap/wrapanytotx")>
<strfmt "--timeout=%d" $parse_timeout>
<$opt = $opt "--timeout=-1">

and the script timeout as 1000
and fetch timeout is 300...

I am confused now with these timeout errors...

Another thing I am seesing these errors :

1. Cannot pdf open /tmp/cvti00323a in the function do_epipdf_file

2. anytotx (21976) ABEND: signal 11 (SIGSEGV); exiting

Post by **mark** » Fri Dec 01, 2006 10:57 am

You're setting no timeout for anytotx so it can run longer than your script timeout. You pretend to use $parse_timeout but then proceed to ignore that and use -1.

What does wrapanytotx do?

That error indicates that the pdf is not convertable for some reason. If the pdf download is truncated it can't be processed. If it's not truncated it may contain structure that anytotx doesn't understand. What's your version of anytotx (anytotx --identify)?

Texis User · Post by **Texis User** » Fri Dec 01, 2006 11:03 am

We corrected that now this is what we have in our scripts

1. <TIMEOUT = 1000>
<$err=0>
<$exret = $ret>
<$exerr = $ret.err>
<$exstderr = $ret.stderr>
<$message = " Timout in script || " $exret " || " $exerr " || " $exstderr><sum %s $message><$message=$ret>
</TIMEOUT>

2. <$wrapanytotx = ($loc_program + "/wrap/wrapanytotx")>
<strfmt "--timeout=%d" $parse_timeout>
<$opt = $opt "--timeout=300">

3. <urlcp timeout 300> 

version of anytotx - release: 20051016 1129521260

bash-3.00$ anytotx --identify
release: 20051016 1129521260
thunderstone: 1
formats: pdf html msw xls mso swf auto other
pdf: 3.00.04 (supporting PDF 1.5)
metaok: 1
features: meta links images rules timeout