Page 1 of 2

utf errors thrown while downloading content.

Posted: Sat Nov 25, 2006 1:49 am
by Texis User
Hi,

We have a script that downloads file contents.
Of late we have been seeing errors like -

000 2006-11-24 22:34:18 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 155690 in the function htutf16_to_utf8
000 2006-11-24 22:35:31 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 1826 in the function htutf16_to_utf8
000 2006-11-24 22:37:03 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 774 in the function htutf16_to_utf8
000 2006-11-24 22:41:35 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 4706 in the function htutf16_to_utf8
000 2006-11-24 22:55:06 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:55:09 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:56:24 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
10



This seem to cause CPU hikes and locks on table.

Any suggestions?

utf errors thrown while downloading content.

Posted: Mon Nov 27, 2006 1:20 pm
by Texis User
Appreciate all the help I can get on this.

Thanks

utf errors thrown while downloading content.

Posted: Mon Nov 27, 2006 1:56 pm
by mark
Sounds like those are improperly encoded pages. They need to be fixed or the errors will continue.

No locks are used during conversion though, so that's unrelated. If you're using webinator set Parallelism:Threads to 1 to prevent processing of multiple pages from affecting each other.

utf errors thrown while downloading content.

Posted: Wed Nov 29, 2006 12:02 pm
by Kai
Can you post a public URL to one of the pages that gives a UTF-16 conversion error, as well as the specific message for that URL? It is most likely due to an erroneously encoded UTF-16 sequence, as mentioned.

utf errors thrown while downloading content.

Posted: Thu Nov 30, 2006 3:36 pm
by Texis User
No public urls, its a intranet site.

Another thing we observe as I pasted in the last message, I see lot of such errors

000 2006-11-24 22:55:06 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:55:09 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:56:24 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
10


At this time I have lots of anytotxt scripts running and a unreasonable hike in not only CPU but /tmp too..

The line 516 corresponds to block where I call anytotxt.

Also is there anyway I could exit gracefully when Timeout occur in my Script?

I have set 300 sec for fetch timeouts but still see errors when my script tries to read a url

utf errors thrown while downloading content.

Posted: Thu Nov 30, 2006 5:12 pm
by mark
Large complicated files being processed? Some formats are particularly memory and processor intensive to convert to text.

Depends what you mean by gracefully. There's the <timeout></timeout> directive which lets you output a message when the timeout occurs. Personally I'd rather give a --timeout option to anytotx that was shorter than my script timeout so the script regains control when the conversion takes too long.

utf errors thrown while downloading content.

Posted: Fri Dec 01, 2006 7:34 am
by Texis User
How do u set timeout for anytotx ?

utf errors thrown while downloading content.

Posted: Fri Dec 01, 2006 8:16 am
by Texis User
Actually I have timeout set as
<$wrapanytotx = ($loc_program + "/wrap/wrapanytotx")>
<strfmt "--timeout=%d" $parse_timeout>
<$opt = $opt "--timeout=-1"><!-- $ret> -->

and the script timeout as 1000
and fetch timeout is 300...

I am confused now with these timeout errors...


Another thing I am seesing these errors :

1. Cannot pdf open /tmp/cvti00323a in the function do_epipdf_file

2. anytotx (21976) ABEND: signal 11 (SIGSEGV); exiting

utf errors thrown while downloading content.

Posted: Fri Dec 01, 2006 10:57 am
by mark
You're setting no timeout for anytotx so it can run longer than your script timeout. You pretend to use $parse_timeout but then proceed to ignore that and use -1.

What does wrapanytotx do?

That error indicates that the pdf is not convertable for some reason. If the pdf download is truncated it can't be processed. If it's not truncated it may contain structure that anytotx doesn't understand. What's your version of anytotx (anytotx --identify)?

utf errors thrown while downloading content.

Posted: Fri Dec 01, 2006 11:03 am
by Texis User
We corrected that now this is what we have in our scripts

1. <TIMEOUT = 1000>
<$err=0>
<$exret = $ret>
<$exerr = $ret.err>
<$exstderr = $ret.stderr>
<$message = " Timout in script || " $exret " || " $exerr " || " $exstderr><sum %s $message><$message=$ret>
</TIMEOUT>

2. <$wrapanytotx = ($loc_program + "/wrap/wrapanytotx")>
<strfmt "--timeout=%d" $parse_timeout>
<$opt = $opt "--timeout=300"><!-- $ret> -->

3. <urlcp timeout 300> <!-- page timeout -->


version of anytotx - release: 20051016 1129521260

bash-3.00$ anytotx --identify
release: 20051016 1129521260
thunderstone: 1
formats: pdf html msw xls mso swf auto other
pdf: 3.00.04 (supporting PDF 1.5)
metaok: 1
features: meta links images rules timeout