utf errors thrown while downloading content.

Texis User
Posts: 74
Joined: Thu Jul 13, 2006 8:47 am

utf errors thrown while downloading content.

Post by Texis User »

Hi,

We have a script that downloads file contents.
Of late we have been seeing errors like -

000 2006-11-24 22:34:18 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 155690 in the function htutf16_to_utf8
000 2006-11-24 22:35:31 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 1826 in the function htutf16_to_utf8
000 2006-11-24 22:37:03 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 774 in the function htutf16_to_utf8
000 2006-11-24 22:41:35 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 4706 in the function htutf16_to_utf8
000 2006-11-24 22:55:06 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:55:09 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:56:24 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
10



This seem to cause CPU hikes and locks on table.

Any suggestions?
Texis User
Posts: 74
Joined: Thu Jul 13, 2006 8:47 am

utf errors thrown while downloading content.

Post by Texis User »

Appreciate all the help I can get on this.

Thanks
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

utf errors thrown while downloading content.

Post by mark »

Sounds like those are improperly encoded pages. They need to be fixed or the errors will continue.

No locks are used during conversion though, so that's unrelated. If you're using webinator set Parallelism:Threads to 1 to prevent processing of multiple pages from affecting each other.
User avatar
Kai
Site Admin
Posts: 1272
Joined: Tue Apr 25, 2000 1:27 pm

utf errors thrown while downloading content.

Post by Kai »

Can you post a public URL to one of the pages that gives a UTF-16 conversion error, as well as the specific message for that URL? It is most likely due to an erroneously encoded UTF-16 sequence, as mentioned.
Texis User
Posts: 74
Joined: Thu Jul 13, 2006 8:47 am

utf errors thrown while downloading content.

Post by Texis User »

No public urls, its a intranet site.

Another thing we observe as I pasted in the last message, I see lot of such errors

000 2006-11-24 22:55:06 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:55:09 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:56:24 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
10


At this time I have lots of anytotxt scripts running and a unreasonable hike in not only CPU but /tmp too..

The line 516 corresponds to block where I call anytotxt.

Also is there anyway I could exit gracefully when Timeout occur in my Script?

I have set 300 sec for fetch timeouts but still see errors when my script tries to read a url
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

utf errors thrown while downloading content.

Post by mark »

Large complicated files being processed? Some formats are particularly memory and processor intensive to convert to text.

Depends what you mean by gracefully. There's the <timeout></timeout> directive which lets you output a message when the timeout occurs. Personally I'd rather give a --timeout option to anytotx that was shorter than my script timeout so the script regains control when the conversion takes too long.
Texis User
Posts: 74
Joined: Thu Jul 13, 2006 8:47 am

utf errors thrown while downloading content.

Post by Texis User »

How do u set timeout for anytotx ?
Texis User
Posts: 74
Joined: Thu Jul 13, 2006 8:47 am

utf errors thrown while downloading content.

Post by Texis User »

Actually I have timeout set as
<$wrapanytotx = ($loc_program + "/wrap/wrapanytotx")>
<strfmt "--timeout=%d" $parse_timeout>
<$opt = $opt "--timeout=-1"><!-- $ret> -->

and the script timeout as 1000
and fetch timeout is 300...

I am confused now with these timeout errors...


Another thing I am seesing these errors :

1. Cannot pdf open /tmp/cvti00323a in the function do_epipdf_file

2. anytotx (21976) ABEND: signal 11 (SIGSEGV); exiting
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

utf errors thrown while downloading content.

Post by mark »

You're setting no timeout for anytotx so it can run longer than your script timeout. You pretend to use $parse_timeout but then proceed to ignore that and use -1.

What does wrapanytotx do?

That error indicates that the pdf is not convertable for some reason. If the pdf download is truncated it can't be processed. If it's not truncated it may contain structure that anytotx doesn't understand. What's your version of anytotx (anytotx --identify)?
Texis User
Posts: 74
Joined: Thu Jul 13, 2006 8:47 am

utf errors thrown while downloading content.

Post by Texis User »

We corrected that now this is what we have in our scripts

1. <TIMEOUT = 1000>
<$err=0>
<$exret = $ret>
<$exerr = $ret.err>
<$exstderr = $ret.stderr>
<$message = " Timout in script || " $exret " || " $exerr " || " $exstderr><sum %s $message><$message=$ret>
</TIMEOUT>

2. <$wrapanytotx = ($loc_program + "/wrap/wrapanytotx")>
<strfmt "--timeout=%d" $parse_timeout>
<$opt = $opt "--timeout=300"><!-- $ret> -->

3. <urlcp timeout 300> <!-- page timeout -->


version of anytotx - release: 20051016 1129521260

bash-3.00$ anytotx --identify
release: 20051016 1129521260
thunderstone: 1
formats: pdf html msw xls mso swf auto other
pdf: 3.00.04 (supporting PDF 1.5)
metaok: 1
features: meta links images rules timeout
Post Reply