We have a script that downloads file contents.
Of late we have been seeing errors like -
000 2006-11-24 22:34:18 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 155690 in the function htutf16_to_utf8
000 2006-11-24 22:35:31 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 1826 in the function htutf16_to_utf8
000 2006-11-24 22:37:03 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 774 in the function htutf16_to_utf8
000 2006-11-24 22:41:35 [ssologin]:15: Cannot completely convert charset UTF-16 to UTF-8: Invalid character sequence at source offset 4706 in the function htutf16_to_utf8
000 2006-11-24 22:55:06 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:55:09 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
000 2006-11-24 22:56:24 /opt/texis/searchscripts/folders/tools/downContentProcess:516: Timeout
10
Sounds like those are improperly encoded pages. They need to be fixed or the errors will continue.
No locks are used during conversion though, so that's unrelated. If you're using webinator set Parallelism:Threads to 1 to prevent processing of multiple pages from affecting each other.
Can you post a public URL to one of the pages that gives a UTF-16 conversion error, as well as the specific message for that URL? It is most likely due to an erroneously encoded UTF-16 sequence, as mentioned.
Large complicated files being processed? Some formats are particularly memory and processor intensive to convert to text.
Depends what you mean by gracefully. There's the <timeout></timeout> directive which lets you output a message when the timeout occurs. Personally I'd rather give a --timeout option to anytotx that was shorter than my script timeout so the script regains control when the conversion takes too long.
Actually I have timeout set as
<$wrapanytotx = ($loc_program + "/wrap/wrapanytotx")>
<strfmt "--timeout=%d" $parse_timeout>
<$opt = $opt "--timeout=-1"><!-- $ret> -->
and the script timeout as 1000
and fetch timeout is 300...
I am confused now with these timeout errors...
Another thing I am seesing these errors :
1. Cannot pdf open /tmp/cvti00323a in the function do_epipdf_file
2. anytotx (21976) ABEND: signal 11 (SIGSEGV); exiting
You're setting no timeout for anytotx so it can run longer than your script timeout. You pretend to use $parse_timeout but then proceed to ignore that and use -1.
What does wrapanytotx do?
That error indicates that the pdf is not convertable for some reason. If the pdf download is truncated it can't be processed. If it's not truncated it may contain structure that anytotx doesn't understand. What's your version of anytotx (anytotx --identify)?
bash-3.00$ anytotx --identify
release: 20051016 1129521260
thunderstone: 1
formats: pdf html msw xls mso swf auto other
pdf: 3.00.04 (supporting PDF 1.5)
metaok: 1
features: meta links images rules timeout