anytotx difficulties

Faiz · Post by **Faiz** » Tue Dec 04, 2001 12:02 pm

Hi,
anytotx extracts the text for most of the PDF docs, but for some it doesn't. When I did a,
anytotx -fpdf <pdfdoc.pdf >pdfdoc.txt
it gave me an error, 000 Can't get text from PDF document.
what could be wrong? does it have something to do with the way pdf docs are created? anytotx --identify gives,
release: 20010418
thunderstone: 1
formats: pdf html msw swf auto
acrobat: 30
metaok: 1
features: meta links images

thanx,

Post by **mark** » Tue Dec 04, 2001 1:17 pm

The file may be truncated or otherwise corrupted. Or it may use some new features that the Adobe Acrobat text libraries can't handle (Adobe dropped support for text extraction with Acrobat 4 so Acrobat 4 or 5 files may have problems). The latest version of Texis comes with a plugin that does not rely on Adobe libraries and therefore can better handle Acrobat 4 and 5 documents.

Faiz · Post by **Faiz** » Tue Dec 04, 2001 4:31 pm

Another question. Does the plugin extract contents from Microsoft Word 9.0? Other versions of Word are fine but the plugin could not extract contents from this version. It gives an error, Error translating. Perhaps a truncated download or corrupt file.
But the file is not corrupt. Any clues?

Faiz · Post by **Faiz** » Tue Dec 04, 2001 4:40 pm

sorry, the document was MSWord 2000. The crawler gives a truncated download error, but when viewed on the browser or MSWord, then I dont get or feel any error. But this happens to small number of documents though.

Post by **mark** » Tue Dec 04, 2001 4:47 pm

I assume you're using -fmsw to process those word docs?
Is the file smaller than your download limit (-z). The default is 100k.

Try processing the file by hand to see if you get any more helpful messages:
anytotx -fmsw <yourfile.doc

Faiz · Post by **Faiz** » Tue Dec 04, 2001 5:12 pm

Yes I am using -fmsw and the dowalk script to index documents. The word document was actually a URL. Strangely, when I saved the Doc Url to the hard disk and did anytotx -fmsw <worddoc.doc , it gave me the contents but not when I used the dowalk script. I had also set <urlcp maxpgsize> to a fairly large amount.
Another thing which I noticed was that if the WORD document has a tabular structure with data in it, then aytotx extracts only the first row of data and ignores others. It does not even give a truncated file error.

Post by **mark** » Tue Dec 04, 2001 5:31 pm

But was the size of the "saved" document larger than maxpgsize?
Or were there any other errors associated with that url?
We'll have to look into the tabular issue you describe.

Post by **mark** » Tue Dec 04, 2001 5:32 pm

Could you upload a copy of the "tabular" document you mentioned to the same directory where you're downloading Texis updates?

Faiz · Post by **Faiz** » Tue Dec 04, 2001 6:18 pm

Yes I will and it will be under the non-disclosure agreement between GE and Thunderstone. Also, since it is a GE internal document, it is sent only for the pupose mentioned in previous postings and this document should not be forwarded and all copies of this document should be destroyed.
I have ftp-ed the file but it gave me some warnings. The file name is doc_92596.doc. I can send you an email with an attachment, if the file uploaded is not proper.
Oh, by the way <urlcp maxpgsize 50000000> is set in the script.

Thanx,

Post by **mark** » Wed Dec 05, 2001 10:44 am

It looks like the sample file you provided is more like a spreadsheet than a document. We'll have to look into improving our handling of xls format, but for now you can process it with -fother instead of -fmsw.