anytotx difficulties

Faiz
Posts: 109
Joined: Wed Jan 10, 2001 1:29 pm

anytotx difficulties

Post by Faiz »

Hi,
anytotx extracts the text for most of the PDF docs, but for some it doesn't. When I did a,
anytotx -fpdf <pdfdoc.pdf >pdfdoc.txt
it gave me an error, 000 Can't get text from PDF document.
what could be wrong? does it have something to do with the way pdf docs are created? anytotx --identify gives,
release: 20010418
thunderstone: 1
formats: pdf html msw swf auto
acrobat: 30
metaok: 1
features: meta links images

thanx,
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

anytotx difficulties

Post by mark »

The file may be truncated or otherwise corrupted. Or it may use some new features that the Adobe Acrobat text libraries can't handle (Adobe dropped support for text extraction with Acrobat 4 so Acrobat 4 or 5 files may have problems). The latest version of Texis comes with a plugin that does not rely on Adobe libraries and therefore can better handle Acrobat 4 and 5 documents.
Faiz
Posts: 109
Joined: Wed Jan 10, 2001 1:29 pm

anytotx difficulties

Post by Faiz »

Another question. Does the plugin extract contents from Microsoft Word 9.0? Other versions of Word are fine but the plugin could not extract contents from this version. It gives an error, Error translating. Perhaps a truncated download or corrupt file.
But the file is not corrupt. Any clues?
Faiz
Posts: 109
Joined: Wed Jan 10, 2001 1:29 pm

anytotx difficulties

Post by Faiz »

sorry, the document was MSWord 2000. The crawler gives a truncated download error, but when viewed on the browser or MSWord, then I dont get or feel any error. But this happens to small number of documents though.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

anytotx difficulties

Post by mark »

I assume you're using -fmsw to process those word docs?
Is the file smaller than your download limit (-z). The default is 100k.

Try processing the file by hand to see if you get any more helpful messages:
anytotx -fmsw <yourfile.doc
Faiz
Posts: 109
Joined: Wed Jan 10, 2001 1:29 pm

anytotx difficulties

Post by Faiz »

Yes I am using -fmsw and the dowalk script to index documents. The word document was actually a URL. Strangely, when I saved the Doc Url to the hard disk and did anytotx -fmsw <worddoc.doc , it gave me the contents but not when I used the dowalk script. I had also set <urlcp maxpgsize> to a fairly large amount.
Another thing which I noticed was that if the WORD document has a tabular structure with data in it, then aytotx extracts only the first row of data and ignores others. It does not even give a truncated file error.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

anytotx difficulties

Post by mark »

But was the size of the "saved" document larger than maxpgsize?
Or were there any other errors associated with that url?
We'll have to look into the tabular issue you describe.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

anytotx difficulties

Post by mark »

Could you upload a copy of the "tabular" document you mentioned to the same directory where you're downloading Texis updates?
Faiz
Posts: 109
Joined: Wed Jan 10, 2001 1:29 pm

anytotx difficulties

Post by Faiz »

Yes I will and it will be under the non-disclosure agreement between GE and Thunderstone. Also, since it is a GE internal document, it is sent only for the pupose mentioned in previous postings and this document should not be forwarded and all copies of this document should be destroyed.
I have ftp-ed the file but it gave me some warnings. The file name is doc_92596.doc. I can send you an email with an attachment, if the file uploaded is not proper.
Oh, by the way <urlcp maxpgsize 50000000> is set in the script.

Thanx,
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

anytotx difficulties

Post by mark »

It looks like the sample file you provided is more like a spreadsheet than a document. We'll have to look into improving our handling of xls format, but for now you can process it with -fother instead of -fmsw.
Post Reply