Search Word or PDF files

vaibhav.choksey · Post by **vaibhav.choksey** » Wed Apr 11, 2001 11:37 am

I am integration of Vignette and Taxis search integration.
I am suppose to do search on text inside the MS word and PDF files. I have no idea how it works. If possible can somebody e-mail the manuals or sample code to do it will be great.
Thanks

-Vaibhav

Post by **mark** » Wed Apr 11, 2001 12:18 pm

See http://www.thunderstone.com/site/texisman/node64.html
Or you can run anytotx on the word and doc files manually with exec.
Or, if you're using gw, you need to use the -n option.
There's also a file called anytotx.txt in the same directory as anytotx that you can read.

vaibhav.choksey · Post by **vaibhav.choksey** » Fri May 11, 2001 10:36 am

Hi!
I have some basic questions:
1) How do I index pdf or word files?
2) How do I create metamorph index for pdf or word file?
3) Is there any way I can upload file using script to Texis server?
4) shall I execute anytotx using script?
thanks
-vaibhav

Post by **mark** » Fri May 11, 2001 10:46 am

1,4) You index the text of those types of files. anytotx is used to extract the text from them which can then be inserted into a texis table.
<exec anytotx $pdffile></exec><$text=$ret>
<exec anytotx -fmsw $wordfile></exec><$text=$ret>
2) same as on any other varchar field.
3) Yes, see http://www.thunderstone.com/site/vortexman/node16.html

vaibhav.choksey · Post by **vaibhav.choksey** » Tue May 15, 2001 4:57 pm

Hi there!
I got searching through Word, XLS and PPT files but doesn't search through Vignette files.
I have INDIRECT type if field and I have stored file path in there and when I try to search for word, which exists in pdf files. It doesn't return me the result.
Is there anything else I need to do? or need to upgrade Texis Software?

Post by **mark** » Tue May 15, 2001 5:17 pm

Does the indirect point to the original PDF or the text extracted using anytotx? You should be doing queries against the extracted text. If you still have problems you need to provide a small outline of your table and load and search procedures for us to be able to help further.

vaibhav.choksey · Post by **vaibhav.choksey** » Wed May 16, 2001 10:33 am

Hi Mark!
Yes I have Metamorph INDEX for INDIRECT Field "FILE_PATH" and it's pointing directly to PDF file. Same field points directly to WORD, POWER POINT and EXECL files and searches through them but doesn't serach throguh PDF files. I am not using "anytotx" as it's not necessary because I have field type of "Indirect". Can u give me some feedback on this. The table structure is something like this.
<SQL "create table cmp_Search (ID INTEGER NOT NULL, PARENT_ID INTEGER NOT NULL, PARENT_TYPE_ID INTEGER NOT NULL, TITLE CHAR(50), DESCRIPTION_TEXT VARCHAR(3000), NAME_ADDRESS CHAR(1000), FILE_PATH INDIRECT, CREATED_DATE DATE, PRIMARY KEY (ID))">
</SQL>

Post by **mark** » Wed May 16, 2001 10:46 am

There is no relationship between using indirect and the need to use anytotx. An indirect simply means that the data is in an external file instead of directly in the table. The data is treated the same for searching either way.

Word files generally have the text visible within the file. When you search it you are searching the text and all of the encoding around it. PDF files do not have the text visible. The only way to get the text is with anytotx. Anytotx will also get rid of the encoding around the text in word files.

You need to create another field for the "text" of the document and populate that with the output of anytotx. That's the field that should be searched, not the raw file.

vaibhav.choksey · Post by **vaibhav.choksey** » Sun Jun 24, 2001 3:24 pm

I got it working.
Thanks
-Vaibhav