Thunderstone Support Forums

Posted: **Fri May 18, 2001 9:54 am**

Hi,

We have setup webinator to search and index pdf files, and it returns the title field (which is exactly what we need).

My question is: is there any way to search the other fields in the doc info section of pdf files, such as subject, keywords, etc?

Thanking you in advance,
Cheers,
Erik

Posted: **Fri May 18, 2001 10:20 am**

gw's -meta option will also extract from PDF files. The metas available from PDFs are:
Author CreationDate ModDate Creator Producer Title Subject Keywords

Posted: **Mon Mar 11, 2002 10:36 pm**

Can you restrict PDF and other documents to be searched and indexed by Filename and Title only..will add keywords etc later but the indexing of the whole document is not needed

TIA
Sal

Posted: **Mon Mar 11, 2002 11:34 pm**

With Webinator 2 you'd have to go back after the walk and clear the Body field with a SQL update statement. Another possibility would be to put a wrapper around anytotx to remove the body content from it's return so gw never stores it.

With Webinator 4 it would be fairly simple to modify the dowalk script to not keep the Body text from the plugin.

Posted: **Tue Mar 12, 2002 9:28 am**

A wrapper???? could you direct me to some examples on how to do either the update or "the wrapper"

TIA
Sal

Posted: **Tue Mar 12, 2002 10:17 am**

texis -s -d /path/to/your/database "update html set Body=''"
gw -d/path/to/your/database -index

The wrapper would involve writing a program or shell script to use as the plugin which would then call the real plugin and strip the body text from it's answer before returning it to gw. Doing this is beyond the scope of free technical support.

Thunderstone Support Forums

PDF doc info

PDF doc info

PDF doc info

PDF doc info

PDF doc info

PDF doc info

PDF doc info