Page 1 of 1

PDF doc info

Posted: Fri May 18, 2001 9:54 am
by Erik
Hi,

We have setup webinator to search and index pdf files, and it returns the title field (which is exactly what we need).

My question is: is there any way to search the other fields in the doc info section of pdf files, such as subject, keywords, etc?

Thanking you in advance,
Cheers,
Erik

PDF doc info

Posted: Fri May 18, 2001 10:20 am
by mark
gw's -meta option will also extract from PDF files. The metas available from PDFs are:
Author CreationDate ModDate Creator Producer Title Subject Keywords

PDF doc info

Posted: Mon Mar 11, 2002 10:36 pm
by sinfanti
Can you restrict PDF and other documents to be searched and indexed by Filename and Title only..will add keywords etc later but the indexing of the whole document is not needed

TIA
Sal

PDF doc info

Posted: Mon Mar 11, 2002 11:34 pm
by mark
With Webinator 2 you'd have to go back after the walk and clear the Body field with a SQL update statement. Another possibility would be to put a wrapper around anytotx to remove the body content from it's return so gw never stores it.

With Webinator 4 it would be fairly simple to modify the dowalk script to not keep the Body text from the plugin.

PDF doc info

Posted: Tue Mar 12, 2002 9:28 am
by sinfanti
A wrapper???? could you direct me to some examples on how to do either the update or "the wrapper"

TIA
Sal

PDF doc info

Posted: Tue Mar 12, 2002 10:17 am
by mark
texis -s -d /path/to/your/database "update html set Body=''"
gw -d/path/to/your/database -index

The wrapper would involve writing a program or shell script to use as the plugin which would then call the real plugin and strip the body text from it's answer before returning it to gw. Doing this is beyond the scope of free technical support.