PDF doc info

Post Reply
Erik
Posts: 21
Joined: Mon Feb 05, 2001 7:37 pm

PDF doc info

Post by Erik »

Hi,

We have setup webinator to search and index pdf files, and it returns the title field (which is exactly what we need).

My question is: is there any way to search the other fields in the doc info section of pdf files, such as subject, keywords, etc?

Thanking you in advance,
Cheers,
Erik
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

PDF doc info

Post by mark »

gw's -meta option will also extract from PDF files. The metas available from PDFs are:
Author CreationDate ModDate Creator Producer Title Subject Keywords
sinfanti
Posts: 2
Joined: Mon Mar 11, 2002 10:28 pm

PDF doc info

Post by sinfanti »

Can you restrict PDF and other documents to be searched and indexed by Filename and Title only..will add keywords etc later but the indexing of the whole document is not needed

TIA
Sal
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

PDF doc info

Post by mark »

With Webinator 2 you'd have to go back after the walk and clear the Body field with a SQL update statement. Another possibility would be to put a wrapper around anytotx to remove the body content from it's return so gw never stores it.

With Webinator 4 it would be fairly simple to modify the dowalk script to not keep the Body text from the plugin.
sinfanti
Posts: 2
Joined: Mon Mar 11, 2002 10:28 pm

PDF doc info

Post by sinfanti »

A wrapper???? could you direct me to some examples on how to do either the update or "the wrapper"

TIA
Sal
User avatar
mark
Site Admin
Posts: 5513
Joined: Tue Apr 25, 2000 6:56 pm

PDF doc info

Post by mark »

texis -s -d /path/to/your/database "update html set Body=''"
gw -d/path/to/your/database -index

The wrapper would involve writing a program or shell script to use as the plugin which would then call the real plugin and strip the body text from it's answer before returning it to gw. Doing this is beyond the scope of free technical support.
Post Reply