I have used the pdf filter for indexing websites as well as about 110,000
pages of a document archive (in conjunction with commercial Texis) and I
can say that it is very fast and does an excellent job of extracting all of
the text from the documents.
My only complaint in using the pdf filter along with webinator is this:
When using gw to index a site, you can use the -z flag to set the
truncation limit based on document size. If you have pdf documents which
are very large due to graphic material you have to set the z setting *very*
high in order for these documents to be indexed at all.
When the filter runs it stores the parsed text in a temp file before it is
loaded into the database. If gw hits the z limit on a pdf document then
the tmp file is lost and no text is inserted into the database for that
particular document. I think it would be better if the z flag pertained to
the extracted text rather than the document size, or at least some other
flag could be enabled.
Kevin
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Kevin Ward, SSAI
MODIS Digital Library Manager
Earth Observatory Technical Manager
NASA Goddard Space Flight Center
Code 922, Greenbelt, MD 20771
(301) 286-9179
kevin.ward@gsfc.nasa.gov
Work to become, not to acquire
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^