Indexing pdf with pdftotx

rjacoby · Post by **rjacoby** » Thu Apr 18, 2002 12:25 pm

First some background:

I just inherited my company's search engine which has been on the back burner for quite a few years. I have no information as to history or whatever. We purchased webinator about 4 years ago including the pdftotx plugin.

gw -version:

Webinator WWW Site Indexer Version 2.51 (Commercial)
Copyright(c) 1995,1996,1997,1998 Thunderstone EPI Inc.
Release: 1999011

Our Unix date stamp on the pdftotx file is Oct. 22, 98. Don't know how to get the actual version. We're running HP-UX 11.0.

------
The problem i'm having is indexing pdf files. I created a pdf doc in Acrobat 5.0 format and tried indexing it. It indexed the text fine but didn't do anything with the meta tags (title, keywords, etc). I ran a gw -st against the html table in the db and the metatag field came up empty. The title field came back with something like "PDF Document (90k)". I've verfied that there is metadata (title, keywords, etc) in the pdf file. I've also verfied that my gw command is indexing meta information correctly by indexing an html file with metatags (it worked).

I'm guessing that the problem is that Adobe changed their metatag API/format and the ancient pdftotx can't read that information anymore. Is this accurate? Is there an updated version of pdftotx that can read Acrobat 5 and earlier pdfs?

Thanks,
Bob

Post by **mark** » Thu Apr 18, 2002 12:55 pm

That version is not capable of extracting any meta info from PDF files and may have general problems with newer PDF files. Webinator version 4 and it's plugin will do those things. Order an update from the webinator order page. Or contact sales from the "Contact Us" page.

rjacoby · Post by **rjacoby** » Thu Apr 18, 2002 2:49 pm

Mark,

Thanks for your quick response.

But I'm confused. In this board "Webinator old 2.5" other people are successfully extracting the data - at least the title information which is our biggest issue right now.

Are you saying that my company's plugin is so old that it doesn't allow this; that all these other people have a newer version of the plugin; and that you don't provide an update (either for free or for an update fee) to the pdf plugin for webinator v2.5 that would allow us to do this.

I understand we can updgrade to v4 and it'll work, but we don't need to - other then the pdf thing, which other people are successfully using under v2.5.

Thanks again,
Bob

Post by **mark** » Thu Apr 18, 2002 3:05 pm

You're running version 2.51. Version 2.54 was the first version to get titles from pdf. Most of the people still using this "2.5" tech support board are using 2.56.

Thunderstone does not generally provide partial upgrades to pieces of a product. Please contact sales to further discuss updates and upgrades and the associated fees.

rjacoby · Post by **rjacoby** » Thu Apr 18, 2002 3:05 pm

oh..and another question.

I looked at your licenses for v4. Since i just inherited this stuff (and i've only been here a month) i have no idea the licensing you guys had 4 years ago.

Has your licensing changed since we originally purchased your software? In other words, is our "Commercial" 2.5 license comparable to the "Commericial" 4.0 license? Is the upgrade to the Professional 4.0 license the same regardless of whether we're upgrading from Commercial 2.5 or Professional 2.5. I'm not aware of any page limits that we currently have so this may be an issue.

Also, in the license I'm confused between section 2.2 and Appendum A, part 8. Section 2.2 says we can install the software on multiple machines within the same site. Appendum A, part 8 seems to rescind this, and limit us to 1 server per license. Is this correct? If so why have section 2.2? It just serves to make me have to go to lawyers to understand everything.

If not, what does that part of the appendum address?

Finally, I'd like to get some clarification on the CMS issue (Appendum A, part 4 and 5). We're looking in to getting one. If the majority of our html files are stored within the CMS system (>30%) we are not allowed to use webinator? I can see Webinator having better integration with your CMS system (like M$ products), but not even letting Webinator be used with another CMS system seems like bad strategy. You're extremely limiting my company's options. If we can only use Webinator with your CMS system then we'll stop using Webinator altogether.

Thanks again,
Bob

rjacoby · Post by **rjacoby** » Thu Apr 18, 2002 3:12 pm

Mark,

Thanks again for your quick reply. I posted my last questions before getting your 2nd reply. I will copy/paste my last message to sales since this board is more for technical issues and my questions would be more appropriate for sales.

I must say i'm impressed with your response time! That goes a looong way in my book.

Bob

Post by **mark** » Thu Apr 18, 2002 3:35 pm

Yes, please do contact sales, but just to quickly clarify a couple points.

The license covers both the Texis and Webinator products. Yes A.8 rescinds 2.2 for Webinator only. So A.8 applies to full Texis only.

Webinator CMS is not a CMS system. It's a version of Webinator that is intended for use with whatever CMS system you may already have.