Titles in PDF files

Post Reply
watterson
Posts: 71
Joined: Mon Feb 14, 2005 4:15 pm

Titles in PDF files

Post by watterson »

How does webinator determine the title for PDF files it crawls? The titles on the search results appear (to me) to be inconsistent. Some have the file name while others have meaningful title.
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Titles in PDF files

Post by John »

It uses the title property stored in the PDF document. Depending on how the PDF was created it may be more or less meaningful.
John Turnbull
Thunderstone Software
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

Titles in PDF files

Post by mark »

Webinator uses the "Title" specified in the PDF document. If there is none it will revert to "PDF Document". Often pdf generators will put the source filename as the title of the document. So you end up with pdf's with useless titles like "myfile.doc".
watterson
Posts: 71
Joined: Mon Feb 14, 2005 4:15 pm

Titles in PDF files

Post by watterson »

How does webinator determine the title for PDF files it crawls? The titles on the search results appear (to me) to be inconsistent. Some have the file name while others have meaningful title.
watterson
Posts: 71
Joined: Mon Feb 14, 2005 4:15 pm

Titles in PDF files

Post by watterson »

I am sending this private so it does not show up on the website.

That makes me wonder how Google does it. When there is a useless title in one of our pdf, doc, xls, files, google somehow finds a meaningful title.
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Titles in PDF files

Post by John »

My guess would be that they may be looking for some large/bold text at the top of the document as an alternative, although I'm not sure when they'd choose to use that versus the actual title property in the PDF.
John Turnbull
Thunderstone Software
watterson
Posts: 71
Joined: Mon Feb 14, 2005 4:15 pm

Titles in PDF files

Post by watterson »

Ok, one last question, I think.

Is there a way to change the title using regular expressions? I am not yet comfortable with how these are used with webinator, so it may not be possible.

Mike
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Titles in PDF files

Post by John »

You can use the "Data from Field" to override the title, and pull it from somewhere else, however that is not currently a conditional, so it is all titles or no titles from the field. For example if you wanted the title of all results to be the first 50 characters of the Body you could have as a search:

>>=.{,50}

in the Text field.
John Turnbull
Thunderstone Software
watterson
Posts: 71
Joined: Mon Feb 14, 2005 4:15 pm

Titles in PDF files

Post by watterson »

Ok, thanks, but as you probably guessed, changing all the titles is not optimal for us.

Does Thunderstone have a "requested feature" list (or similar)? I would like to request something like this.
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

Titles in PDF files

Post by John »

Yes, we do have a requested feature list, and more conditional Data from Field is on the list, but I'll make sure to add this case.
John Turnbull
Thunderstone Software
Post Reply