search within a book feature

KMandalia · Post by **KMandalia** » Mon Nov 08, 2004 5:57 pm

Does webinator have any support or feature that searches images or pdf files accurately?

A couple weeks ago I learned that webinator may be searching pdf file correctly (I am not quite sure whether searching a .pdf or .doc has the same level of accuracy as searching a .htm file) but when it comes to hilighting it's not that reliable.

We have lots of pdf files that we would like to create a separate profile for and implement a document search (If adobe has navigation blocking capability that would not allow the users to go back and forth the pages that would be great, do you know anything about it?)

If searching within the .pdf file with navigation blocking is not 100% feasible, we would like to consider the possibility of having a database of images of .pdf files(with care being taken that each image has the full ALT text). Can webinator search images and display the results the same way as any other web page?

In short, does webinator have any capability of offering search solutions that Amazon (and now Google) have as far as 'search within a book' goes?

KMandalia · Post by **KMandalia** » Tue Nov 16, 2004 11:09 am

How does webinator index the PDFs?

The reason I am asking this is because I have two PDFs that webinator has indexed. In one of the PDF the title comes out to be PDF Document and in other it is 'XYZ SOMETITLE' where if I look in the original pdf document the strig 'XYZ SOMETITLE' is not to be found directly?

What I need webinator to do is to take whatever name I am giving to my PDF files and display that name as result title. How can I do that?

Post by **mark** » Tue Nov 16, 2004 11:50 am

PDF files have an internal meta field called Title (as well as Author and others) that is set by the author of the document. If the Title field is set that is used as the document title. If the Title field is not set you get PDF Document. Modify dowalk to change the behavior. Look for
<if $title eq "">

KMandalia · Post by **KMandalia** » Tue Nov 16, 2004 12:53 pm

In that case,I think it may not be possible for me to split the files correctly since I tried several tools, some of them will import the meta data but what I need is to walk the pdf such that if the title is not present then it should take the filename.

How should I modify the dowalk so that it will take the file name of the PDF file if it can't find the title?

Post by **mark** » Tue Nov 16, 2004 2:46 pm

Look for the line mentioned above. Set $title to whatever you want there.

KMandalia · Post by **KMandalia** » Tue Nov 16, 2004 3:47 pm

I know what I want to do, just not sure how to do it.

I guess, in the function 'dofilt' right after the line '

<if $title eq "">

I need to split the current Url of the document (whatever the file type is) so as to just retain everything after the last '/' and then again take out the extension from the returned string so that what I will have is just the name of the file.

But, I want to do this type of thing only for results from my site not all the site (rest of them can have PDF document). The reason for that is I am giving meaningful names to my files and now that we have url searching capability, I can take full advantage of it.

Can you help me with my requirements?

KMandalia · Post by **KMandalia** » Tue Nov 16, 2004 4:15 pm

BTW, I didn't see where in the dowalk the title is changed to 'PDF Document' if the document doesn't have any title.

AND

Whatever I want to do can be (or S.B) done in the 'setupresults' function ( How would I change the the code at the start of the function so that it checks the url of the result it's processing, if it is our url and if the document type is PDF, it will assign filename as title instead of PDF document...)

Can this be done?

Post by **mark** » Tue Nov 16, 2004 4:53 pm

dowalk creates "PDF Document (48k)" with
<strfmt "%s Document (%dk)" $dt $x>

Extract the filename from the url in dowalk with something like:
<rex "[^/]+\F\.[^.]+>>=" $u>

If you want to do it in search instead you can see where it's already looking for "PDF Document (" and add some code there to change the title.

Post by **mark** » Tue Nov 16, 2004 4:57 pm

p.s.
In search also look where it checks for .pdf extension.

Post by **mark** » Tue Nov 16, 2004 11:08 pm

Something that hasn't been mentioned before, all walk setting "Plugin Split" may be useful for splitting pdf text into individual pages that then refer back to the full document.