search within a book feature

KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

search within a book feature

Post by KMandalia »

Does webinator have any support or feature that searches images or pdf files accurately?

A couple weeks ago I learned that webinator may be searching pdf file correctly (I am not quite sure whether searching a .pdf or .doc has the same level of accuracy as searching a .htm file) but when it comes to hilighting it's not that reliable.

We have lots of pdf files that we would like to create a separate profile for and implement a document search (If adobe has navigation blocking capability that would not allow the users to go back and forth the pages that would be great, do you know anything about it?)

If searching within the .pdf file with navigation blocking is not 100% feasible, we would like to consider the possibility of having a database of images of .pdf files(with care being taken that each image has the full ALT text). Can webinator search images and display the results the same way as any other web page?

In short, does webinator have any capability of offering search solutions that Amazon (and now Google) have as far as 'search within a book' goes?
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

search within a book feature

Post by KMandalia »

How does webinator index the PDFs?

The reason I am asking this is because I have two PDFs that webinator has indexed. In one of the PDF the title comes out to be PDF Document and in other it is 'XYZ SOMETITLE' where if I look in the original pdf document the strig 'XYZ SOMETITLE' is not to be found directly?

What I need webinator to do is to take whatever name I am giving to my PDF files and display that name as result title. How can I do that?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

search within a book feature

Post by mark »

PDF files have an internal meta field called Title (as well as Author and others) that is set by the author of the document. If the Title field is set that is used as the document title. If the Title field is not set you get PDF Document. Modify dowalk to change the behavior. Look for
<if $title eq ""><!-- no title, make one up based on file type and size -->
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

search within a book feature

Post by KMandalia »

In that case,I think it may not be possible for me to split the files correctly since I tried several tools, some of them will import the meta data but what I need is to walk the pdf such that if the title is not present then it should take the filename.

How should I modify the dowalk so that it will take the file name of the PDF file if it can't find the title?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

search within a book feature

Post by mark »

Look for the line mentioned above. Set $title to whatever you want there.
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

search within a book feature

Post by KMandalia »

I know what I want to do, just not sure how to do it.

I guess, in the function 'dofilt' right after the line '

<if $title eq ""><!-- no title, make one up based on file type and size -->

I need to split the current Url of the document (whatever the file type is) so as to just retain everything after the last '/' and then again take out the extension from the returned string so that what I will have is just the name of the file.

But, I want to do this type of thing only for results from my site not all the site (rest of them can have PDF document). The reason for that is I am giving meaningful names to my files and now that we have url searching capability, I can take full advantage of it.

Can you help me with my requirements?
KMandalia
Posts: 301
Joined: Fri Jul 09, 2004 3:50 pm

search within a book feature

Post by KMandalia »

BTW, I didn't see where in the dowalk the title is changed to 'PDF Document' if the document doesn't have any title.

AND

Whatever I want to do can be (or S.B) done in the 'setupresults' function ( How would I change the the code at the start of the function so that it checks the url of the result it's processing, if it is our url and if the document type is PDF, it will assign filename as title instead of PDF document...)

Can this be done?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

search within a book feature

Post by mark »

dowalk creates "PDF Document (48k)" with
<strfmt "%s Document (%dk)" $dt $x>

Extract the filename from the url in dowalk with something like:
<rex "[^/]+\F\.[^.]+>>=" $u>

If you want to do it in search instead you can see where it's already looking for "PDF Document (" and add some code there to change the title.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

search within a book feature

Post by mark »

p.s.
In search also look where it checks for .pdf extension.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

search within a book feature

Post by mark »

Something that hasn't been mentioned before, all walk setting "Plugin Split" may be useful for splitting pdf text into individual pages that then refer back to the full document.
Post Reply