Page 1 of 1

HTML converstion

Posted: Mon Mar 08, 2004 12:15 pm
by mmcfadden
I have 2 questions.

1. Many of the link titles on the results page just indicate type of document, e.g., "PDF Document", "Word Document", or even worse Untitled document which makes it difficult to get a sense of what these document's content might be. More than anything this demonstrates bad choice of link text or Doc titles on the part of the authors, but I wonder if it's possible on our end to somehow override the titles of these documents with something more meaningful? Google seems to do this somehow - not sure if it does so by parsing the PDF content or by accessing the PDF's meta data.

2. Another cool thing Google does is convert non-HTML documents to HTML. I am wondering if this is possible with the webinator tool.

HTML converstion

Posted: Mon Mar 08, 2004 1:05 pm
by mark
The plugin will get title info from pdf meta data if it's available. Can you give an example where google gets a useful title for a document but webinator doesn't for the same document?

Please also provide your anytotx version (anytotx --identify).

HTML converstion

Posted: Mon Mar 08, 2004 1:49 pm
by mmcfadden
The ability Google offers is to convert documents into HTML on the fly. If you do a search at Google.com and a word or PDF document comes up as a result you also get a link that states View as HTML. I used the filetype:pdf operator in my search to get results that are just pdf or word. I am still checking into the document title.

HTML converstion

Posted: Mon Mar 08, 2004 2:48 pm
by mark
I'm aware of the html display for pdf documents. The plugin doesn't currently offer such a feature. People generally prefer the native format. If they can't read that the plain text is usually sufficient.

HTML converstion

Posted: Tue Mar 09, 2004 7:58 am
by doran
Webinator's "Match Info" link DOES provide a view-as-html version, for PDF as well as other document types. Although not fancy, all the text is there (cached), and it highlights your search terms too.