New User - external metadata and data load api questions

jstoll · Post by **jstoll** » Fri Sep 05, 2008 1:49 pm

I apologize if this has been covered previously - I haven't been able to find anything definitive in the docs or forums on this.

I am using a document management system (DMS) called KnowledgeTree, as well as a content management system (CMS) called Drupal. Both allow for the definition of a taxonomy and the application of taxonomy terms (basically, keyword metadata) to items managed by the systems. (Documents in the case of the DMS, pages, articles, etc in the case of the CMS). So, in the case of the DMS, I can upload a document (pdf, doc, xls, graphic, audio, video, etc) and then tag it with terms from the taxonomy. The terms are not internal to the document (so wouldn't be picked up on a crawl of the document), but are external to the document. They display on a 'summary' page, and are of course accessible via a database query. So, my task is to figure out how to associate this external metadata with the document data. (Say the document talked about apples and pears, but I wanted to tag it with 'hybrids', even though 'hybrids' were not mentioned in the document), so that when I searched for 'hybrid', I'd get the article as a result. (and additionally, even if I searched for 'apple' or 'pear', I'd be able to categorize the resulting document via applied taxonomy terms, such as 'hybrid').

Any thoughts, ideas, suggestions, etc? It sounds like a 'connector' is ideally what I want, but there don't seem to be connectors for Drupal or KnowledgeTree, and this will be a process that will need to be duplicated for other applications, some of which will be custom. It also seems that perhaps the data load api is an option (I could conceivably query the DMS/CMS databases and then feed Keywords, Category, etc to it?)

Also, on the data load api - is it required to pass the actual body of the item in the Body tag, or can a URL to it simply be passed, such that the appliance will index the document via the URL? (It just seems excessive to have to pass the content of every item that we want to index when it is accessible via URL).

Thanks for the help, and please forgive my newbie-ness!

Post by **jason112** » Fri Sep 05, 2008 6:05 pm

If I understand things correctly, your need to associate arbitrary keywords with documents, where that association isn't contained within the document itself.

A "connector" does describe the behavior that should work for you. You're correct in that the dataload API (which all connectors use) will allow you to do what you want - tell the appliance "Here's a URL, it has this body, and THESE keywords go with it. really."

--------------------
It is technically possible to give the dataload API the binary .doc object and have the appliance parse the object itself, or through other appliance means give it a list of URLs for it to fetch & process.

However, when you go those routes you lose the ability to specify arbitrary keywords for every URL, which is the crux of your situation. It's "all or nothing" where either the appliance figures out what the content for the URL is, or you tell it exactly what the content for the URL is.

jstoll · Post by **jstoll** » Fri Sep 05, 2008 8:35 pm

jason112 - thanks a bunch for the reply! Your understanding of my need is correct - I need to associate arbitrary keywords with documents, where the keywords are not contained in the document (nor in the document's metadata, such as a PDF might have, nor in a META tag via HTML, etc).

Are there samples/source for connectors? Something to get started with would be immensely helpful. Similarly for the dataload api - I suspect that I would want to first hack together some proof-of-concept code via the dataload api, then perhaps look into building an honest-to-goodness connector. If there is any sample code and sample XML docs and/or SOAP calls for using the dataload api, that would also be very helpful.

You lost me a bit on the 2nd part of your explanation. If I'm understanding that section correctly, you're saying that you can push one or more URLs to the appliance via the dataload api or other means (vs a crawl), but that in doing so, you forfeit the ability to associate arbitrary keywords with it. Is that correct?

Also, in that section, it sounds like you're saying that if I use the dataload api, I need to actually pass the textual content of the document in the Body tag, not just a stream dump of the file itself. (ie, I my code that created the XML would need to know how to open, read, etc the contents of every document type that I wanted to be able to push to the appliance, whether it be .doc, .xls, .pdf, etc, etc) That's a scary thought, I have to admit. I hope I'm misunderstanding that! (If this is the case, is there no means of taking advantage of the appliance's ability to decipher the content of various document types via exposed api calls or such?)

Thanks Very Much for helping me out on this - I just don't quite know how/where to get started, but this is definitely getting me pointed in a direction now!

Post by **jason112** » Mon Sep 08, 2008 10:17 am

> Are there samples/source for connectors? Something to
> get started with would be immensely helpful. Similarly
> for the dataload api

First off, "connector" is just a name for "a program that reads data from a given source and pushes it in to the data load API", so the closest we can come for giving an example of that is an example of the dataload API, as technically anything that uses it could be considered a "connector".

For a dataload API example, see the documentation on your appliance under "DataLoad API". It describes the format, and gives an example XML submission and response.

> If I'm understanding that section correctly, you're
> saying that you can push one or more URLs to the
> appliance via the dataload api or other means (vs a
> crawl), but that in doing so, you forfeit the ability
> to associate arbitrary keywords with it. Is that
> correct?

Correct.

> it sounds like you're saying that if I use the
> dataload api, I need to actually pass the textual
> content of the document in the Body tag, not just a
> stream dump of the file itself.

Unfortunately that's correct. With the dataload API, there are two ways of specifying the data:

* You specify all fields manually, such as title, keywords, and body.
* You give a binary hunk of data and let the appliance figure it all out for itself.

If you give the binary hunk of data, then it extracts keywords from the doc. There's currently no way to say "extract the body from this binary hunk, but use these keywords too".

I understand how being able to do that would help your situation, and it sounds feasible. I can put in a feature request for it but can't make any promises at the moment for a time line.

> Is there no means of taking advantage of the
> appliance's ability to decipher the content of
> various document types via exposed api calls or such?

There's currently no way to ask the appliance what it thinks is the text of a document, although for your situation that'd be more of a workaround rather than a solution.

Post by **jason112** » Mon Sep 08, 2008 10:27 am

Nevermind, you CAN already specify custom keywords with a binary body. I was reading the code wrong, and I just tested it out.

To clarify, when specifying a binary blob of data, page title and body are set by the content extractor, so there's no way to say "use this binary data, but also add this to the _body_". Keywords, however, are left alone by the extractor and any keywords you set in the dataload XML will be used.

jstoll · Post by **jstoll** » Mon Sep 08, 2008 12:32 pm

Wow, this is huge!! That is a workable approach. Not ideal, but workable!

So, I can crawl (via an external process) my particular site, collect the binary files and associated metadata, create the XML document and then submit via POST or SOAP and it will all magically be searchable and categorizeable, correct? (I'm going to try this momentarily).

Once I have done this for all documents in a system (ie, the first time), does this have to be done regularly to update the existing document contents (ie, I have to implement my own crawler/re-crawler process externally), or will the appliance automatically re-crawl the document when it does its regular crawling processes as a result of the document and its URL being in the appliance's database at that point?) I presume that I would have to re-submit when metadata changes occur, and in this case, would I need to once again submit the entire document as a binary hunk with the metadata, or will the presence of the document in the system allow me to just specify the metadata changes?) Clearly, I'd have to submit any new documents, document deletions, etc.

It is definitely good to be able to associate binary data and metadata like this, but it seems to force me to recreate my own crawling system and just use the appliance for doing searches (vs crawls). One half (or more?) of the job of the appliance is gathering data though, so I'd really like to be able to rely on the appliance to take care of this business. It seems to be a significant hole in the system to not be able to have it crawl and associate external metadata with crawled document content. (The Google Appliance seems to allow querying against external databases at crawl time, for instance - this seems like a really crucial capability.)

My main moving-forward question at the moment is, what 'mechanical' information in the XML submission is required? I see fields for:

- Type - What is the difference between 'I' and 'UI'?

- Size - I presume that is file size in bytes? Is this required if I'm passing a binary hunk in the Body or RawData element?

- Visited - is this used for anything internally (such as a 'stalness' evaluation or such, and do I need to track externally (ie, in my crawl system that is doing the push to the appliance) last submission dates?

- Modified - similar to visited - is it required, what, if any effect does it have on the search results?

- NextCheck - again, what is it, what effect does it have, is it required? On a related note, once I push something to the box, will it then re-crawl that document on its own as part of its regular crawl process (to catch updates, changes, etc), or will it wait for me to push a newer version to it?

- MimeType - will the box figure this out, or so I need to set this for each document?

- Charset - is this required for binary data?

- RawData - I presume that actual binary content should be passed here vs in the Body tag, is that correct?

Lastly, would it be possible to have you send me the sample submission w/ the binary body that you just tested out? That could be a lot of help in getting this going on my end.

Thanks for the continuing help!

Jim

Post by **John** » Tue Sep 09, 2008 4:52 pm

For Type, 'I' is insert and 'UI' forces an index update and refreshes the spell dictionaries etc.

The Size should be sent, and is the file size in bytes.

Visited is used if the appliance is refreshing data, and you can use the string 'now' to use the current time.

Modified is also used when refreshing, and is the date that is displayed as part of the search results and used for order by date.

NextCheck is when the appliance would next visit the page.

MimeType should be set if known. It may extract data without it set, but will not fill in the MimeType data.

Charset should still be sent.

You can either send RawData with the binary content, or
Body with the extracted text.

The recrawl could be done by the appliance, but it will try and get the metadata from the fetch, not keep the metadata you pushed in.