Hi,
I cannot crawl some word documents. I am using the dowalk script. When I added some print statements in the dowalk script to see what's happening, i found that,
<rex ">>=\n" $page>
inside <dofilt> returns nothing. However $page before the above statement displayed the full text. The next statement, <rex ">>=!\n\n+\n\n\P=.+" $page><$page=$ret> returns nothing again. So, without any value in $page, the Body field is not populated.
<anytotx> from command prompt works fine. Can't figure out what's going wrong. Please help.
What is your $opt setting when you are running anytotx on the word documents? You may need to uncomment the "<case "doc">" line in doplugin if you still have it commented.
The $opt value is "-fmsw" and the line was uncommented. I had already set the LD_LIBRARY_PATH in .profile. I set it once again from command prompt, exported it and got the following result,
-- Full text after this line <exec nobr /usr/local/morph3/bin/anytotx -mTitle -mSubject -mKeywords $opt><fmt "%s" $htmlpage></exec>
<$page=$ret><!-- get text version of page -->
-- Nothing, after <rex ">>=\n" $page>
-- Full text again, after <rex ">>=!\n\n+\n\n" $page>
-- And nothing after <rex ">>=!\n\n+\n\n\P=.+" $page><$page=$ret>
So, finally, $page is still empty. This happens only when i run the command from command prompt,
texis top=<url> dowalk
But, when I index documents from web, i.e, indexing as soon as a document is added via web, it is done successfully. Might be a very small mistake, but I cant figure out.
Thanx. It works now. The web addtion process is something like this,
When a user uploads a document into the webserver, I inserts the Url (of the uploaded document) in a texis table, which triggers the crawling command, "texis top=<url> dowalk", and indexes the document, before returning back to the page from where it is uploaded.
Does this statement, <rex ">>=\n" $page> searches for a newline in $page?
It looks for a newline as the first character. It's related to processing info headers from anytotx. But anytotx does not currently generate any info headers for msw documents.