cannot crawl word documents

Faiz · Post by **Faiz** » Tue Mar 20, 2001 5:29 pm

Hi,
I cannot crawl some word documents. I am using the dowalk script. When I added some print statements in the dowalk script to see what's happening, i found that,
<rex ">>=\n" $page>
inside <dofilt> returns nothing. However $page before the above statement displayed the full text. The next statement, <rex ">>=!\n\n+\n\n\P=.+" $page><$page=$ret> returns nothing again. So, without any value in $page, the Body field is not populated.
<anytotx> from command prompt works fine. Can't figure out what's going wrong. Please help.

Thanx,

Post by **mark** » Tue Mar 20, 2001 5:55 pm

What is your $opt setting when you are running anytotx on the word documents? You may need to uncomment the "<case "doc">" line in doplugin if you still have it commented.

Faiz · Post by **Faiz** » Tue Mar 20, 2001 6:41 pm

The $opt value is "-fmsw" and the line was uncommented. I had already set the LD_LIBRARY_PATH in .profile. I set it once again from command prompt, exported it and got the following result,
-- Full text after this line <exec nobr /usr/local/morph3/bin/anytotx -mTitle -mSubject -mKeywords $opt><fmt "%s" $htmlpage></exec>
<$page=$ret>

-- Nothing, after <rex ">>=\n" $page>
-- Full text again, after <rex ">>=!\n\n+\n\n" $page>
-- And nothing after <rex ">>=!\n\n+\n\n\P=.+" $page><$page=$ret>

So, finally, $page is still empty. This happens only when i run the command from command prompt,
texis top=<url> dowalk

But, when I index documents from web, i.e, indexing as soon as a document is added via web, it is done successfully. Might be a very small mistake, but I cant figure out.

Thanx,

Post by **mark** » Wed Mar 21, 2001 11:01 am

Change this:

<rex ">>=\n" $page>
<if $ret eq "">
...
<else>
...
</if>

To this:
<if $opt ne "-fmsw">
<rex ">>=\n" $page>
<if $ret eq "">
...
<else>
...
</if>
</if>

I don't know what your web addition process is so I can't really comment on that.

Faiz · Post by **Faiz** » Wed Mar 21, 2001 11:29 am

Thanx. It works now. The web addtion process is something like this,
When a user uploads a document into the webserver, I inserts the Url (of the uploaded document) in a texis table, which triggers the crawling command, "texis top=<url> dowalk", and indexes the document, before returning back to the page from where it is uploaded.
Does this statement, <rex ">>=\n" $page> searches for a newline in $page?

Post by **mark** » Wed Mar 21, 2001 11:56 am

It looks for a newline as the first character. It's related to processing info headers from anytotx. But anytotx does not currently generate any info headers for msw documents.