MS Word doc search problem

ryan.dorosh
Posts: 8
Joined: Tue Mar 13, 2001 6:10 pm

MS Word doc search problem

Post by ryan.dorosh »

I have recently purchased the commercial version with the PDF plugin. I'm trying to get it to search MS Word docs as well and have place the appropiate lines into the gw command, but still no luck. It is searching PDF's just fine though. Here is my .set file.

# handle PDF files with purchased PDF plugin
napplication/pdf,pdf,pdftotx
napplication/msword,doc,pdftotx
# allow big files so PDF doesn't get truncated
z1500000
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

MS Word doc search problem

Post by mark »

for msword use

napplication/msword,doc,pdftotx -fmsw
ryan.dorosh
Posts: 8
Joined: Tue Mar 13, 2001 6:10 pm

MS Word doc search problem

Post by ryan.dorosh »

I've added the -fmsw flag, and it still does not find nor add the .doc files which reside in the same folder as the PDF files (which do work).

Here is the new .set file.

# handle PDF files with purchased PDF plugin
napplication/pdf,pdf,pdftotx
napplication/msword,doc,pdftotx -fmsw
# allow big files so PDF doesn't get truncated
z1500000
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

MS Word doc search problem

Post by mark »

Webinator only "finds" files that are linked into the site. They must be reachable by using a web browser starting at the url you give to gw and clicking on hyperlinks. Also they must be reachable without java or javascript.

Of course the other possibility is that you're not using the standard ".doc" extension for your word files. If not, adjust your n option accordingly.
ryan.dorosh
Posts: 8
Joined: Tue Mar 13, 2001 6:10 pm

MS Word doc search problem

Post by ryan.dorosh »

Hi Mark,

The files are in a folder within the website tree. They are accessible from the address given to gw. The PDF's work fine, but the doc files aren't listed as gw is running through and indexing all the files (even though they are in the folder and listed in the .set file). The extension is .doc and specified as such in the .set file. Any other ideas? Everything else works perfect.

Thanks,
Ryan.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

MS Word doc search problem

Post by mark »

So, with a web browser you can start at the url given to gw and click on one or more hyperlinks to bring up a doc file that gw misses.?

What's your full gw command line? I assume you've supplied the complete .set file. Also, are the extensions ".doc" or ".DOC"?

Are the doc files contained in a directory that is disallowed by robots.txt?
ryan.dorosh
Posts: 8
Joined: Tue Mar 13, 2001 6:10 pm

MS Word doc search problem

Post by ryan.dorosh »

The file extensions are .doc, and yes, through the browser given the http://server/dir/x.doc I can load the Word document. There is no robots.txt in the folder, and all other files in the folder are working fine.

Here is the complete .set file.

#
# common options for walking reboot:81
#
# exclude duplicate text tree
# xhttp://www.somesite.com/text/

# exclude some junk
# xhttp://www.somesite.com/test/
# xhttp://www.somesite.com/personals/

# allow other file extensions (asp etc)
# fasp

# handle PDF files with purchased PDF plugin
napplication/pdf,pdf,pdftotx
napplication/msword,doc,pdftotx -fmsw
# allow big files so PDF doesn't get truncated
z1500000
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

MS Word doc search problem

Post by mark »

I'm not asking if you can access the file directly. I'm asking if you can get to it from the url you are giving to gw. And you didn't provide the gw command line.
ryan.dorosh
Posts: 8
Joined: Tue Mar 13, 2001 6:10 pm

MS Word doc search problem

Post by ryan.dorosh »

this is my command line.....
gw -msetoptions.set http://localhost:81

And if I understand you correctly, yes, I can get to the .doc file from the URL I'm giving gw.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

MS Word doc search problem

Post by mark »

All I can suggest now is to start clean and run with high verbosity to see why it might be rejecting those urls. wipe the database (or use a different one) and add a -v9 option. capture the screen output to a file so you can review it and see what it says about the .doc urls.