pdf indexing problem

cindy_walker · Post by **cindy_walker** » Tue Aug 14, 2001 2:08 pm

We're using Webinator Commercial, version 2.56, release 20000620 on a web server that runs Irix 6.5.

It used to index pdf documents, but for months hasn't been including them. It gives no errors, it
just ignores them. No pdfs appear in the list of files it displays as it indexes. Here's the command
I use to create the index:

gw -ddb2 -z300000 -Iindex.htm -fshtml -meta=keywords,description
-n"application/pdf,pdf,/www/pub/webinator/bin/anytotx" -fpdf http://www.dot.ca.gov

When I run anytotx on the command line it does just what it should - it displays the
text of the pdf document in stdout.

I set up a test index to help figure out what's wrong. You can see it at

http://www.dot.ca.gov/cgi-bin/texis/web ... h/?db=test

The words fastrak or collection or toll should produce results, but every search produces in a page
with "no documents matched the query" and these error messages in the code:





The SYSOBJECTS.tbl file is there with 444 permissions, owned by webinator. I suppose this is a separate problem. I don't see this with our main index.

Here's the command I used to create the test index:

gw -dtest -n"application/pdf,pdf,/www/pub/webinator/bin/anytotx" -fpdf
-z500000 http://www.dot.ca.gov/fastrak/pdftest.htm

There are only 3 pages in this index - 1 pdf file and 2 htm files. It appears to index all 3 without complaint.

If I try to force it to add a pdf this happens:

gw -dtest -g http://www.dot.ca.gov/fastrak/FasTrakApp.pdf
http://www.dot.ca.gov/fastrak/FasTrakApp.pdf: Disallowed extension
Visited 0 pages total

I see this same message if I try to add a pdf to our main index.

What can I do to get webinator to cheerfully include pdfs? Is there something special about the default index?
I believe our troubles started when I began creating a second index, db2, renaming it "db" and replacing the
db directory with it.

Post by **mark** » Tue Aug 14, 2001 3:07 pm

Does db2 exist before you start the walk?

See what URLs are in the test database with:
gw -dtest -st "select Url from html"
See what's stored for the URLs with:
gw -dtest -st "select * from html"

For SYSOBJECTS, maybe the test directory is not readable.

With -g you would still need -n and -z. -fpdf is not needed with -n.

cindy_walker · Post by **cindy_walker** » Tue Aug 14, 2001 4:11 pm

db2 doesn't exist before I start indexing. I rename it "db" and copy it over after indexing is finished. After our default index, db, was in production we wanted to add meta tags to the index, so I started creating the second index, then copying over. Sometimes I use a -rewalk command on the db index. Either way I'm not getting pdfs.

Using the "select Url from html" statement returns the Url of all three pages that are supposed to be part of the test index: pdftest.htm, App.pdf and test2.htm

Using the "select * from html" command lists content for all pages.

The test directory has 777 permissions, just like the db directory.

I was able to add a new pdf document with the command:

gw -dtest -z500000 -g -fpdf -n"application/pdf,pdf,/www/pub/webinator/bin/anytotx" http://www.dot.ca.gov/fastrak/FasTrakApp.pdf

"-pdf" is needed apparently, otherwise I get the "can't run plugin" message.

Post by **mark** » Tue Aug 14, 2001 4:47 pm

So for the test db the problem is the search. Perhaps SYSOBJECTS got corrupted somehow. Is/was the disk full? You should find a good copy in the .master directory. Or to test without it put
<apicp eqprefix "">
at the top of the <a name=qpar> function in the search script.

You can look for the pdf urls in your big database with:
gw -st "select Url from html where Url like '.pdf'"
see their content with
gw -st "select * from html where Url like '.pdf'"

cindy_walker · Post by **cindy_walker** » Tue Aug 14, 2001 5:42 pm

You're right that we've been having trouble with some volumes on our web server becoming 100% full. It was definitely happening on the day I created the test index. Now space isn't a problem. I copied SYSOBJECTS.tbl from .master to the test directory. This didn't solve the problem. I added
<apicp eqprefix ""> right under <a name=qpar> in the search script and still see "no documents matched the query" except that now the error message in the html is:



I removed the test directory and created a new test index exactly the way I did before. It appears to have indexed the pdfs but still returns no search results and still gives the error message above.

Using the select statement to list the pdf files in our main index returned nothing. I expected it would. I already fgreped gw.log and found no lines containing .pdf.

Post by **mark** » Tue Aug 14, 2001 6:10 pm

Either you interrupted gw before it finished or you used -noindex.
Run
gw -dtest -index
to create/update the search index on the fetched data.

cindy_walker · Post by **cindy_walker** » Tue Aug 14, 2001 6:54 pm

There are only three documents in the test index. I didn't use -noindex. I ran the command you listed above but it still gives the same error message. Entering

gw -dtest -st "select Url from html"

shows all three documents that should be there. I just created another test index, restricting it to a section of our site with 15 files, 2 of which are pdfs. I was hoping that there was something fluky about the test index and that this one would behave o.k. All 15 files are included in the new index, but it gives the same error message when I try to search it.

Post by **mark** » Tue Aug 14, 2001 9:37 pm

This is getting strange. What changed on the system between when it worked and when it didn't?

If the data is indexed there should be several files starting with "xhtml" in the database directory. What does this give:
gw -dtest -st "select TBNAME,NAME,TYPE,FIELDS from SYSINDEX"

Are the database directory and files owned by the user gw is running as? And is the texis CGI running as that user? Check if gw and and texis are setuid.

cindy_walker · Post by **cindy_walker** » Wed Aug 15, 2001 12:12 pm

We did upgrade the OS from Irix 6.4 to 6.5 recently. But here's what I think the problem was.

When I create a new index, its directory is placed in /www/pub/webinator/bin - the same place gw lives. If I leave it there and try to search it, the web server (or texis) puts a new directory of the same name with only some of the necessary files (and no SYSOBJECTS.tbl) in /www/pub/webinator. The original index is owned by user webinatr, group webinatr. The copied index is owned by user webinatr and group nobody. It looks like unless I'm recreating the default index I'll need to copy the new index from the bin directory into the webinator directory for it to work correctly.

Now that I moved the test directory into the webinator directory, searching it works just fine, including pdfs. I'm wondering whether I had earlier - for our main index - made the mistake of entering the URL to search before the -n"application/pdf..." command. I didn't realize that the URL had to be the last item.

Post by **mark** » Wed Aug 15, 2001 12:48 pm

You don't need to copy the database. You need to specify the path to the database with gw if you're not in the webinator directory. e.g. use:
gw -d/www/pub/webinator/test
not
gw -dtest

And, yes, placing the url before -n could cause the problem. The gw.log file in the database directory should indicate the command line you used.