pdf & doc

resume.robot · Post by **resume.robot** » Sun Dec 16, 2001 2:19 pm

I have succeeded in getting pdf and doc files spidered to some test databases, but the standard search script does not find them and neither does a /gw -st select statement. A select count(Url) statement finds the pages, but the select statement does not find any text strings within the pdf or doc pages. They find only text strings within the html pages in the database.

How can this be done? Is there a vortex code?

Thanks

I reviewed the pdf plugin, which allows results to be displayed in a pdf browser plugin. That's not my question.

I reviewed the url syntax and url settings pages in the vortex manual, but that doesn't answer the question either.

I looked for index commands in the webinator manual but no luck there.

Sun machine

/texis -version
Texis Web Script (Vortex) Copyright (c) 1996-2000 Thunderstone - EPI, Inc.
Commercial Version 3.01.962147411 of Jun 27, 2000 (sparc-sun-solaris2.6)

/gw -version
Webinator WWW Site Indexer Version 2.56 (Commercial)
Copyright(c) 1995,1996,1997,1998,1999,2000 Thunderstone EPI Inc.
Release: 20000627

Linux machine

/gw -version
Webinator WWW Site Indexer Version 2.52 (Commercial)
Copyright(c) 1995,1996,1997,1998 Thunderstone EPI Inc.
Release: 19990218

Post by **John** » Sun Dec 16, 2001 5:24 pm

The first step is to review what is in the database for the files in question. E.g.

gw -st "select * from html where Url = 'www.server/file.pdf'"

and see what information was extracted from the file. You might also try running the plugin, anytotx, manually on the file from the command line:

anytotx < file.pdf

where you have a copy of file.pdf on the machine with anytotx.

resume.robot · Post by **resume.robot** » Sun Dec 16, 2001 9:33 pm

Thanks for your response, John.

gw.log shows the files to have been retrieved

2001/12/16 13:30:38 Retrieving http://dept.physics.upenn.edu/~pcn/Mss/cv.pdf
2001/12/16 13:30:39 Max page size exceeded (truncated) for http://dept.physics.upenn.edu/~pcn/Mss/cv.pdf
2001/12/16 13:30:39 Retrieving http://depts.washington.edu/biostat/fac ... llstro.pdf
2001/12/16 13:30:39 Max page size exceeded (truncated) for http://depts.washington.edu/biostat/fac ... llstro.pdf
2001/12/16 13:30:39 Retrieving http://depts.washington.edu/psych/Facul ... ero_cv.pdf
2001/12/16 13:30:40 Max page size exceeded (truncated) for http://depts.washington.edu/psych/Facul ... ero_cv.pdf
2001/12/16 13:30:40 Retrieving http://depts.washington.edu/psych/Facul ... lyn_cv.pdf
2001/12/16 13:30:40 Retrieving http://depts.washington.edu/psych/Facul ... ith_cv.pdf
2001/12/16 13:30:40 Max page size exceeded (truncated) for http://depts.washington.edu/psych/Facul ... ith_cv.pdf
2001/12/16 13:30:40 Retrieving http://depts.washington.edu/psych/Facul ... oll_cv.pdf
2001/12/16 13:30:41 Max page size exceeded (truncated) for http://depts.washington.edu/psych/Facul ... oll_cv.pdf

but select does not find anything there.

Here is the string that retrieved the files:

/gw -d/export/usr/data/health.pdf.test -noindex -dns=sys -a -R -r -O -fpdf -fshtml -fasp -fcfm -fjsp -t7 -z10000 -v9 "&health.pdf.cv.11216"

We don't have copies of any of the files, all are remote.

Is it possible the 10,000 character limit prevents any meaningful text in pdf files from being spidered?

I was incorrect earlier, select count(Url) does not find them either.

Post by **John** » Sun Dec 16, 2001 9:48 pm

Yes, that's exactly what is happening. Only 10,000 characters are downloaded from the webserver, which isn't enough to extract any text from. In fact with a PDF file you need the entire file to extract the text from it.

resume.robot · Post by **resume.robot** » Sun Dec 16, 2001 10:56 pm

Tried it without the -z flag

Here is the gw string

gw -d/db -noindex -dns=sys -a -R -r -O -fpdf -fshtml -fasp -fcfm -fjsp -t7 -v9 "&list" > outputfile

Here is some output:

http://www.uoregon.edu/~jonesey/cfjresume.pdf
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Retrieving
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Disallowed MIME type
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0
http://www.upenn.edu/careerservices/gra ... resume.pdf
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Retrieving
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Disallowed MIME type
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0
http://www.waddellsoftware.com/pdf%20do ... resume.pdf
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Retrieving
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Disallowed MIME type
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0
http://www.werc.net/landfill/files/carlsonresume.pdf
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Retrieving
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Disallowed MIME type
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0
http://www.whpierceexploration.com/whpresume.pdf
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Retrieving

Here is some of the gw.log file

2001/12/16 22:43:22 Retrieving http://www.upenn.edu/careerservices/gra ... resume.pdf
2001/12/16 22:43:22 Retrieving http://www.waddellsoftware.com/pdf%20do ... resume.pdf
2001/12/16 22:43:23 Retrieving http://www.werc.net/landfill/files/carlsonresume.pdf
2001/12/16 22:43:23 Retrieving http://www.whpierceexploration.com/whpresume.pdf
2001/12/16 22:43:24 Max page size exceeded (truncated) for http://www.whpierceexploration.com/whpresume.pdf

The urlcount for todo showed over 900 urls, now after retrieving the urlcount shows 0. The urlcount for html shows 5. The html.tbl file is only 48022 bytes.

Where did they go?

Thanks

Post by **mark** » Mon Dec 17, 2001 10:09 am

You need to use the -n option to use the plugin to process pdf's and doc's. See the manual for syntax.
http://www.thunderstone.com/site/gw25man/node52.html
See also the readme that comes with the plugin.

resume.robot · Post by **resume.robot** » Mon Dec 17, 2001 11:51 am

Thanks, I will try that for pdf.

I did have some success with doc files without the -n option. Search results & links appeared normally using the standard search script.

Should I try another experiment with doc files with the -n option and compare results?

Post by **mark** » Mon Dec 17, 2001 1:43 pm

Without the plugin the doc files and such will be placed into the database. Some of the text in doc files happens to be visible that way so searches will find things. You will get better results by using the plugin though. PDF files have no "visible" text. It's all compressed and otherwise mangled.

resume.robot · Post by **resume.robot** » Tue Dec 18, 2001 10:23 pm

Thanks

So for a pdf file I use the argument straight from the man page:

gw -n"application/pdf,pdf,pdftotx"

What is the argument for a doc file?

gw -n"application/msword,doc,?????"

Post by **mark** » Wed Dec 19, 2001 9:58 am

-n"application/msword,doc,pdftotx -fmsw"

Please also see the readme that came with the plugin (pdftotx.txt or anytotx.txt).