pdf & doc

resume.robot
Posts: 68
Joined: Sat Jan 13, 2001 1:23 am

pdf & doc

Post by resume.robot »

I have succeeded in getting pdf and doc files spidered to some test databases, but the standard search script does not find them and neither does a /gw -st select statement. A select count(Url) statement finds the pages, but the select statement does not find any text strings within the pdf or doc pages. They find only text strings within the html pages in the database.

How can this be done? Is there a vortex code?

Thanks

I reviewed the pdf plugin, which allows results to be displayed in a pdf browser plugin. That's not my question.

I reviewed the url syntax and url settings pages in the vortex manual, but that doesn't answer the question either.

I looked for index commands in the webinator manual but no luck there.

Sun machine

/texis -version
Texis Web Script (Vortex) Copyright (c) 1996-2000 Thunderstone - EPI, Inc.
Commercial Version 3.01.962147411 of Jun 27, 2000 (sparc-sun-solaris2.6)

/gw -version
Webinator WWW Site Indexer Version 2.56 (Commercial)
Copyright(c) 1995,1996,1997,1998,1999,2000 Thunderstone EPI Inc.
Release: 20000627


Linux machine


/gw -version
Webinator WWW Site Indexer Version 2.52 (Commercial)
Copyright(c) 1995,1996,1997,1998 Thunderstone EPI Inc.
Release: 19990218
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

pdf & doc

Post by John »

The first step is to review what is in the database for the files in question. E.g.

gw -st "select * from html where Url = 'www.server/file.pdf'"

and see what information was extracted from the file. You might also try running the plugin, anytotx, manually on the file from the command line:

anytotx < file.pdf

where you have a copy of file.pdf on the machine with anytotx.
John Turnbull
Thunderstone Software
resume.robot
Posts: 68
Joined: Sat Jan 13, 2001 1:23 am

pdf & doc

Post by resume.robot »

Thanks for your response, John.

gw.log shows the files to have been retrieved


2001/12/16 13:30:38 Retrieving http://dept.physics.upenn.edu/~pcn/Mss/cv.pdf
2001/12/16 13:30:39 Max page size exceeded (truncated) for http://dept.physics.upenn.edu/~pcn/Mss/cv.pdf
2001/12/16 13:30:39 Retrieving http://depts.washington.edu/biostat/fac ... llstro.pdf
2001/12/16 13:30:39 Max page size exceeded (truncated) for http://depts.washington.edu/biostat/fac ... llstro.pdf
2001/12/16 13:30:39 Retrieving http://depts.washington.edu/psych/Facul ... ero_cv.pdf
2001/12/16 13:30:40 Max page size exceeded (truncated) for http://depts.washington.edu/psych/Facul ... ero_cv.pdf
2001/12/16 13:30:40 Retrieving http://depts.washington.edu/psych/Facul ... lyn_cv.pdf
2001/12/16 13:30:40 Retrieving http://depts.washington.edu/psych/Facul ... ith_cv.pdf
2001/12/16 13:30:40 Max page size exceeded (truncated) for http://depts.washington.edu/psych/Facul ... ith_cv.pdf
2001/12/16 13:30:40 Retrieving http://depts.washington.edu/psych/Facul ... oll_cv.pdf
2001/12/16 13:30:41 Max page size exceeded (truncated) for http://depts.washington.edu/psych/Facul ... oll_cv.pdf


but select does not find anything there.

Here is the string that retrieved the files:

/gw -d/export/usr/data/health.pdf.test -noindex -dns=sys -a -R -r -O -fpdf -fshtml -fasp -fcfm -fjsp -t7 -z10000 -v9 "&health.pdf.cv.11216"


We don't have copies of any of the files, all are remote.

Is it possible the 10,000 character limit prevents any meaningful text in pdf files from being spidered?

I was incorrect earlier, select count(Url) does not find them either.
User avatar
John
Site Admin
Posts: 2622
Joined: Mon Apr 24, 2000 3:18 pm
Location: Cleveland, OH
Contact:

pdf & doc

Post by John »

Yes, that's exactly what is happening. Only 10,000 characters are downloaded from the webserver, which isn't enough to extract any text from. In fact with a PDF file you need the entire file to extract the text from it.
John Turnbull
Thunderstone Software
resume.robot
Posts: 68
Joined: Sat Jan 13, 2001 1:23 am

pdf & doc

Post by resume.robot »

Tried it without the -z flag

Here is the gw string

gw -d/db -noindex -dns=sys -a -R -r -O -fpdf -fshtml -fasp -fcfm -fjsp -t7 -v9 "&list" > outputfile

Here is some output:

http://www.uoregon.edu/~jonesey/cfjresume.pdf
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Retrieving
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Disallowed MIME type
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0
http://www.upenn.edu/careerservices/gra ... resume.pdf
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Retrieving
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Disallowed MIME type
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0
http://www.waddellsoftware.com/pdf%20do ... resume.pdf
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Retrieving
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Disallowed MIME type
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0
http://www.werc.net/landfill/files/carlsonresume.pdf
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Retrieving
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Disallowed MIME type
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0
http://www.whpierceexploration.com/whpresume.pdf
4: TotLinks: 68, Links: 65/ 0, Good: 47, New: 0 Retrieving

Here is some of the gw.log file



2001/12/16 22:43:22 Retrieving http://www.upenn.edu/careerservices/gra ... resume.pdf
2001/12/16 22:43:22 Retrieving http://www.waddellsoftware.com/pdf%20do ... resume.pdf
2001/12/16 22:43:23 Retrieving http://www.werc.net/landfill/files/carlsonresume.pdf
2001/12/16 22:43:23 Retrieving http://www.whpierceexploration.com/whpresume.pdf
2001/12/16 22:43:24 Max page size exceeded (truncated) for http://www.whpierceexploration.com/whpresume.pdf

The urlcount for todo showed over 900 urls, now after retrieving the urlcount shows 0. The urlcount for html shows 5. The html.tbl file is only 48022 bytes.

Where did they go?

Thanks
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

pdf & doc

Post by mark »

resume.robot
Posts: 68
Joined: Sat Jan 13, 2001 1:23 am

pdf & doc

Post by resume.robot »

Thanks, I will try that for pdf.

I did have some success with doc files without the -n option. Search results & links appeared normally using the standard search script.

Should I try another experiment with doc files with the -n option and compare results?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

pdf & doc

Post by mark »

Without the plugin the doc files and such will be placed into the database. Some of the text in doc files happens to be visible that way so searches will find things. You will get better results by using the plugin though. PDF files have no "visible" text. It's all compressed and otherwise mangled.
resume.robot
Posts: 68
Joined: Sat Jan 13, 2001 1:23 am

pdf & doc

Post by resume.robot »

Thanks

So for a pdf file I use the argument straight from the man page:

gw -n"application/pdf,pdf,pdftotx"

What is the argument for a doc file?

gw -n"application/msword,doc,?????"
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

pdf & doc

Post by mark »

-n"application/msword,doc,pdftotx -fmsw"

Please also see the readme that came with the plugin (pdftotx.txt or anytotx.txt).
Post Reply