-z argument with <!-- hidden and <style> tags in html

resume.robot
Posts: 68
Joined: Sat Jan 13, 2001 1:23 am

-z argument with <!-- hidden and <style> tags in html

Post by resume.robot »

I seem to be noticing entries with no body tag. When I visit the individual web pages, they seem to have a long <!-- hidden --> tag and a long <style> tag.

The gw spider is set to -z5000. Is this creating the issue?

Thanks
Mike Clark
robot@resumerobot.com
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

-z argument with <!-- hidden and <style> tags in html

Post by mark »

Do you mean no body *content*?
Is the html source of the page larger than your -z settings of 5000 bytes? If so you should receive a "truncated" message during the walk and the body content will only contain as much as was in the first 5000 bytes.
resume.robot
Posts: 68
Joined: Sat Jan 13, 2001 1:23 am

-z argument with <!-- hidden and <style> tags in html

Post by resume.robot »

Yes you are correct, the database entries for a few pages display no body content.

I don't know what the log entries were for those pages. I do see truncated pages and log entries all the time, and for those entries the body content is the first 5000 characters, just like it's supposed to be.

Here are 3 of the urls in question
http://66.92.69.146/tomresume.htm
http://www.catadjuster.org/adjusters/joyce.html
http://www.hendrik-weber.de/vitae-e7_Delaware2.htm

On reviewing them, using view/source, they all have xml tags at the top, even though the pages have htm or html suffix. The pages seem to be created by microsoft word 9.

Here is the gw string being used to populate the database

gw -d/db -noindex -a -r -O -fshtml -fasp -fcfm -t7 -z5000
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

-z argument with <!-- hidden and <style> tags in html

Post by mark »

As I suspected. There's no text in the first 5000 bytes of the html.
resume.robot
Posts: 68
Joined: Sat Jan 13, 2001 1:23 am

-z argument with <!-- hidden and <style> tags in html

Post by resume.robot »

Is there any way to exclude hidden tags from the text to be spidered, so that the limits apply only spider-able (visible) text characters?
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

-z argument with <!-- hidden and <style> tags in html

Post by mark »

No. The limit is on the downloaded size. It downloads the allowed amount, disconnects, then extracts the text from the page. The idea is to reduce bandwidth usage. You just need to download more across the board to accomodate such things. (Or tell your submitters not to put so much junk at the head of their files.)