dowalk, cgi redirect, and frames

pnam
Posts: 18
Joined: Fri May 18, 2001 1:20 pm

dowalk, cgi redirect, and frames

Post by pnam »

Hi,

I'm using the dowalk script to spider the site and populate the html table. I've altered dowalk to ignore the path /cgi-bin/ unless the path is /cgi-bin/navigator/. this allows us to spider see pages that link via a cgi redirect. My problem is that not all the links are being recorded. Another thing to complicate matters is that the site uses frames and some of the links on the left nav bar open a page in the right main content frame.

Has anyone out there run into similar situations ?

Thanks for you help.

Peter
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

dowalk, cgi redirect, and frames

Post by mark »

What kind of links are not being recorded? And when you say recorded do you mean in refs or followed and fetched into html? It will only record,follow,store http urls.

Pages linked from framed pages will be treated as standalone pages, regardless of "target", not part of their parent frameset. You usually only want the content from the "main" section of the frame anyhow. Not the nav menu repeated over and over.
pnam
Posts: 18
Joined: Fri May 18, 2001 1:20 pm

dowalk, cgi redirect, and frames

Post by pnam »

I figured out what was going on. the redirects were going to a different machine, but the address displayed in the address bar showed the original machine. For example. www.myfakesite.com would have a link that redirects to someothercomputer.myfakesite.com but it would still display something like www.myfakesite.com/redirected/page.html.

I tried playing with the setting for acceptdomain but it didn't help.

any ideas??
pnam
Posts: 18
Joined: Fri May 18, 2001 1:20 pm

dowalk, cgi redirect, and frames

Post by pnam »

sorry, when I said recorded, i meant that most of the links were not recorded in the html table.

I just found out that it is actually load balancing b/w two different machines for performance. Has anybody been in a situation like this before?

It seems like if we let dowalk follow the redirect, then everytime someone tried to get to a page from the search results screen, it would always go to that specific machine that dowalk was redirected to for that link.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

dowalk, cgi redirect, and frames

Post by mark »

That's a fairly unusual way to do load balancing. It's usually done in some transparent fashion. acceptdomain will work for normal href's but not the targets of redirects. If you have any influence over the server you might configure it to always give urls to the same machine to your walker. Identify it either by IP or User Agent.

You could set <urlcp offsiteok y>
That will allow it to follow all offsite redirects and frames.

The urls returned by searches will be static until the next walk. Search users will get directed to the same url that the walker was.
pnam
Posts: 18
Joined: Fri May 18, 2001 1:20 pm

dowalk, cgi redirect, and frames

Post by pnam »

it's strange. i set offsiteok to yes and it does grab a couple more pages, but there are still many pages that aren't being walked through. I even removed "cgi-bin" from the rejects variable.
It seems to get some of links on the main page but the ones below it (links on the linked page) aren't getting walked.
When I run gw, it complains about a refs table not in the data dictionary, but it still did over 800 inserts into the html table before i stopped it.

We decided not to modify the html table. Could we still use gw but exclude a list of like 20 directories and also exclude links to a certain machine that is different from the start url, but witin the same domain? I tried this, but couldn't quite exactly get the -x/EXPR option working.

I found some documentation on gw and robots. I'll try it out and update this thread.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

dowalk, cgi redirect, and frames

Post by mark »

You need to take a look at exactly which pages are being walked and not to determine the exact cause. That will involve the html refs and error tables.

If there's no refs table you must have modified dowalk to not create it. Or you have verbosity turned up high and are seeing an informational message.

You should be able to exclude specific areas from specfic sites within the allowed domain. Note that -x (not -x/) is the same as making an entry in robots.txt.
pnam
Posts: 18
Joined: Fri May 18, 2001 1:20 pm

dowalk, cgi redirect, and frames

Post by pnam »

I've started with a new database and this time, I used the gw -create to make the database instead of doing using creatdb and creating the tables myself. and i also started with a fresh version of the dowalk script. it seems to be working okay so far. I have two concerns tho.
1. When the site gets modified, would it be hard to alter the script to allow it to rewalk the site? (without having to wipe the db and repopulate it)
2. The other concern is with what sql commands are available and I have started a differnet thread for that one.

Thanks for your help.
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

dowalk, cgi redirect, and frames

Post by mark »

1) it would be a whole different outer loop. Instead of creating database and walking through todo lists you would select old urls from the existing database and set <urlcp ifmodsince $Visited> before each fetch then check for httpcode 304 indicating not modified. Only records for modified pages need to be updated.
pnam
Posts: 18
Joined: Fri May 18, 2001 1:20 pm

dowalk, cgi redirect, and frames

Post by pnam »

okay, i've only been vortex scripting for a couple days, so please be patient with me. it seems like what you suggest will check if a page already exists in the tstone db and modify it based on it's mod date in the http header info. and new html pages will be inserted as normal. what about deleted pages? we don't want to return search results on pages that don't exist!
I apologize if i've misuderstood your response.
Post Reply