Can't crawl site

Post Reply
webinatoruser
Posts: 39
Joined: Fri Apr 08, 2005 8:54 am

Can't crawl site

Post by webinatoruser »

Trying to crawl the documents linked to this site but to no avail. There is at least one redirect and setting up the cookie directory doesn't help. Only works through browser making me think there is a referrer lookup happening before the documents are served on the second server. Any ideas for a workaround in webinator?

http://www.un.org/Docs/sc/sgrep11.htm
User avatar
mark
Site Admin
Posts: 5515
Joined: Tue Apr 25, 2000 6:56 pm

Can't crawl site

Post by mark »

What's an example of a page you want to get that you're not? I notice that most of the links on that page are on a different site, daccess-ods.un.org. Maybe you need to add that to Extra Domains. Also make sure you're not stripping query strings and that ? is not in exclusions. You also need to turn off stay under if you want it to follow links above .../Docs/sc/
webinatoruser
Posts: 39
Joined: Fri Apr 08, 2005 8:54 am

Can't crawl site

Post by webinatoruser »

I can't get any of the documents on that page. I have the different site and the second redirect at daccess-dds-ny.un.org in my base urls. query strings aren't stripped and ? is not in exclusions. stay under is on, but I assume if I add the above to base urls it wouldn't matter. I also notice if I try to grab one of the final document urls on daccess-dds-ny.un.org I get an error message from the server, suggesting it is not happy that I am linking from a non un.org domain.
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

Can't crawl site

Post by jason112 »

Yes, it's definitely checking the Referer; both for The /access.nsf/ page that eventually redirects to the pdf, and the login page itself that provides the cookie.

Webinator normally doesn't provide referers to the main fetches. You can edit the script to force this - in the <fetchset> function, you should see a line like this:

<fetch parallel=$SSc_maxthreads $u><!-- get everything at this depth -->

If you add this right before it:
<urlcp header "Referer" "http://daccess-ods.un.org/">

Then, assuming the cookie has been set up properly, you should be able to get the PDFs.
webinatoruser
Posts: 39
Joined: Fri Apr 08, 2005 8:54 am

Can't crawl site

Post by webinatoruser »

I tried everything and couldn't manage. I wrote a script instead below after finding the login form on the error page and it worked. But what I really wanted to do was crawl from the page with the list of links. I can't figure out if the essence of the script below could be emulated in the webinator interface?

<urlcp header "Referer" "http://www.un.org/">
<urlcp user "freeods2">
<urlcp pass "1234">
<urlcp offsiteok "on">
<urlcp maxredirs 5>
<submit method="post" URL="http://daccess-dds-ny.un.org/names.nsf?Login"
name="_DominoForm"
Username="freeods2"
Password="1234">
<fetch "http://daccess-dds-ny.un.org/doc/UNDOC/ ... penElement">
User avatar
jason112
Site Admin
Posts: 347
Joined: Tue Oct 26, 2004 5:35 pm

Can't crawl site

Post by jason112 »

This can mostly be accomplished with the Primer URL system; again, the only limiting factor is they're checking refers, and Webinator normally doesn't include one.

If you enter the primer info:
Primer Type: Custom
Custom Primer URL: http://daccess-dds-ny.un.org/prod/ods_m ... sword=1234
Base URL MM Query: http://www.un.org/*

And in the script, add that <urlcp header> line in the beginning of the "doprimer" function, then it should be able to get the cookie properly & crawl the content.
Post Reply