Can't crawl site

webinatoruser · Post by **webinatoruser** » Fri May 13, 2011 11:28 am

Trying to crawl the documents linked to this site but to no avail. There is at least one redirect and setting up the cookie directory doesn't help. Only works through browser making me think there is a referrer lookup happening before the documents are served on the second server. Any ideas for a workaround in webinator?

http://www.un.org/Docs/sc/sgrep11.htm

Post by **mark** » Fri May 13, 2011 12:39 pm

What's an example of a page you want to get that you're not? I notice that most of the links on that page are on a different site, daccess-ods.un.org. Maybe you need to add that to Extra Domains. Also make sure you're not stripping query strings and that ? is not in exclusions. You also need to turn off stay under if you want it to follow links above .../Docs/sc/

webinatoruser · Post by **webinatoruser** » Fri May 13, 2011 2:54 pm

I can't get any of the documents on that page. I have the different site and the second redirect at daccess-dds-ny.un.org in my base urls. query strings aren't stripped and ? is not in exclusions. stay under is on, but I assume if I add the above to base urls it wouldn't matter. I also notice if I try to grab one of the final document urls on daccess-dds-ny.un.org I get an error message from the server, suggesting it is not happy that I am linking from a non un.org domain.

Post by **jason112** » Fri May 13, 2011 3:49 pm

Yes, it's definitely checking the Referer; both for The /access.nsf/ page that eventually redirects to the pdf, and the login page itself that provides the cookie.

Webinator normally doesn't provide referers to the main fetches. You can edit the script to force this - in the <fetchset> function, you should see a line like this:

<fetch parallel=$SSc_maxthreads $u>

If you add this right before it:
<urlcp header "Referer" "http://daccess-ods.un.org/">

Then, assuming the cookie has been set up properly, you should be able to get the PDFs.

webinatoruser · Post by **webinatoruser** » Sat May 14, 2011 6:37 am

I tried everything and couldn't manage. I wrote a script instead below after finding the login form on the error page and it worked. But what I really wanted to do was crawl from the page with the list of links. I can't figure out if the essence of the script below could be emulated in the webinator interface?

<urlcp header "Referer" "http://www.un.org/">
<urlcp user "freeods2">
<urlcp pass "1234">
<urlcp offsiteok "on">
<urlcp maxredirs 5>
<submit method="post" URL="http://daccess-dds-ny.un.org/names.nsf?Login"
name="_DominoForm"
Username="freeods2"
Password="1234">
<fetch "http://daccess-dds-ny.un.org/doc/UNDOC/ ... penElement">

Post by **jason112** » Tue May 17, 2011 10:32 am

This can mostly be accomplished with the Primer URL system; again, the only limiting factor is they're checking refers, and Webinator normally doesn't include one.

If you enter the primer info:
Primer Type: Custom
Custom Primer URL: http://daccess-dds-ny.un.org/prod/ods_m ... sword=1234
Base URL MM Query: http://www.un.org/*

And in the script, add that <urlcp header> line in the beginning of the "doprimer" function, then it should be able to get the cookie properly & crawl the content.