user-agent or something else?

Post Reply
nduvnjak
Posts: 40
Joined: Wed Feb 06, 2008 3:45 pm

user-agent or something else?

Post by nduvnjak »

Hi,
when I look at the source html of this website:
http://phpbbchina.com/forum/index.php
via IE, I get the "English" version. The contents of the site is still in Chinese, but some keywords important for my crawling script are in English. The header of the source html is like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-gb" xml:lang="en-gb">

But when I <fetch> it via Vortex script I get something else, the "Chinese" version. I can confirm that by looking into returned html, whose header is:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="zh-cn" xml:lang="zh-cn">

And of course the "keywords" my script was relying on, are lost. Some unreadable characters are there instead.

How can I make the Vortex script retreive the same html as IE?

I tried setting the user-agent to the same value the IE has set. Namely:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)
but that didn't help.
Is there any other setting?

Thank you!
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

user-agent or something else?

Post by mark »

Try setting header "Accept-Language" with the desired value.
nduvnjak
Posts: 40
Joined: Wed Feb 06, 2008 3:45 pm

user-agent or something else?

Post by nduvnjak »

great, thanks! this actually worked:

<urlcp header "accept-language" "en">

Can I bug you some more?
There's another difference in return from IE and Vortex fetch, and I don't know what setting I need to adjust.

HTML source from IE is showing HREF of the <a> links like this:

href="./viewtopic.php?f=2&t=830&st=0&sk=t&sd=a&start=15"

while fetch from Vortex is giving me this:

href="./viewtopic.php?f=2&t=830&st=0&sk=t&sd=a&sid=a5eea6f89731e309064ac7e7c12be869&start=15

i.e. it insert some kind of session ID, I guess.

How to avoid that?

Thanks!
User avatar
mark
Site Admin
Posts: 5519
Joined: Tue Apr 25, 2000 6:56 pm

user-agent or something else?

Post by mark »

Maybe that's a session id of some kind. Maybe you need to enable cookies to not get the id in the url? Beyond that it's really hard to guess details of their internal app design.
Post Reply