Hi,
when I look at the source html of this website:
http://phpbbchina.com/forum/index.php
via IE, I get the "English" version. The contents of the site is still in Chinese, but some keywords important for my crawling script are in English. The header of the source html is like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-gb" xml:lang="en-gb">
But when I <fetch> it via Vortex script I get something else, the "Chinese" version. I can confirm that by looking into returned html, whose header is:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="zh-cn" xml:lang="zh-cn">
And of course the "keywords" my script was relying on, are lost. Some unreadable characters are there instead.
How can I make the Vortex script retreive the same html as IE?
I tried setting the user-agent to the same value the IE has set. Namely:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)
but that didn't help.
Is there any other setting?
Thank you!
when I look at the source html of this website:
http://phpbbchina.com/forum/index.php
via IE, I get the "English" version. The contents of the site is still in Chinese, but some keywords important for my crawling script are in English. The header of the source html is like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-gb" xml:lang="en-gb">
But when I <fetch> it via Vortex script I get something else, the "Chinese" version. I can confirm that by looking into returned html, whose header is:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="zh-cn" xml:lang="zh-cn">
And of course the "keywords" my script was relying on, are lost. Some unreadable characters are there instead.
How can I make the Vortex script retreive the same html as IE?
I tried setting the user-agent to the same value the IE has set. Namely:
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 1.1.4322; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)
but that didn't help.
Is there any other setting?
Thank you!