Python Scrape with requests - python

When I run this script IDLE does not continue. Usually it will give an error. Other scripts run fine so i know its not IDLE. I thought my code was correct but maybe I've missed something. This is not everything i will be scraping from the site, just wanted to see this work first than I can run through everything later.
import csv
import requests
import os
##HOME TEAM
req = requests.get('http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=10%2F17%2F2017&DateTo=04%2F11%2F2018&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=Home&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=')
data = req.json()
my_data = []
pk = data['resultSets']
for item in data:
team = item.get['rowSet']
for item in team:
Team_Id = item[0]
Team_Name = item[1]
my_data.append([Team_Id, Team_Name])
headers = ["Team_Id", "Team_Name"]
with open("NBA_Home_Team.csv", "a", newline='') as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(my_data)
f.close()
##os.system("taskkill /f /im pythonw.exe")

Seems like it hangs because the server doesn't respond. It can be verified by killing the process and checking the stack trace:
Traceback (most recent call last):
req = requests.get('http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=10%2F17%2F2017&DateTo=04%2F11%2F2018&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=Home&MeasureType=Base&Month=0
&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsCo
nference=&VsDivision=')
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 502, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 612, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 440, in send
timeout=timeout
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 379, in _make_request
httplib_response = conn.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1121, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 438, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 394, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/lib/python2.7/socket.py", line 480, in readline
data = self._sock.recv(self._rbufsize) <-- we're stucking here
KeyboardInterrupt
I tried to open the URL in my browser and it worked fine and I received response within a second. Then I started to tweak the request in the code to mimic a valid browser. My first idea was using a valid User-Agent and I immediately received response with the following code:
data = requests.get(
'http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=10%2F17%2F2017&DateTo=04%2F11%2F2018&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=Home&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=',
headers={'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_0 like Mac OS X) AppleWebKit/602.1.38 (KHTML, like Gecko) Version/10.0 Mobile/14A300 Safari/602.1'},
).json()
Perhaps some kind of defense mechanism against bots causes the unresponsiveness without valid User-Agent.
Other notes on the code snippet:
for item in data:
Use pk instead of data.
team = item.get['rowSet']
Use item['rowSet'] or item.get('rowSet') but do not mix them. item.get is a function so [] cannot be applied.
my_data.append([Team_Id, Team_Name])
The indentation should be the same as the above line

Related

PyGithub and Python 3.6

I have the following script that grabs a repository from Github using PYGitHub
import logging
import getpass
import os
from github import Github, Repository as Repository, UnknownObjectException
GITHUB_URL = 'https://github.firstrepublic.com/api/v3'
if __name__ == '__main__':
logging.getLogger().setLevel(logging.DEBUG)
logging.debug('validating GH token')
simpleuser = getpass.getuser().replace('adm_','')
os.path.exists(os.path.join(os.path.expanduser('~' + getpass.getuser()) + '/.ssh/github-' + simpleuser + '.token'))
with open(os.path.join(os.path.expanduser('~' + getpass.getuser()) + '/.ssh/github-' + simpleuser + '.token'), 'r') as token_file:
github_token = token_file.read()
logging.debug(f'Token after file processing: {github_token}')
logging.debug('initializing github')
g = Github(base_url=GITHUB_URL, login_or_token=github_token)
logging.debug("attempting to get repository")
source_repo = g.get_repo('CLOUD/iam')
Works just fine in Python 3.9.1 on my Mac.
In production, we have RHEL7, Python 3.6.8 (can't upgrade it, don't suggest it). This is where it blows up:
(virt) user#lmachine: directory$ python3 test3.py -r ORG/repo_name -d
DEBUG:root:validating GH token
DEBUG:root:Token after file processing: <properly_formed_token>
DEBUG:root:initializing github
DEBUG:root:attempting to get repository
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): <domain>:443
Traceback (most recent call last):
File "test3.py", line 68, in <module>
source_repo = g.get_repo(args.repo)
File "/home/adm_gciesla/virt/lib/python3.6/site-packages/github/MainClass.py", line 348, in get_repo
"GET", "%s%s" % (url_base, full_name_or_id)
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 319, in requestJsonAndCheck
verb, url, parameters, headers, input, self.__customConnection(url)
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 410, in requestJson
return self.__requestEncode(cnx, verb, url, parameters, headers, input, encode)
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 487, in __requestEncode
cnx, verb, url, requestHeaders, encoded_input
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 513, in __requestRaw
response = cnx.getresponse()
File "/home/user/virt/lib/python3.6/site-packages/github/Requester.py", line 116, in getresponse
allow_redirects=False,
File "/home/user/virt/lib/python3.6/site-packages/requests/sessions.py", line 543, in get
return self.request('GET', url, **kwargs)
File "/home/user/virt/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/home/user/virt/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/home/user/virt/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/home/user/virt/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/home/user/virt/lib/python3.6/site-packages/urllib3/connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib64/python3.6/http/client.py", line 1254, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib64/python3.6/http/client.py", line 1295, in _send_request
self.putheader(hdr, value)
File "/usr/lib64/python3.6/http/client.py", line 1232, in putheader
raise ValueError('Invalid header value %r' % (values[i],))
ValueError: Invalid header value b'token <properly_formed_token>\n'
The script is a stripped down version of a larger application. I've tried rolling back to earlier versions of PyGitHub, that's really all I have control over in prod. Same error regardless. PyGithub's latest release claims Python >=3.6 should work.
I've really run the gamut of debugging. Seems like reading from environment variables can work sometimes, but the script needs to be able to use whatever credentials are available. Passing in the token as an argument is only for running locally.
Hopefully someone out there has seen something similar.
We just figured it out. Apparently, even though there's no newline in the .token file, there is one after calling file.read()
Changing github_token = token_file.read() to github_token = token_file.read().strip() fixes the problem.

404 HTTP error, despite being able to see the page in the browser

I'm trying to map this website, but I got a problem while trying to fully crawl it. I'm getting an error 404 even though the URL exists.
Here is my code:
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
csvFile = open("C:/Users/Pichau/codigo/govbr/brasil/govfederal/govbr/arquivos/teste.txt",'wt')
paginas = set()
def getLinks(pageUrl):
global paginas
html = urlopen("https://www.gov.br/pt-br/"+pageUrl)
bsObj = BeautifulSoup(html, "html.parser")
writer = csv.writer(csvFile)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
if link.attrs['href'] not in paginas:
#nova página encontrada
newPage = link.attrs['href']
print(newPage)
paginas.add(newPage)
getLinks(newPage)
csvRow = []
csvRow.append(newPage)
writer.writerow(csvRow)
getLinks("")
csvFile.close()
And this is the error message I got, after I tried to run that code:
#wrapper
/
#main-navigation
#nolivesearchGadget
#tile-busca-input
#portal-footer
http://brasil.gov.br
Traceback (most recent call last):
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 26, in <module>
getLinks("")
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
getLinks(newPage)
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
getLinks(newPage)
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
getLinks(newPage)
[Previous line repeated 4 more times]
File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 10, in getLinks
html = urlopen("https://www.gov.br/pt-br/"+pageUrl)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 523, in open
response = meth(req, response)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 632, in http_response
response = self.parent.error(
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 561, in error
return self._call_chain(*args)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
result = func(*args)
File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 641, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
PS C:\Users\Pichau\codigo\govbr>
I've tried to do it only with the main link, and it works fine, but as soon as i add the pageurl variable to the url, it gives me this error. How can I fix this error?
From what I can see, you're right- the page is there... for us people on browsers. What I assume is happening is some basic anti-botting mechanism which bans uncommon UserAgents, or in other words, only lets browsers view the page. However, as the User Agent is a header that we can control, we can manipulate it so it won't throw the 404 error.
I can't type out the code for it at the moment but you will need to pair this StackOverflow answer describing how to change a header in urllib, and you must write some code which takes that answer and changes the "UserAgent" header to a value like Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36, which I've taken from here.
After you've changed the UserAgent header, you should be able to download the page successfully.

2captcha selenium out of range

I'm trying to implement 2captcha using selenium with Python.
I just copied the example form their documentation:
https://github.com/2captcha/2captcha-api-examples/blob/master/ReCaptcha%20v2%20API%20Examples/Python%20Example/2captcha_python_api_example.py
This is my code:
from selenium import webdriver
from time import sleep
from selenium.webdriver.support.select import Select
import requests
driver = webdriver.Chrome('chromedriver.exe')
driver.get('the_url')
current_url = driver.current_url
captcha = driver.find_element_by_id("captcha-box")
captcha2 = captcha.find_element_by_xpath("//div/div/iframe").get_attribute("src")
captcha3 = captcha2.split('=')
#print(captcha3[2])
# Add these values
API_KEY = 'my_api_key' # Your 2captcha API KEY
site_key = captcha3[2] # site-key, read the 2captcha docs on how to get this
url = current_url # example url
proxy = 'Myproxy' # example proxy
proxy = {'http': 'http://' + proxy, 'https': 'https://' + proxy}
s = requests.Session()
# here we post site key to 2captcha to get captcha ID (and we parse it here too)
captcha_id = s.post("http://2captcha.com/in.php?key={}&method=userrecaptcha&googlekey={}&pageurl={}".format(API_KEY, site_key, url), proxies=proxy).text.split('|')[1]
# then we parse gresponse from 2captcha response
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id), proxies=proxy).text
print("solving ref captcha...")
while 'CAPCHA_NOT_READY' in recaptcha_answer:
sleep(5)
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id), proxies=proxy).text
recaptcha_answer = recaptcha_answer.split('|')[1]
# we make the payload for the post data here, use something like mitmproxy or fiddler to see what is needed
payload = {
'key': 'value',
'gresponse': recaptcha_answer # This is the response from 2captcha, which is needed for the post request to go through.
}
# then send the post request to the url
response = s.post(url, payload, proxies=proxy)
# And that's all there is to it other than scraping data from the website, which is dynamic for every website.
This is my error:
solving ref captcha...
Traceback (most recent call last):
File "main.py", line 38, in
recaptcha_answer = recaptcha_answer.split('|')[1]
IndexError: list index out of range
The captcha is getting solved because I can see it on 2captcha dashboard, so which is the error if it's de official documentation?
EDIT:
For some without modification I'm getting the captcha solved form 2captcha but then I get this error:
solving ref captcha...
OK|this_is_the_2captch_answer
Traceback (most recent call last):
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 594, in urlopen
self._prepare_proxy(conn)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 805, in _prepare_proxy
conn.connect()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connection.py", line 308, in connect
self._tunnel()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 906, in _tunnel
(version, code, message) = response._read_status()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 278, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: <html>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\adapters.py", line 449, in send
timeout=timeout
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\util\retry.py", line 368, in increment
raise six.reraise(type(error), error, _stacktrace)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\packages\six.py", line 685, in reraise
raise value.with_traceback(tb)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 594, in urlopen
self._prepare_proxy(conn)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 805, in _prepare_proxy
conn.connect()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connection.py", line 308, in connect
self._tunnel()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 906, in _tunnel
(version, code, message) = response._read_status()
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 278, in _read_status
raise BadStatusLine(line)
urllib3.exceptions.ProtocolError: ('Connection aborted.', BadStatusLine('<html>\r\n'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 49, in <module>
response = s.post(url, payload, proxies=proxy)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 581, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "C:\Users\Usuari\AppData\Local\Programs\Python\Python37-32\lib\site-packages\requests\adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine('<html>\r\n'))
Why am I getting this error?
I'm setting as site_key = current_url_where_captcha_is_located
Is this correct?
Use your debugger or put a print(recaptcha_answer) before the error line to see what's the value of recaptcha_answer before you try to call .split('|') on it. There is no | in the string so when you're trying to get the second element of the resulting list with [1] it fails.
Looks like you don't provide any valid proxy connection parameters but passing this proxy to requests when connecting to the API.
Just comment these two lines:
#proxy = 'Myproxy' # example proxy
#proxy = {'http': 'http://' + proxy, 'https': 'https://' + proxy}
And then remove proxies=proxy from four lines:
captcha_id = s.post("http://2captcha.com/in.php?key={}&method=userrecaptcha&googlekey={}&pageurl={}".format(API_KEY, site_key, url)).text.split('|')[1]
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id)).text
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id)).text
response = s.post(url, payload, proxies=proxy)

Python: requests module throws exception with Gevent

The following code:
import gevent
import gevent.monkey
gevent.monkey.patch_socket()
import requests
import json
base_url = 'https://api.getclever.com'
section_url = base_url + '/v1.1/sections'
#get all sections
sections = requests.get(section_url, auth=('DEMO_KEY', '')).json()
urls = [base_url+data['uri']+'/students' for data in sections['data']]
#get students for each section
threads = [gevent.spawn(requests.get, url, auth=('DEMO_KEY', '')) for url in urls]
gevent.joinall(threads)
students = [thread.value for thread in threads]
#get number of students in each section
num_students = [len(student.json()['data']) for student in students]
print (sum(num_students)/len(num_students))
results in this error:
Traceback (most recent call last):
File "clever.py", line 12, in <module>
sections = requests.get(section_url, auth=('DEMO_KEY', '')).json()
File "/Library/Python/2.7/site-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 382, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 485, in send
r = adapter.send(request, **kwargs)
File "/Library/Python/2.7/site-packages/requests/adapters.py", line 379, in send
raise SSLError(e)
requests.exceptions.SSLError: [Errno 2] _ssl.c:503: The operation did not complete (read)
What am I doing wrong here?
Here's a similar question: [Errno 2] _ssl.c:504: The operation did not complete (read).
When you comment out
gevent.monkey.patch_socket()
or use
gevent.monkey.patch_all()
or use
gevent.monkey.patch_socket()
gevent.monkey.patch_ssl()
then the problem disappears.

Problem with urllib2 loading mobile site

I'm trying to fetch some data from http://m.finnkino.fi/events/now_showing, but at the moment I'm failing badly because I'm not even able to load the page source with python.
At the moment I'm using following code:
req = urllib2.urlopen(URL,None,2.5)
page = req.read()
print page
Here is the traceback for timeout error:
Traceback (most recent call last):
File "user/src/finnkinoParser.py", line 26, in <module>
main()
File "user/src/finnkinoParser.py", line 13, in main
getNowPlayingMovies()
File "user/src/finnkinoParser.py", line 17, in getNowPlayingMovies
req = urllib2.urlopen(baseURL,None,2.5)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 383, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 401, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 361, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1130, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1105, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error timed out>
If I browse to the url with my browser it works fine. So could someone tell me what makes that site that much different so the urllib2 is unable to load the page. I suppose it has something to do with the site being aimed to mobile users. With "regular" sites urllib2 works fine. Is there any other kind of sites to which the basic urlopen(URL) doesn't work?
Thanks for help
Following snippet works fine.
import httplib
headers = {"User-Agent": "Mozilla/5.0"}
conn = httplib.HTTPConnection("m.finnkino.fi")
conn.request("GET", "/events/now_showing", "", headers)
response = conn.getresponse()
print response.status, response.reason
data = response.read()
print data
conn.close()
It seems their server has verified several request vars. After tested some times, here is conclusion:
http protocol must be HTTP/1.1.
if request headers have Connection prop, its value should be keep-alive.
request headers must have User-Agent prop, whatever its value.
While in urllib2, Connection prop in HTTPHandler has been set to Close by default (L1127 in urllib2.py). you can use urlgrabber or other HTTP handler which supports HTTP/1.1 and keep-alive.

Categories