can calling pythons urllib2.info() cause an exception?

can calling pythons urllib2.info() cause an exception? - python

I'm getting a couple of exceptions popping up from time to time but can't think of the cause.
Here's a snippet:
try:
r = urllib2.urlopen(url)
except urllib2.URLError, e:
if hasattr(e, 'code'):
# unauthorized
print('UA: %s' % url)
elif hasattr(e, 'reason'):
print('TO: %s' % url)
# timeout
else:
i = r.info()
try:
server = i['server']
except:
pass
else:
if not 'authenticate' in server:
print('NA: %s' % url)
I'm thinking perhaps that r.info() is causing an exception but not sure why it would as the r = urllib2.urlopen(url) is covered with the try.
The errors are:
Traceback (most recent call last):
File "C:\Python27\lib\threading.py", line 551, in __bootstrap_inner
self.run()
File "C:\Users\anthony\Scripts\checker.py", line 35, in run
r = urllib2.urlopen(url)
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 418, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Python27\lib\urllib2.py", line 1180, in do_open
r = h.getresponse(buffering=True)
File "C:\Python27\lib\httplib.py", line 1030, in getresponse
response.begin()
File "C:\Python27\lib\httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "C:\Python27\lib\httplib.py", line 371, in _read_status
raise BadStatusLine(line)
BadStatusLine: ''
and
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 400, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 418, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Python27\lib\urllib2.py", line 1180, in do_open
r = h.getresponse(buffering=True)
File "C:\Python27\lib\httplib.py", line 1030, in getresponse
response.begin()
File "C:\Python27\lib\httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "C:\Python27\lib\httplib.py", line 365, in _read_status
line = self.fp.readline()
File "C:\Python27\lib\socket.py", line 447, in readline
data = self._sock.recv(self._rbufsize)
error: [Errno 10054] An existing connection was forcibly closed by the remote host
I've read a bit of information on the [Errno 10054] but have no idea how to prevent it.
Any help would be appriciated.

I'm thinking perhaps that r.info() is causing an exception but not
sure why it would as the r = urllib2.urlopen(url) is covered with the
try.
Nope. The first exception has nothing to do with r.info() - exception is raised on urllib2.urlopen(url), as you may see in the traceback.
BadStatusLine exception is defined in httplib and your except urllib2.URLError simply doesn't catch it. You should probably improve your exception handling logic like:
except (httplib.HTTPException, urllib2.URLError) as err:
...

Related

urllib2 httplib.BadStatusLine

I am using with python urllib2 to a connect HTTP server. Sometimes I get the response: httplib.BadStatusLine: ''.
My code :
response = None
try:
request = urllib2.Request(http_url,params)
response = urllib2.urlopen(request,timeout=5000)
return str(response.read())
except urllib2.HTTPError :
return ""
except urllib2.URLError:
return ""
response error :
File "/usr/lib64/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib64/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/usr/lib64/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/usr/lib64/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/lib64/python2.7/urllib2.py", line 1244, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib64/python2.7/urllib2.py", line 1217, in do_open
r = h.getresponse(buffering=True)
File "/usr/lib64/python2.7/httplib.py", line 1051, in getresponse
response.begin()
File "/usr/lib64/python2.7/httplib.py", line 415, in begin
version, status, reason = self._read_status()
File "/usr/lib64/python2.7/httplib.py", line 379, in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
I don't to know how to fix this bug? And I don't to know why I get an error response.

Catching errors when scraping with Selenium

As part of a scraping job, I am trying to catch errors and bypass them. I want to keep a while: loop going in spite of these errors being raised. I have this code:
logger = logging.getLogger(__name__)
# ...
except (httplib.HTTPException, IOError) as e:
logger.exception('Ignoring exception, sleeping for 20 seconds')
time.sleep(20)
But, this still throws the same socket error as before:
Traceback (most recent call last):
File "/Users/aa/Box Sync/Work/PythonCode/TWPh/dev.py", line 54, in <module>
old_length = len(driver.page_source)
File "/usr/local/lib/python2.7/site- packages/selenium/webdriver/remote/webdriver.py", line 438, in page_source
return self.execute(Command.GET_PAGE_SOURCE)['value']
File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 173, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 417, in _request
resp = opener.open(request)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1200, in do_open
r = h.getresponse(buffering=True)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1073, in getresponse
response.begin()
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 415, in begin
version, status, reason = self._read_status()
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 371, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
socket.error: [Errno 54] Connection reset by peer
[Finished in 24639.7s with exit code 1]

Python error escape "socket.error: [Errno 54] Connection reset by peer"

I'm running a scraper that's going through a few domains and it's getting hung up on http://www.1000markets.com/
Checking from different sources it seems to be down. That's totally fine, but I'm getting the error mentioned in the title.
How can I escape this? I'm using HTTPerror and URLerror but it's still getting hung up.
Any help on this would be great
def get_html(link):
import urllib2
from urllib2 import Request, urlopen, URLError, HTTPError
try:
res = urllib2.urlopen(link)
html = res.read()
except URLError as e:
return link
except HTTPError as e:
return link
edit: Attached is the error
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
socket.error: [Errno 54] Connection reset by peer

Go to next item in list on error

I am pulling websites from a list and want to test, whether they are up or down. The code below works fine as long as they are up, but as soon as something is wrong with one of these urls, I get an error message and the whole scrip stops.
What I want to achieve: Error message == website not working therefore print down and move to next item in list.
import urllib2
from urllib2 import Request, urlopen, HTTPError, URLError
def checkurl(z):
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent':user_agent }
link = "http://"+z
req = Request(link, headers = headers)
try:
page_open = urlopen(req)
except HTTPError, e:
print "down"
else:
print 'up'
#print urllib2.urlopen('http://'+z).read()
Traceback (most recent call last):
File "/home/user/Videos/python/onion/qweqweqweq.py", line 48, in <module>
checkurl(x)
File "/home/user/Videos/python/onion/qweqweqweq.py", line 23, in checkurl
page_open = urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 401, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 419, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 379, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1211, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1178, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
File "/usr/lib/python2.7/httplib.py", line 962, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.7/httplib.py", line 996, in _send_request
self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 958, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 818, in _send_output
self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 780, in send
self.connect()
File "/usr/lib/python2.7/httplib.py", line 761, in connect
self.timeout, self.source_address)
File "/home/user/Videos/python/onion/qweqweqweq.py", line 5, in create_connection
sock.connect(address)
File "/usr/lib/python2.7/dist-packages/socks.py", line 369, in connect
self.__negotiatesocks5(destpair[0],destpair[1])
File "/usr/lib/python2.7/dist-packages/socks.py", line 236, in __negotiatesocks5
raise Socks5Error(ord(resp[1]),_generalerrors[ord(resp[1])])
TypeError: __init__() takes exactly 2 arguments (3 given)

You are catching HTTPError, but what is thrown is Socks5Error.

You're missing Socks5Error in your except clause. Look at the traceback:
raise Socks5Error(ord(resp[1]),_generalerrors[ord(resp[1])])
Note that this wouldn't have happened if you used requests instead of urllib2. The interface is a lot clearer, the documentation better.

In answer to "would it be possible to assume that the website is down regardless of the error", then this will do it:
req = Request(link, headers = headers)
try:
page_open = urlopen(req)
except:
print "down"
else:
print 'up'

Python 3 get HTML content

I'm using this code to get the web site html content,
import urllib.request
import lxml.html as lh
req= urllib.request.Request("http://www.ip-adress.com/ip_tracer/157.123.22.11",
headers={'User-Agent' : "Magic Browser"})
html = urllib.request.urlopen(req).read()
doc = lh.fromstring(html)
print (''.join(doc.xpath('.//*[#class="odd"]')[-1].text_content().split()))
I want to get the Organization: Zenith Data Systems.
but it shows some errors
Traceback (most recent call last):
File "/usr/local/python3.2.3/lib/python3.2/urllib/request.py", line 1135, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/local/python3.2.3/lib/python3.2/http/client.py", line 967, in request
self._send_request(method, url, body, headers)
File "/usr/local/python3.2.3/lib/python3.2/http/client.py", line 1005, in _send_request
self.endheaders(body)
File "/usr/local/python3.2.3/lib/python3.2/http/client.py", line 963, in endheaders
self._send_output(message_body)
File "/usr/local/python3.2.3/lib/python3.2/http/client.py", line 808, in _send_output
self.send(msg)
File "/usr/local/python3.2.3/lib/python3.2/http/client.py", line 746, in send
self.connect()
File "/usr/local/python3.2.3/lib/python3.2/http/client.py", line 724, in connect
self.timeout, self.source_address)
File "/usr/local/python3.2.3/lib/python3.2/socket.py", line 404, in create_connection
raise err
File "/usr/local/python3.2.3/lib/python3.2/socket.py", line 395, in create_connection
sock.connect(sa)
socket.error: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "ext.py", line 4, in <module>
html = urllib.request.urlopen(req).read()
File "/usr/local/python3.2.3/lib/python3.2/urllib/request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/python3.2.3/lib/python3.2/urllib/request.py", line 369, in open
response = self._open(req, data)
File "/usr/local/python3.2.3/lib/python3.2/urllib/request.py", line 387, in _open
'_open', req)
File "/usr/local/python3.2.3/lib/python3.2/urllib/request.py", line 347, in _call_chain
result = func(*args)
File "/usr/local/python3.2.3/lib/python3.2/urllib/request.py", line 1155, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/local/python3.2.3/lib/python3.2/urllib/request.py", line 1138, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>}
How to solve it. Thanks,

Basically, Connection Refused means only registered users are allowed to access the page, or server under heavy maintenance or similar reasons.
From your above code, if you want to handle errors you may try using try and except like below code:
try:
req= urllib.request.Request("http://www.ip-adress.com/ip_tracer/157.123.22.11",headers={'User-Agent' : "Magic Browser"})
html = urllib.request.urlopen(req).read()
doc = lh.fromstring(html)
print (''.join(doc.xpath('.//*[#class="odd"]')[-1].text_content().split()))
except urllib.error.URLError as e:
print(e.reason)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

can calling pythons urllib2.info() cause an exception? - python

Related

urllib2 httplib.BadStatusLine

Catching errors when scraping with Selenium

Python error escape "socket.error: [Errno 54] Connection reset by peer"

Go to next item in list on error

Python 3 get HTML content

Categories

Resources