Retry loading page on timeout with urllib2? - python

I am trying to force Python to retry loading the page when I get a timeout error. Is there a way that I can make it retry a specific number of times, possibly after a specific time delay?
Any help would be appreciated.
Thank you.

urllib2 doesn't have anything built-in for that, but you can write it yourself.
The tricky part is that, as the urlopen docs say, no matter what goes wrong, you just get a URLError. So, how do you know whether it was a timeout, or something else?
Well, if you look up URLError, it says it will have a reason which will be a socket.error for remote URLs. And if you look up socket.error it tells you that it's a subclass of either IOError or OSError (depending on your Python version). And if you look up OSError, it tells you that it has an errno that represents the underlying error.
So, which errno value do you get for timeout? I'm willing to bet it's EINPROGRESS, but let's find out for sure:
>>> urllib.urlopen('http://127.0.0.1', timeout=0)
urllib2.URLError: <urlopen error [Errno 36] Operation now in progress>
>>> errno.errorcode[36]
'EINPROGRESS'
(You could just use the number 36, but that's not guaranteed to be the same across platforms; errno.EINPROGRESS should be more portable.)
So:
import errno
import urllib2
def retrying_urlopen(retries, *args, **kwargs):
for i in range(retries):
try:
return urllib2.urlopen(*args, **kwargs)
except URLError as e:
if e.reason.errno == errno.EINPROGRESS:
continue
raise
If you think this sucks and should be a lot less clunky… well, I think everyone agrees. Exceptions have been radically improved twice, with another big one coming up, plus various small changes along the way. But if you stick with 2.7, you don't get the benefits of those improvements.
If moving to Python 3.4 isn't possible, maybe moving to a third-party module like requests or urllib3 is. Both of those libraries have a separate exception type for Timeout, instead of making you grub through the details of a generic URLError.

Check out the requests library. If you'd like to wait only for a specified amount of time (not for the entire download, just until you get a response from the server), just add the timeout argument to the standard URL request, in seconds:
r = requests.get(url, timeout=10)
If the timeout time is exceeded, it raises a requests.exceptions.Timeout exception, which can be handled however you wish. As an example, you could put the request in a try/except block, catch the exception if it's raised, and retry the connection again for a specified number of times before failing completely.
You might also want to check out requests.adapters.HTTPAdapter, which has a max_retries argument. It's typically used within a Requests Session, and according to the docs, it provides a general-case interface for Requests sessions to contact HTTP and HTTPS urls by implementing the Transport Adapter interface.

Even I am new to Python, but I think even a simple solution like this could do the trick,
begin with considering stuff as None, where stuff is page_source. Also remember that I have only considered the URLError exception. You might want to add more as desired.
import urllib2
import time
stuff=None
max_attempts=4
r=0
while stuff is None and r<max_attempts:
try:
response = urllib2.urlopen('http://www.google.com/ncr', timeout=10)
stuff = response.read()
except urllib2.URLError:
r=r+1
print "Re-trying, attempt -- ",r
time.sleep(5)
pass
print stuff
Hope that helps.
Regards,
Md. Mohsin

Related

Python: What happens if script stops while requests.get() is executing?

I know that requests.get() provides an HTTP interface so that the programmer can make various requests to a HTTP server.
That tells me that somewhere a port must be opened so that the request can happen.
Taking that into account, what would happen if the script is stopped (say, by a Key Board Interrupt, so the machine that is executing the script remains connected to the internet) before the request is answered/complete?
Would the port/connection remain opened?
Does the port/connection close automatically?
The short answer to the question is: requests will close a connection in the case of any exception, including KeyboardInterrupt and SystemExit.
A little digging into the requests source code reveals that requests.get ultimately calls the HTTPAdapter.send method (which is where all the magic happens).
There are two ways in which a request might be made within the send method: chunked or not chunked. Which send we perform depends on the value of the request.body and the Content-Length header:
chunked = not (request.body is None or 'Content-Length' in request.headers)
In the case where the request body is None or the Content-Length is set, requests will make use of the high-level urlopen method of urllib3:
if not chunked:
resp = conn.urlopen(
method=request.method,
url=url,
body=request.body,
# ...
)
The finally block of the urllib3.PoolManager.urlopen method has code that handles closing the connection in the case where the try block didn't execute successfully:
clean_exit = False
# ...
try:
# ...
# Everything went great!
clean_exit = True
finally:
if not clean_exit:
# We hit some kind of exception, handled or otherwise. We need
# to throw the connection away unless explicitly told not to.
# Close the connection, set the variable to None, and make sure
# we put the None back in the pool to avoid leaking it.
conn = conn and conn.close()
release_this_conn = True
In the case where the response can be chunked, requests goes a bit lower level and uses the underlying low level connection provided by urllib3. In this case, requests still handles the exception, it does this with a try / except block that starts immediately after grabbing a connection, and finishes with:
low_conn = conn._get_conn(timeout=DEFAULT_POOL_TIMEOUT)
try:
# ...
except:
# If we hit any problems here, clean up the connection.
# Then, reraise so that we can handle the actual exception.
low_conn.close()
raise
Interestingly the connection may not be closed if there are no errors, depending on how you have configured connection pooling for urllib3. In the case of a successful execution, the connection is put back into the connection pool (though I cannot find a _put_conn call in the requests source for the chunked send, which might be a bug in the chunked work-flow).
On a much lower level, when a program exits, the OS kernel closes all file descriptors opened by that program. These include network sockets.

Why does requests.get() not raise when the server can't be found?

In the following code snippet, I know for a fact that https://asdasdasdasd.vm:8080/v2/api-docs does not exist. It fails a DNS lookup. Unfortunately, the get() never seems to return, raise, or timeout. My logs have only "A" in them. I would expect A C D or A B D. But I only ever see A in the logs.
try:
sys.stderr.write("A")
resp = requests.get("https://asdasdasdasd.vm:8080/v2/api-docs", timeout=1.0)
sys.stderr.write("B")
except:
sys.stderr.write("C")
sys.stderr.write("D")
sys.stderr.flush()
return swag
(That URL is not sanitized for this post. That's actually the URL I'm trying to use while working on this question.)
What am I missing here?
EDIT - I have also tried specifying the timeout as (1.0,1.0) but the behavior did not change.
EDIT2 - Per suggestions below, I ran my code from the python and ipython consoles. The code behaves as I expect (ACD). Of course, in my real application, I am not running this code from the command line. I don't know how this matters, but the method containing the code is being invoked by a web service. Specifically, a Swagger endpoint. With my browser, I hit an endpoint that's supposed to return our Swagger documentation. The endpoint (which uses flask_swagger) invokes init_swagger(...). init_swagger() calls my method with a Swagger object. That's it. How this matters, I cannot say. It doesn't make any sense to me that something outside of my method should somehow be able to mess with my exception handling.
The only thing I can think of is that Swagger has jacked with the requests class. But now it is dinner time and I am going home.
The following code for me returns A, C, D
import requests
from requests.exceptions import ConnectionError
try:
print("A")
resp = requests.get("https://asdasdasdasd.vm:8080/v2/api-docs", timeout=1.0)
print("B")
except ConnectionError:
print("C")
print("D")
This is because the host cannot be resolved for me, if I swap it out for localhost...
resp = requests.get("http://localhost/v2/api-docs", timeout=1.0)
...then I see an A, followed by a period of time before C and D show.
From reading the comments, I know what is up...
builtins has a ConnectionError that can be used (without importing anything). Requests doesn't use this exception, instead it uses the one found in requests.exceptions if you wish to catch the ConnectionError you must catch the correct exception, or it will drop out and not execute the except clause.

Python Requests Module - API Calls

I'm written a django web project and am using some API calls. I'd like to try and build in some mechanisms to handle slow and failed API calls. Specifically, I'd like to try the API call three times with increasing call times breaking the loop when the request is successful. What is a good way to handle this or is what I've put together acceptable? Below is the code I have in place now.
for x in [0.5, 1, 5]:
try:
r = requests.get(api_url, headers = headers, timeout=x)
break
except:
pass
You can use exceptions provided by requests itself to handle failed Api calls. You can use ConnectionError exception if a network problem occurs. Refer to this so post for more details. I am not pasting a link to requests docs and explaining every exception in detail since SO post given before have the answer for your question. An example code segment is given below
try:
r = requests.get(url, params={'key': 'value'})
except requests.exceptions.ConnectionError as e:
print e
This outlines the procedure I'm talking about. A single API request could end up being a little flaky.
migrateup.com/making-unreliable-apis-reliable-with-python/#

In flask, should i manually catch all possible error in views?

I'm new on Flask, when writing view, i wander if all errors should be catched. If i do so, most of view code should be wrappered with try... except. I think it's not graceful.
for example.
#app.route('/')
def index():
try:
API.do()
except:
abort(503)
Should i code like this? If not, will the service crash(uwsgi+lnmp)?
You only catch what you can handle. The word "handle" means "do something useful with" not merely "print a message and die". The print-and-die is already handled by the exception mechanism and probably does it better than you will.
For example, this is not handling an exception usefully:
denominator = 0
try:
y = x / denominator
except ZeroDivisionError:
abort(503)
There is nothing useful you can do, and the abort is redundant as that's what uncaught exceptions will cause to happen anyway. Here is an example of a useful handling:
try:
config_file = open('private_config')
except IOError:
config_file = open('default_config_that_should_always_be_there')
but note that if the second open fails, there is nothing useful to do so it will travel up the call stack and possibly halt the program. What you should never do is have a bare except: because it hides information about what faulted where. This will result in much head scratching when you get a defect report of "all it said was 503" and you have no idea what went wrong in API.do().
Try / except blocks that can't do any useful handling clutter up the code and visually bury the main flow of execution. Languages without exceptions force you to check every call for an error return if only to generate an error return yourself. Exceptions exist in part to get rid of that code noise.

Recovering from ECONNRESET in Python/Mechanize

I've got a large bulk downloading application written in Python/Mechanize, aiming to download something like 20,000 files. Clearly, any downloader that big is occasionally going to run into some ECONNRESET errors. Now, I know how to handle each of these individually, but there's two problems with that:
I'd really rather not wrap every single outbound web call in a try/catch block.
Even if I were to do so, there's trouble with knowing how to handle the errors once the exception has thrown. If the code is just
data = browser.response().read()
then I know precisely how to deal with it, namely:
data = None
while (data == None):
try:
data = browser.response().read()
except IOError as e:
if e.args[1].args[0].errno != errno.ECONNRESET:
raise
data = None
but if it's just a random instance of
browser.follow_link(link)
then how do I know what Mechanize's internal state looks like if an ECONNRESET is thrown somewhere in here? For example, do I need to call browser.back() before I try the code again? What's the proper way to recover from that kind of error?
EDIT: The solution in the accepted answer certainly works, and in my case it turned out to be not so hard to implement. I'm still academically interested, however, in whether there's an error handling mechanism that could result in quicker error catching.
Perhaps place the try..except block higher up in the chain of command:
import collections
def download_file(url):
# Bundle together the bunch of browser calls necessary to download one file.
browser.follow_link(...)
...
response=browser.response()
data=response.read()
urls=collections.deque(urls)
while urls:
url=urls.popleft()
try:
download_file(url)
except IOError as err:
if err.args[1].args[0].errno != errno.ECONNRESET:
raise
else:
# if ECONNRESET error, add the url back to urls to try again later
urls.append(url)

Categories