Python HTTP Error 429 with urllib2 - python

I am using the following code to resolve redirects to return a links final url
def resolve_redirects(url):
return urllib2.urlopen(url).geturl()
Unfortunately I sometimes get HTTPError: HTTP Error 429: Too Many Requests. What is a good way to combat this? Is the following good or is there a better way.
def resolve_redirects(url):
try:
return urllib2.urlopen(url).geturl()
except HTTPError:
time.sleep(5)
return urllib2.urlopen(url).geturl()
Also, what would happen if there is an exception in the except block?

It would be better to make sure the HTTP code is actually 429 before re-trying.
That can be done like this:
def resolve_redirects(url):
try:
return urllib2.urlopen(url).geturl()
except HTTPError, e:
if e.code == 429:
time.sleep(5);
return resolve_redirects(url)
raise
This will also allow arbitrary numbers of retries (which may or may not be desired).
https://docs.python.org/2/howto/urllib2.html#httperror

This is a fine way to handle the exception, though you should check to make sure you are always sleeping for the appropriate amount of time between requests for the given website (for example twitter limits the amount of requests per minute and has this amount clearly shown in their api documentation). So just make sure you're always sleeping long enough.
To recover from an exception within an exception, you can simply embed another try/catch block:
def resolve_redirects(url):
try:
return urllib2.urlopen(url).geturl()
except HTTPError:
time.sleep(5)
try:
return urllib2.urlopen(url).geturl()
except HTTPError:
return "Failed twice :S"
Edit: as #jesse-w-at-z points out, you should be returning an URL in the second error case, the code I posted is just a reference example of how to write a nested try/catch.

Adding User-Agent to request header solved my issue:
from urllib import request
from urllib.request import urlopen
url = 'https://www.example.com/abc.json'
req = request.Request(url)
req.add_header('User-Agent', 'abc-bot')
response = request.urlopen(req)

Related

How to try accessing webpage again after http errors in python for loop

Example of 522 error when I go to the webpage manually
Example of 525 error when I go to the webpage manually
Example of 504 error when I go to the webpage manually
I am running the following for loop which goes through a dictionary of subreddits(key) and urls (value). The urls produce a dictionary with all posts from 2022 of a given subreddit. Sometimes the for loop stops and produces a 'http error 525' or other errors.
I'm wondering how I can check for these errors when reading the url and then try again until the error is not given before moving to the next subreddit.
for subredd, url in dict_last_subreddit_posts.items():
print(subredd)
page = urllib.request.urlopen(url).read()
dict_last_posts[subredd] = page
I haven't been able to figure it out.
You can put this code in try and except block like this:
for subredd, url in dict_last_subreddit_posts.items():
print(subredd)
while True:
try:
page = urllib.request.urlopen(url).read()
dict_last_posts[subredd] = page
break # exit the while loop if the request succeeded
except urllib.error.HTTPError as e:
if e.code == 525 or e.code == 522 or e.code == 504:
print("Encountered HTTP error while reading URL. Retrying...")
else:
raise # re-raise the exception if it's a different error
This code will catch any HTTP Error that occurs while reading the URL and check if the error code is 525 or 504 or 525. If it is, it will print a message and try reading the URL again. If it's a different error, it will re-raise the exception so that you can handle it appropriately.
NOTE: This code will retry reading the URL indefinitely until it succeeds or a different error occurs. You may want to add a counter or a timeout to prevent the loop from going on forever in case the error persists.
It's unwise to indefinitely retry a request. Set a limit even if it's very high, but don't set it so high that it causes you to be rate limited (HTTP status 429). The backoff_factor will also have an impact on rate limiting.
Use the requests package for this. This makes it very easy to set a custom adapter for all of your requests via Session, and it includes Retry from urllib3 which takes care of retry behavior in an object you can pass to your adapter.
import requests
from requests.adapters import HTTPAdapter, Retry
s = requests.Session()
retries = Retry(
total=5,
backoff_factor=0.1,
status_forcelist=[504, 522, 525]
)
s.mount('https://', HTTPAdapter(max_retries=retries))
for subredd, url in dict_last_subreddit_posts.items():
response = s.get(url)
dict_last_posts[subredd] = response.content
You can play around with total (maximum number of retries) and backoff_factor (adjusts wait time between retries) to get the behavior you want.
Try something like this:
for subredd, url in dict_last_subreddit_posts.items():
print(subredd)
http_response = urllib.request.urlopen(url)
while http_response.status != 200:
if http_response.status == 503:
http_response = urllib.request.urlopen(url)
elif http_response.status == 523:
#enter code here
else:
#enter code here
dict_last_posts[subredd] = http_response.read()
But, Michael Ruth answer is better

Capturing Response Body for a HTTP Error in python

Need to capture the response body for a HTTP error in python. Currently using the python request module's raise_for_status(). This method only returns the Status Code and description. Need a way to capture the response body for a detailed error log.
Please suggest alternatives to python requests module if similar required feature is present in some different module. If not then please suggest what changes can be done to existing code to capture the said response body.
Current implementation contains just the following:
resp.raise_for_status()
I guess I'll write this up quickly. This is working fine for me:
try:
r = requests.get('https://www.google.com/404')
r.raise_for_status()
except requests.exceptions.HTTPError as err:
print(err.request.url)
print(err)
print(err.response.text)
you can do something like below, which returns the content of the response, in unicode.
response.text
or
try:
r = requests.get('http://www.google.com/nothere')
r.raise_for_status()
except requests.exceptions.HTTPError as err:
print(err)
sys.exit(1)
# 404 Client Error: Not Found for url: http://www.google.com/nothere
here you'll get the full explanation on how to handle the exception. please check out Correct way to try/except using Python requests module?
You can log resp.text if resp.status_code >= 400.
There are some tools you may pick up such as Fiddler, Charles, wireshark.
However, those tools can just display the body of the response without including the reason or error stack why the error raises.

Requests — how to tell if you're getting a success message?

My question is closely related to this one.
I'm using the Requests library to hit an HTTP endpoint.
I want to check if the response is a success.
I am currently doing this:
r = requests.get(url)
if 200 <= response.status_code <= 299:
# Do something here!
Instead of doing that ugly check for values between 200 and 299, is there a shorthand I can use?
The response has an ok property. Use that:
if response.ok:
...
The implementation is just a try/except around Response.raise_for_status, which is itself checks the status code.
#property
def ok(self):
"""Returns True if :attr:`status_code` is less than 400, False if not.
This attribute checks if the status code of the response is between
400 and 600 to see if there was a client error or a server error. If
the status code is between 200 and 400, this will return True. This
is **not** a check to see if the response code is ``200 OK``.
"""
try:
self.raise_for_status()
except HTTPError:
return False
return True
I am a Python newbie but I think the easiest way is:
if response.ok:
# whatever
The pythonic way to check for requests success would be to optionally raise an exception with
try:
resp = requests.get(url)
resp.raise_for_status()
except requests.exceptions.HTTPError as err:
print(err)
EAFP: It’s Easier to Ask for Forgiveness than Permission: You should just do what you expect to work and if an exception might be thrown from the operation then catch it and deal with that fact.

Why do I receive a timeout error from Pythons requests module?

I use requests.post(url, headers, timeout=10) and sometimes I received a ReadTimeout exception HTTPSConnectionPool(host='domain.com', port=443): Read timed out. (read timeout=10)
Since I already set timeout as 10 seconds, why am I still receiving a ReadTimeout exception?
Per https://requests.readthedocs.io/en/latest/user/quickstart/#timeouts, that is the expected behavior. As royhowie mentioned, wrap it in a try/except block
(e.g.:
try:
requests.post(url, headers, timeout=10)
except requests.exceptions.Timeout:
print "Timeout occurred"
)
try:
#defined request goes here
except requests.exceptions.ReadTimeout:
# Set up for a retry, or continue in a retry loop
You can wrap it like an exception block like this. Since you asked for this only ReadTimeout. Otherwise catch all of them;
try:
#defined request goes here
except:
# Set up for a retry, or continue in a retry loop
Another thing you can try is at the end of your code block, include the following:
time.sleep(2)
This worked for me. The delay is longer (in seconds) but might help overcome the issue you're having.

Why does Python's urllib2.urlopen() raise an HTTPError for successful status codes?

According to the urllib2 documentation,
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
And yet the following code
request = urllib2.Request(url, data, headers)
response = urllib2.urlopen(request)
raises an HTTPError with code 201 (created):
ERROR 2011-08-11 20:40:17,318 __init__.py:463] HTTP Error 201: Created
So why is urllib2 throwing HTTPErrors on this successful request?
It's not too much of a pain; I can easily extend the code to:
try:
request = urllib2.Request(url, data, headers)
response = urllib2.urlopen(request)
except HTTPError, e:
if e.code == 201:
# success! :)
else:
# fail! :(
else:
# when will this happen...?
But this doesn't seem like the intended behavior, based on the documentation and the fact that I can't find similar questions about this odd behavior.
Also, what should the else block be expecting? If successful status codes are all interpreted as HTTPErrors, then when does urllib2.urlopen() just return a normal file-like response object like all the urllib2 documentation refers to?
You can write a custom Handler class for use with urllib2 to prevent specific error codes from being raised as HTTError. Here's one I've used before:
class BetterHTTPErrorProcessor(urllib2.BaseHandler):
# a substitute/supplement to urllib2.HTTPErrorProcessor
# that doesn't raise exceptions on status codes 201,204,206
def http_error_201(self, request, response, code, msg, hdrs):
return response
def http_error_204(self, request, response, code, msg, hdrs):
return response
def http_error_206(self, request, response, code, msg, hdrs):
return response
Then you can use it like:
opener = urllib2.build_opener(self.BetterHTTPErrorProcessor)
urllib2.install_opener(opener)
req = urllib2.Request(url, data, headers)
urllib2.urlopen(req)
As the actual library documentation mentions:
For 200 error codes, the response object is returned immediately.
For non-200 error codes, this simply passes the job on to the protocol_error_code handler methods, via OpenerDirector.error(). Eventually, urllib2.HTTPDefaultErrorHandler will raise an HTTPError if no other handler handles the error.
http://docs.python.org/library/urllib2.html#httperrorprocessor-objects
I personally think it was a mistake and very nonintuitive for this to be the default behavior.
It's true that non-2XX codes imply a protocol level error, but turning that into an exception is too far (in my opinion at least).
In any case, I think the most elegant way to avoid this is:
opener = urllib.request.build_opener()
for processor in opener.process_response['https']: # or http, depending on what you're using
if isinstance(processor, urllib.request.HTTPErrorProcessor): # HTTPErrorProcessor also for https
opener.process_response['https'].remove(processor)
break # there's only one such handler by default
response = opener.open('https://www.google.com')
Now you have the response object. You can check it's status code, headers, body, etc.

Categories