Python's urllib.request.urlopen() will raise an exception if the HTTP status code of the request is not OK (e.g., 404).
This is because the default opener uses the HTTPDefaultErrorHandler class:
A class which defines a default handler for HTTP error responses; all responses are turned into HTTPError exceptions.
Even if you build your own opener, it (un)helpfully includes the HTTPDefaultErrorHandler for you implicitly.
If, however, you don't want Python to raise an exception if you get a non-OK response, it's unclear how to disable this behavior.
If you build your own opener with build_opener(), the documentation notes, emphasis added,
Instances of the following classes will be in front of the handlers, unless the handlers contain them, instances of them or subclasses of them: ... HTTPDefaultErrorHandler ...
Therefore, we need to make our own subclass of HTTPDefaultErrorHandler that does not raise an exception and simply passes the response through the pipeline unmodified. Then build_opener() will use our error handler instead of the default one.
import urllib.request
class NonRaisingHTTPErrorProcessor(urllib.request.HTTPErrorProcessor):
http_response = https_response = lambda self, request, response: response
opener = urllib.request.build_opener(NonRaisingHTTPErrorProcessor)
response = opener.open('http://example.com/doesnt-exist')
print(response.status) # prints 404
This answer (including the code sample) was not written by ChatGPT, but it did point out the solution.
Related
There are many situations where I know an error will occur and want to pass additional data to Sentry. I still want the exception to get raised (so the rest of my code stops) and I only want one error in sentry.
For example, let's say I'm making an HTTP call and, in the event that the HTTP call fails, I want an error including the response text sent to Sentry:
import requests
resp = requests.post(url, json=payload)
if resp.ok:
return resp.json()
try:
text = resp.json()
except json.JSONDecodeError:
text = resp.text
# TODO: add `text` to Sentry error
resp.raise_for_status()
How do I do this using the Sentry Python SDK?
Rejected solutions:
Sentry's logging: this results in two errors in sentry (one for the log statement and one for the raised exception)
capture_expection: this results in two errors in sentry (one for the captured exception and one for the raised exception)
Adding extra details to the exception message: this breaks Sentry's error grouping because each error has a unique name
Large or Unpredictable Data: set_context
If you need to send a lot of data or you don't know the contents of your data, the function you are looking for is Sentry's set_context. You want to call this function right before your exception gets raised. Note that context objects are limited in size to 8kb.
Note: you should only call set_context in a situation where an exception will definitely get raised, or else the extra information you set may get added to other (irrelevant) errors in Sentry.
For example:
import requests
import sentry_sdk
resp = requests.post(url, json=payload)
if resp.ok:
return resp.json()
try:
text = resp.json()
except json.JSONDecodeError:
text = resp.text
# vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
sentry_sdk.set_context("Payload", {"text": text})
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
resp.raise_for_status()
This attaches it as additional data to your Sentry error, listed just after the breadcrumbs:
Small, Predictable Data: set_tag
If your data is small and predictable (such as the HTTP status code), you can use sentry's set_tag. It's best to do this within a push_scope block so that the tag is just set for the area of your code that may go wrong. Note that tags keys are limited to 32 characters and tag values are limited in size to 200 characters.
Tags show up at the top of the view of a sentry error.
For example:
import requests
from sentry_sdk import push_scope
resp = requests.post(url, json=payload)
if resp.ok:
return resp.json()
# vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
with push_scope() as scope:
sentry_sdk.set_tag("status", resp.status_code)
resp.raise_for_status()
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I'm using Python 3.7 with urllib.
All work fine but it seems not to athomatically redirect when it gets an http redirect request (307).
This is the error i get:
ERROR 2020-06-15 10:25:06,968 HTTP Error 307: Temporary Redirect
I've to handle it with a try-except and manually send another request to the new Location: it works fine but i don't like it.
These is the piece of code i use to perform the request:
req = urllib.request.Request(url)
req.add_header('Authorization', auth)
req.add_header('Content-Type','application/json; charset=utf-8')
req.data=jdati
self.logger.debug(req.headers)
self.logger.info(req.data)
resp = urllib.request.urlopen(req)
url is an https resource and i set an header with some Authhorization info and content-type.
req.data is a JSON
From urllib documentation i've understood that the redirects are authomatically performed by the the library itself, but it doesn't work for me. It always raises an http 307 error and doesn't follow the redirect URL.
I've also tried to use an opener specifiyng the default redirect handler, but with the same result
opener = urllib.request.build_opener(urllib.request.HTTPRedirectHandler)
req = urllib.request.Request(url)
req.add_header('Authorization', auth)
req.add_header('Content-Type','application/json; charset=utf-8')
req.data=jdati
resp = opener.open(req)
What could be the problem?
The reason why the redirect isn't done automatically has been correctly identified by yours truly in the discussion in the comments section. Specifically, RFC 2616, Section 10.3.8 states that:
If the 307 status code is received in response to a request other
than GET or HEAD, the user agent MUST NOT automatically redirect the
request unless it can be confirmed by the user, since this might
change the conditions under which the request was issued.
Back to the question - given that data has been assigned, this automatically results in get_method returning POST (as per how this method was implemented), and since that the request method is POST, and the response code is 307, an HTTPError is raised instead as per the above specification. In the context of Python's urllib, this specific section of the urllib.request module raises the exception.
For an experiment, try the following code:
import urllib.request
import urllib.parse
url = 'http://httpbin.org/status/307'
req = urllib.request.Request(url)
req.data = b'hello' # comment out to not trigger manual redirect handling
try:
resp = urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
if e.status != 307:
raise # not a status code that can be handled here
redirected_url = urllib.parse.urljoin(url, e.headers['Location'])
resp = urllib.request.urlopen(redirected_url)
print('Redirected -> %s' % redirected_url) # the original redirected url
print('Response URL -> %s ' % resp.url) # the final url
Running the code as is may produce the following
Redirected -> http://httpbin.org/redirect/1
Response URL -> http://httpbin.org/get
Note the subsequent redirect to get was done automatically, as the subsequent request was a GET request. Commenting out req.data assignment line will result in the lack of the "Redirected" output line.
Other notable things to note in the exception handling block, e.read() may be done to retrieve the response body produced by the server as part of the HTTP 307 response (since data was posted, there might be a short entity in the response that may be processed?), and that urljoin is needed as the Location header may be a relative URL (or simply has the host missing) to the subsequent resource.
Also, as a matter of interest (and for linkage purposes), this specific question has been asked multiple times before and I am rather surprised that they never got any answers, which follows:
How to handle 307 redirection using urllib2 from http to https
HTTP Error 307: Temporary Redirect in Python3 - INTRANET
HTTP Error 307 - Temporary redirect in python script
Using Python 2.7, Django on Google App Engine. I'm trying to do some simple URL checking, including checking a JSON data payload, and return a meaningful error to the user. What I have coded is basically this:
from django.core.exceptions import SuspiciousOperation
...
def check(self, request):
json_data = json.loads(request.body)
if not json_data:
raise SuspiciousOperation('Required JSON data not found in the POST request.')
...
But, when I test this in debug mode (DEBUG = True in settings.py) by omitting the JSON data, instead of returning a HTTP 400 as I expect from SuspiciousOperation, I get an HTTP 500 that contains my error message "Required JSON data not found in the POST request." The same thing occurs if I check for a valiud URL with URLValidator(): I can correctly test for a good or bad URL with the URLValidator(), but if I try to raise a custom message on a bad URL with SuspiciousOperation I get HTTP 500 instead of 400.
How can I return a meaningful error to my caller without the server error obfuscating everything when Debug is turned back off and crashing the process in the process? Is SuspiciousOperation not supported by GAE?
There was an issue raised about this on Django's bug tracker and it looks like it was fixed in 1.6 but not backported. Indeed, SuspiciousOperation is handled by a catch-all in 1.5.11 (django/django/core/handlers/base.py line 173):
except: # Handle everything else, including SuspiciousOperation, etc.
# Get the exception info now, in case another exception is thrown later.
signals.got_request_exception.send(sender=self.__class__, request=request)
response = self.handle_uncaught_exception(request, resolver, sys.exc_info())
my problem is about Web redirect ,, i'm using urllib>getcode() to know what status codes return
so here is my code
import urllib
a = urllib.urlopen("http://www.site.com/incorrect-tDirectory")
a.getcode()
a.getcode() return 200 but actually it's redirect to main page and i've check references that says redirect should return as i remember 300 or 301 but it's not 200 hopefully you got me
so my question how to catch the redirection
urllib2.urlopen() doc page says:
This function returns a file-like object with two additional methods:
geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed
info() — return the meta-information of the page, such as headers, in the form of an mimetools.Message instance (see Quick Reference to HTTP Headers)
urllib.urlopen() actually implements geturl(), too, but it's not put as explicitly in the documentation.
According to the urllib2 documentation,
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
And yet the following code
request = urllib2.Request(url, data, headers)
response = urllib2.urlopen(request)
raises an HTTPError with code 201 (created):
ERROR 2011-08-11 20:40:17,318 __init__.py:463] HTTP Error 201: Created
So why is urllib2 throwing HTTPErrors on this successful request?
It's not too much of a pain; I can easily extend the code to:
try:
request = urllib2.Request(url, data, headers)
response = urllib2.urlopen(request)
except HTTPError, e:
if e.code == 201:
# success! :)
else:
# fail! :(
else:
# when will this happen...?
But this doesn't seem like the intended behavior, based on the documentation and the fact that I can't find similar questions about this odd behavior.
Also, what should the else block be expecting? If successful status codes are all interpreted as HTTPErrors, then when does urllib2.urlopen() just return a normal file-like response object like all the urllib2 documentation refers to?
You can write a custom Handler class for use with urllib2 to prevent specific error codes from being raised as HTTError. Here's one I've used before:
class BetterHTTPErrorProcessor(urllib2.BaseHandler):
# a substitute/supplement to urllib2.HTTPErrorProcessor
# that doesn't raise exceptions on status codes 201,204,206
def http_error_201(self, request, response, code, msg, hdrs):
return response
def http_error_204(self, request, response, code, msg, hdrs):
return response
def http_error_206(self, request, response, code, msg, hdrs):
return response
Then you can use it like:
opener = urllib2.build_opener(self.BetterHTTPErrorProcessor)
urllib2.install_opener(opener)
req = urllib2.Request(url, data, headers)
urllib2.urlopen(req)
As the actual library documentation mentions:
For 200 error codes, the response object is returned immediately.
For non-200 error codes, this simply passes the job on to the protocol_error_code handler methods, via OpenerDirector.error(). Eventually, urllib2.HTTPDefaultErrorHandler will raise an HTTPError if no other handler handles the error.
http://docs.python.org/library/urllib2.html#httperrorprocessor-objects
I personally think it was a mistake and very nonintuitive for this to be the default behavior.
It's true that non-2XX codes imply a protocol level error, but turning that into an exception is too far (in my opinion at least).
In any case, I think the most elegant way to avoid this is:
opener = urllib.request.build_opener()
for processor in opener.process_response['https']: # or http, depending on what you're using
if isinstance(processor, urllib.request.HTTPErrorProcessor): # HTTPErrorProcessor also for https
opener.process_response['https'].remove(processor)
break # there's only one such handler by default
response = opener.open('https://www.google.com')
Now you have the response object. You can check it's status code, headers, body, etc.