Python requests returns 200 instead of 301 - python

url = "https://www.avito.ma/fr/2_mars/sacs_et_accessoires/Ch%C3%A2les_en_Vrai_Soie_Chanel_avec_boite_38445885.htm"
try
r = requests.get(url,headers={'User-Agent': ua.random},timeout=timeout) # execute a timed website request
if r.status_code > 299: # check for bad status
r.raise_for_status() # if confirmed raise bad status
else:
print(r.status_code, url) # otherwise print status code and url
except Exception as e:
print('\nThe following exception: {0}, \nhas been found found on the following post: "{1}".\n'.format(e,url))
Expected status = 301 Moved Permanently
You can visit the page or check http://www.redirect-checker.org/index.php with the url for a correct terminal print.
Returned status = 200 OK
The page has been moved and it should return the above 301 Moved Permanently, however it returns a 200. I read the requests doc and checked all the parameters (allow_redirects=False etc.) but I don't think it is a mistake of configuration.
I am puzzled at why requests wouldn't see the redirects.
Any ideas?
Thank you in advance.

Python Requests module has the allow_redirect parameter in True by default. I've tested it with False and it gives the 301 code that you're looking for.
Note after reading your comment above: r.history saves each response_code before the one that you're right now which is saved in r.status_code (only if you leave the parameter in True).

Related

How to try accessing webpage again after http errors in python for loop

Example of 522 error when I go to the webpage manually
Example of 525 error when I go to the webpage manually
Example of 504 error when I go to the webpage manually
I am running the following for loop which goes through a dictionary of subreddits(key) and urls (value). The urls produce a dictionary with all posts from 2022 of a given subreddit. Sometimes the for loop stops and produces a 'http error 525' or other errors.
I'm wondering how I can check for these errors when reading the url and then try again until the error is not given before moving to the next subreddit.
for subredd, url in dict_last_subreddit_posts.items():
print(subredd)
page = urllib.request.urlopen(url).read()
dict_last_posts[subredd] = page
I haven't been able to figure it out.
You can put this code in try and except block like this:
for subredd, url in dict_last_subreddit_posts.items():
print(subredd)
while True:
try:
page = urllib.request.urlopen(url).read()
dict_last_posts[subredd] = page
break # exit the while loop if the request succeeded
except urllib.error.HTTPError as e:
if e.code == 525 or e.code == 522 or e.code == 504:
print("Encountered HTTP error while reading URL. Retrying...")
else:
raise # re-raise the exception if it's a different error
This code will catch any HTTP Error that occurs while reading the URL and check if the error code is 525 or 504 or 525. If it is, it will print a message and try reading the URL again. If it's a different error, it will re-raise the exception so that you can handle it appropriately.
NOTE: This code will retry reading the URL indefinitely until it succeeds or a different error occurs. You may want to add a counter or a timeout to prevent the loop from going on forever in case the error persists.
It's unwise to indefinitely retry a request. Set a limit even if it's very high, but don't set it so high that it causes you to be rate limited (HTTP status 429). The backoff_factor will also have an impact on rate limiting.
Use the requests package for this. This makes it very easy to set a custom adapter for all of your requests via Session, and it includes Retry from urllib3 which takes care of retry behavior in an object you can pass to your adapter.
import requests
from requests.adapters import HTTPAdapter, Retry
s = requests.Session()
retries = Retry(
total=5,
backoff_factor=0.1,
status_forcelist=[504, 522, 525]
)
s.mount('https://', HTTPAdapter(max_retries=retries))
for subredd, url in dict_last_subreddit_posts.items():
response = s.get(url)
dict_last_posts[subredd] = response.content
You can play around with total (maximum number of retries) and backoff_factor (adjusts wait time between retries) to get the behavior you want.
Try something like this:
for subredd, url in dict_last_subreddit_posts.items():
print(subredd)
http_response = urllib.request.urlopen(url)
while http_response.status != 200:
if http_response.status == 503:
http_response = urllib.request.urlopen(url)
elif http_response.status == 523:
#enter code here
else:
#enter code here
dict_last_posts[subredd] = http_response.read()
But, Michael Ruth answer is better

HTTP Error 307: Temporary Redirect in Python3 - INTRANET [duplicate]

I'm using Python 3.7 with urllib.
All work fine but it seems not to athomatically redirect when it gets an http redirect request (307).
This is the error i get:
ERROR 2020-06-15 10:25:06,968 HTTP Error 307: Temporary Redirect
I've to handle it with a try-except and manually send another request to the new Location: it works fine but i don't like it.
These is the piece of code i use to perform the request:
req = urllib.request.Request(url)
req.add_header('Authorization', auth)
req.add_header('Content-Type','application/json; charset=utf-8')
req.data=jdati
self.logger.debug(req.headers)
self.logger.info(req.data)
resp = urllib.request.urlopen(req)
url is an https resource and i set an header with some Authhorization info and content-type.
req.data is a JSON
From urllib documentation i've understood that the redirects are authomatically performed by the the library itself, but it doesn't work for me. It always raises an http 307 error and doesn't follow the redirect URL.
I've also tried to use an opener specifiyng the default redirect handler, but with the same result
opener = urllib.request.build_opener(urllib.request.HTTPRedirectHandler)
req = urllib.request.Request(url)
req.add_header('Authorization', auth)
req.add_header('Content-Type','application/json; charset=utf-8')
req.data=jdati
resp = opener.open(req)
What could be the problem?
The reason why the redirect isn't done automatically has been correctly identified by yours truly in the discussion in the comments section. Specifically, RFC 2616, Section 10.3.8 states that:
If the 307 status code is received in response to a request other
than GET or HEAD, the user agent MUST NOT automatically redirect the
request unless it can be confirmed by the user, since this might
change the conditions under which the request was issued.
Back to the question - given that data has been assigned, this automatically results in get_method returning POST (as per how this method was implemented), and since that the request method is POST, and the response code is 307, an HTTPError is raised instead as per the above specification. In the context of Python's urllib, this specific section of the urllib.request module raises the exception.
For an experiment, try the following code:
import urllib.request
import urllib.parse
url = 'http://httpbin.org/status/307'
req = urllib.request.Request(url)
req.data = b'hello' # comment out to not trigger manual redirect handling
try:
resp = urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
if e.status != 307:
raise # not a status code that can be handled here
redirected_url = urllib.parse.urljoin(url, e.headers['Location'])
resp = urllib.request.urlopen(redirected_url)
print('Redirected -> %s' % redirected_url) # the original redirected url
print('Response URL -> %s ' % resp.url) # the final url
Running the code as is may produce the following
Redirected -> http://httpbin.org/redirect/1
Response URL -> http://httpbin.org/get
Note the subsequent redirect to get was done automatically, as the subsequent request was a GET request. Commenting out req.data assignment line will result in the lack of the "Redirected" output line.
Other notable things to note in the exception handling block, e.read() may be done to retrieve the response body produced by the server as part of the HTTP 307 response (since data was posted, there might be a short entity in the response that may be processed?), and that urljoin is needed as the Location header may be a relative URL (or simply has the host missing) to the subsequent resource.
Also, as a matter of interest (and for linkage purposes), this specific question has been asked multiple times before and I am rather surprised that they never got any answers, which follows:
How to handle 307 redirection using urllib2 from http to https
HTTP Error 307: Temporary Redirect in Python3 - INTRANET
HTTP Error 307 - Temporary redirect in python script

python requests GET returning HTTP 204

I cannot wrap my brain around this issue:
When I run this code in my IDE (pycharm), or via the command line, I get a 204 HTTP response and no content. When I set breakpoints in my debugger to see what is happening, the code executes fine and r.content and r.text are populated with the results from the request. r.status_code also has a value of 200 when running in the debugger.
code:
r = requests.post(self.dispatchurl, verify=False, auth=HTTPBasicAuth(self.user, self.passwd))
print 'first request to get sid: status {}'.format(r.status_code)
json_data = json.loads(r.text)
self.sid = json_data['sid']
print 'the sid is: {}'.format(self.sid)
self.getresulturl = '{}/services/search/jobs/{}/results{}'.format(self.url, self.sid, self.outputmode)
x = requests.get(self.getresulturl, verify=False, auth=HTTPBasicAuth(self.user, self.passwd))
print 'second request to get the data: status {}'.format(x.status_code)
print 'content: {}'.format(x.text)
output when run through debugger:
first request to get sid: status 201
the sid is: sanitizedatahere
second request to get the data: status 200
content: {"preview":false...}
Process finished with exit code 0
When I execute the code normally without the debugger, i get a 204 on the second response.
output:
first request to get sid: status 201
the sid is: sanitizedatahere
second request to get the data: status 204
content:
Process finished with exit code 0
I am guessing this has something to do with the debugger slowing down the requests and allowing the server to respond with the data? This seems like a race condition. I've never run into this with requests.
Is there something I am doing wrong? I'm at a loss. Thanks in advance for looking.
Solved by adding this loop:
while r.status_code == 204:
time.sleep(1)
r = requests.get(self.resulturl, verify=False, auth=HTTPBasicAuth(self.user, self.passwd))
As I suspected the Rest API was taking longer to collect results, hence the 204. When running the debugger, it slowed the process long enough that the API was able to complete the initial request, thus giving a 200.
The HTTP 204 No Content success status response code indicates that the request has succeeded, but that the client doesn't need to go away from its current page. A 204 response is cacheable by default.
Below settings would solve the issue.
r = requests.get(splunk_end, headers=headers, verify=False)
while r.status_code == 204:
time.sleep(1)
r = requests.get(splunk_end, headers=headers, verify=False)
The 204 response is converted to 200. Please check below logs.
https://localhost:8089/services/search/jobs/4D44-A45E-7BDB8F0BE473/results?output_mode=json
/usr/lib/python2.7/site-packages/botocore/vendored/requests/packages/urllib3/connectionpool.py:768: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
InsecureRequestWarning)
/usr/lib/python2.7/site-packages/botocore/vendored/requests/packages/urllib3/connectionpool.py:768: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
InsecureRequestWarning)
<Response [204]>
/usr/lib/python2.7/site-packages/botocore/vendored/requests/packages/urllib3/connectionpool.py:768: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
InsecureRequestWarning)
<Response [200]>
Thanks
In this case , after an SID generation , directly trying to get results when the status response is 200 but the dispatchState is not DONE. So when checking results response it gives 204.
we can keep checking the status of job ("<s:key name="dispatchState">DONE</s:key>" ) by filtering.So once the dispatchState shows DONE , go for checking results , then response code would directly give 200.

I want to test a single web page's response status of many requests(To find out whether there are 404, 5XX requests of this web page)

I am new to python, can anyone tell me which python tools I should use to get my work done? Any good idea to build a python script to automatically find out these 404, 5XX requests?Thanks in advance!
We can check the response status code:
>>> r = requests.get('http://httpbin.org/get')
>>> r.status_code
200
Requests also comes with a built-in status code lookup object for easy reference:
>>> r.status_code == requests.codes.ok
True
If we made a bad request (a 4XX client error or 5XX server error response), we can raise it with Response.raise_for_status():
>>> bad_r = requests.get('http://httpbin.org/status/404')
>>> bad_r.status_code
404
But, since our status_code for r was 200, when we call raise_for_status() we get:
>>> r.raise_for_status()
None
Reffer : this link

Does requests.codes.ok include a 304?

I have a program which uses the requests module to send a get request which (correctly) responds with a 304 "Not Modified". After making the request, I check to make sure response.status_code == requests.codes.ok, but this check fails. Does requests not consider a 304 as "ok"?
There is a property called ok in the Response object that returns True if the status code is not a 4xx or a 5xx.
So you could do the following:
if response.ok:
# 304 is included
The code of this property is pretty simple:
#property
def ok(self):
try:
self.raise_for_status()
except HTTPError:
return False
return True
You can check actual codes in the source. ok means 200 only.
You can check the implementation of requests.status code here source code.The implementation allows you to access all/any kind of status_codes as follow:
import requests
import traceback
url = "https://google.com"
req = requests.get(url)
try:
if req.status_code == requests.codes['ok']: # Check the source code for all the codes
print('200')
elif req.status_code == requests.codes['not_modified']: # 304
print("304")
elifreq.status_code == requests.codes['not_found']: # 404
print("404")
else:
print("None of the codes")
except:
traceback.print_exc(file=sys.stdout)
In conclusion, you can access any request-response like demonstrated. I am sure there are better ways but this worked for me.
.ok "..If the status code is between 200 and 400, this will return True."
mentioned in source code as:
"""Returns True if :attr:status_code is less than 400, False if not.
This attribute checks if the status code of the response is between
400 and 600 to see if there was a client error or a server error. If
the status code is between 200 and 400, this will return True. This
is not a check to see if the response code is 200 OK.
"""

Categories