HTTPError: Moved Permanently raised by urllib2.urlopen() - python

I am trying to send a patch request a database using urllib and urllib2 library in python 2.7 (as I cannot use requests library cause it does not work in this server and nobody has found the solution for that, so please do not suggest requests because that path is already closed).
The code look like this:
data={"name":"whatever name"}
data=urllib.urlencode(data,'utf-8')#Encoding the dictionary of the data to make the request
req=urllib2.Request(url=next_url,headers={"Authorization": auth_header,"Content-Type": "application/json"})#Creating a request object of urllib library
req.add_data=data
req.get_method = lambda: 'PATCH'
resp = urllib2.urlopen(req)
If don't assign both req.get_method=lambda: 'PATCH' , req.add_data=data the request class automatically sends a get request which has a 200 response, so I guess it does not have to do with the authorization and credentials. Using python 3 and urllib.request library works as well, so the server accept for sure PATCH requests.
I hope that anybody can find the solution... I cannot picture why this' happening.
Update SOLVED: I find the problem, to be related with the url I was making the request.

The "Moved Permanently" error would indicate that the server responded with a HTTP 301 error, meaning that the URL you are requesting has been moved to another URL (https://en.wikipedia.org/wiki/HTTP_301).
I would suggest to take a network traffic capture with tools like tcpdump or wireshark, to check the HTTP conversation and confirm . If the server is actually replying with a 301 and this is not urllib raising a wrong error code, the server response should include a "Location" header with another URL, and you should try this one instead.
Note that urllib has problems when dealing with redirects., so you might want to reconsider trying to make the "requests" module work instead.

Related

Python requests: how to handle status code 304

I'm trying to use requests and bs4 to get info from a website, but am receiving the status code 304 and no content from request.get(). I've done some reading and understand that this code indicates the resource is already in my cache. How can I either access the resource from my cache, or preferably, clear my cache so that I can receive the resource new?
I've tried adding the following header: headers={'Cache-Control': 'no-cache'} to requests.get() but still have the same issue.
Additionally I've looked into the requests-cache module, but am unclear on how or if this could be used to solve the problem.
code:
import requests
r = requests.get('https://smsreceivefree.com/')
print(r.status_code)
print(r.content)
output:
304
b''
A server should send a 304 Not Modified reply if the client sent a conditional request, like one having an If-Modified-Since header. This makes sense if the client already has a cached version of the page, and wants to avoid downloading the content if he already has the newest version of it.
In this case, the website seems to send a 304 to certain kinds of clients, as it seems: ones where the User-Agent seems to indicate automation (which is true, in your case).
The server should instead send a 4xx error code, probably a 403 Forbidden, but likely uses a 304 in order to throw bot writers off the right track and make them come to StackOverflow.

Python 2.7 urllib2 raising urllib2.HTTPError 301 when hitting redirect with xml content

I'm using urllib2 to request a particular S3 bucket at hxxp://s3.amazonaws.com/mybucket. Amazon sends back an HTTP code of 301 along with some XML data (the redirect being to hxxp://mybucket.s3.amazonaws.com/). Instead of following the redirect, python raises urllib2.HTTPError: HTTP Error 301: Moved Permanently.
According to the official Python docs at HOWTO Fetch Internet Resources Using urllib2, "the default handlers handle redirects (codes in the 300 range)".
Is python handling this incorrectly (presumably because of the unexpected XML in the response), or am I doing something wrong? I've watched in Wireshark and the response comes back exactly the same to python's request as it does to me using a web client. In debugging, I don't see the XML being captured anywhere in the response object.
Thanks for any guidance.
Edit: Sorry for not posting the code initially. It's nothing special, literally just this -
import urllib2, httplib
request = urllib2.Request(site)
response = urllib2.urlopen(request)
You are better off using the requests library. requests handle redirection by default : http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history
import requests
response = requests.get(site)
print(response.content)
I don't get the problem with urllib2, I tried to look into the documentation https://docs.python.org/2/library/urllib2.html but it doesn't look intuitive.
It seems that in Python3, they refactored it to make it less a burden to use, but I am still convinced that requests is the way to go.
Note The urllib2 module has been split across several modules in
Python 3 named urllib.request and urllib.error. The 2to3 tool will
automatically adapt imports when converting your sources to Python 3.

Website Scraping In Python Server Detection

I have come across an interesting phenomenon when scraping a particular site on the web. The issue arises when using Python's urllib2 (Python 2.7). For example:
import urllib2
LINK = "http://www.sample.com/article/1"
HEADERS = {'User-Agent': 'Mozilla5.0/...'} # Ellipsis for brevity
req = urllib2.Request(link, data=None, headers=HEADERS)
resp = urllib2.urlopen(req, timeout=20).read()
Here are the strange outcomes:
1) When a valid user agent is passed to the request headers, the server will return a status of 200 and a page saying there was an issue processing the request (invalid html). This means I am able to get a successful response from the server with corrupted data.
2) When an invalid user agent is passed (empty headers {}), the server will timeout. However, if the timeout is set to a large value (20 seconds in this example), the server will then return the valid data but in a slow fashion.
This problem arises on the server when there has been no prior requests, therefore I believe the server may be expecting a certain cookie from the request to serve valid data. Anyone have insight into why this is happening?
You can't know what's going-on on the server side.
You can just keep experiment and guess how it works on the other side. If this is a user agent only issue - just keep sending it(and maybe changing it once in awhile) for ALL of your requests, including the first one.
Also, I would open chrome dev tool on a new session(incognito) and recording all the actions you're doing, this way you can see the structure of the requests that being made by a real browser.

Is there a reason why http response is different from the site and code output?

I get a http response using Python by executing the following code:
conn = httplib.HTTPSConnection('westus.api.cognitive.microsoft.com')
conn.request("GET", "/something/somethig")
response = conn.getresponse()
#data = response.read()
data =json.load(response)
print(data)
The results show a list of API results.
But they are different from the one that is executed and when I manually access the westus.api.cognitive.microsoft.com/something/something website.
Can somebody tell me what is wrong here?
There are a lot of things that are different between a request in a script and a request in a browser. For one, your script will not execute any associated JavaScript from the page. Second, the header of your http request includes details of the requesting client.
For example for a REST interface, a server may return the plainest JSON to an application request and return a formatted page for a browser request.
In Chrome, you can open the developer tools with "..."->"More Tools"->"Developer Tools", and with that open, you can open all of your requests and see the headers:
There is a similar function in Firefox for looking at headers. Click the hamburger->"Developer"->"Web Console". Under "Net" you can filter for requests. Click a request to see the details.
For POST commands, also look at the body of the request.
Finally, last week, I was trying to automate a POST command in Java, and I had some difficulty in doing so. A colleague was able to make the call with the curl command, and that gave me enough clues about the critical parameters. So I recommend trying curl which can help distinguish critical parameters from accidental ones, or at least to look at the problem from another angle.

Client Digest Authentication Python with URLLIB2 will not remember Authorization Header Information

I am trying to use Python to write a client that connects to a custom http server that uses digest authentication. I can connect and pull the first request without problem. Using TCPDUMP (I am on MAC OS X--I am both a MAC and a Python noob) I can see the first request is actually two http requests, as you would expect if you are familiar with RFC2617. The first results in the 401 UNAUTHORIZED. The header information sent back from the server is correctly used to generate headers for a second request with some custom Authorization header values which yields a 200 OK response and the payload.
Everything is great. My HTTPDigestAuthHandler opener is working, thanks to urllib2.
In the same program I attempt to request a second, different page, from the same server. I expect, per the RFC, that the TCPDUMP will show only one request this time, using almost all the same Authorization Header information (nc should increment).
Instead it starts from scratch and first gets the 401 and regenerates the information needed for a 200.
Is it possible with urllib2 to have subsequent requests with digest authentication recycle the known Authorization Header values and only do one request?
[Re-read that a couple times until it makes sense, I am not sure how to make it any more plain]
Google has yielded surprisingly little so I guess not. I looked at the code for urllib2.py and its really messy (comments like: "This isn't a fabulous effort"), so I wouldn't be shocked if this was a bug. I noticed that my Connection Header is Closed, and even if I set it to keepalive, it gets overwritten. That led me to keepalive.py but that didn't work for me either.
Pycurl won't work either.
I can hand code the entire interaction, but I would like to piggy back on existing libraries where possible.
In summary, is it possible with urllib2 and digest authentication to get 2 pages from the same server with only 3 http requests executed (2 for first page, 1 for second).
If you happen to have tried this before and already know its not possible please let me know. If you have an alternative I am all ears.
Thanks in advance.
Although it's not available out of the box, urllib2 is flexible enough to add it yourself. Subclass HTTPDigestAuthHandler, hack it (retry_http_digest_auth method I think) to remember authentication information and define an http_request(self, request) method to use it for all subsequent requests (add WWW-Authenticate header).

Categories