python http client module error / inconsistent - python

I'm getting the following output
301 Moved Permanently --- when using http.client
200 --- when using requests
URL handling "http://i.imgur.com/fyxDric.jpg" passed as arg through command
What I expect is give me 200 status ok response.
This is the body
if scheme == 'http':
print('Ruuning in the http')
conn = http.client.HTTPConnection("www.i.imgur.com")
conn.request("GET", urlparse(url).path)
conn_resp = conn.getresponse()
body = conn_resp.read()
print(conn_resp.status, conn_resp.reason, body)
When using the requests
headers = {'User-Agent': 'Mozilla/5.0 Chrome/54.0.2840.71 Safari/537.36'}
response = requests.get(url, allow_redirects=False)
print(response.status_code)

You are trying to hit imgur over http, but imgur redirects all its request to process over https.
Due to this redirect the issue is occurring.
http module doesnt inherently handle the redirects you need to handle the redirects, where as requests module handles these redirects by itself.

The documentation on the http module includes in its first sentence "It is normally not used directly." Unlike requests it doesn't action the 301 response and follow the redirection in the headers. It instead returns the 301, which you would have to process yourself.

Related

python-requests give me diffrent response from what I see in the browser, Why?

I want to get data from this site.
When I get data from the main url. I get an HTML file that contains structure but not the values.
import requests
from bs4 import BeautifulSoup
url ='http://option.ime.co.ir/'
r = requests.get(url)
soup = BeautifulSoup(r,'lxml')
print(soup.prettify())
I find out that the site get values from
url1 = 'http://option.ime.co.ir/GetTime'
url2 = 'http://option.ime.co.ir/GetMarketData'
When I watch responses from those url in the browser. I see a JSON format response and time in a specific format.
but when I use requests to get the data it gives me same HTML that I get from url.
Do you know whats the reason? How should I get the responses that I see in the browser?
I check headers for all urls and I didn't find something special that I should send with my request.
You have to provide the proper HTTP headers in the request. In my case, I was able to make it work using the following headers. Note that in my testing the HTTP response was a 200 OK rather than a redirect to the root website (as when no HTTP headers were provided in the request).
Raw HTTP Request:
GET http://option.ime.co.ir/GetTime HTTP/1.1
Host: option.ime.co.ir
Referer: "http://option.ime.co.ir/"
Accept: "application/json, text/plain, */*"
User-Agent: "Mozilla/5.0 (Windows NT 6.1; rv:45.0) Gecko/20100101 Firefox/45.0"
This should give you the proper JSON response you need.
You first connection using the browser is getting a 302 Redirection response (to the same url).
Then it is running some JS so the so the second request doesn't redirect anymore and gets the expected JSON.
It is a usual technique so other people don't use their API without permission.
Set the "preserve log" checkbox in dev. tools so you can see it by yourself.

400 Bad Request With urllib2 for POST

I am struggling from 2 days with a post request to be made only using urllib & urllib2. I have limitations in using curl or requests library, as the machine I would need to deploy my code doesn't support any of these.
The post call would be accompanied with a Header and json Body. I am able to make any get call, but POST with Data & Header throws 400 bad requests. Tried and applied all the options available in google/stackoverflow, but nothing solved!
Below is the sample code:--
import urllib
import urllib2
url = 'https://1.2.3.4/rest/v1/path'
headers = {'X-Auth-Token': '123456789sksksksk111',
'Content-Type': 'application/json'}
body = {'Action': 'myaction',
'PressType': 'Format1', 'Target': '/abc/def'}
data = urllib.urlencode(body)
request = urllib2.Request(url, data, headers)
response = urllib2.urlopen(request, data)
And on setting debug handler, below is the format of the request that can be traced:--
send: 'POST /rest/v1/path HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 52\r\nHost: 1.2.3.4\r\nUser-Agent: Python-urllib/2.7\r\nConnection: close\r\nX-Auth-Token: 123456789sksksksk111\r\nContent-Type: application/json\r\n\r\nAction=myaction&PressType=Format1&Target=%2Fabc%2Fdef'
reply: 'HTTP/1.1 400 Bad Request\r\n'
Please note, the same post request works perfectly fine with any REST client and with Requests library. In the debug handler output, if we see, the json structure is Content-Type: application/json\r\n\r\nAction=myaction&PressType=Format1&Target=%2Fabc%2Fdef, can that be a problem!
You can dump the json instead of encoding it. I was facing the same and got solved with it!
Remove data = urllib.urlencode(body) and use urllib2.urlopen(req, json.dumps(data))
That should solve.

HTTP Error 504: Gateway Time-out when trying to read a reddit comments post

I am encountering an error when trying to get a comments' http from reddit. This has happened to various URLs (not all of them with special characters) and this is one of them. In one hour time frame, there may be 1000 or more requests to the reddit.com domain.
hdr = {"User-Agent": "My Agent"}
try:
req = urllib2.Request("http://www.reddit.com/r/gaming/"
"comments/1bjuee/when_pokΓ©mon_was_good", headers=hdr)
htmlSource = urllib2.urlopen(req).read()
except Exception as inst:
print inst
Output>>HTTP Error 504: Gateway Time-out
HTTP Error 504 Gateway timeout - A server (not necessarily a Web server) is acting as a gateway or proxy to fulfil the request by the client (e.g. your Web browser or our CheckUpDown robot) to access the requested URL. This server did not receive a timely response from an upstream server it accessed to deal with your HTTP request.
This usually means that the upstream server is down (no response to the gateway/proxy), rather than that the upstream server and the gateway/proxy do not agree on the protocol for exchanging data.
Problem can appear in different places on the network and there is no "unique" solution for it. You will have to investigative the problem by your own.
Your code works fine. Possible solution for you problem would be:
import urllib2
hdr = {"User-Agent": "My Agent"}
while True:
try:
req = urllib2.Request("http://www.reddit.com/", headers=hdr)
response = urllib2.urlopen(req)
htmlSource = response.read()
if response.getcode() == 200:
break
except Exception as inst:
print inst
This code will try to request webpage until it gets 200 response (standard response for successful HTTP requests). When 200 response will occur while loop will break and you can do next request (or whatever you have in your program)

About handling a redirection in python

I am new to python and am trying to learn some new modules. Fortunately or unfortunately, I picked up the urllib2 module and started using it with one URL that's causing me problems.
To begin with, I created the Request object and then called Read() on the response object. It was failing. Turns out its getting redirected but the error code is still 200. Not sure what's going on. Here is the code --
def get_url_data(url):
print "Getting URL " + url
user_agent = "Mozilla/5.0 (Windows NT 6.0; rv:14.0) Gecko/20100101 Firefox/14.0.1"
headers = { 'User-Agent' : user_agent }
request = urllib2.Request(url, str(headers) )
try:
response = urllib2.urlopen(request)
except urllib2.HTTPError, e:
print response.geturl()
print response.info()
print response.getcode()
return False;
else:
print response
print response.info()
print response.getcode()
print response.geturl()
return response
I am calling the above function with http://www.chilis.com".
I was expecting to receive a 301, 302, or 303 but instead I see 200. Here is the response I see --
Getting URL http://www.chilis.com
<addinfourl at 4354349896 whose fp = <socket._fileobject object at 0x1037513d0>>
Cache-Control: private
Server: Microsoft-IIS/7.5
SPRequestGuid: 48bbff39-f8b1-46ee-a70c-bcad16725a4d
X-SharePointHealthScore: 0
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
MicrosoftSharePointTeamServices: 14.0.0.6120
X-MS-InvokeApp: 1; RequireReadOnly
Date: Wed, 13 Feb 2013 11:21:27 GMT
Connection: close
Content-Length: 0
Set-Cookie: BIGipServerpool_http_chilis.com=359791882.20480.0000; path=/
200
http://www.chilis.com/(X(1)S(q24tqizldxqlvy55rjk5va2j))/Pages/ChilisVariationRoot.aspx?AspxAutoDetectCookieSupport=1
Can someone explain what this URL is and how to handle this? I know I can use the "Handling Redirects" section from Diveintopython.net but there also with the code on that page I see the same response 200.
EDIT: Using the code from DiveintoPython, I see its a temporary redirection. What I don't understand is why the HTTP Errorcode from code is 200. Isn't that supposed to be the actual return code?
EDIT2: Now that I see it better, its not a weird redirection at all. I am editing the title.
EDIT3: If urllib2 follows the redirection automatically, I am not sure why the following code does not get the front page for chilis.com.
docObj = get_url_data(url)
doc = docObj.read()
soup = BeautifulSoup(doc, 'lxml')
print(soup.prettify())
If I use the URL that the browser eventually ends up getting redirected to it works (http://www.chilis.com/EN/Pages/home.aspx").
urllib2 automatically follows redirects, so the information you're seeing is from the page that was redirected to.
If you don't want it to follow redirect, you'll need to subclass urllib2.HTTPRedirectHandler. Here's a relevant SO posting on how to do that: How do I prevent Python's urllib(2) from following a redirect
Regarding EDIT 3: it looks like www.chilis.com requires accepting cookies. This can be implemented using urllib2, but I would suggest installing the requests module (http://pypi.python.org/pypi/requests/).
The following seems to do exactly what you want (without any error handling):
import requests
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())

How do I set cookies using Python urlopen?

I am trying to fetch an html site using Python urlopen.
I am getting this error:
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop
The code:
from urllib2 import Request
request = Request(url)
response = urlopen(request)
I understand that the server redirects to another URL and that it is looking for a cookie.
How do I set the cookie it is looking for so I can read the html?
Here's an example from Python documentation, adjusted to your code:
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)
response = opener.open(request)

Categories