I'm trying to write a small program that will simply display the header information of a website. Here is the code:
import urllib2
url = 'http://some.ip.add.ress/'
request = urllib2.Request(url)
try:
html = urllib2.urlopen(request)
except urllib2.URLError, e:
print e.code
else:
print html.info()
If 'some.ip.add.ress' is google.com then the header information is returned without a problem. However if it's an ip address that requires basic authentication before access then it returns a 401. Is there a way to get header (or any other) information without authentication?
I've worked it out.
After try has failed due to unauthorized access the following modification will print the header information:
print e.info()
instead of:
print e.code()
Thanks for looking :)
If you want just the headers, instead of using urllib2, you should go lower level and use httplib
import httplib
conn = httplib.HTTPConnection(host)
conn.request("HEAD", path)
print conn.getresponse().getheaders()
If all you want are HTTP headers then you should make HEAD not GET request. You can see how to do this by reading Python - HEAD request with urllib2.
Related
I'm using Python 3.7 with urllib.
All work fine but it seems not to athomatically redirect when it gets an http redirect request (307).
This is the error i get:
ERROR 2020-06-15 10:25:06,968 HTTP Error 307: Temporary Redirect
I've to handle it with a try-except and manually send another request to the new Location: it works fine but i don't like it.
These is the piece of code i use to perform the request:
req = urllib.request.Request(url)
req.add_header('Authorization', auth)
req.add_header('Content-Type','application/json; charset=utf-8')
req.data=jdati
self.logger.debug(req.headers)
self.logger.info(req.data)
resp = urllib.request.urlopen(req)
url is an https resource and i set an header with some Authhorization info and content-type.
req.data is a JSON
From urllib documentation i've understood that the redirects are authomatically performed by the the library itself, but it doesn't work for me. It always raises an http 307 error and doesn't follow the redirect URL.
I've also tried to use an opener specifiyng the default redirect handler, but with the same result
opener = urllib.request.build_opener(urllib.request.HTTPRedirectHandler)
req = urllib.request.Request(url)
req.add_header('Authorization', auth)
req.add_header('Content-Type','application/json; charset=utf-8')
req.data=jdati
resp = opener.open(req)
What could be the problem?
The reason why the redirect isn't done automatically has been correctly identified by yours truly in the discussion in the comments section. Specifically, RFC 2616, Section 10.3.8 states that:
If the 307 status code is received in response to a request other
than GET or HEAD, the user agent MUST NOT automatically redirect the
request unless it can be confirmed by the user, since this might
change the conditions under which the request was issued.
Back to the question - given that data has been assigned, this automatically results in get_method returning POST (as per how this method was implemented), and since that the request method is POST, and the response code is 307, an HTTPError is raised instead as per the above specification. In the context of Python's urllib, this specific section of the urllib.request module raises the exception.
For an experiment, try the following code:
import urllib.request
import urllib.parse
url = 'http://httpbin.org/status/307'
req = urllib.request.Request(url)
req.data = b'hello' # comment out to not trigger manual redirect handling
try:
resp = urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
if e.status != 307:
raise # not a status code that can be handled here
redirected_url = urllib.parse.urljoin(url, e.headers['Location'])
resp = urllib.request.urlopen(redirected_url)
print('Redirected -> %s' % redirected_url) # the original redirected url
print('Response URL -> %s ' % resp.url) # the final url
Running the code as is may produce the following
Redirected -> http://httpbin.org/redirect/1
Response URL -> http://httpbin.org/get
Note the subsequent redirect to get was done automatically, as the subsequent request was a GET request. Commenting out req.data assignment line will result in the lack of the "Redirected" output line.
Other notable things to note in the exception handling block, e.read() may be done to retrieve the response body produced by the server as part of the HTTP 307 response (since data was posted, there might be a short entity in the response that may be processed?), and that urljoin is needed as the Location header may be a relative URL (or simply has the host missing) to the subsequent resource.
Also, as a matter of interest (and for linkage purposes), this specific question has been asked multiple times before and I am rather surprised that they never got any answers, which follows:
How to handle 307 redirection using urllib2 from http to https
HTTP Error 307: Temporary Redirect in Python3 - INTRANET
HTTP Error 307 - Temporary redirect in python script
so I want to check if a URL is reachable from python, and I got this code from googling:
def checkUrl(url):
p = urlparse(url)
conn = http.client.HTTPConnection(p.netloc)
conn.request('HEAD', p.path)
resp = conn.getresponse()
return resp.status < 400
Here is my URL: https://eurotableau.nomisonline.com.
It works fine if I just pass that in to the function. The resp.status is 302. However, if I add a port 443 at the end of it, https://eurotableau.nomisonline.com:443, it returns false. The resp.status is 400. I tried both URL in google Chrome, both of them work. So my question is why is this happening? Anyway I can include the port value and still get valid resp.status value (< 400)? Thanks.
Use http.client.HTTPSConnection instead. The plain old HTTPConnection ignores the protocol that is part of the URL.
If you do not require the HEAD method but just wish to check if host is available then why not do:
from urllib2 import urlopen
try:
u = urlopen("https://eurotableau.nomisonline.com")
u.close()
print "Everything fine!"
except Exception, e:
if hasattr(e, "code"):
print "Server is there but something is wrong with rest of URL"
else: print "Server is on vacations or was never there!"
print e
This will establish a connection with server but it won't download any data unless you read it. It'll only read few KB to get the header (like when using HEAD method) and wait for you to request more. But you will close it there.
So, you can catch an exception and see what the problem is, or if there is no exception, just close the connection.
urllib2 will handle HTTPS and protocol://user#URL:PORT for you neatly.
No worries about anything.
Scope:
I am currently trying to write a Web scraper for this specific page. I have a pretty strong "Web Crawling" background using C#, but this httplib is beating me off.
Problem:
When trying to make a Http Get request for the page specified above I get a "Moved Permanently", that points to the very same URL. I can make a request using the requests lib, but I want to make it work using httplib so I can understand what I am doing wrong.
Code Sample:
I am completely new to Python, so any wrong language guideline or syntax is C#'s fault.
import httplib
# Wrapper for a "HTTP GET" Request
class HttpClient(object):
def HttpGet(self, url, host):
connection = httplib.HTTPConnection(host)
connection.request('GET', url)
return connection.getresponse().read()
# Using "HttpClient" class
httpclient = httpClient()
# This is the full URL I need to make a get request for : https://420101.com/strain-database
httpResponseText = httpclient.HttpGet('www.420101.com','/strain-database')
print httpResponseText
I really want to make it work using the httplib library, instead of requests or any other fancy one because I feel like I am missing something really small here.
The problem i've had too little or too much caffeine in my system.
To get a https, I needed the HTTPSConnection class.
Also, there is no 'www' in the address I wanted to GET. So, it shouldn't be included in the host.
Both of the wrong addresses redirect me to the correct one, with the 301 error code. If I were using requests or a more full featured module, it would have automatically followed the redirect.
My Validation:
c = httplib.HTTPSConnection('420101.com')
c.request("GET", "/strain-database")
r = c.getresponse()
print r.status, r.reason
200 OK
I've been searching all around for a Python 3.x code sample to get HTTP Header information.
Something as simple as get_headers equivalent in PHP cannot be found in Python easily. Or maybe I am not sure how to best wrap my head around it.
In essence, I would like to code something where I can see whether a URL exists or not
something in the line of
h = get_headers(url)
if(h[0] == 200)
{
print("Bingo!")
}
So far, I tried
h = http.client.HTTPResponse('http://docs.python.org/')
But always got an error
To get an HTTP response code in python-3.x, use the urllib.request module:
>>> import urllib.request
>>> response = urllib.request.urlopen(url)
>>> response.getcode()
200
>>> if response.getcode() == 200:
... print('Bingo')
...
Bingo
The returned HTTPResponse Object will give you access to all of the headers, as well. For example:
>>> response.getheader('Server')
'Apache/2.2.16 (Debian)'
If the call to urllib.request.urlopen() fails, an HTTPError Exception is raised. You can handle this to get the response code:
import urllib.request
try:
response = urllib.request.urlopen(url)
if response.getcode() == 200:
print('Bingo')
else:
print('The response code was not 200, but: {}'.format(
response.get_code()))
except urllib.error.HTTPError as e:
print('''An error occurred: {}
The response code was {}'''.format(e, e.getcode()))
For Python 2.x
urllib, urllib2 or httplib can be used here. However note, urllib and urllib2 uses httplib. Therefore, depending on whether you plan to do this check a lot (1000s of times), it would be better to use httplib. Additional documentation and examples are here.
Example code:
import httplib
try:
h = httplib.HTTPConnection("www.google.com")
h.connect()
except Exception as ex:
print "Could not connect to page."
For Python 3.x
A similar story to urllib (or urllib2) and httplib from Python 2.x applies to the urllib2 and http.client libraries in Python 3.x. Again, http.client should be quicker. For more documentation and examples look here.
Example code:
import http.client
try:
conn = http.client.HTTPConnection("www.google.com")
conn.connect()
except Exception as ex:
print("Could not connect to page.")
and if you wanted to check the status codes you would need to replace
conn.connect()
with
conn.request("GET", "/index.html") # Could also use "HEAD" instead of "GET".
res = conn.getresponse()
if res.status == 200 or res.status == 302: # Specify codes here.
print("Page Found!")
Note, in both examples, if you would like to catch the specific exception relating to when the URL doesn't exist, rather than all of them, catch the socket.gaierror exception instead (see the socket documentation).
You can use requests module to check it:
import requests
url = "http://www.example.com/"
res = requests.get(url)
if res.status_code == 200:
print("bingo")
You can also check header contents before making downloading the whole content of the webpage by using header.
you can use the urllib2 library
import urllib2
if urllib2.urlopen(url).code == 200:
print "Bingo"
I'm currently working on a automated way to interface with a database website that has RESTful webservices installed. I am having issues with figure out the proper formatting of how to properly send the requests listed in the following site using python.
https://neesws.neeshub.org:9443/nees.html
Particular example is this:
POST https://neesws.neeshub.org:9443/REST/Project/731/Experiment/1706/Organization
<Organization id="167"/>
The biggest problem is that I do not know where to put the XML formatted part of the above. I want to send the above as a python HTTPS request and so far I've been trying something of the following structure.
>>>import httplib
>>>conn = httplib.HTTPSConnection("neesws.neeshub.org:9443")
>>>conn.request("POST", "/REST/Project/731/Experiment/1706/Organization")
>>>conn.send('<Organization id="167"/>')
But this appears to be completely wrong. I've never actually done python when it comes to webservices interfaces so my primary question is how exactly am I supposed to use httplib to send the POST Request, particularly the XML formatted part of it? Any help is appreciated.
You need to set some request headers before sending data. For example, content-type to 'text/xml'. Checkout the few examples,
Post-XML-Python-1
Which has this code as example:
import sys, httplib
HOST = www.example.com
API_URL = /your/api/url
def do_request(xml_location):
"""HTTP XML Post requeste"""
request = open(xml_location,"r").read()
webservice = httplib.HTTP(HOST)
webservice.putrequest("POST", API_URL)
webservice.putheader("Host", HOST)
webservice.putheader("User-Agent","Python post")
webservice.putheader("Content-type", "text/xml; charset=\"UTF-8\"")
webservice.putheader("Content-length", "%d" % len(request))
webservice.endheaders()
webservice.send(request)
statuscode, statusmessage, header = webservice.getreply()
result = webservice.getfile().read()
print statuscode, statusmessage, header
print result
do_request("myfile.xml")
Post-XML-Python-2
You may get some idea.