Python - Get Header information from URL - python

I've been searching all around for a Python 3.x code sample to get HTTP Header information.
Something as simple as get_headers equivalent in PHP cannot be found in Python easily. Or maybe I am not sure how to best wrap my head around it.
In essence, I would like to code something where I can see whether a URL exists or not
something in the line of
h = get_headers(url)
if(h[0] == 200)
{
print("Bingo!")
}
So far, I tried
h = http.client.HTTPResponse('http://docs.python.org/')
But always got an error

To get an HTTP response code in python-3.x, use the urllib.request module:
>>> import urllib.request
>>> response = urllib.request.urlopen(url)
>>> response.getcode()
200
>>> if response.getcode() == 200:
... print('Bingo')
...
Bingo
The returned HTTPResponse Object will give you access to all of the headers, as well. For example:
>>> response.getheader('Server')
'Apache/2.2.16 (Debian)'
If the call to urllib.request.urlopen() fails, an HTTPError Exception is raised. You can handle this to get the response code:
import urllib.request
try:
response = urllib.request.urlopen(url)
if response.getcode() == 200:
print('Bingo')
else:
print('The response code was not 200, but: {}'.format(
response.get_code()))
except urllib.error.HTTPError as e:
print('''An error occurred: {}
The response code was {}'''.format(e, e.getcode()))

For Python 2.x
urllib, urllib2 or httplib can be used here. However note, urllib and urllib2 uses httplib. Therefore, depending on whether you plan to do this check a lot (1000s of times), it would be better to use httplib. Additional documentation and examples are here.
Example code:
import httplib
try:
h = httplib.HTTPConnection("www.google.com")
h.connect()
except Exception as ex:
print "Could not connect to page."
For Python 3.x
A similar story to urllib (or urllib2) and httplib from Python 2.x applies to the urllib2 and http.client libraries in Python 3.x. Again, http.client should be quicker. For more documentation and examples look here.
Example code:
import http.client
try:
conn = http.client.HTTPConnection("www.google.com")
conn.connect()
except Exception as ex:
print("Could not connect to page.")
and if you wanted to check the status codes you would need to replace
conn.connect()
with
conn.request("GET", "/index.html") # Could also use "HEAD" instead of "GET".
res = conn.getresponse()
if res.status == 200 or res.status == 302: # Specify codes here.
print("Page Found!")
Note, in both examples, if you would like to catch the specific exception relating to when the URL doesn't exist, rather than all of them, catch the socket.gaierror exception instead (see the socket documentation).

You can use requests module to check it:
import requests
url = "http://www.example.com/"
res = requests.get(url)
if res.status_code == 200:
print("bingo")
You can also check header contents before making downloading the whole content of the webpage by using header.

you can use the urllib2 library
import urllib2
if urllib2.urlopen(url).code == 200:
print "Bingo"

Related

How can I read the contents of an URL with Transcrypt? Where is urlopen() located?

In Transcrypt I try to read JSON data from a URL, so I try:
import urllib.request
data = urllib.request.urlopen(data_url)
But I get the error "Import error, can't find [...] urllib.request". So urllib.request doesn't seem to be support; strangely though the top-level import urllib works, but with this I do not get to the urlopen() function...
Any idea where urlopen() is located in Transcrypt? Or is there another way to retrieve URLs?
I don't believe Transcrypt has the Python urllib library available. You will need to use a corresponding JavaScript library instead. I prefer axios, but you can also just use the built in XMLHttpRequest() or window.fetch()
Here is a Python function you can incorporate that uses window.fetch():
def fetch(url, callback):
def check_response(response):
if response.status != 200:
console.error('Fetch error - Status Code: ' + response.status)
return None
return response.json()
prom = window.fetch(url)
resp = prom.then(check_response)
resp.then(callback)
prom.catch(console.error)
Just call this fetch function from your Python code and pass in the URL and a callback to utilize the response after it is received.

"ConnectionResetError" What should I do?

# -*- coding: UTF-8 -*-
import urllib.request
import re
import os
os.system("cls")
url=input("Url Link : ")
if(url[0:8]=="https://"):
url=url[:4]+url[5:]
if(url[0:7]!="http://"):
url="http://"+url
try :
try :
value=urllib.request.urlopen(url,timeout=60).read().decode('cp949')
except UnicodeDecodeError :
value=urllib.request.urlopen(url,timeout=60).read().decode('UTF8')
par='<title>(.+?)</title>'
result=re.findall(par,value)
print(result)
except ConnectionResetError as e:
print(e)
TimeoutError is disappeared. But ConnectionResetError appear. What is this Error? Is it server problem? So it can't solve with me?
포기하지 마세요! Don't give up!
Some website require specific HTTP Header, in this case, User-agent. So you need to set this header in your request.
Change your request like this (17 - 20 line of your code)
# Make request object
request = urllib.request.Request(url, headers={"User-agent": "Python urllib test"})
# Open url using request object
response = urllib.request.urlopen(request, timeout=60)
# read response
data = response.read()
# decode your value
try:
value = data.decode('CP949')
except UnicodeDecodeError:
value = data.decode('UTF-8')
You can change "Python urllib test" to anything you want. Almost every servers use User-agent for statistical purposes.
Last, consider using appropritate whitespaces, blank lines, comments to make your code more readable. It will be good for you.
More reading:
HTTP/1.1: Header Field Definitions - to understand what is User-agent header.
21.6. urllib.request — Extensible library for opening URLs — Python 3.4.3 documentation - Always read documentation. Link to urllib.request.Request section.

making a simple GET/POST with url Encoding python

i have a custom url of the form
http://somekey:somemorekey#host.com/getthisfile.json
i tried all the way but getting errors :
method 1 :
from httplib2 import Http
ipdb> from urllib import urlencode
h=Http()
ipdb> resp, content = h.request("3b8138fedf8:1d697a75c7e50#abc.myshopify.com/admin/shop.json")
error :
No help on =Http()
Got this method from here
method 2 :
import urllib
urllib.urlopen(url).read()
Error :
*** IOError: [Errno url error] unknown url type: '3b8108519e5378'
I guess something wrong with the encoding ..
i tried ...
ipdb> url.encode('idna')
*** UnicodeError: label empty or too long
Is there any way to make this Complex url get call easy .
You are using a PDB-based debugger instead of a interactive Python prompt. h is a command in PDB. Use ! to prevent PDB from trying to interpret the line as a command:
!h = Http()
urllib requires that you pass it a fully qualified URL; your URL is lacking a scheme:
urllib.urlopen('http://' + url).read()
Your URL does not appear to use any international characters in the domain name, so you do not need to use IDNA encoding.
You may want to look into the 3rd-party requests library; it makes interacting with HTTP servers that much easier and straightforward:
import requests
r = requests.get('http://abc.myshopify.com/admin/shop.json', auth=("3b8138fedf8", "1d697a75c7e50"))
data = r.json() # interpret the response as JSON data.
The current de facto HTTP library for Python is Requests.
import requests
response = requests.get(
"http://abc.myshopify.com/admin/shop.json",
auth=("3b8138fedf8", "1d697a75c7e50")
)
response.raise_for_status() # Raise an exception if HTTP error occurs
print response.content # Do something with the content.

How do I get HTTP header info without authentication using python?

I'm trying to write a small program that will simply display the header information of a website. Here is the code:
import urllib2
url = 'http://some.ip.add.ress/'
request = urllib2.Request(url)
try:
html = urllib2.urlopen(request)
except urllib2.URLError, e:
print e.code
else:
print html.info()
If 'some.ip.add.ress' is google.com then the header information is returned without a problem. However if it's an ip address that requires basic authentication before access then it returns a 401. Is there a way to get header (or any other) information without authentication?
I've worked it out.
After try has failed due to unauthorized access the following modification will print the header information:
print e.info()
instead of:
print e.code()
Thanks for looking :)
If you want just the headers, instead of using urllib2, you should go lower level and use httplib
import httplib
conn = httplib.HTTPConnection(host)
conn.request("HEAD", path)
print conn.getresponse().getheaders()
If all you want are HTTP headers then you should make HEAD not GET request. You can see how to do this by reading Python - HEAD request with urllib2.

Python script to see if a web page exists without downloading the whole page?

I'm trying to write a script to test for the existence of a web page, would be nice if it would check without downloading the whole page.
This is my jumping off point, I've seen multiple examples use httplib in the same way, however, every site I check simply returns false.
import httplib
from httplib import HTTP
from urlparse import urlparse
def checkUrl(url):
p = urlparse(url)
h = HTTP(p[1])
h.putrequest('HEAD', p[2])
h.endheaders()
return h.getreply()[0] == httplib.OK
if __name__=="__main__":
print checkUrl("http://www.stackoverflow.com") # True
print checkUrl("http://stackoverflow.com/notarealpage.html") # False
Any ideas?
Edit
Someone suggested this, but their post was deleted.. does urllib2 avoid downloading the whole page?
import urllib2
try:
urllib2.urlopen(some_url)
return True
except urllib2.URLError:
return False
how about this:
import httplib
from urlparse import urlparse
def checkUrl(url):
p = urlparse(url)
conn = httplib.HTTPConnection(p.netloc)
conn.request('HEAD', p.path)
resp = conn.getresponse()
return resp.status < 400
if __name__ == '__main__':
print checkUrl('http://www.stackoverflow.com') # True
print checkUrl('http://stackoverflow.com/notarealpage.html') # False
this will send an HTTP HEAD request and return True if the response status code is < 400.
notice that StackOverflow's root path returns a redirect (301), not a 200 OK.
Using requests, this is as simple as:
import requests
ret = requests.head('http://www.example.com')
print(ret.status_code)
This just loads the website's header. To test if this was successfull, you can check the results status_code. Or use the raise_for_status method which raises an Exception if the connection was not succesfull.
How about this.
import requests
def url_check(url):
#Description
"""Boolean return - check to see if the site exists.
This function takes a url as input and then it requests the site
head - not the full html and then it checks the response to see if
it's less than 400. If it is less than 400 it will return TRUE
else it will return False.
"""
try:
site_ping = requests.head(url)
if site_ping.status_code < 400:
# To view the return status code, type this : **print(site.ping.status_code)**
return True
else:
return False
except Exception:
return False
You can try
import urllib2
try:
urllib2.urlopen(url='https://someURL')
except:
print("page not found")

Categories