Is there a difference between those two bs4 objects?
from urllib2 import urlopen, Request
from bs4 import BeautifulSoup
req1 = Request("https://stackoverflow.com/") # HTTPS
html1 = urlopen(req1).read()
req2 = Request("http://stackoverflow.com/") # HTTP
html2 = urlopen(req2).read()
bsObj1 = BeautifulSoup(html1, "html.parser")
bsObj2 = BeautifulSoup(html2, "html.parser")
Do you really need to specify an HTTP protocol?
Here's my limited understanding: There isn't a practical difference in this case.
My understanding is that most websites that have https will redirect http URLs to https, as is the case here. It's possible for a site to have an http version and an https version up simultaneously, in which case they might not redirect. This would be bad practice, but nothing is stopping someone from doing it.
I would still explicitly use https whenever possible, just as a best practice.
All communication over the HTTP protocol happens using HTTP verbs GET, POST, PUT, DELETE. Specifying the protocol has two purposes:
1) It specifies the scheme for data communication.
A general URI is of the form:
scheme:[//[user[:password]#]host[:port]][/path][?query][#fragment] and common schemes are http(s), ftp, mailto, file, data, and irc.
2) It specifies if the scheme supports SSL encryption:
With http schemes, the added 's' in https ensures SSL encryption of data.
According to urllib3 Python docs:
It is highly recommended to always use SSL certificate verification.In order to enable verification you will need a set of root certificates. The easiest and most reliable method is to use the certifi package which provides Mozilla’s root certificate bundle:
pip install certifi
>>> import certifi
>>> import urllib3
>>> http = urllib3.PoolManager(
... cert_reqs='CERT_REQUIRED',
... ca_certs=certifi.where())
The PoolManager will automatically handle certificate verification and will raise SSLError if verification fails:
>>> http.request('GET', 'https://google.com')
(No exception)
>>> http.request('GET', 'https://expired.badssl.com')
urllib3.exceptions.SSLError ...
Related
I am trying to access a server over my internal network under https://prodserver.de/info.
I have the code structure as below:
import requests
from requests.auth import *
username = 'User'
password = 'Hello#123'
resp = requests.get('https://prodserver.de/info/', auth=HTTPBasicAuth(username,password))
print(resp.status_code)
While trying to access this server via browser, it works perfectly fine.
What am I doing wrong?
By default, requests library verifies the SSL certificate for HTTPS requests. If the certificate is not verified, it will raise a SSLError. You check this by disabling the certificate verification by passing verify=False as an argument to the get method, if this is the issue.
import requests
from requests.auth import *
username = 'User'
password = 'Hello#123'
resp = requests.get('https://prodserver.de/info/', auth=HTTPBasicAuth(username,password), verify=False)
print(resp.status_code)
try using requests' generic auth, like this:
resp = requests.get('https://prodserver.de/info/', auth=(username,password)
What am I doing wrong?
I can not be sure without investigating your server, but I suggest checking if assumption (you have made) that server is using Basic authorization, there exist various Authentication schemes, it is also possible that your server use cookie-based solution, rather than headers-based one.
While trying to access this server via browser, it works perfectly
fine.
You might then use developer tools to see what is actually send inside and with request which does result in success.
This is probably a dumb question, but I just want to make sure with the below.
I am currently using the requests library in python. I am using this to call an external API hosted on Azure cloud.
If I use the requests library from a virtual machine, and the requests library sends to URL: https://api-management-example/run, does that mean my communication to this API, as well as the entire payload I send through is secure? I have seen in my Python site-packages in my virtual environment, there is a cacert.pem file. Do I need to update that at all? Do I need to do anything else on my end to ensure the communication is secure, or the fact that I am calling the HTTPS URL means it is secure?
Any information/guidance would be much appreciated.
Thanks,
A HTTPS is secure with valid signed certificate. Some people use self signed certificate to maintain HTTPS. In requests library, you explicitly verify your certificate. If you have self-signed HTTPS then, you need to pass the certificate to cross verify with your local certificate.
verify = True
import requests
response = requests.get("https://api-management-example/run", verify=True)
Self Signed Certificate
import requests
response = requests.get("https://api-management-example/run", verify="/path/to/local/certificate/file/")
Post requests are more secure because they can carry data in an encrypted form as a message body. Whereas GET requests append the parameters in the URL, which is also visible in the browser history, SSL/TLS and HTTPS connections encrypt the GET parameters as well. If you are not using HTTPs or SSL/TSL connections, then POST requests are the preference for security.
A dictionary object can be used to send the data, as a key-value pair, as a second parameter to the post method.
The HTTPS protocol is safe provided you have a valid SSL certificate on your API. If you want to be extra safe, you can implement end-to-end encryption/cryptography. Basically converting your so called plaintext, and converting it to scrambled text, called ciphertext.
You can explicitly enable verification in requests library:
import requests
session = requests.Session()
session.verify = True
session.post(url='https://api-management-example/run', data={'bar':'baz'})
This is enabled by default. you can also verify the certificate per request:
requests.get('https://github.com', verify='/path/to/certfile')
Or per session:
s = requests.Session()
s.verify = '/path/to/certfile'
Read the docs.
I have a code as below :
headers = {'content-type': 'ContentType.APPLICATION_XML'}
uri = "www.client.url.com/hit-here/"
clientCert = "path/to/cert/abc.crt"
clientKey = "path/to/key/abc.key"
PROTOCOL = ssl.PROTOCOL_TLSv1
context = ssl.SSLContext(PROTOCOL)
context.load_default_certs()
context.load_cert_chain(clientCert, clientKey)
conn = httplib.HTTPSConnection(uri, some_port, context=context)
I am not really a network programmer, so i did some googling for handshake connection and found ssl.SSLContext(PROTOCOL) as the needed function, code works fine.
Then i hit the roadblock, my local has version 2.7.10 but all the production boxes have 2.7.3 with them, so SSLContext is not supported and upgrading python version is not an option / in control.
I tried reading ssl — SSL wrapper for socket objects but couldn't make sense out of it.
what i tried (in vain) :
s_ = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = ssl.wrap_socket(s_, keyfile=clientKey, certfile=clientCert, cert_reqs=ssl.CERT_REQUIRED)
new_conn = s.connect((uri, some_port))
but returns :
SSLError(1, u'[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)')
Question - how to generate SSL Context on older version so as to have a secure https connection?
You have to specify the ca_certs file (which should point to trust store)
I've got the perfect solution using the requests library. The requests library has got to be my favorite library I've ever used, cause it takes something in Python that is inherently difficult to do -- SSL and REST requests -- and makes it unbelievably simple. I checked out their version support and Python 2.6+ is supported.
Here is an example of how to use their library.
>>> requests.get(uri)
And that is all you have to do. The requests library takes care of establishing a ssl connection.
Taking this one step farther. If you need to persist cookies between requests, you can do so like this.
>>> sess = requests.Session()
>>> credentials = {"username": "user",
"password": "pass"}
>>> sess.post("https://some-website/login", params=credentials)
<Response [200]>
>>> sess.get("https://some-website/a-backend-page").text
<html> the backend page... </html>
Edit: If you need to, you can also pass in the path to the certificate and the key like so requests.get(uri, cert=('path/to/cert/abc.crt', 'path/to/key/abc.key'))
Now hopefully you can convince them to install the requests library on the production boxes, cause it would be well worth it. Let me know if this works out for you.
I am trying to open an https URL using the urlopen method in Python 3's urllib.request module. It seems to work fine, but the documentation warns that "[i]f neither cafile nor capath is specified, an HTTPS request will not do any verification of the server’s certificate".
I am guessing I need to specify one of those parameters if I don't want my program to be vulnerable to man-in-the-middle attacks, problems with revoked certificates, and other vulnerabilities.
cafile and capath are supposed to point to a list of certificates. Where am I supposed to get this list from? Is there any simple and cross-platform way to use the same list of certificates that my OS or browser uses?
Works in python 2.7 and above
context = ssl.create_default_context(cafile=certifi.where())
req = urllib2.urlopen(urllib2.Request(url, body, headers), context=context)
I found a library that does what I'm trying to do: Certifi. It can be installed by running pip install certifi from the command line.
Making requests and verifying them is now easy:
import certifi
import urllib.request
urllib.request.urlopen("https://example.com/", cafile=certifi.where())
As I expected, this returns a HTTPResponse object for a site with a valid certificate and raises a ssl.CertificateError exception for a site with an invalid certificate.
Elias Zamarias answer still works, but gives a deprecation warning:
DeprecationWarning: cafile, cpath and cadefault are deprecated, use a custom context instead.
I was able to solve the same problem this way instead (using Python 3.7.0):
import ssl
import urllib.request
ssl_context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
response = urllib.request.urlopen("http://www.example.com", context=ssl_context)
You can download the certificates Mozilla in a format usable for urllib (e.g. PEM format) at http://curl.haxx.se/docs/caextract.html
Different Linux distributives have different pack names. I tested in Centos and Ubuntu. These certificate bundles are updates with system update. So you may just detect which bundle is available and use it with urlopen.
cafile = None
for i in [
'/etc/ssl/certs/ca-bundle.crt',
'/etc/ssl/certs/ca-certificates.crt',
]:
if os.path.exists(i):
cafile = i
break
if cafile is None:
raise RuntimeError('System CA-certificates bundle not found')
import certifi
import ssl
import urllib.request
try:
from urllib.request import HTTPSHandler
context = ssl.SSLContext(ssl.PROTOCOL_SSLv23)
context.options |= ssl.OP_NO_SSLv2
context.verify_mode = ssl.CERT_REQUIRED
context.load_verify_locations(certifi.where(), None)
https_handler = HTTPSHandler(context=context, check_hostname=True)
opener = urllib.request.build_opener(https_handler)
except ImportError:
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', YOUR_USER_AGENT)]
urllib.request.install_opener(opener)
Does urllib2 in Python 2.6.1 support proxy via https?
I've found the following at http://www.voidspace.org.uk/python/articles/urllib2.shtml:
NOTE
Currently urllib2 does not support
fetching of https locations through a
proxy. This can be a problem.
I'm trying automate login in to web site and downloading document, I have valid username/password.
proxy_info = {
'host':"axxx", # commented out the real data
'port':"1234" # commented out the real data
}
proxy_handler = urllib2.ProxyHandler(
{"http" : "http://%(host)s:%(port)s" % proxy_info})
opener = urllib2.build_opener(proxy_handler,
urllib2.HTTPHandler(debuglevel=1),urllib2.HTTPCookieProcessor())
urllib2.install_opener(opener)
fullurl = 'https://correct.url.to.login.page.com/user=a&pswd=b' # example
req1 = urllib2.Request(url=fullurl, headers=headers)
response = urllib2.urlopen(req1)
I've had it working for similar pages but not using HTTPS and I suspect it does not get through proxy - it just gets stuck in the same way as when I did not specify proxy. I need to go out through proxy.
I need to authenticate but not using basic authentication, will urllib2 figure out authentication when going via https site (I supply username/password to site via url)?
EDIT:
Nope, I tested with
proxies = {
"http" : "http://%(host)s:%(port)s" % proxy_info,
"https" : "https://%(host)s:%(port)s" % proxy_info
}
proxy_handler = urllib2.ProxyHandler(proxies)
And I get error:
urllib2.URLError: urlopen error
[Errno 8] _ssl.c:480: EOF occurred in
violation of protocol
Fixed in Python 2.6.3 and several other branches:
_bugs.python.org/issue1424152 (replace _ with http...)
http://www.python.org/download/releases/2.6.3/NEWS.txt
Issue #1424152: Fix for httplib, urllib2 to support SSL while working through
proxy. Original patch by Christopher Li, changes made by Senthil Kumaran.
I'm not sure Michael Foord's article, that you quote, is updated to Python 2.6.1 -- why not give it a try? Instead of telling ProxyHandler that the proxy is only good for http, as you're doing now, register it for https, too (of course you should format it into a variable just once before you call ProxyHandler and just repeatedly use that variable in the dict): that may or may not work, but, you're not even trying, and that's sure not to work!-)
Incase anyone else have this issue in the future I'd like to point out that it does support https proxying now, make sure the proxy supports it too or you risk running into a bug that puts the python library into an infinite loop (this happened to me).
See the unittest in the python source that is testing https proxying support for further information:
http://svn.python.org/view/python/branches/release26-maint/Lib/test/test_urllib2.py?r1=74203&r2=74202&pathrev=74203