Using certifi module with urllib2? - python

I'm having trouble downloading https pages with the urllib2 module, which seems to result from urllib2's inability to access the system's certificate store.
To get around this issue, one possible solution is to download https web pages with pycurl, by using the certifi module. The following is an example of doing so:
def download_web_page_with_curl(url_website):
from pycurl import Curl, CAINFO, URL
from certifi import where
from cStringIO import StringIO
response = StringIO()
curl = Curl()
curl.setopt(CAINFO, where())
curl.setopt(URL, url_website)
curl.setopt(curl.WRITEFUNCTION, response.write)
curl.perform()
curl.close()
return response.getvalue()
Is there a way to use certifi with urllib2 (in a fashion comparable to the pycurl example above), which will permit me to download https sites? Alternatively, is there another feasible urllib2-based workaround which will remedy the permissions issue, without compromising security?

Would recommend using requests per my other answer. However, to answer the original question of how to do this with urllib2:
import urllib2
import certifi
def download_web_page_with_urllib2(url_website):
t = urllib2.urlopen(url_website, cafile=certifi.where())
return t.read()
text = download_web_page_with_urllib2('https://www.google.com/')
The same recommendations about error checking apply.

Expanding on the comment to use requests (which is built on urllib3):
def download_web_page_with_requests(url_website):
import requests
r = requests.get(url_website)
return r.text
This is so much easier than anything else and properly handles SSL verification independent of the platform's own cert lists. If certifi is found, requests will automatically use it. If not, it silently falls back to a more limited, possibly older set of built-in root certs. If ensuring that certifi is used matters to you, you can do this:
r = requests.get(url_website, verify=certifi.where())
Note that the above code does not do the error checking that you should probably do. So I'll point out that requests.get() can throw a number of exceptions for invalid ULRs, unreachable sites, communication errors, and failed certification validation, so you should be prepared to catch and deal with those. If it does successfully talk to a server, but the server returns a non-OK status code (such as for a non-existent page), then an exception won't be thrown, so you'd also want to check that r.status_code==200.

Related

History of retries using Request Library

I'm building a new retry feature in my Orchestrate script and I want to know how many times, and, if possible, what error my request method got when trying to connect to a specific URL.
For now, I need this for logging purposes, because I'm working on a messaging system and I may need this 'retry' information to understand when and why I'm facing any kind of problem in HTTP requests, once I work in a micro-service environment.
So far, I debugged and certify that retries are working as expected (I have a mocked flask server for all micro services that we use), but I couldn't find a way to got the 'retries history' data.
In other words, for example, I want to see if a specific micro-service may respond only after the third request, and those kind of thing.
Below is the code that I'm using now:
from requests import exceptions, Session
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
def open_request_session():
# Default retries configs
toggle = True #loaded from config file
session = Session()
if toggle:
# loaded from config file as
parameters = {'total': 5,
'backoff_factor': 0.0,
'http_error_to_retry': [400]}well
retries = Retry(total=parameters['total'],
backoff_factor=parameters['backoff_factor'],
status_forcelist=parameters['http_error_to_retry'],
# Do not force an exception when False
raise_on_status=False)
session.mount('http://', HTTPAdapter(max_retries=retries))
session.mount('https://', HTTPAdapter(max_retries=retries))
return session
# Request
with open_request_session() as request:
my_response = request.get(url, timeout=10)
I see in the urllib3 documentation that Retry has a history attribute, but when I try to consult the attribute it is empty.
I don't know if I'm doing wrong or forgetting something, once Software Development is not my best skill.
So, I have two questions:
Does anyone know a way to got this history information?
How can I do create tests to verify if the retries behavior is working as expected? (So far I only test in debug mode)
I'm using Python 3.6.8.
I know that I can create a while statement to 'control' this, but I'm trying to avoid complexity. And this is why I'm here, I'm looking for an alternative based on Python and community best practices.
A bit late, but I just figured this out myself so thought I'd share what I found.
Short answer:
response.raw.retries.history
will get you what you are looking for.
Long answer:
You cannot get the history off the original Retry instance created. Under the covers, urllib3 creates a new Retry instance for every attempt.
urllib3 does store the last Retry instance on the response when one is returned. However, the response from the requests library is a wrapper around the urllib3 response. Luckily, requests stores the original urllib3 response on the raw field.

SSLError("bad handshake: Error([('SSL routines', 'tls_process_ske_dhe', 'dh key too small' in Python

I have seen a few links for this issue and most people want the server to be updated for security reasons. I am looking to make an internal only tool and connect to a server that is not able to be modified. My code is below and I am hopeful I can get clarity on how I can accept the small key and process the request.
Thank you all in advance
import requests
from requests.auth import HTTPBasicAuth
import warnings
import urllib3
warnings.filterwarnings("ignore")
requests.packages.urllib3.disable_warnings()
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS += 'HIGH:!DH:!aNULL'
#requests.packages.urllib3.contrib.pyopenssl.DEFAULT_SSL_CIPHER_LIST += 'HIGH:!DH:!aNULL'
url = "https://x.x.x.x/place/stuff"
userName = 'stuff'
passW = 'otherstuff'
dataR = requests.get(url,auth=HTTPBasicAuth(userName, passW),verify=False)
print(dataR.text)
The problem with too small DH keys is discussed in length at https://weakdh.org` with various remediations.
Now in your case it depends on OpenSSL which Python uses under the hood. It hardcodes thing to reject too small values.
Have a look at: How to reject weak DH parameters in an OpenSSL client?
Currently OpenSSL in client mode stops handshake only if the keylength of server selected DH parameters is less than 768 bit (hardcoded in source).
Based on the answer there, you could use SSL_CTX_set_tmp_dh_callback and SSL_set_tmp_dh_callback to control things more to your liking... except that at that time it did not seem to work at the client side only the server side.
Based on http://openssl.6102.n7.nabble.com/How-to-enforce-DH-field-size-in-the-client-td60442.html it seems that some work was added in the 1.1.0 branch for that problem. It seems to hint at a commit 2001129f096d10bbd815936d23af3e97daf7882d in 1.0.2 so first maybe try a newer version of OpenSSL (you did not specify which versions you are using).
However even if you manage to have everything working with OpenSSL you still need your Python to use it (so probably to compile python yourself) and then have the specific API inside Python to work on that... to be honest I think you will loose far less time fixing the service (even if you say you can not modify it) instead of trying to basically cripple the client, as rejecting small keys is a good thing (for reasons explained in the first link).

Python SSLError: VERSION_TOO_LOW

I'm having some trouble using urllib to fetch some web content on my Debian server. I use the following code to get the contents of most websites without problems:
import urllib.request as request
url = 'https://www.metal-archives.com/'
req = request.Request(url, headers={'User-Agent': "foobar"})
response = request.urlopen(req)
response.read()
However, if the website is using an older encryption protocol, the urlopen function will throw the following error:
ssl.SSLError: [SSL: VERSION_TOO_LOW] version too low (_ssl.c:748)
I have found a way to work around this problem, consisting in using an SSL context and passing it as an argument to the urlopen function, so the previous code would have to be modified:
...
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
response = request.urlopen(req, context=context)
...
Which will work, provided the protocol specified matches the website I'm trying to access. However, this does not seem like the best solution since:
If the site owners ever update their cryptography methods, the code will stop working
The code above will only work for this site, and I would have to create special cases for every website I visit in the entire program, since everyone could be using a different version of the protocol. That would lead to pretty messy code
The first solution I posted (the one without the ssl context) oddly seems to work on my ArchLinux machine, even though they both have the same versions of everything
Does anyone know about a generic solution that would work for every TLS version? Am I missing something here?
PS: For completeness, I will add that I'm using Debian 9, python v3.6.2, openssl v1.1.0f and urllib3 v1.22
In the end, I've opted to wrap the method call inside a try-except, so I can use the older SSL version as fallback. The final code is this:
url = 'https://www.metal-archives.com'
req = request.Request(url, headers={"User-Agent": "foobar"})
try:
response = request.urlopen(req)
except (ssl.SSLError, URLError):
# Try to use the older TLSv1 to see if we can fix the problem
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
response = request.urlopen(req, context=context)
I have only tested this code on a dozen websites and it seems to work so far, but I'm not sure it will work every time. Also, this solution seems inefficient, since it needs two http requests, which can be very slow.
Improvements are still welcome :)

How to disable SSL verification for urlretrieve?

A similar question had been asked a couple times around SO, but the solutions are for urlopen. That function takes an optional context parameter which can accept a pre-configured SSL context. urlretrieve does not have this parameter. How can I bypass SSL verification errors in the following call?
urllib.request.urlretrieve(
"http://sourceforge.net/projects/libjpeg-turbo/files/1.3.1/libjpeg-turbo-1.3.1.tar.gz/download",
destFolder+"/libjpeg-turbo.tar.gz")
This solution worked as well for me: before making the call to the library, define the default SSL context:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
# urllib.request.urlretrieve(...)
Source: http://thomas-cokelaer.info/blog/2016/01/python-certificate-verified-failed/
This does not appear to be possible with urlretrieve (in Python >=2.7.9, or Python >=3.0).
The requests package is recommended as a replacement.
Edited to add: the context parameter has been added to the code, even though it isn't mentioned in the documentation! Hat-tip #Sushisource

Persistent HTTPS Connections in Python

I want to make an HTTPS request to a real-time stream and keep the connection open so that I can keep reading content from it and processing it.
I want to write the script in python. I am unsure how to keep the connection open in my script. I have tested the endpoint with curl which keeps the connection open successfully. But how do I do it in Python. Currently, I have the following code:
c = httplib.HTTPSConnection('userstream.twitter.com')
c.request("GET", "/2/user.json?" + req.to_postdata())
response = c.getresponse()
Where do I go from here?
Thanks!
It looks like your real-time stream is delivered as one endless HTTP GET response, yes? If so, you could just use python's built-in urllib2.urlopen(). It returns a file-like object, from which you can read as much as you want until the server hangs up on you.
f=urllib2.urlopen('https://encrypted.google.com/')
while True:
data = f.read(100)
print(data)
Keep in mind that although urllib2 speaks https, it doesn't validate server certificates, so you might want to try and add-on package like pycurl or urlgrabber for better security. (I'm not sure if urlgrabber supports https.)
Connection keep-alive features are not available in any of the python standard libraries for https. The most mature option is probably urllib3
httplib2 supports this. (I'd have thought this the most mature option, didn't know urllib3 yet, so TokenMacGuy may still be right)
EDIT: while httplib2 does support persistent connections, I don't think you can really consume streams with it (ie. one long response vs. multiple requests over the same connection), which I now realise you may need.

Categories