Python Requests throws SSL Error on certain site - python

EDIT - FIXED tldr, semi-old version of python installed a couple years ago had ssl package that was not updated to handle newer SSL certificates. After updating Python and making sure the ssl package was up to date, everything worked.
I'm new to web scraping, and wanted to scrape a certain site, but for some reason I'm getting errors when using the Python's Requests package on this particular site.
I am working on secure login to scrape data from my user profile. The login address can be found here: https://secure.funorb.com/m=weblogin/loginform.ws?mod=hiscore_fo&ssl=0&expired=0&dest=
I'm just trying to perform simple tasks at this point, like printing the text from a get request. The following is my code.
import requests
req = requests.get('https://secure.funorb.com/m=weblogin/loginform.ws?mod=hiscore_fo&ssl=0&expired=0&dest=',verify=False)
print req.text
When I run this, an error is thrown:
File "/Library/Python/2.7/site-packages/requests/adapters.py", line 512, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: EOF occurred in violation of protocol (_ssl.c:590)
I've looked in this file to see what's going on. It seems the culprit is
except (_SSLError, _HTTPError) as e:
if isinstance(e, _SSLError):
raise SSLError(e, request=request)
elif isinstance(e, ReadTimeoutError):
raise ReadTimeout(e, request=request)
else:
raise
I'm not really sure how to avoid this unfortunately, I'm kind of at my debugging limit here.
My code works just fine on other secure sites, such as https://bitbucket.org/account/signin/. I've looked at a ton of solutions on stack exchange and around the net, and a lot of people claimed adding in the optional argument "verify=False" should fix these types of SSL errors (ableit it's not the most secure way to do it). But as you can see from my code snippet, this isn't helping me.
If anyone can get this working/give advice on where to go it would be much appreciated.

... lot of people claimed adding in the optional argument "verify=False" should fix these types of SSL errors
adding verify=False helps against errors when validating the certificate, but not against EOF from server, handshake errors or similar.
As can be seen from SSLLabs this specific server exhibits the behavior of simply closing the connection (i.e. "EOF occurred in violation of protocol") for clients which don't support TLS 1.2 with modern ciphers. While you don't specify which SSL version you use I expect it to be a version less than OpenSSL 1.0.1, the first version of OpenSSL supporting TLS 1.2.
Please check ssl.OPENSSL_VERSION for the version used in your code. If I'm correct your only fix is to upgrade the version of OpenSSL use by Python. How this is done depends on your platform but there are existing posts about it, like Updating openssl in python 2.7.

Seen it somewhere else. What if you try using sessions like this:
import requests
sess = requests.Session()
adapter = requests.adapters.HTTPAdapter(max_retries = 20)
sess.mount('http://', adapter)
Then, change requests.get() with sess.get()
If you want to keep working with requests, maybe you need to install ndg-httpsclient package.

Related

"Can't connect to HTTPS URL because the SSL module is not available."

Hi guys I having trouble with SSL and python.
I had a script that goes to f5 API with requests and it was worked fine.
I did another API script. Tried it on another machine (some VPN system I can't provide the name) also with requests package since my python API script with requests package doesn't work.
I know the problem started when I tried to approach the other machine's API because the problem I have is on 2 machines. On the other machine I did it on purpose to see if that was my problem (I was right sadly).
Example to script that worked:
def f5_ltm_01_active_status():
response = requests.get("https://<ip-address>/mgmt/tm/cm/device/ver=12.1.3.4", auth=("user","password"),verify=False)
try:
json_response = response.json()
if json_response["items"][0]['hostname']:
return_str = "%s is %s" %(json_response["items"][0]['hostname'],json_response["items"][0]['failoverState'])
else:
return_str = "Wrong value in JSON"
except:
return_str = "Something went wrong , please check the code "
finally:
return return_str
The error that the Python returned is:
raise SSLError(e, request=request) requests.exceptions.SSLError:
HTTPSConnectionPool(host='ip-address-of-f5', port=443): Max retries
exceeded
with url: /mgmt/tm/cm/device?ver=12.1.3.4 (Caused by SSLError("Can't
connect
to HTTPS URL because the SSL module is not available."))
If I operate the script in debug mode it works.
I am using Windows 10
I am using Python v3.7.2
Another thing that I tried to do:
looking in Stackoverflow for answers
looking at Google to try to get answers
erase Pycharm and reinstall
erase python and reinstall
erase requests pack and urllibs3 pack and reinstall
install open-ssl - pyopenssl pack
I am really want to get an answer and idea of how and why did it happen
but mainly how to fix it
I was mistaken.
I made a Python file named ssl.py in the same project.
Python must have been looking for ssl attributes at my ssl.py file.
When I erased that file it all worked out.
I hope I was helping anyone with this post.

SSLError("bad handshake: Error([('SSL routines', 'tls_process_ske_dhe', 'dh key too small' in Python

I have seen a few links for this issue and most people want the server to be updated for security reasons. I am looking to make an internal only tool and connect to a server that is not able to be modified. My code is below and I am hopeful I can get clarity on how I can accept the small key and process the request.
Thank you all in advance
import requests
from requests.auth import HTTPBasicAuth
import warnings
import urllib3
warnings.filterwarnings("ignore")
requests.packages.urllib3.disable_warnings()
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS += 'HIGH:!DH:!aNULL'
#requests.packages.urllib3.contrib.pyopenssl.DEFAULT_SSL_CIPHER_LIST += 'HIGH:!DH:!aNULL'
url = "https://x.x.x.x/place/stuff"
userName = 'stuff'
passW = 'otherstuff'
dataR = requests.get(url,auth=HTTPBasicAuth(userName, passW),verify=False)
print(dataR.text)
The problem with too small DH keys is discussed in length at https://weakdh.org` with various remediations.
Now in your case it depends on OpenSSL which Python uses under the hood. It hardcodes thing to reject too small values.
Have a look at: How to reject weak DH parameters in an OpenSSL client?
Currently OpenSSL in client mode stops handshake only if the keylength of server selected DH parameters is less than 768 bit (hardcoded in source).
Based on the answer there, you could use SSL_CTX_set_tmp_dh_callback and SSL_set_tmp_dh_callback to control things more to your liking... except that at that time it did not seem to work at the client side only the server side.
Based on http://openssl.6102.n7.nabble.com/How-to-enforce-DH-field-size-in-the-client-td60442.html it seems that some work was added in the 1.1.0 branch for that problem. It seems to hint at a commit 2001129f096d10bbd815936d23af3e97daf7882d in 1.0.2 so first maybe try a newer version of OpenSSL (you did not specify which versions you are using).
However even if you manage to have everything working with OpenSSL you still need your Python to use it (so probably to compile python yourself) and then have the specific API inside Python to work on that... to be honest I think you will loose far less time fixing the service (even if you say you can not modify it) instead of trying to basically cripple the client, as rejecting small keys is a good thing (for reasons explained in the first link).

Python web scraping : urllib.error.URLError: urlopen error [Errno 11001] getaddrinfo failed

This is the first time I am trying to use Python for Web scraping. I have to extract some information from a website. I work in an institution, so I am using a proxy for Internet access.
I have used this code. Which works fine with URLs like e.g. https://www.google.co.in, or https://www.pythonprogramming.net
But when I use this URL: http://www.genecards.org/cgi-bin/carddisp.pl?gene=APOA1 which I need for scraping data, it shows
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
Here is my code.
import urllib.request as req
proxy = req.ProxyHandler({'http': r'http://username:password#url:3128'})
auth = req.HTTPBasicAuthHandler()
opener = req.build_opener(proxy, auth, req.HTTPHandler)
req.install_opener(opener)
conn = req.urlopen('https://www.google.co.in')
return_str = conn.read()
print(return_str)
Please guide me on what the issue here which I am not able to understand.
Also while searching for the above error, I read something about absolute URLs. Is that related to it?
The problem is that your proxy server, and your own host, seem to use two different DNS resolvers, or two resolvers updated at different instants in time.
So when you pass www.genecards.org, the proxy does not know that address, and the attempt to get address information (getAddrInfo) fails. Hence the error.
The problem is quite a bit more awkward than that, though. GeneCards.org is an alias for an Incapsula DNS host:
$ host www.genecards.org
www.genecards.org is an alias for 6hevx.x.incapdns.net.
And that machine is itself a proxy, hiding the real GeneCards site behind (so you might use http://192.230.83.165/ as an address, and it would never work).
This kind of merry-go-round is used by those sites that, among other things - how shall I put it - take a dim view of being scraped:
So yes, you could try several things to make scraping work. Chances are that they will only work for a short time, before being shut down harder and harder. So in the best scenario, you would be forced to continuously update your scraping code. Which can, and will, break down whenever it's most inconvenient to you.
This is no accident: it is intentional on GeneCards' part, and clearly covered in their terms of service:
Misuse of the Services
7.2 LifeMap may restrict, suspend or terminate the account of any Registered Users who abuses or misuses the GeneCards Suite Products. Misuse of the GeneCards Suite Products includes scraping, spidering and/or crawling GeneCards Suite Products; creating multiple or false profiles...
I suggest you take a different approach - try enquiring for a consultation license. Scraping a web site that does not care (or is unable, or hasn't yet come around) to providing its information in a easier format is one thing - stealing that information is quite different.
Also, note that you're connecting to a Squid proxy that in all probability is logging the username you're using. Any scraping made through that proxy would immediately be traced back to that user, in the event that LifeMap files a complaint for unauthorized scraping.
Try to ping url:3128 from your terminal. Provide responses? Problem seems related to security from server.

Python SSLError: VERSION_TOO_LOW

I'm having some trouble using urllib to fetch some web content on my Debian server. I use the following code to get the contents of most websites without problems:
import urllib.request as request
url = 'https://www.metal-archives.com/'
req = request.Request(url, headers={'User-Agent': "foobar"})
response = request.urlopen(req)
response.read()
However, if the website is using an older encryption protocol, the urlopen function will throw the following error:
ssl.SSLError: [SSL: VERSION_TOO_LOW] version too low (_ssl.c:748)
I have found a way to work around this problem, consisting in using an SSL context and passing it as an argument to the urlopen function, so the previous code would have to be modified:
...
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
response = request.urlopen(req, context=context)
...
Which will work, provided the protocol specified matches the website I'm trying to access. However, this does not seem like the best solution since:
If the site owners ever update their cryptography methods, the code will stop working
The code above will only work for this site, and I would have to create special cases for every website I visit in the entire program, since everyone could be using a different version of the protocol. That would lead to pretty messy code
The first solution I posted (the one without the ssl context) oddly seems to work on my ArchLinux machine, even though they both have the same versions of everything
Does anyone know about a generic solution that would work for every TLS version? Am I missing something here?
PS: For completeness, I will add that I'm using Debian 9, python v3.6.2, openssl v1.1.0f and urllib3 v1.22
In the end, I've opted to wrap the method call inside a try-except, so I can use the older SSL version as fallback. The final code is this:
url = 'https://www.metal-archives.com'
req = request.Request(url, headers={"User-Agent": "foobar"})
try:
response = request.urlopen(req)
except (ssl.SSLError, URLError):
# Try to use the older TLSv1 to see if we can fix the problem
context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
response = request.urlopen(req, context=context)
I have only tested this code on a dozen websites and it seems to work so far, but I'm not sure it will work every time. Also, this solution seems inefficient, since it needs two http requests, which can be very slow.
Improvements are still welcome :)

Httplib2 ssl error

Today I faced one interesting issue.
I'm using the foursquare recommended python library httplib2 raise
SSLHandshakeError(SSLError(1, '_ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed'),)
while trying to request an oauth token
response, body = h.request(url, method, headers=headers, body=data)
in
_process_request_with_httplib2 function
does anyone know why this happens?
If you know that the site you're trying to get is a "good guy", you can try creating your "opener" like this:
import httplib2
if __name__ == "__main__":
h = httplib2.Http(".cache", disable_ssl_certificate_validation=True)
resp, content = h.request("https://site/whose/certificate/is/bad/", "GET")
(the interesting part is disable_ssl_certificate_validation=True )
From the docs:
http://bitworking.org/projects/httplib2/doc/html/libhttplib2.html#httplib2.Http
EDIT 01:
Since your question was actually why does this happen, you can check this or this.
EDIT 02:
Seeing how this answer has been visited by more people than I expected, I'd like to explain a bit when disabling certificate validation could be useful.
First, a bit of light background on how these certificates work. There's quite a lot of information in the links provided above, but here it goes, anyway.
The SSL certificates need to be verified by a well known (at least, well known to your browser) Certificate Authority. You usually buy the whole certificate from one of those authorities (Symantec, GoDaddy...)
Broadly speaking, the idea is: Those Certificate Authorities (CA) give you a certificate that also contains the CA information in it. Your browsers have a list of well known CAs, so when your browser receives a certificate, it will do something like: "HmmmMMMmmm.... [the browser makes a supiciuous face here] ... I received a certificate, and it says it's verified by Symantec. Do I know that "Symantec" guy? [the browser then goes to its list of well known CAs and checks for Symantec] Oh, yeah! I do. Ok, the certificate is good!
You can see that information yourself if you click on the little lock by the URL in your browser:
However, there are cases in which you just want to test the HTTPS, and you create your own Certificate Authority using a couple of command line tools and you use that "custom" CA to sign a "custom" certificate that you just generated as well, right? In that case, your browser (which, by the way, in the question is httplib2.Http) is not going to have your "custom" CA among the list of trusted CAs, so it's going to say that the certificate is invalid. The information is still going to travel encrypted, but what the browser is telling you is that it doesn't fully trust that is traveling encrypted to the place you are supposing it's going.
For instance, let's say you created a set of custom keys and CAs and all the mambo-jumbo following this tutorial for your localhost FQDN and that your CA certificate file is located in the current directory. You could very well have a server running on https://localhost:4443 using your custom certificates and whatnot. Now, your CA certificate file is located in the current directory, in the file ./ca.crt (in the same directory your Python script is going to be running in). You could use httplib2 like this:
h = httplib2.Http(ca_certs='./ca.crt')
response, body = h.request('https://localhost:4443')
print(response)
print(body)
... and you wouldn't see the warning anymore. Why? Because you told httplib2 to go look for the CA's certificate to ./ca.crt)
However, since Chrome (to cite a browser) doesn't know about this CA's certificate, it will consider it invalid:
Also, certificates expire. There's a chance you are working in a company which uses an internal site with SSL encryption. It works ok for a year, and then your browser starts complaining. You go to the person that is in charge of the security, and ask "Yo!! I get this warning here! What's happening?" And the answer could very well be "Oh boy!! I forgot to renew the certificate! It's ok, just accept it from now, until I fix that." (true story, although there were swearwords in the answer I received :-D )
Recent versions of httplib2 is defaulting to its own certificate store.
# Default CA certificates file bundled with httplib2.
CA_CERTS = os.path.join(
os.path.dirname(os.path.abspath(__file__ )), "cacerts.txt")
In case if you're using ubuntu/debian, you can explicitly pass the path to system certificate file like
httplib2.HTTPSConnectionWithTimeout(HOST, ca_certs="/etc/ssl/certs/ca-certificates.crt")
Maybe this could be the case:
I got the same problem and debugging the Google Lib I found out that the reason was that I was using an older version of httplib2(0.9.2). When I updated to the most recent (0.14.0) it worked.
If you already install the most recent, make sure that some lib is not installing an older version of httplib2 inside its dependencies.
When you see this error with a self-signed certificate, as often happens inside a corporate proxy, you can point httplib2 to your custom certificate bundle using an environment variable. When, for example, you don't want to (or can't) modify the code to pass the ca_certs parameter.
You can also do this when you don't want to modify the system certificate store to append your CA cert.
export HTTPLIB2_CA_CERTS="\path\to\your\CA_certs_bundle"

Categories