Catching SSLError due to unsecure URL with requests in Python? - python

I have a list of a few thousand URLs and noticed one of them is throwing as SSLError when passed into requests.get(). Below is my attempt to work around this using both a solution suggested in this similar question as well as a failed attempt to catch the error with a "try & except" block using ssl.SSLError:
url = 'https://archyworldys.com/lidl-recalls-puff-pastry/'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
try:
response = session.get(url,allow_redirects=False,verify=True)
except ssl.SSLError:
pass
The error returned at the very end is:
SSLError: HTTPSConnectionPool(host='archyworldys.com', port=443): Max retries exceeded with url: /lidl-recalls-puff-pastry/ (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),))
When I opened the URL in Chrome, I get a "Not Secure" / "Privacy Error" that blocks the webpage. However, if I try the URL with HTTP instead of HTTPS (e.g. 'http://archyworldys.com/lidl-recalls-puff-pastry/') it works just fine in my browser. Per this question, setting verify to False solves the problem, but I prefer to find a more secure work-around.
While I understand a simple solution would be to remove the URL from my data, I'm trying to find a solution that let's me proceed (e.g. if in a for loop) by simply skipping this bad URL and moving on the next one.

The error I get when running your code is:
requests.exceptions.SSLError:
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
(_ssl.c:645)
Based on this one needs to catch requests.exceptions.SSLError and not ssl.SSLError, i.e.:
try:
response = session.get(url,allow_redirects=False,verify=True)
except requests.exceptions.SSLError:
pass
While it looks like the error you get is different this is probably due the code you show being not exactly the code you are running. Anyway, look at the exact error message you get and figure out from this which exception exactly to catch. You might also try to catch a more general exception like this and by doing this get the exact Exception class you need to catch:
try:
response = session.get(url,allow_redirects=False,verify=True)
except Exception as x:
print(type(x),x)
pass

Related

How to resolve this? HTTPConnectionPool Max retries exceeded (NewConnectionError Failed to establish new connection: [Errno 11002] getaddrinfo failed)

I'm trying to create a function that goes through all links it finds on a page, tests them to see if they return a 404 status code and then returns those that do from a list. However, as the title of this question suggests, I'm coming across an error which I am stuck on resolving. Here is the relevant part of the function in question:
def find_404(urlSoup):
s = requests.Session()
s.headers['User-Agent'] = 'My program'
retry_strategy = Retry(connect=3, backoff_factor=1)
adapter = HTTPAdapter(max_retries=retry_strategy
s.mount('http://', adapter)
s.mount('https://', adapter)
for link in urlSoup.find_all('a', href=True):
search_links.append(link.get('href'))
print(search_links)
for search_link in search_links:
try:
if not search_link.startswith('#') and not search_link.startswith('/'):
broken_query = s.get(search_link, allow_redirects=3)
if broken_query.status_code == 404:
broken_links.append(search_link)
except requests.exceptions.ConnectionError as exc:
print('Error: {0}'.format(exc))
The whole requests.Session() bit preceding the first for loop was added in an effort to resolve the error as some prior Stack Overflow posts I researched recommended that solution, but it hasn't fixed the error.
When I run the program with the above code (and a bit more which I left out of the above) I get the error mentioned in the title. I also don't think the function returns all the 404 pages it should be. I am using this page for testing because it links to various 404 pages:
https://optinmonster.com/best-404-page-examples/.
The above function only returns 5 404 links when I can tell from the page and the search_links list that there are other 404 pages it should probably be recognising.

SSL and NewConnectionError

I want to crawl a given list by the Top-1-Million from Alexa, to check which website still offers acces via http:// an do not redirect to https://.
If the webpage does not redirect to a https:// Domain, it should be written into a csv file.
The Problem occurs, when I am adding a bunch of multiple URLs. Than I get two errors:
ssl.SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1056
or
requests.exceptions.ConnectionError: HTTPConnectionPool(host='17ok.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 11001] getaddrinfo failed')
I have tried the opportunities mentioned in the following threads and documentation:
https://2.python-requests.org//en/latest/user/advanced/#ssl-cert-verification
Edit: the sample url: https://requestb.in raises a 404 error actually, probably does not exist even more (?)
Python Requests throwing SSLError
Python Requests: NewConnectionError
requests.exceptions.SSLError: HTTPSConnectionPool: (Caused by SSLError(SSLError(336445449, '[SSL] PEM lib (_ssl.c:3816)')))
and some other delivered solutions.
The option to set verify=False helps, when using it for few URLs, but not when using a List > 10 URLs, the program brakes. I tried my program on a Win10 machine as well as on Ubuntu 16.04.
As expected, its the same issue. I also tried the option using Sessions and installed the certificate library as sugested.
If I am just calling three pages like 'http://www.example.com', 'https://www.github.com' and 'http://www.python.org', its not a big deal and the delivered solutions. The Headache starts, when using a bunch of URLs from the Alexa List.
Here is my code, which is working, when using it for only 3-4 urls:
import requests
from requests.utils import urlparse
urls = ['http://www.example.com',
'http://bloomberg.com',
'http://github.com',
'https://requestbin.fullcontact.com/']
with open('G:\\Request_HEADER_Suite/dummy/http.csv', 'w') as f:
for url in urls:
r = requests.get(url, stream=True, verify=False)
parsed_url = urlparse(r.url)
print("URL: ", url)
print("Redirected to: ", r.url)
print("Status Code: ", r.status_code)
print("Scheme: ", parsed_url.scheme)
if parsed_url.scheme == 'http':
f.write(url + '\n')
I expect to crawl at least a list with 100 URLs. The code should write URLs which are accessible by http:// and do not redirect to https:// into a csv file or complementary database and ignore all URLs with https://.
Because it is working for few URLs, I would expectd a stable opportunity for a larger scan.
But 2 errors araise and break the program. Is it worthy to try a workaround using pytest? Any other suggestions? Thanks in advance.
EDIT:
This is a list, which will raise errors. Only for clarification, this list from a study based on the Alexa-Top-1-Million.
urls = ['http://www.example.com',
'http://bloomberg.com',
'http://github.com',
'https://requestbin.fullcontact.com/',
'http://51sole.com',
'http://58.com',
'http://9gag.com',
'http://abs-cbn.com',
'http://academia.edu',
'http://accuweather.com',
'http://addroplet.com',
'http://addthis.com',
'http://adf.ly',
'http://adhoc2.net',
'http://adobe.com',
'http://1688.com',
'http://17ok.com',
'http://17track.net',
'http://1and1.com',
'http://1tv.ru',
'http://2ch.net',
'http://360.cn',
'http://39.net',
'http://4chan.org',
'http://4pda.ru']
I double checked, the last time the errors starts with the url 17.ok.com. But I have also tried different lists with urls. Thanks for your support.

Requests SSLError: HTTPSConnectionPool(host='www.recruit.com.hk', port=443): Max retries exceeded with url

I'm getting really confused over this.
Here's what I'm using.
requests 2.18.4
python 2.7.14
I'm building a scraper and trying to use requests.get() to connect to a url.
This is a link from indeed that jumps to another link.
Here is the code:
r = rqs.get('https://www.indeed.hk/rc/clk?jk=ab794b2879313f04&fccid=a659206a7e1afa15')
Here's the error raised:
File "/Users/cecilialee/anaconda/envs/py2/lib/python2.7/site-packages/requests/adapters.py", line 506, in send
raise SSLError(e, request=request)
SSLError: HTTPSConnectionPool(host='www.recruit.com.hk', port=443): Max retries exceeded with url: /jobseeker/JobDetail.aspx?jobOrder=L04146652 (Caused by SSLError(SSLEOFError(8, u'EOF occurred in violation of protocol (_ssl.c:661)'),))
Setting verify = False does not solve this error.
I've searched online but couldn't find a solution that can help to fix my issue. Can anyone help?
You can use HTTP (but not https) to get info from the site.
>>> response = requests.get('http://www.recruit.com.hk')
>>> response.status_code
200
>>> len(response.text)
I tried you code, it's ok:
>>> r = requests.get('https://www.indeed.hk/rc/clk?jk=ab794b2879313f04&fccid=a659206a7e1afa15')
>>> r.status_code
200
>>> len(r.text)
34272
My environment:
python 2.7.10
requests==2.5.0

python SSL error when using request

I need to write a simple test script for rest get using python. What I have is:
import request
url = 'http://myurl......net'
headers ={'content-type':'application/xml'}
r= requests.get(url,headers=headers)
so this give me the following SSL error:
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed(_ssl.c:590)
So, i did some research and add " verify = False" to the end of my last line of code, but not I am stuck with: : InsecureRequestsWarning, Unverified request is been made. Addig certificate verification is strongly advised."
What to do to get rid of this message?

Max retries exceeded with URL in requests

I'm trying to get the content of App Store > Business:
import requests
from lxml import html
page = requests.get("https://itunes.apple.com/in/genre/ios-business/id6000?mt=8")
tree = html.fromstring(page.text)
flist = []
plist = []
for i in range(0, 100):
app = tree.xpath("//div[#class='column first']/ul/li/a/#href")
ap = app[0]
page1 = requests.get(ap)
When I try the range with (0,2) it works, but when I put the range in 100s it shows this error:
Traceback (most recent call last):
File "/home/preetham/Desktop/eg.py", line 17, in <module>
page1 = requests.get(ap)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 383, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 486, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 378, in send
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='itunes.apple.com', port=443): Max retries exceeded with url: /in/app/adobe-reader/id469337564?mt=8 (Caused by <class 'socket.gaierror'>: [Errno -2] Name or service not known)
Just use requests features:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
session.get(url)
This will GET the URL and retry 3 times in case of requests.exceptions.ConnectionError. backoff_factor will help to apply delays between attempts to avoid failing again in case of periodic request quota.
Take a look at urllib3.util.retry.Retry, it has many options to simplify retries.
What happened here is that itunes server refuses your connection (you're sending too many requests from same ip address in short period of time)
Max retries exceeded with url: /in/app/adobe-reader/id469337564?mt=8
error trace is misleading it should be something like "No connection could be made because the target machine actively refused it".
There is an issue at about python.requests lib at Github, check it out here
To overcome this issue (not so much an issue as it is misleading debug trace) you should catch connection related exceptions like so:
try:
page1 = requests.get(ap)
except requests.exceptions.ConnectionError:
r.status_code = "Connection refused"
Another way to overcome this problem is if you use enough time gap to send requests to server this can be achieved by sleep(timeinsec) function in python (don't forget to import sleep)
from time import sleep
All in all requests is awesome python lib, hope that solves your problem.
Just do this,
Paste the following code in place of page = requests.get(url):
import time
page = ''
while page == '':
try:
page = requests.get(url)
break
except:
print("Connection refused by the server..")
print("Let me sleep for 5 seconds")
print("ZZzzzz...")
time.sleep(5)
print("Was a nice sleep, now let me continue...")
continue
You're welcome :)
I got similar problem but the following code worked for me.
url = <some REST url>
page = requests.get(url, verify=False)
"verify=False" disables SSL verification. Try and catch can be added as usual.
pip install pyopenssl seemed to solve it for me.
https://github.com/requests/requests/issues/4246
Specifying the proxy in a corporate environment solved it for me.
page = requests.get("http://www.google.com:80", proxies={"http": "http://111.233.225.166:1234"})
The full error is:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.google.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
It is always good to implement exception handling. It does not only help to avoid unexpected exit of script but can also help to log errors and info notification. When using Python requests I prefer to catch exceptions like this:
try:
res = requests.get(adress,timeout=30)
except requests.ConnectionError as e:
print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
print(str(e))
renewIPadress()
continue
except requests.Timeout as e:
print("OOPS!! Timeout Error")
print(str(e))
renewIPadress()
continue
except requests.RequestException as e:
print("OOPS!! General Error")
print(str(e))
renewIPadress()
continue
except KeyboardInterrupt:
print("Someone closed the program")
Here renewIPadress() is a user define function which can change the IP address if it get blocked. You can go without this function.
Adding my own experience for those who are experiencing this in the future. My specific error was
Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'
It turns out that this was actually because I had reach the maximum number of open files on my system. It had nothing to do with failed connections, or even a DNS error as indicated.
When I was writing a selenium browser test script, I encountered this error when calling driver.quit() before a usage of a JS api call.Remember that quiting webdriver is last thing to do!
i wasn't able to make it work on windows even after installing pyopenssl and trying various python versions (while it worked fine on mac), so i switched to urllib and it works on python 3.6 (from python .org) and 3.7 (anaconda)
import urllib
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
contents = html.read()
print(contents)
just import time
and add :
time.sleep(6)
somewhere in the for loop, to avoid sending too many request to the server in a short time.
the number 6 means: 6 seconds.
keep testing numbers starting from 1, until you reach the minimum seconds that will help to avoid the problem.
It could be network config issue also. So, for that u need to re-config ur network confgurations.
for Ubuntu :
sudo vim /etc/network/interfaces
add 8.8.8.8 in dns-nameserver and save it.
reset ur network : /etc/init.d/networking restart
Now try..
Adding my own experience :
r = requests.get(download_url)
when I tried to download a file specified in the url.
The error was
HTTPSConnectionPool(host, port=443): Max retries exceeded with url (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
I corrected it by adding verify = False in the function as follows :
r = requests.get(download_url + filename)
open(filename, 'wb').write(r.content)
Check your network connection. I had this and the VM did not have a proper network connection.
I had the same error when I run the route in the browser, but in postman, it works fine. It issue with mine was that, there was no / after the route before the query string.
127.0.0.1:5000/api/v1/search/?location=Madina raise the error and removing / after the search worked for me.
This happens when you send too many requests to the public IP address of https://itunes.apple.com. It as you can see caused due to some reason which does not allow/block access to the public IP address mapping with https://itunes.apple.com. One better solution is the following python script which calculates the public IP address of any domain and creates that mapping to the /etc/hosts file.
import re
import socket
import subprocess
from typing import Tuple
ENDPOINT = 'https://anydomainname.example.com/'
ENDPOINT = 'https://itunes.apple.com/'
def get_public_ip() -> Tuple[str, str, str]:
"""
Command to get public_ip address of host machine and endpoint domain
Returns
-------
my_public_ip : str
Ip address string of host machine.
end_point_ip_address : str
Ip address of endpoint domain host.
end_point_domain : str
domain name of endpoint.
"""
# bash_command = """host myip.opendns.com resolver1.opendns.com | \
# grep "myip.opendns.com has" | awk '{print $4}'"""
# bash_command = """curl ifconfig.co"""
# bash_command = """curl ifconfig.me"""
bash_command = """ curl icanhazip.com"""
my_public_ip = subprocess.getoutput(bash_command)
my_public_ip = re.compile("[0-9.]{4,}").findall(my_public_ip)[0]
end_point_domain = (
ENDPOINT.replace("https://", "")
.replace("http://", "")
.replace("/", "")
)
end_point_ip_address = socket.gethostbyname(end_point_domain)
return my_public_ip, end_point_ip_address, end_point_domain
def set_etc_host(ip_address: str, domain: str) -> str:
"""
A function to write mapping of ip_address and domain name in /etc/hosts.
Ref: https://stackoverflow.com/questions/38302867/how-to-update-etc-hosts-file-in-docker-image-during-docker-build
Parameters
----------
ip_address : str
IP address of the domain.
domain : str
domain name of endpoint.
Returns
-------
str
Message to identify success or failure of the operation.
"""
bash_command = """echo "{} {}" >> /etc/hosts""".format(ip_address, domain)
output = subprocess.getoutput(bash_command)
return output
if __name__ == "__main__":
my_public_ip, end_point_ip_address, end_point_domain = get_public_ip()
output = set_etc_host(ip_address=end_point_ip_address, domain=end_point_domain)
print("My public IP address:", my_public_ip)
print("ENDPOINT public IP address:", end_point_ip_address)
print("ENDPOINT Domain Name:", end_point_domain )
print("Command output:", output)
You can call the above script before running your desired function :)
My situation is rather special. I tried the answers above, none of them worked. I suddenly thought whether it has something to do with my Internet proxy? You know, I'm in mainland China, and I can't access sites like google without an internet proxy. Then I turned off my Internet proxy and the problem was solved.
In my case, I am deploying some docker containers inside the python script and then calling one of the deployed services. Error is fixed when I add some delay before calling the service. I think it needs time to get ready to accept connections.
from time import sleep
#deploy containers
#get URL of the container
sleep(5)
response = requests.get(url,verify=False)
print(response.json())
First I ran the run.py file and then I ran the unit_test.py file, it works for me
Add headers for this request.
headers={
'Referer': 'https://itunes.apple.com',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
requests.get(ap, headers=headers)
I am coding a test with Gauge and I encountered this error as well, it was because I was trying to request an internal URL without activating VPN.

Categories