How can I retrieve this file online with python? SSL error - python

So I have been trying to download the dataset from this page with python program.
The method I have tried using were requests and urllib.request.
A page I used as reference to solve the SSL error but didnt work...
My code here:
import pandas as pd
import requests
import shutil
# 2017 School Quality Report
FileLink = 'https://data.cityofnewyork.us/api/views/cxrnzyvb/files/35e2893e-75ed-4449-8e7e-d6360a3386a1?download=true&filename=2017_School_Quality_Report_DD.xlsx'
requests.packages.urllib3.disable_warnings()
response = requests.get(FileLink,verify='gd_bundle-g2-g1.crt', auth=('user', 'pass'),stream = True)
response.raw.decode_content = True
with open("2017_School_Quality_Report_DD.xlsx", 'wb') as f:
shutil.copyfileobj(response.raw, f)
#import urllib.request
#urllib.request.urlretrieve(FileLink, '2017_School_Quality_Report_DD.xlsx')
data = pd.read_excel('2017_School_Quality_Report_DD.xlsx')
print(data.sheet_names)
There is this error message which I don't know what to do to solve:
SSLError: HTTPSConnectionPool(host='data.cityofnewyork.us',
port=443): Max retries exceeded with url: /api/views/cxrn-
zyvb/files/35e2893e-75ed-4449-8e7e-d6360a3386a1?
download=true&filename=2017_School_Quality_Report_DD.xlsx (Caused by
SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate
verify failed (_ssl.c:777)'),))
Please kindly let me know how I can solve the error or show me how you would do this task. I am fairly new to python. Thank you.
NOTE: found solution on this page which worked for me

Related

Python Geopy with Google Maps API SSL Error

I found the following link:
https://towardsdatascience.com/pythons-geocoding-convert-a-list-of-addresses-into-a-map-f522ef513fd6
It shows a quick walk through on how to use Google Maps API to get latitude/longitude. However, when I use the provided code I get an SSL error. I have a working API key as I can get the URL to work that is produced from the second code set below.
Code:
from geopy.geocoders import GoogleV3
AUTH_KEY = "HIDDEN"
geolocator = GoogleV3(api_key=AUTH_KEY)
print(geolocator.geocode("1 Apple Park Way, Cupertino, CA").point) #Apple
Error:
HTTPSConnectionPool(host='maps.googleapis.com', port=443): Max retries exceeded with url: /maps/api/geocode/json?address=1+Apple+Park+Way%2C+Cupertino%2C+CA&key=HIDDEN (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1076)')))
I also tried using the following:
Code:
import requests
import json
import urllib
# https://developers.google.com/maps/documentation/geocoding/intro
base_url= "https://maps.googleapis.com/maps/api/geocode/json?"
AUTH_KEY = HIDDEN
# set up your search parameters - address and API key
parameters = {"address": "1 Apple Park Way, Cupertino, CA",
"key": AUTH_KEY}
# urllib.parse.urlencode turns parameters into url
print(f"{base_url}{urllib.parse.urlencode(parameters)}")
r = requests.get(f"{base_url}{urllib.parse.urlencode(parameters)}")
I get the exact same error. Oddly though the URL produced by print(f"{base_url}{urllib.parse.urlencode(parameters)}") is usable when I click on it.

SSL and NewConnectionError

I want to crawl a given list by the Top-1-Million from Alexa, to check which website still offers acces via http:// an do not redirect to https://.
If the webpage does not redirect to a https:// Domain, it should be written into a csv file.
The Problem occurs, when I am adding a bunch of multiple URLs. Than I get two errors:
ssl.SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1056
or
requests.exceptions.ConnectionError: HTTPConnectionPool(host='17ok.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 11001] getaddrinfo failed')
I have tried the opportunities mentioned in the following threads and documentation:
https://2.python-requests.org//en/latest/user/advanced/#ssl-cert-verification
Edit: the sample url: https://requestb.in raises a 404 error actually, probably does not exist even more (?)
Python Requests throwing SSLError
Python Requests: NewConnectionError
requests.exceptions.SSLError: HTTPSConnectionPool: (Caused by SSLError(SSLError(336445449, '[SSL] PEM lib (_ssl.c:3816)')))
and some other delivered solutions.
The option to set verify=False helps, when using it for few URLs, but not when using a List > 10 URLs, the program brakes. I tried my program on a Win10 machine as well as on Ubuntu 16.04.
As expected, its the same issue. I also tried the option using Sessions and installed the certificate library as sugested.
If I am just calling three pages like 'http://www.example.com', 'https://www.github.com' and 'http://www.python.org', its not a big deal and the delivered solutions. The Headache starts, when using a bunch of URLs from the Alexa List.
Here is my code, which is working, when using it for only 3-4 urls:
import requests
from requests.utils import urlparse
urls = ['http://www.example.com',
'http://bloomberg.com',
'http://github.com',
'https://requestbin.fullcontact.com/']
with open('G:\\Request_HEADER_Suite/dummy/http.csv', 'w') as f:
for url in urls:
r = requests.get(url, stream=True, verify=False)
parsed_url = urlparse(r.url)
print("URL: ", url)
print("Redirected to: ", r.url)
print("Status Code: ", r.status_code)
print("Scheme: ", parsed_url.scheme)
if parsed_url.scheme == 'http':
f.write(url + '\n')
I expect to crawl at least a list with 100 URLs. The code should write URLs which are accessible by http:// and do not redirect to https:// into a csv file or complementary database and ignore all URLs with https://.
Because it is working for few URLs, I would expectd a stable opportunity for a larger scan.
But 2 errors araise and break the program. Is it worthy to try a workaround using pytest? Any other suggestions? Thanks in advance.
EDIT:
This is a list, which will raise errors. Only for clarification, this list from a study based on the Alexa-Top-1-Million.
urls = ['http://www.example.com',
'http://bloomberg.com',
'http://github.com',
'https://requestbin.fullcontact.com/',
'http://51sole.com',
'http://58.com',
'http://9gag.com',
'http://abs-cbn.com',
'http://academia.edu',
'http://accuweather.com',
'http://addroplet.com',
'http://addthis.com',
'http://adf.ly',
'http://adhoc2.net',
'http://adobe.com',
'http://1688.com',
'http://17ok.com',
'http://17track.net',
'http://1and1.com',
'http://1tv.ru',
'http://2ch.net',
'http://360.cn',
'http://39.net',
'http://4chan.org',
'http://4pda.ru']
I double checked, the last time the errors starts with the url 17.ok.com. But I have also tried different lists with urls. Thanks for your support.

Catching SSLError due to unsecure URL with requests in Python?

I have a list of a few thousand URLs and noticed one of them is throwing as SSLError when passed into requests.get(). Below is my attempt to work around this using both a solution suggested in this similar question as well as a failed attempt to catch the error with a "try & except" block using ssl.SSLError:
url = 'https://archyworldys.com/lidl-recalls-puff-pastry/'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
try:
response = session.get(url,allow_redirects=False,verify=True)
except ssl.SSLError:
pass
The error returned at the very end is:
SSLError: HTTPSConnectionPool(host='archyworldys.com', port=443): Max retries exceeded with url: /lidl-recalls-puff-pastry/ (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),))
When I opened the URL in Chrome, I get a "Not Secure" / "Privacy Error" that blocks the webpage. However, if I try the URL with HTTP instead of HTTPS (e.g. 'http://archyworldys.com/lidl-recalls-puff-pastry/') it works just fine in my browser. Per this question, setting verify to False solves the problem, but I prefer to find a more secure work-around.
While I understand a simple solution would be to remove the URL from my data, I'm trying to find a solution that let's me proceed (e.g. if in a for loop) by simply skipping this bad URL and moving on the next one.
The error I get when running your code is:
requests.exceptions.SSLError:
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed
(_ssl.c:645)
Based on this one needs to catch requests.exceptions.SSLError and not ssl.SSLError, i.e.:
try:
response = session.get(url,allow_redirects=False,verify=True)
except requests.exceptions.SSLError:
pass
While it looks like the error you get is different this is probably due the code you show being not exactly the code you are running. Anyway, look at the exact error message you get and figure out from this which exception exactly to catch. You might also try to catch a more general exception like this and by doing this get the exact Exception class you need to catch:
try:
response = session.get(url,allow_redirects=False,verify=True)
except Exception as x:
print(type(x),x)
pass

Requests SSLError: HTTPSConnectionPool(host='www.recruit.com.hk', port=443): Max retries exceeded with url

I'm getting really confused over this.
Here's what I'm using.
requests 2.18.4
python 2.7.14
I'm building a scraper and trying to use requests.get() to connect to a url.
This is a link from indeed that jumps to another link.
Here is the code:
r = rqs.get('https://www.indeed.hk/rc/clk?jk=ab794b2879313f04&fccid=a659206a7e1afa15')
Here's the error raised:
File "/Users/cecilialee/anaconda/envs/py2/lib/python2.7/site-packages/requests/adapters.py", line 506, in send
raise SSLError(e, request=request)
SSLError: HTTPSConnectionPool(host='www.recruit.com.hk', port=443): Max retries exceeded with url: /jobseeker/JobDetail.aspx?jobOrder=L04146652 (Caused by SSLError(SSLEOFError(8, u'EOF occurred in violation of protocol (_ssl.c:661)'),))
Setting verify = False does not solve this error.
I've searched online but couldn't find a solution that can help to fix my issue. Can anyone help?
You can use HTTP (but not https) to get info from the site.
>>> response = requests.get('http://www.recruit.com.hk')
>>> response.status_code
200
>>> len(response.text)
I tried you code, it's ok:
>>> r = requests.get('https://www.indeed.hk/rc/clk?jk=ab794b2879313f04&fccid=a659206a7e1afa15')
>>> r.status_code
200
>>> len(r.text)
34272
My environment:
python 2.7.10
requests==2.5.0

python SSL error when using request

I need to write a simple test script for rest get using python. What I have is:
import request
url = 'http://myurl......net'
headers ={'content-type':'application/xml'}
r= requests.get(url,headers=headers)
so this give me the following SSL error:
[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed(_ssl.c:590)
So, i did some research and add " verify = False" to the end of my last line of code, but not I am stuck with: : InsecureRequestsWarning, Unverified request is been made. Addig certificate verification is strongly advised."
What to do to get rid of this message?

Categories