I am trying to use urllib for Python to open a URL but I get an error with one specific URL which looks normal compared to the others and also works well in a browser.
The code that generates the error is:
import cStringIO
import imghdr
import urllib2
response = urllib2.urlopen('https://news.artnet.com/wp-content/news-upload/2015/08/Brad_Pitt_Fury_2014-e1440597554269.jpg')
However, if I do the exact same thing with a similar URL I don't get the error:
import cStringIO
import imghdr
import urllib2
response = urllib2.urlopen('https://upload.wikimedia.org/wikipedia/commons/d/d4/Brad_Pitt_June_2014_(cropped).jpg')
The error I get in the first example is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1240, in https_open
context=self._context)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1197, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:590)>
This is a site using Cloudflare SSL and it needs Server Name Indication (SNI). Without SNI access to this site will show the behavior you can see here, i.e. trigger a tlsv1 alert. SNI was added to Python 2.7 only with 2.7.9 and you are probably using an older version of Python.
I found a very good explanation in the urllib3 documentation.
The following code solved the issue with Python 2.7.10:
import urllib3
import ssl
import certifi
urllib3.contrib.pyopenssl.inject_into_urllib3()
# Open connection
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED', # Force certificate check.
ca_certs=certifi.where(), # Path to the Certifi bundle.
)
# Make verified HTTPS requests.
try:
resp = http.request('GET', url_photo)
except urllib3.exceptions.SSLError as e:
# Handle incorrect certificate error.
print "error with https certificates"
Related
All I want to do is scrape some data about earthquakes from a website. In fact, I just want Python to be able to extract data from URL's. For some reason, even the simplest code which only opens a url and uses '.readlines()' is met with a wall of errors. It doesn't seem to understand the 'openurl' command, nor most anything else.
I don't know what to even try, because I can't parse the errors that it's giving me. I was hoping, before I had to do something drastic like re-download python or something, that someone would have an answer for me.
import urllib.request
def urltest():
url = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv"
f = urllib.request.urlopen(url)
allLines = f.readlines()
f.close()
line = allLines[0].decode()
print(line)
This is the code I've used to simply test it. The URL goes to a website which holds a .csv file, which python should easily acquire and read through.
If anyone wants, I can actually post the entire wall of errors that this code returns. There looks to be at least 6 different ones, but this is the final line that it spits back:
urllib.error.URLError: <urlopen error unknown url type: https>
Looking through the urllib.requests module it loads a collection of handlers. we can see this code snippet in urllib.request.py
if hasattr(http.client, "HTTPSConnection"):
default_classes.append(HTTPSHandler)
skip = set()
for klass in default_classes:
for check in handlers:
if isinstance(check, type):
if issubclass(check, klass):
skip.add(klass)
elif isinstance(check, klass):
skip.add(klass)
for klass in skip:
default_classes.remove(klass)
for klass in default_classes:
opener.add_handler(klass())
So the https handler class is only loaded if the http.client.py has the attribute HTTPSConnection. If we look in the http.client.py we can see the following code for setting this attribute.
try:
import ssl
except ImportError:
pass
else:
class HTTPSConnection(HTTPConnection):
"This class allows communication via SSL."
default_port = HTTPS_PORT
So the HTTPSConnection class is only created if the ssl module can successfully be imported. If you system doesnt have the ssl module then http.client wont load the HTTPSConnection class which in turn will not add the attribute and as such urllib wont load a handler for https.
While the code you provided worked on my system. I added the following code before it to cause my system to not be able to locate the ssl module.
#load then remove the ssl module from the system
import sys
import ssl
del ssl
sys.modules['ssl']=None
import urllib.request
def urltest():
url = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv"
f = urllib.request.urlopen(url)
allLines = f.readlines()
f.close()
line = allLines[0].decode()
print(line)
urltest()
Doing this i get the same error you were getting
C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\python.exe C:/Users/cd00119621/PycharmProjects/ideas/stackoverflow.py
Traceback (most recent call last):
File "C:/Users/cd00119621/PycharmProjects/ideas/stackoverflow.py", line 19, in <module>
urltest()
File "C:/Users/cd00119621/PycharmProjects/ideas/stackoverflow.py", line 13, in urltest
f = urllib.request.urlopen(url)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 563, in error
result = self._call_chain(*args)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 755, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 548, in _open
'unknown_open', req)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 1387, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: https>
So i suspect you have installed python without ssl configured. You should be able to verify this easly by just trying to import ssl from the python command line import ssl if you get an error like
>>> import ssl
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'ssl'
Then that will be the cause of your issues. You would have to either reinstall python with ssl configured or somehow build the ssl module from source
It looks like the problem is a network(dns/proxy/firewall) issue.
https://github.com/pbugnion/gmaps/issues/245
You can use Pandas:
import pandas as pd
data = pd.read_csv('http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv')
print (data)
On Windows Vista SP2 + Python 2.7.10 I can connect to https://www.python.org, but not to https://codereview.appspot.com
The script:
HOST1 = 'https://www.python.org'
HOST2 = 'https://codereview.appspot.com'
import urllib2
print HOST1
urllib2.urlopen(HOST1)
print HOST2
urllib2.urlopen(HOST2)
And the output:
E:\>py test.py
https://www.python.org
https://codereview.appspot.com
Traceback (most recent call last):
File "test.py", line 9, in <module>
urllib2.urlopen(HOST2)
File "C:\Python27\lib\urllib2.py", line 158, in urlopen
return opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 435, in open
response = self._open(req, data)
File "C:\Python27\lib\urllib2.py", line 453, in _open
'_open', req)
File "C:\Python27\lib\urllib2.py", line 413, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 1244, in https_open
context=self._context)
File "C:\Python27\lib\urllib2.py", line 1201, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)>
How can I troubleshoot, what exactly is wrong with https://codereview.appspot.com/ ?
My guess is that it is related to the alternative chain handling in OpenSSL, as described in detail in Python Urllib2 SSL error. Although Python uses the windows CA store to get the trusted root certificates the validation of the trust chain itself is done within OpenSSL.
According to "Python 2.7.10 Released" Python 2.7.10 on Windows includes OpenSSL 1.0.2a but the fixes regarding alternative chains were done in 1.0.2b only (and had to be fixed fast afterwards because they contained a serious security bug).
If you look at the SSLLabs report for codereview.appspot.com you can see that there are multiple trust chains which probably causes the problem. Contrary to that python.org only has a single trust chain.
To work around the problem it might be necessary to use your own root CA store which must contain the certificate for "/C=US/O=Equifax/OU=Equifax Secure Certificate Authority" to verify codereview.appspot.com correctly. The certificate can be found here and you can give it with the cafile parameter to urllib2.urlopen.
I'm trying to fetch some data from http://m.finnkino.fi/events/now_showing, but at the moment I'm failing badly because I'm not even able to load the page source with python.
At the moment I'm using following code:
req = urllib2.urlopen(URL,None,2.5)
page = req.read()
print page
Here is the traceback for timeout error:
Traceback (most recent call last):
File "user/src/finnkinoParser.py", line 26, in <module>
main()
File "user/src/finnkinoParser.py", line 13, in main
getNowPlayingMovies()
File "user/src/finnkinoParser.py", line 17, in getNowPlayingMovies
req = urllib2.urlopen(baseURL,None,2.5)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 383, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 401, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 361, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1130, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1105, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error timed out>
If I browse to the url with my browser it works fine. So could someone tell me what makes that site that much different so the urllib2 is unable to load the page. I suppose it has something to do with the site being aimed to mobile users. With "regular" sites urllib2 works fine. Is there any other kind of sites to which the basic urlopen(URL) doesn't work?
Thanks for help
Following snippet works fine.
import httplib
headers = {"User-Agent": "Mozilla/5.0"}
conn = httplib.HTTPConnection("m.finnkino.fi")
conn.request("GET", "/events/now_showing", "", headers)
response = conn.getresponse()
print response.status, response.reason
data = response.read()
print data
conn.close()
It seems their server has verified several request vars. After tested some times, here is conclusion:
http protocol must be HTTP/1.1.
if request headers have Connection prop, its value should be keep-alive.
request headers must have User-Agent prop, whatever its value.
While in urllib2, Connection prop in HTTPHandler has been set to Close by default (L1127 in urllib2.py). you can use urlgrabber or other HTTP handler which supports HTTP/1.1 and keep-alive.
This work fine:
import urllib2
opener = urllib2.build_opener(
urllib2.HTTPHandler(),
urllib2.HTTPSHandler(),
urllib2.ProxyHandler({'http': 'http://user:pass#proxy:3128'}))
urllib2.install_opener(opener)
print urllib2.urlopen('http://www.google.com').read()
But, if http change to https:
...
print urllib2.urlopen('https://www.google.com').read()
There are errors:
Traceback (most recent call last):
File "D:\Temp\6\tmp.py", line 13, in <module>
print urllib2.urlopen('https://www.google.com').read()
File "C:\Python26\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python26\lib\urllib2.py", line 389, in open
response = self._open(req, data)
File "C:\Python26\lib\urllib2.py", line 407, in _open
'_open', req)
File "C:\Python26\lib\urllib2.py", line 367, in _call_chain
result = func(*args)
File "C:\Python26\lib\urllib2.py", line 1154, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "C:\Python26\lib\urllib2.py", line 1121, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 10060]
Why and how solve this problem?
Change this line:
urllib2.ProxyHandler({'http': 'http://user:pass#proxy:3128'}))
to this:
urllib2.ProxyHandler({'https': 'http://user:pass#proxy:3128'}))
It works fine for me.
On Windows, errno 10060 is a winsock error meaning the connection timed out. Are you able to reach https://www.google.com from the same machine using a web browser with a proxy set to http://user:pass#proxy:3128 ? Are you sure your proxy server can handle both https and http on the same port?
The documentation for urllib2 says the following:
Note: Currently urllib2 does not support fetching of https locations
through a proxy. However, this can be enabled by extending urllib2 as
shown in this recipe.
I must admit above recipe didn't work right away for Jython 2.5.3, but I'm still trying.
UPDATE: I applied this patch to Jython 2.5.3, and it worked for me. I can fetch HTTPS resources over a proxy server now.
UPDATE2: Here is the code to query HTTPS resources with Basic authentication over HTTP Proxy (DON'T FORGET TO INSTALL PATCH FIRST (see previous update)):
from suds.client import Client
from suds.transport.https import HttpAuthenticated
credentials = dict(username='...', password='...', proxy={'https': 'host:port', 'http': 'host:port'})
t = HttpAuthenticated(**credentials)
url = 'https://example.com/service?wsdl'
client = Client(url, transport=t)
print client.service.getFoo()
I installed Python 2.6.2 earlier on a Windows XP machine and run the following code:
import urllib2
import urllib
page = urllib2.Request('http://www.python.org/fish.html')
urllib2.urlopen( page )
I get the following error.
Traceback (most recent call last):<br>
File "C:\Python26\test3.py", line 6, in <module><br>
urllib2.urlopen( page )<br>
File "C:\Python26\lib\urllib2.py", line 124, in urlopen<br>
return _opener.open(url, data, timeout)<br>
File "C:\Python26\lib\urllib2.py", line 383, in open<br>
response = self._open(req, data)<br>
File "C:\Python26\lib\urllib2.py", line 401, in _open<br>
'_open', req)<br>
File "C:\Python26\lib\urllib2.py", line 361, in _call_chain<br>
result = func(*args)<br>
File "C:\Python26\lib\urllib2.py", line 1130, in http_open<br>
return self.do_open(httplib.HTTPConnection, req)<br>
File "C:\Python26\lib\urllib2.py", line 1105, in do_open<br>
raise URLError(err)<br>
URLError: <urlopen error [Errno 11001] getaddrinfo failed><br><br><br>
import urllib2
response = urllib2.urlopen('http://www.python.org/fish.html')
html = response.read()
You're doing it wrong.
Have a look in the urllib2 source, at the line specified by the traceback:
File "C:\Python26\lib\urllib2.py", line 1105, in do_open
raise URLError(err)
There you'll see the following fragment:
try:
h.request(req.get_method(), req.get_selector(), req.data, headers)
r = h.getresponse()
except socket.error, err: # XXX what error?
raise URLError(err)
So, it looks like the source is a socket error, not an HTTP protocol related error. Possible reasons: you are not on line, you are behind a restrictive firewall, your DNS is down,...
All this aside from the fact, as mcandre pointed out, that your code is wrong.
Name resolution error.
getaddrinfo is used to resolve the hostname (python.org)in your request. If it fails, it means that the name could not be resolved because:
It does not exist, or the records are outdated (unlikely; python.org is a well-established domain name)
Your DNS server is down (unlikely; if you can browse other sites, you should be able to fetch that page through Python)
A firewall is blocking Python or your script from accessing the Internet (most likely; Windows Firewall sometimes does not ask you if you want to allow an application)
You live on an ancient voodoo cemetery. (unlikely; if that is the case, you should move out)
Windows Vista, python 2.6.2
It's a 404 page, right?
>>> import urllib2
>>> import urllib
>>>
>>> page = urllib2.Request('http://www.python.org/fish.html')
>>> urllib2.urlopen( page )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python26\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python26\lib\urllib2.py", line 389, in open
response = meth(req, response)
File "C:\Python26\lib\urllib2.py", line 502, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python26\lib\urllib2.py", line 427, in error
return self._call_chain(*args)
File "C:\Python26\lib\urllib2.py", line 361, in _call_chain
result = func(*args)
File "C:\Python26\lib\urllib2.py", line 510, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
>>>
DJ
First, I see no reason to import urllib; I've only ever seen urllib2 used to replace urllib entirely and I know of no functionality that's useful from urllib and yet is missing from urllib2.
Next, I notice that http://www.python.org/fish.html gives a 404 error to me. (That doesn't explain the backtrace/exception you're seeing. I get urllib2.HTTPError: HTTP Error 404: Not Found
Normally if you just want to do a default fetch of a web pages (without adding special HTTP headers, doing doing any sort of POST, etc) then the following suffices:
req = urllib2.urlopen('http://www.python.org/')
html = req.read()
# and req.close() if you want to be pedantic