This work fine:
import urllib2
opener = urllib2.build_opener(
urllib2.HTTPHandler(),
urllib2.HTTPSHandler(),
urllib2.ProxyHandler({'http': 'http://user:pass#proxy:3128'}))
urllib2.install_opener(opener)
print urllib2.urlopen('http://www.google.com').read()
But, if http change to https:
...
print urllib2.urlopen('https://www.google.com').read()
There are errors:
Traceback (most recent call last):
File "D:\Temp\6\tmp.py", line 13, in <module>
print urllib2.urlopen('https://www.google.com').read()
File "C:\Python26\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python26\lib\urllib2.py", line 389, in open
response = self._open(req, data)
File "C:\Python26\lib\urllib2.py", line 407, in _open
'_open', req)
File "C:\Python26\lib\urllib2.py", line 367, in _call_chain
result = func(*args)
File "C:\Python26\lib\urllib2.py", line 1154, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "C:\Python26\lib\urllib2.py", line 1121, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 10060]
Why and how solve this problem?
Change this line:
urllib2.ProxyHandler({'http': 'http://user:pass#proxy:3128'}))
to this:
urllib2.ProxyHandler({'https': 'http://user:pass#proxy:3128'}))
It works fine for me.
On Windows, errno 10060 is a winsock error meaning the connection timed out. Are you able to reach https://www.google.com from the same machine using a web browser with a proxy set to http://user:pass#proxy:3128 ? Are you sure your proxy server can handle both https and http on the same port?
The documentation for urllib2 says the following:
Note: Currently urllib2 does not support fetching of https locations
through a proxy. However, this can be enabled by extending urllib2 as
shown in this recipe.
I must admit above recipe didn't work right away for Jython 2.5.3, but I'm still trying.
UPDATE: I applied this patch to Jython 2.5.3, and it worked for me. I can fetch HTTPS resources over a proxy server now.
UPDATE2: Here is the code to query HTTPS resources with Basic authentication over HTTP Proxy (DON'T FORGET TO INSTALL PATCH FIRST (see previous update)):
from suds.client import Client
from suds.transport.https import HttpAuthenticated
credentials = dict(username='...', password='...', proxy={'https': 'host:port', 'http': 'host:port'})
t = HttpAuthenticated(**credentials)
url = 'https://example.com/service?wsdl'
client = Client(url, transport=t)
print client.service.getFoo()
Related
I am trying to use urllib for Python to open a URL but I get an error with one specific URL which looks normal compared to the others and also works well in a browser.
The code that generates the error is:
import cStringIO
import imghdr
import urllib2
response = urllib2.urlopen('https://news.artnet.com/wp-content/news-upload/2015/08/Brad_Pitt_Fury_2014-e1440597554269.jpg')
However, if I do the exact same thing with a similar URL I don't get the error:
import cStringIO
import imghdr
import urllib2
response = urllib2.urlopen('https://upload.wikimedia.org/wikipedia/commons/d/d4/Brad_Pitt_June_2014_(cropped).jpg')
The error I get in the first example is:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 431, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 449, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1240, in https_open
context=self._context)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1197, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [SSL: TLSV1_ALERT_INTERNAL_ERROR] tlsv1 alert internal error (_ssl.c:590)>
This is a site using Cloudflare SSL and it needs Server Name Indication (SNI). Without SNI access to this site will show the behavior you can see here, i.e. trigger a tlsv1 alert. SNI was added to Python 2.7 only with 2.7.9 and you are probably using an older version of Python.
I found a very good explanation in the urllib3 documentation.
The following code solved the issue with Python 2.7.10:
import urllib3
import ssl
import certifi
urllib3.contrib.pyopenssl.inject_into_urllib3()
# Open connection
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED', # Force certificate check.
ca_certs=certifi.where(), # Path to the Certifi bundle.
)
# Make verified HTTPS requests.
try:
resp = http.request('GET', url_photo)
except urllib3.exceptions.SSLError as e:
# Handle incorrect certificate error.
print "error with https certificates"
OS: Windows 7; Python 2.7.3 using the Python GUI Shell
I'm trying to read a website through Python, and several authors use the urllib and urllib2 libraries. To store the site in a variable, I've seen a similar approach proposed:
import urllib
import urllib2
g = "http://www.google.com/"
read = urllib2.urlopen(g)
The last line generates an error after a 120+ seconds:
> Traceback (most recent call last): File "<pyshell#27>", line 1, in
> <module>
> r = urllib2.urlopen(o) File "C:\Python27\lib\urllib2.py", line 126, in urlopen
> return _opener.open(url, data, timeout) File "C:\Python27\lib\urllib2.py", line 400, in open
> response = self._open(req, data) File "C:\Python27\lib\urllib2.py", line 418, in _open
> '_open', req) File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
> result = func(*args) File "C:\Python27\lib\urllib2.py", line 1207, in http_open
> return self.do_open(httplib.HTTPConnection, req) File "C:\Python27\lib\urllib2.py", line 1177, in do_open
> raise URLError(err) URLError: <urlopen error [Errno 10060] A connection attempt failed because the connected party did not properly
> respond after a period of time, or established connection failed
> because connected host has failed to respond>
I tried bypassing the g variable and trying to urlopen("http://www.google.com/") with no success either (it generates the same error after the same length of time).
The error code 10060 means it cannot connect to the remote peer. It might be because of the network problem or mostly your setting issues, such as proxy setting.
You could try to connect the same host with other tools(such as ncat) and/or with another PC within your same local network to find out where the problem is occuring.
For proxy issue, there are some material here:
Using an HTTP PROXY - Python
Why can't I get Python's urlopen() method to work on Windows?
Hope it helps!
Answer (Basic is advance!):
Error: 10060
Adding a timeout parameter to request solved the issue for me.
Example 1
import urllib
import urllib2
g = "http://www.google.com/"
read = urllib2.urlopen(g, timeout=20)
Example 2
A similar error also occurred while I was making a GET request. Again, passing a timeout parameter solved the 10060 Error.
response = requests.get(param_url, timeout=20)
This is because of the proxy settings.
I also had the same problem, under which I could not use any of the modules which were fetching data from the internet.
There are simple steps to follow:
1. open the control panel
2. open internet options
3. under connection tab open LAN settings
4. go to advance settings and unmark everything, delete every proxy in there. Or u can just unmark the checkbox in proxy server this will also do the same
5. save all the settings by clicking ok.
you are done.
try to run the programme again, it must work
it worked for me at least
just change your internet connection it is going to work..
I've been trying to get Tor to work with Python, but I've been hitting a brick wall. I simply can't get any of the examples to work. Here is one from Stackoverflow
import urllib2
proxy = urllib2.ProxyHandler({'http':'127.0.0.1:8118'})
opener = urllib2.build_opener(proxy)
print opener.open('http://check.torproject.org/').read()
I've installed Tor and it works fine while browsing through Aurora. However running this python script I get
Traceback (most recent call last):
File "/home/x/Tor.py", line 4, in <module>
print opener.open('http://check.torproject.org/').read()
File "/usr/lib/python2.6/urllib2.py", line 391, in open
response = self._open(req, data)
File "/usr/lib/python2.6/urllib2.py", line 409, in _open
'_open', req)
File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.6/urllib2.py", line 1161, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.6/urllib2.py", line 1136, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>
I've searched the web, but been unable to find people with simiair problems. Am I missing something totally obvious?!
I've written an article showing how to use Tor with Python (using SOCKS) on http://blog.databigbang.com/distributed-scraping-with-multiple-tor-circuits/
Hope it helps.
I have the same problem but can not find a solution!
I am running Ubuntu, I can open TOR (latest version) with Vidalia, and surf the web correctly. So vidalia works and is connected.
If I use TorCtl in python, I get a response from TOR saying it is live and running!
However, if I want to open a page using urllib2 as described by Loko, I get the same answer.
If someone has a good idea, it would be really nice!
Tor acts as a Socks5 proxy. You need to configure your script with that in mind. Google "socks.py"
I'm trying to fetch some data from http://m.finnkino.fi/events/now_showing, but at the moment I'm failing badly because I'm not even able to load the page source with python.
At the moment I'm using following code:
req = urllib2.urlopen(URL,None,2.5)
page = req.read()
print page
Here is the traceback for timeout error:
Traceback (most recent call last):
File "user/src/finnkinoParser.py", line 26, in <module>
main()
File "user/src/finnkinoParser.py", line 13, in main
getNowPlayingMovies()
File "user/src/finnkinoParser.py", line 17, in getNowPlayingMovies
req = urllib2.urlopen(baseURL,None,2.5)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 383, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 401, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 361, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1130, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1105, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error timed out>
If I browse to the url with my browser it works fine. So could someone tell me what makes that site that much different so the urllib2 is unable to load the page. I suppose it has something to do with the site being aimed to mobile users. With "regular" sites urllib2 works fine. Is there any other kind of sites to which the basic urlopen(URL) doesn't work?
Thanks for help
Following snippet works fine.
import httplib
headers = {"User-Agent": "Mozilla/5.0"}
conn = httplib.HTTPConnection("m.finnkino.fi")
conn.request("GET", "/events/now_showing", "", headers)
response = conn.getresponse()
print response.status, response.reason
data = response.read()
print data
conn.close()
It seems their server has verified several request vars. After tested some times, here is conclusion:
http protocol must be HTTP/1.1.
if request headers have Connection prop, its value should be keep-alive.
request headers must have User-Agent prop, whatever its value.
While in urllib2, Connection prop in HTTPHandler has been set to Close by default (L1127 in urllib2.py). you can use urlgrabber or other HTTP handler which supports HTTP/1.1 and keep-alive.
I installed Python 2.6.2 earlier on a Windows XP machine and run the following code:
import urllib2
import urllib
page = urllib2.Request('http://www.python.org/fish.html')
urllib2.urlopen( page )
I get the following error.
Traceback (most recent call last):<br>
File "C:\Python26\test3.py", line 6, in <module><br>
urllib2.urlopen( page )<br>
File "C:\Python26\lib\urllib2.py", line 124, in urlopen<br>
return _opener.open(url, data, timeout)<br>
File "C:\Python26\lib\urllib2.py", line 383, in open<br>
response = self._open(req, data)<br>
File "C:\Python26\lib\urllib2.py", line 401, in _open<br>
'_open', req)<br>
File "C:\Python26\lib\urllib2.py", line 361, in _call_chain<br>
result = func(*args)<br>
File "C:\Python26\lib\urllib2.py", line 1130, in http_open<br>
return self.do_open(httplib.HTTPConnection, req)<br>
File "C:\Python26\lib\urllib2.py", line 1105, in do_open<br>
raise URLError(err)<br>
URLError: <urlopen error [Errno 11001] getaddrinfo failed><br><br><br>
import urllib2
response = urllib2.urlopen('http://www.python.org/fish.html')
html = response.read()
You're doing it wrong.
Have a look in the urllib2 source, at the line specified by the traceback:
File "C:\Python26\lib\urllib2.py", line 1105, in do_open
raise URLError(err)
There you'll see the following fragment:
try:
h.request(req.get_method(), req.get_selector(), req.data, headers)
r = h.getresponse()
except socket.error, err: # XXX what error?
raise URLError(err)
So, it looks like the source is a socket error, not an HTTP protocol related error. Possible reasons: you are not on line, you are behind a restrictive firewall, your DNS is down,...
All this aside from the fact, as mcandre pointed out, that your code is wrong.
Name resolution error.
getaddrinfo is used to resolve the hostname (python.org)in your request. If it fails, it means that the name could not be resolved because:
It does not exist, or the records are outdated (unlikely; python.org is a well-established domain name)
Your DNS server is down (unlikely; if you can browse other sites, you should be able to fetch that page through Python)
A firewall is blocking Python or your script from accessing the Internet (most likely; Windows Firewall sometimes does not ask you if you want to allow an application)
You live on an ancient voodoo cemetery. (unlikely; if that is the case, you should move out)
Windows Vista, python 2.6.2
It's a 404 page, right?
>>> import urllib2
>>> import urllib
>>>
>>> page = urllib2.Request('http://www.python.org/fish.html')
>>> urllib2.urlopen( page )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python26\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python26\lib\urllib2.py", line 389, in open
response = meth(req, response)
File "C:\Python26\lib\urllib2.py", line 502, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python26\lib\urllib2.py", line 427, in error
return self._call_chain(*args)
File "C:\Python26\lib\urllib2.py", line 361, in _call_chain
result = func(*args)
File "C:\Python26\lib\urllib2.py", line 510, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
>>>
DJ
First, I see no reason to import urllib; I've only ever seen urllib2 used to replace urllib entirely and I know of no functionality that's useful from urllib and yet is missing from urllib2.
Next, I notice that http://www.python.org/fish.html gives a 404 error to me. (That doesn't explain the backtrace/exception you're seeing. I get urllib2.HTTPError: HTTP Error 404: Not Found
Normally if you just want to do a default fetch of a web pages (without adding special HTTP headers, doing doing any sort of POST, etc) then the following suffices:
req = urllib2.urlopen('http://www.python.org/')
html = req.read()
# and req.close() if you want to be pedantic