How to fix a SSL Certificate error with urllib? - python

I'm developing a python program to grab all images from a website and download them to a folder along with creating a csv file to store all of this information. I'm utilizing urllib and continue to get an error about ssl certificate failure. I'm running on Jupyter notebook, Windows 10, and Python 3.7.
I tried pip installing certifi and urllib but those are already satisfied. I've tried restarting Jupyter and that does not fix the problem. I'm not really sure where to start to fix this as I'm not super familiar with urllib.
I expect this to download the images and output to the csv file, and it does output to the csv file, but the image won't download when I get this error:
The error doesn't halt the program but it does inhibit the intended function of the program.

If you are using urllib library use context parameter when you gave request to open URL. Here is implementation:
import urllib.request
import ssl
#Ignore SSL certificate errors
ssl_context = ssl.create_default_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE
html = urllib.request.urlopen(url, context=ssl_context).read()

If you are able to consider using the requests library instead of urllib, you just do
import requests
response = requests.get('your_url', verify=False)
But also consider the warning here.

Related

How to download an image properly from the web using 'urllib.request' in Python 3.7?

I tried to download an image from the web using Python 3.7. But I got some error in my code and I cannot understand what is wrong in my code and how to recover it.
I use PyCharm 3.4 and MacOS X:
My Code:
import urllib.request
urllib.request.urlretrieve("http://www.digimouth.com/news/media/2011/09/google-logo.jpg", "local-filename.jpg")
Error
urllib.error.URLError: <urlopen error [Errno 65] No route to host>
Your approach is correct. However, the link itself is dead; hence the error.
Simply use urllib.request.urlretrieve(url=link, filename=output), your approach is correct. If the url is an image, you download an image. If the url is an HTML file, you download a HTML file.
Your error urllib.error.URLError: <urlopen error [Errno 65] No route to host> is because your link is broken. The urlretrieve only works for non-broken links. Additionally, urlretrieve is considered to belong in the legacy interface.
Unfortunately, there is nothing you can do to fix the URL "http://www.digimouth.com/news/media/2011/09/google-logo.jpg" and it also appears to be suspicious now.
The code you provided works for a different picture, for example:
import urllib.request
urllib.request.urlretrieve("https://www.google.com/url?sa=i&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwi7nsiIqqXgAhUnuqQKHY6uDa4QjRx6BAgBEAU&url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FGiraffe&psig=AOvVaw1g8lkjuT8Ly2FxVhGp1vp6&ust=1549481373274429", "giraffe.jpg")
The problem is with your link, as http://www.digimouth.com/news/media/2011/09/google-logo.jpg seems to be dead. Even wget http://www.digimouth.com/news/media/2011/09/google-logo.jpg does not work in terminal and Chrome cannot open that link properly. So I suggest to choose a different image.
For SSL error, see: https://stackoverflow.com/a/28052583/8565438

Pytrends: The request failed: Google returned a response with code 429

I'm using Pytrends to extract Google trends data, like:
from pytrends.request import TrendReq
pytrend = TrendReq()
pytrend.build_payload(kw_list=['bitcoin'], cat=0, timeframe=from_date+' '+today_date)
And it returns an error:
ResponseError: The request failed: Google returned a response with code 429.
I made it yesterday and for some reason it doesn't work now! The source code from github failed too:
pytrends = TrendReq(hl='en-US', tz=360, proxies = {'https': 'https://34.203.233.13:80'})
How can I fix this? Thanks a lot!
TLDR; I solved the problem with a custom patch
Explanation
The problem comes from the Google bot recognition system. As other similar systems do, it stops serving too frequent requests coming from suspicious clients. Some of the features used to recognize trustworthy clients are the presence of specific headers generated by the javascript code present on the web pages. Unfortunately, the python requests library does not provide such a level of camouflage against those bot recognition systems since javascript code is not even executed.
So the idea behind my patch is to leverage the headers generated by my browser interacting with google trends. Those headers are generated by the browser meanwhile I am logged in using my Google account, in other words, those headers are linked with my google account, so for them, I am trustworthy.
Solution
I solved in the following way:
First of all you must use google trends from your web browser while you are logged in with your Google Account;
In order to track the actual HTTP GET made: (I am using Chromium) Go into "More Tools" -> "Developers Tools" -> "Network" tab.
Visit the Google Trend page and perform a search for a trend; it will trigger a lot of HTTP requests on the left sidebar of the "Network" tab;
Identify the GET request (in my case it was /trends/explore?q=topic&geo=US) and right-click on it and select Copy -> Copy as cURL;
Then go to this page and paste the cURL script on the left side and copy the "headers" dictionary you can find inside the python script generated on the right side of the page;
Then go to your code and subclass the TrendReq class, so you can pass the custom header just copied:
from pytrends.request import TrendReq as UTrendReq
GET_METHOD='get'
import requests
headers = {
...
}
class TrendReq(UTrendReq):
def _get_data(self, url, method=GET_METHOD, trim_chars=0, **kwargs):
return super()._get_data(url, method=GET_METHOD, trim_chars=trim_chars, headers=headers, **kwargs)
Remove any "import TrendReq" from your code since now it will use this you just created;
Retry again;
If in any future the error message comes back: repeat the procedure. You need to update the header dictionary with fresh values and it may trigger the captcha mechanism.
This one took a while but it turned out the library just needed an update. You can check out a few of the approaches I posted here, both of which resulted in Status 429 Responses:
https://github.com/GeneralMills/pytrends/issues/243
Ultimately, I was able to get it working again by running the following command from my bash prompt:
Run:
pip install --upgrade --user git+https://github.com/GeneralMills/pytrends
For the latest version.
Hope that works for you too.
EDIT:
If you can't upgrade from source you may have some luck with:
pip install pytrends --upgrade
Also, make sure you're running git as an administrator if on Windows.
I had the same problem even after updating the module with pip install --upgrade --user git+https://github.com/GeneralMills/pytrends and restart python.
But, the issue was solved via the below method:
Instead of
pytrends = TrendReq(hl='en-US', tz=360, timeout=(10,25), proxies=['https://34.203.233.13:80',], retries=2, backoff_factor=0.1, requests_args={'verify':False})
Just ran:
pytrend = TrendReq()
Hope this can be helpful!
After running the upgrade command via pip install, you should restart the python kernel and reload the pytrend library.

'requests' in python is not downloading file completely

I am using requests library of python to download a file of size approx. 40mb. But with my code i am getting file of size 14mb only. It is not showing any error(few warnings though before download file).
here is my code:
import requests
file_url = "https://file_url.tar"
user='username'
passw='password'
r = requests.get(file_url,auth=(user,passw),verify=False, stream = True)
with open("c.tar","wb") as code:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
code.write(chunk)
I tried using without 'stream=True' too. but that also not working.
When i am puting this URL in browser i am getting complete file of 40 mb.
I tried this script on some other machine and it is working fine there(and i am getting those warnings here too).
These are the warnings i am getting:
SNIMissingWarning: An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning.
SNIMissingWarning
InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html
InsecureRequestWarning)
but i don't think there is a problem because of this because if i am running this script on some other system i am getting these warnings but script is working fine.
I am using urllib instead of requests
import urllib
url = "http://62.138.7.10/downloads/Bh2g2m.Bh2g.06.DR.M0viesC0unter.mp4?st=6MVZyTUL7X22v7ILOtB2XA&e=1502823147"
file_name = 'trial_video.mp4'
response = urllib.urlopen(url)
with open(file_name,'wb') as f:
f.write(response.read())
Hope this will help you
I have experienced similar problems with Requests. Requests is great for doing fancy JSON api POST requests etc, but for ordinary file downloads, pycurl is a much better tool. The complicated dependency on libcurl means you shouldn't try installing pycurl with pip; instead you need to download a copy from your distro, or use one of the prebuilt win32 modules from their site.
For what it's worth, when I was using requests for file downloads, I also set up logging, and I got some "broken pipe" errors. Maybe Requests disconnects early for performance reasons or something? I didn't have the patience to figure it out when I knew there was an alternative solution that works reliably.

urllib.request SSL Connection Python 3

I'm trying to parse the data from this url:
https://www.chemeo.com/search?q=show%3Ahfus+tf%3A275%3B283
But I think this is failing because the website uses SSL TLS 1.3. How can I enable my Python script, below, to connect using SSL in urllib.request?
I've tried using an SSL context but this doesn't seem to work.
This is the Python 3.6 code I have:
import urllib.request
import ssl
from bs4 import BeautifulSoup
scontext = ssl.SSLContext(ssl.PROTOCOL_SSLv23)
chemeo_search_url = "https://www.chemeo.com/search?q=show%3Ahfus+tf%3A275%3B283"
print(chemeo_search_url)
with urllib.request.urlopen(chemeo_search_url, context=scontext) as f:
print(f.read(200))
Try:
ssl.PROTOCOL_TLS
From the docs on "PROTOCOL_SSLv23":
Deprecated since version 2.7.13: Use PROTOCOL_TLS instead.
note:
Be sure to have the CA certificate bundles installed, like on a minimal build of alpine linux, busybox, the certs have to be installed. Also sometimes if python wasn't compiled with SSL support, it might be necessary to to do so. Also depending on which version of OpenSSL has been compiled will determine which features for SSL will be usable.
Also note chemeo site doesn't use TLSv1.3 ... it is still experimental and not all that secure at the time of this writing, they currently support tls 1.0, 1.1, 1.2 using "letsencrypt" as their cert provider.

Python 3.5 urllib won't open webpage in browser

I tried following code in VS2015, Eclipse and Spyder:
import urllib.request
with urllib.request.urlopen('https://www.python.org/') as response:
html = response.read()
In call cases it won't open the webpage in the browser. I am not sure what is the problem. Debug won't help. In VS2015 the program exists with code 0 which I suppose means successful.
You are using wrong library for the job. urllib module provides functions to send http requests and capture the result in your program. It has nothing to do with a web browser. What you are looking for is the webbrowser module. Here is an example:
import webbrowser
webbrowser.open('http://google.com')
This will show the web page in your browser.
urllib
is a module that is used to send request to web pages and read its contents.
Where as:
webbrowser
is used to open the desired url.
It can used as follows:
import webbrowser
webbrowser.open('http://docs.python.org/lib/module-webbrowser.html')
which usually re-uses existing browser window.
To open in new window:
webbrowser.open_new('http://docs.python.org/lib/module-webbrowser.html')
To open in new tab:
webbrowser.open_new_tab('http://docs.python.org/lib/module-webbrowser.html')
To access via command line interface:
$ python -m webbrowser -t "http://www.python.org"
-n: open new window
-t: open new tab
Here is python documentation for webbrowser:
python 3.6
python 2.7

Categories