Pandas read_csv fails to download CSV with SSL Error - python

Here is the code to reproduce it:
import pandas as pd
url = 'https://info.gesundheitsministerium.gv.at/data/timeline-faelle-bundeslaender.csv'
df = pd.read_csv(url)
it fails with the following traceback:
URLError: <urlopen error [SSL: DH_KEY_TOO_SMALL] dh key too small (_ssl.c:1129)>
Here is a link to check the url. The download works from the same browser, if you embed the link on a markdown cell in Jupyter.
Any ideas to make this "just work"?
Update:
as per the question suggested by Florin C. below,
This solution solves the issue when downloading via requests:
import requests
import urllib3
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS = 'ALL:#SECLEVEL=1'
requests.get(url)
It would be just a matter of forcing Pandas to to the same, somehow.
My Environment:
Python implementation: CPython
Python version : 3.9.7
IPython version : 7.28.0
requests : 2.25.1
seaborn : 0.11.2
json : 2.0.9
numpy : 1.20.3
plotly : 5.4.0
matplotlib: 3.5.0
lightgbm : 3.3.1
pandas : 1.3.4
Watermark: 2.2.0

Here is a solution if you need to automate the process and don't want to have to download the csv first and then read from file.
import requests
import urllib3
import io
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS = 'ALL:#SECLEVEL=1'
res = requests.get(url)
pd.read_csv(io.BytesIO(res.content), sep=';')
It should be noted that it may not be safe to change the default cyphers to SECLEVEL=1 at the OS level. But this temporary change should be ok.

Related

pandas profiling import error : not able to import pandas_profiling package

I cannot import name DataError from pandas.core.base
When I import package:
from pandas_profiling import ProfileReport
it shows error:
cannot import name 'DataError' from 'pandas.core.base'
The new version of Python (3.11) does not work with pandas_profiling. However, the old version works fine.
My version is 3.8.8, and it works in the command line:
import pandas_profiling as pp
data = pd.read_csv( "-----.csv")
pp.ProfileReport(data)

Python Pandas to convert CSV to Parquet using Fastparquet

I am using Python 3.6 interpreter in my PyCharm venv, and trying to convert a CSV to Parquet.
import pandas as pd
df = pd.read_csv('/parquet/drivers.csv')
df.to_parquet('output.parquet')
Error-1
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support
Solution-1
Installed fastparquet 0.2.1
Error-2
File "/Users/python parquet/venv/lib/python3.6/site-packages/fastparquet/compression.py", line 131, in compress_data
(algorithm, sorted(compressions)))
RuntimeError: Compression 'snappy' not available. Options: ['GZIP', 'UNCOMPRESSED']
I Installed python-snappy 0.5.3 but still getting the same error? Do I need to install any other library?
If I use PyArrow 0.12.0 engine, I don't experience the issue.
In fastparquet snappy compression is an optional feature.
To quickly check a conversion from csv to parquet, you can execute the following script (only requires pandas and fastparquet):
import pandas as pd
from fastparquet import write, ParquetFile
df = pd.DataFrame({"col1": [1,2,3,4], "col2": ["a","b","c","d"]})
# df.head() # Test your initial value
df.to_csv("/tmp/test_csv", index=False)
df_csv = pd.read_csv("/tmp/test_csv")
df_csv.head() # Test your intermediate value
df_csv.to_parquet("/tmp/test_parquet", compression="GZIP")
df_parquet = ParquetFile("/tmp/test_parquet").to_pandas()
df_parquet.head() # Test your final value
However, if you need to write or read using snappy compression you might follow this answer about installing snappy library on ubuntu.
I've used the following versions:
python 3.10.9 fastparquet==2022.12.0 pandas==1.5.2
This code works seemlessly for me
import pandas as pd
df = pd.read_csv('/parquet/drivers.csv')
df.to_parquet('output.parquet', engine="fastparquet")
I'd recommend you move away from python 3.6 as it has reached end of life and is no longer supported.

How can I open a .snappy.parquet file in python?

How can I open a .snappy.parquet file in python 3.5? So far, I used this code:
import numpy
import pyarrow
filename = "/Users/T/Desktop/data.snappy.parquet"
df = pyarrow.parquet.read_table(filename).to_pandas()
But, it gives this error:
AttributeError: module 'pyarrow' has no attribute 'compat'
P.S. I installed pyarrow this way:
pip install pyarrow
I have got the same issue and managed to solve it by following the solutio proposed in https://github.com/dask/fastparquet/issues/366 solution.
1) install python-snappy by using conda install (for some reason with pip install, I couldn't download it)
2) Add the snappy_decompress function.
from fastparquet import ParquetFile
import snappy
def snappy_decompress(data, uncompressed_size):
return snappy.decompress(data)
pf = ParquetFile('filename') # filename includes .snappy.parquet extension
dff=pf.to_pandas()
The error AttributeError: module 'pyarrow' has no attribute 'compat' is sadly a bit misleading. To execute the to_pandas() function on a pyarrow.Table instance you need pandas installed. The above error is a sympton of the missing requirement.
pandas is a not a hard requirement of pyarrow as most of its functionality is usable with just Python built-ins and NumPy. Thus users of pyarrow which include pandas can work with it without needing to have pandas pre-installed.
You can use pandas to read snppay.parquet files into a python pandas dataframe.
import pandas as pd
filename = "/Users/T/Desktop/data.snappy.parquet"
df = pd.read_parquet(filename)

Error with pulling data from Yahoo Finance

I am trying to pull data from Yahoo Finance via Pandas. I have used similar pulls before, but haven't faced any issue before this
import pandas as pd
import numpy as np
import datetime as dt
from dateutil import parser
from pandas_datareader import data
from dateutil.relativedelta import relativedelta
end_date=dt.datetime.today()
begdate = end_date + relativedelta(years=-10)
data1 = data.get_data_yahoo('^DJI',begdate,end_date,interval='m')
This is the error I am getting
RemoteDataError: Unable to read URL: http://ichart.finance.yahoo.com/table.csv
I am using Python 3.5
EDIT:
This issue has been fixed as of v0.5.0 of pandas-reader. The fix below no longer applies.
As pointed out by others, the API endpoint has changed and a patch has been made but hasn't been merged to the master branch of pandas-datareader yet (as of 2017-05-21 6:19 UTC). The fix is at this branch by Rob Kimball (Issue | PR). For a temporary fix (until the patch is merged into master), try:
$ pip install git+https://github.com/rgkimball/pandas-datareader#fix-yahoo --upgrade
Or, in case you want to tweak the source code:
$ git clone https://github.com/rgkimball/pandas-datareader
$ cd pandas-datareader
$ git checkout fix-yahoo
$ pip install -e .
On Python:
import pandas_datareader as pdr
print(pdr.__version__) # Make sure it is '0.4.1'.
pdr.get_data_yahoo('^DJI')

PyPI API - How to get stable package version

How does pip determine which version is the stable version of a package? For example, the current stable release of Django is 1.7.5 (as of 2-27-15), and that is the version installed by the command pip install django.
But when I go to the PyPI JSON API for Django (https://pypi.python.org/pypi/Django/json), it resolves to the most recent release (including development versions):
"version": "1.8b1",
There is a key in the JSON response that looks like it would indicate stable:
"stable_version": null,
but the value is null on all the packages I tried in the API. There is this line in the JSON response:
"classifiers": [
"Development Status :: 4 - Beta",
But that is a complex line to parse on. It would be nice if there was a line like "stable_version": true or false. How can I determine the default pip installed version using the PyPI JSON API?
Version scheme defined in the PEP-440. There is a module packaging, which can handle version parsing and comparison.
I came up with this function to get latest stable version of a package:
import requests
import json
try:
from packaging.version import parse
except ImportError:
from pip._vendor.packaging.version import parse
URL_PATTERN = 'https://pypi.python.org/pypi/{package}/json'
def get_version(package, url_pattern=URL_PATTERN):
"""Return version of package on pypi.python.org using json."""
req = requests.get(url_pattern.format(package=package))
version = parse('0')
if req.status_code == requests.codes.ok:
j = json.loads(req.text.encode(req.encoding))
releases = j.get('releases', [])
for release in releases:
ver = parse(release)
if not ver.is_prerelease:
version = max(version, ver)
return version
if __name__ == '__main__':
print("Django==%s" % get_version('Django'))
When executed, this produces following results:
$ python v.py
Django==2.0
Just a quick note (as I can't add a reply to a previous answer yet) that sashk's answer could return an incorrect answer, as max() doesnt really understand versioning, f.e. as of now on SQLAlchemy it thinks 1.1.9 is higher than 1.1.14 (which is actually the latest stable release).
quick solution:
import urllib.request
import packaging.version
import distutils.version
data = json.loads(urllib.request.urlopen('https://pypi.python.org/pypi/SQLAlchemy/json').readall().decode('utf-8'))
max([distutils.version.LooseVersion(release) for release in data['releases'] if not packaging.version.parse(release).is_prerelease])
which correctly returns
LooseVersion ('1.1.14')

Categories