Trouble with Python3 string encoding - python

I have link to CSV file with encoding windows-1251. Field area_name in this file is string. In Jupyter Lab on my laptop I doing:
import pandas as pd
df = pd.read_csv(target_link, encoding="windows-1251", delimiter=";")
df['area_name'] = [el.encode('utf-8').decode('utf-8').replace('/', '') for el in df.area_name.values]
df.to_sql(...)
After this I have correct data in database
And when I use this code in Apache Airflow on server I have incorrect encoding in database
On my laptop:
OS - macOS Monterey 12.1
Python version 3.9.7
Pandas version 1.3.4
On server:
Python version 3.7.10
Pandas version 1.1.4
Airflow version 1.10.15
In both cases I use on database. It has encoding UTF8.
How fix encoding in pandas?

Related

Pandas read_csv fails to download CSV with SSL Error

Here is the code to reproduce it:
import pandas as pd
url = 'https://info.gesundheitsministerium.gv.at/data/timeline-faelle-bundeslaender.csv'
df = pd.read_csv(url)
it fails with the following traceback:
URLError: <urlopen error [SSL: DH_KEY_TOO_SMALL] dh key too small (_ssl.c:1129)>
Here is a link to check the url. The download works from the same browser, if you embed the link on a markdown cell in Jupyter.
Any ideas to make this "just work"?
Update:
as per the question suggested by Florin C. below,
This solution solves the issue when downloading via requests:
import requests
import urllib3
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS = 'ALL:#SECLEVEL=1'
requests.get(url)
It would be just a matter of forcing Pandas to to the same, somehow.
My Environment:
Python implementation: CPython
Python version : 3.9.7
IPython version : 7.28.0
requests : 2.25.1
seaborn : 0.11.2
json : 2.0.9
numpy : 1.20.3
plotly : 5.4.0
matplotlib: 3.5.0
lightgbm : 3.3.1
pandas : 1.3.4
Watermark: 2.2.0
Here is a solution if you need to automate the process and don't want to have to download the csv first and then read from file.
import requests
import urllib3
import io
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS = 'ALL:#SECLEVEL=1'
res = requests.get(url)
pd.read_csv(io.BytesIO(res.content), sep=';')
It should be noted that it may not be safe to change the default cyphers to SECLEVEL=1 at the OS level. But this temporary change should be ok.

Python Pandas to convert CSV to Parquet using Fastparquet

I am using Python 3.6 interpreter in my PyCharm venv, and trying to convert a CSV to Parquet.
import pandas as pd
df = pd.read_csv('/parquet/drivers.csv')
df.to_parquet('output.parquet')
Error-1
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support
Solution-1
Installed fastparquet 0.2.1
Error-2
File "/Users/python parquet/venv/lib/python3.6/site-packages/fastparquet/compression.py", line 131, in compress_data
(algorithm, sorted(compressions)))
RuntimeError: Compression 'snappy' not available. Options: ['GZIP', 'UNCOMPRESSED']
I Installed python-snappy 0.5.3 but still getting the same error? Do I need to install any other library?
If I use PyArrow 0.12.0 engine, I don't experience the issue.
In fastparquet snappy compression is an optional feature.
To quickly check a conversion from csv to parquet, you can execute the following script (only requires pandas and fastparquet):
import pandas as pd
from fastparquet import write, ParquetFile
df = pd.DataFrame({"col1": [1,2,3,4], "col2": ["a","b","c","d"]})
# df.head() # Test your initial value
df.to_csv("/tmp/test_csv", index=False)
df_csv = pd.read_csv("/tmp/test_csv")
df_csv.head() # Test your intermediate value
df_csv.to_parquet("/tmp/test_parquet", compression="GZIP")
df_parquet = ParquetFile("/tmp/test_parquet").to_pandas()
df_parquet.head() # Test your final value
However, if you need to write or read using snappy compression you might follow this answer about installing snappy library on ubuntu.
I've used the following versions:
python 3.10.9 fastparquet==2022.12.0 pandas==1.5.2
This code works seemlessly for me
import pandas as pd
df = pd.read_csv('/parquet/drivers.csv')
df.to_parquet('output.parquet', engine="fastparquet")
I'd recommend you move away from python 3.6 as it has reached end of life and is no longer supported.

Pandas read_excel

I struggled for a few hours how to read an excel file with pd.read_excel where the path is a website address. I figured out that the link doesn't go directly to the file but just triggers downloading. Is there any easy way to solve it?
Part of code:
link_energy = 'http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls'
df_energy = pd.read_excel(link_energy)
Error message:
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\n\n\n<!DOC'
Probably it's not a problem of pandas but my lack of skills how do do it.
For me works everything as expected in the following code:
import pandas as pd
link_energy = 'http://unstats.un.org/unsd/environment/excel_file_tables/2013/Energy%20Indicators.xls'
df_energy = pd.read_excel(link_energy)
df_energy
without errors on the following env:
The version of the notebook server is: 5.2.2
The server is running on this version of Python:
Python 3.6.3 | packaged by conda-forge | (default, Nov 4 2017, 10:10:56)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]
Current Kernel Information:
Python 3.6.3 | packaged by conda-forge | (default, Nov 4 2017, 10:10:56)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.2.1 -- An enhanced Interactive Python. Type '?' for help.
However I am not having access to your url posted.
but pd.read_excel won't work and you need to use pd.read_csv
import pandas as pd
df = pd.read_csv('https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls')
Now you need to see the excel file what it contains what is the separator used, if there are any other values in any columns then it needs to be skipped in order to load and read useful data.

IndexError: pop from empty stack (python)

I tried to import the excel file with the following code and the size file is about 71 Mb and when runing the code, it shows "IndexError: pop from empty stack". Thus, kindly help me with this.
Code:
import pandas as pd
df1 = pd.read_excel('F:/Test PCA/Week-7-MachineLearning/weather.xlsx',
sheetname='Sheet1', header=0)
Data: https://www.dropbox.com/s/zyrry53li55hvha/weather.xlsx?dl=0
Using the latest pandas and xlrd this works fine to read the "weather.xlsx" file you provided:
df1 = pd.read_excel('weather.xlsx',sheet_name='Sheet1')
Can you try running:
pip install --upgrade pandas
pip install --upgrade xlrd
To ensure you have the latest version of the modules for reading the file?
i tried with same code provided by you with below versions of pandas and xlrd and it is working fine just changed sheetname argument to sheet_name
pandas==0.22.0
xlrd==1.1.0
df=pd.read_excel('weather.xlsx',sheet_name='Sheet1',header=0)

Getting Spark 1.5 to run on mac local

I downloaded the prebuilt spark for hadoop 2.4 and I'm getting the following error when I try to fire up a SparkContext in python:
ClassNotFoundException: org.apache.spark.launcher.Main
The following code should be correct:
import sys, os
os.environ['SPARK_HOME'] = '/spark-1.5.1-bin-hadoop2.4/'
sys.path.insert(0, '/spark-1.5.1-bin-hadoop2.4/python/')
os.environ['PYTHONPATH'] = '/usr/local/Cellar/python/2.7.9/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/'
import pyspark
from pyspark import SparkContext
sc = SparkContext('local[2]')
Turns out my issue was that the default JDK on my mac is Java 1.6 and Spark 1.5 dropped support for Java 1.6 (reference). I upgraded to Java 1.8 with the installer from oracle, and it fixed the problem.

Categories