Pandas ignores set encoding in read_csv? - python

Using Linux, Pandas 1.0.1 and Python 3.6 I get a strange error in production:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/app-root/lib/python3.6/site-packages/luigi/worker.py", line 199, in run
new_deps = self._run_get_new_deps()
File "/opt/app-root/lib/python3.6/site-packages/luigi/worker.py", line 141, in _run_get_new_deps
task_gen = self.task.run()
File "/opt/app-root/src/import_validation/validate_csv.py", line 275, in run
validate(temp_csv, self.query_id)
File "/opt/app-root/src/import_validation/validate_csv.py", line 263, in validate
pandas.read_csv(path, encoding='latin1', sep=sep)
File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/opt/app-root/lib/python3.6/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1136, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1253, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas/_libs/parsers.pyx", line 1268, in pandas._libs.parsers.TextReader._string_convert
File "pandas/_libs/parsers.pyx", line 1458, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 12: invalid continuation byte
As you can see in the traceback, I am already setting the encoding to latin1:
pandas.read_csv(path, encoding='latin1', sep=sep)
Why does pandas try to decode UTF-8 when i have specified latin1 as encoding? I have tried to use other aliases for latin1, it gives the same result.
Any idea why pandas seems to be ignoring my encoding setting?
Edit: Removed comment on not working in Windows. The same error happened, I just cheated when I passed the file, not passing it the same way.

The problem was a few too many layers of abstraction. I had a wrapper around this that tried to decompress the file if it ended with 'gz'. I then gave pandas not a path, but a temporary file. This file does of course already have its encoding set, and then the encoding setting is ignored in pandas. The solution is/was to pass the encoding to the temporary file, or as I did, just pass the original path to pandas, as it handles decompressed of files automatically.

Related

Errors when trying to read csv file into pandas data frame

I am fairly new to working with Python so forgive me if this is an obvious issue but when attempting to read my csv file into a data frame I get the following errors:
Traceback (most recent call last):
File "pandas\_libs\parsers.pyx", line 1119, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1244, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1259, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1450, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 11: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\keegan.olson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 688, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\Users\keegan.olson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 460, in _read
data = parser.read(nrows)
File "C:\Users\keegan.olson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 1198, in read
ret = self._engine.read(nrows)
File "C:\Users\keegan.olson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 2157, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 941, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1073, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1126, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1244, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1259, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1450, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 11: invalid start byte
Any Help would be greatly appreciated!

Getting Errors memory from Pandas dataframe from CSV

I have never received this error before in Python and I was wondering why it occurs and what to do about it. This file is 11.7mb
relationships = pd.read_csv('relationships.tsv')
File "/usr/local/lib/python3.7/dist-packages/pandas/io/parsers.py",
line 1995, in read
data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read File
"pandas/_libs/parsers.pyx", line 914, in
pandas._libs.parsers.TextReader._read_low_memory File
"pandas/_libs/parsers.pyx", line 968, in
pandas._libs.parsers.TextReader._read_rows File
"pandas/_libs/parsers.pyx", line 955, in
pandas._libs.parsers.TextReader._tokenize_rows File
"pandas/_libs/parsers.pyx", line 2172, in
pandas._libs.parsers.raise_parser_error pandas.errors.ParserError:
Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
The documentation for read_csv says
sep : str, default ‘,’
If you have a tab-separate file, you need to explicitly pass
df = pd.read_csv('relationships.tsv', sep='\t')

Merging CSV files in python [duplicate]

This question already has answers here:
UnicodeDecodeError when reading CSV file in Pandas with Python
(25 answers)
Closed 3 years ago.
I have been trying to merge several csv files into one but its showing me some error. I am new to python, your help will be highly appreciated.
Following is my code:
import pandas as pd
import numpy as np
import glob
all_data_csv = pd.read_csv("C:/Users/Am/Documents/A.csv", encoding='utf-8')
for f in glob.glob('*.csv'):
df = pd.read_csv(f, encoding='utf-8')
all_data_csv= pd.merge(all_data_csv,df ,how= 'outer')
print(all_data_csv)
and the error shown:
Traceback (most recent call last):
File "pandas\_libs\parsers.pyx", line 1169, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1315, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1553, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 1: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:/internship/j.py", line 8, in <module>
df = pd.read_csv(f, encoding='utf-8')
File "C:\Users\Amreeta Koner\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\Amreeta Koner\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 435, in _read
data = parser.read(nrows)
File "C:\Users\Amreeta Koner\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "C:\Users\Amreeta Koner\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1176, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1299, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1315, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1553, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x88 in position 1: invalid start byte
It seems like you have a non-ascii character in your csv file. I would check out the answer here. Hope it helps.
#run the same code with little addon
pd.read_csv("C:/Users/Am/Documents/A.csv",header=0,encoding = "ISO-8859-1")

exception reading in large tab separated file chunked

I have a 350MB tab separated text file. If I try to read it into memory I get an out of memory exception. So I am trying something along those lines (i.e. only read in a few columns):
import pandas as pd
input_file_and_path = r'C:\Christian\ModellingData\X.txt'
column_names = [
'X1'
# , 'X2
]
raw_data = pd.DataFrame()
for chunk in pd.read_csv(input_file_and_path, names=column_names, chunksize=1000, sep='\t'):
raw_data = pd.concat([raw_data, chunk], ignore_index=True)
print(raw_data.head())
Unfortunately, I get this:
Traceback (most recent call last):
File "pandas\_libs\parsers.pyx", line 1134, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/xxxx/EdaDataPrepRange1.py", line 17, in <module>
for chunk in pd.read_csv(input_file_and_path, header=None, names=column_names, chunksize=1000, sep='\t'):
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1007, in __next__
return self.get_chunk()
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1070, in get_chunk
return self.read(nrows=size)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1141, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xae in position 5: invalid start byte
Any ideas. Btw how can I generally deal with large files and impute for example missing variables? Ultimately, I need to read in everything to determine, for example, the median to be imputed.
use encoding="utf-8" while using pd.read_csv
Here they have used this encoding. see if this works. open(file path, encoding='windows-1252'):
Reference: 'utf-8' codec can't decode byte 0xa0 in position 4276: invalid start byte
Working Solution
to use encoding encoding="ISO-8859-1"
Regarding your large file problem, just use a file handler and context manager:
with open("your_file.txt") as fileObject:
for line in fileObject:
do_something_with(line)
## No need to close file as 'with' automatically does that
This won't load the whole file into memory. Instead, it'll load a line at a time, and will 'forget' previous lines unless you store a reference.
Also, regarding your encoding problem, just use encoding="utf-8" while using pd.read_csv.

Import strings with line terminator from csv to dask dataframe

I have a csv with strings containing line terminator I can import with panda with this code :
df_desc = pd.read_csv(import_desc, sep="|")
But when I try to import it in a dask dataframe :
import dask.dataframe as ddf
import_info = "data/info.csv"
df_desc = ddf.read_csv(import_desc, sep="|", blocksize=None, dtype='str')
I get this error :
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1578, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1015, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/data_extraction_dask.py", line 10, in <module>
df_desc = ddf.read_table(import_desc, sep="|", blocksize=None, dtype='str')
File "/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.py", line 323, in read
**kwargs)
File "/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.py", line 243, in read_pandas
head = reader(BytesIO(b_sample), **kwargs)
File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 411, in _read
data = parser.read(nrows)
File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 982, in read
ret = self._engine.read(nrows)
File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1719, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)
File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)
File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 130
The documentation mention :
It should also be noted that this function may fail if a CSV file
includes quoted strings that contain the line terminator. To get
around this you can specify blocksize=None to not split files into
multiple partitions, at the cost of reduced parallelism.
That's why I used blocksize=None but this function use a sampling strategy that use the first bytes of the file to determine the type of columns and , I think, generate this error.
I can't skip the samping step even by indicating the type with dtypes.
Is there any workaround ?

Categories