Import strings with line terminator from csv to dask dataframe - python

I have a csv with strings containing line terminator I can import with panda with this code :
df_desc = pd.read_csv(import_desc, sep="|")
But when I try to import it in a dask dataframe :
import dask.dataframe as ddf
import_info = "data/info.csv"
df_desc = ddf.read_csv(import_desc, sep="|", blocksize=None, dtype='str')
I get this error :
Traceback (most recent call last):
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1578, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm CE.app/Contents/helpers/pydev/pydevd.py", line 1015, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/data_extraction_dask.py", line 10, in <module>
df_desc = ddf.read_table(import_desc, sep="|", blocksize=None, dtype='str')
File "/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.py", line 323, in read
**kwargs)
File "/anaconda2/lib/python2.7/site-packages/dask/dataframe/io/csv.py", line 243, in read_pandas
head = reader(BytesIO(b_sample), **kwargs)
File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 411, in _read
data = parser.read(nrows)
File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 982, in read
ret = self._engine.read(nrows)
File "/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1719, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas/_libs/parsers.c:10862)
File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas/_libs/parsers.c:11138)
File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas/_libs/parsers.c:11884)
File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas/_libs/parsers.c:11755)
File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas/_libs/parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 130
The documentation mention :
It should also be noted that this function may fail if a CSV file
includes quoted strings that contain the line terminator. To get
around this you can specify blocksize=None to not split files into
multiple partitions, at the cost of reduced parallelism.
That's why I used blocksize=None but this function use a sampling strategy that use the first bytes of the file to determine the type of columns and , I think, generate this error.
I can't skip the samping step even by indicating the type with dtypes.
Is there any workaround ?

Related

ParserError when using Pandas script

I'm getting the below error when trying to convert csx to xlx format using Pandas script.
I tried running the below script:
import os
os.chdir("/opt/alb_test/alb/albt1/Source/alb/al/conversion/scripts")
# Reading the csv file
import pandas as pd
print(pd.__file__)
df_new = pd.read_csv("sourcefile.csv", sep="|", header=None).dropna(axis=1, how="all")
# saving xlsx file
df_new.to_excel("sourcefile.xlsx", index=False)
I am getting the error mentioned below:
Traceback (most recent call last):
File "/opt/alb_test/alb/albt1/Source/alb/al/conversion/scripts/pythn.py", line 13, in <module>
df = pd.read_csv("ff_mdm_reject_report.csv", lineterminator='\n')
File "/opt/infa/MDW-env/lib/python3.9/site-packages/pandas/util/_decorators.py", line 211, in wrapper
return func(*args, **kwargs)
File "/opt/infa/MDW-env/lib/python3.9/site-packages/pandas/util/_decorators.py", line 331, in wrapper
return func(*args, **kwargs)
File "/opt/infa/MDW-env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
return _read(filepath_or_buffer, kwds)
File "/opt/infa/MDW-env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 611, in _read
return parser.read(nrows)
File "/opt/infa/MDW-env/lib/python3.9/site-packages/pandas/io/parsers/readers.py", line 1778, in read
) = self._engine.read( # type: ignore[attr-defined]
File "/opt/infa/MDW-env/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 808, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 866, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 852, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1973, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 30488, saw 2
Can anyone guide me how to fix it?
Thanks in advance!

lxml.etree.XMLSyntaxError while trying to read_excel using pandas

I've a spreadsheet (~50 mb) with multiple sheets, and I'm trying to read it using pandas.
import pandas as pd
df = pd.read_excel('compiled_output.xlsx', sheet_name='Sheet1')
I'm not sure why it's throwing lxml.etree.XMLSyntaxError; I've done this many times before. I also tried passing engine=openpyxl, downgraded to pandas==1.2.4 but I get the same error:
df = pd.read_excel('compiled_output.xlsx', sheet_name='Sheet1')
File "/usr/local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 336, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 1131, in __init__
self._reader = self._engines[engine](self._io, storage_options=storage_options)
File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 475, in __init__
super().__init__(filepath_or_buffer, storage_options=storage_options)
File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_base.py", line 391, in __init__
self.book = self.load_workbook(self.handles.handle)
File "/usr/local/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 486, in load_workbook
return load_workbook(
File "/usr/local/lib/python3.9/site-packages/openpyxl/reader/excel.py", line 317, in load_workbook
reader.read()
File "/usr/local/lib/python3.9/site-packages/openpyxl/reader/excel.py", line 282, in read
self.read_worksheets()
File "/usr/local/lib/python3.9/site-packages/openpyxl/reader/excel.py", line 216, in read_worksheets
rels = get_dependents(self.archive, rels_path)
File "/usr/local/lib/python3.9/site-packages/openpyxl/packaging/relationship.py", line 131, in get_dependents
node = fromstring(src)
File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1784, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1141, in lxml.etree._BaseParser._parseDoc
File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
File "<string>", line 2
lxml.etree.XMLSyntaxError: internal error: Huge input lookup, line 2, column 12753697

Getting Errors memory from Pandas dataframe from CSV

I have never received this error before in Python and I was wondering why it occurs and what to do about it. This file is 11.7mb
relationships = pd.read_csv('relationships.tsv')
File "/usr/local/lib/python3.7/dist-packages/pandas/io/parsers.py",
line 1995, in read
data = self._reader.read(nrows) File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read File
"pandas/_libs/parsers.pyx", line 914, in
pandas._libs.parsers.TextReader._read_low_memory File
"pandas/_libs/parsers.pyx", line 968, in
pandas._libs.parsers.TextReader._read_rows File
"pandas/_libs/parsers.pyx", line 955, in
pandas._libs.parsers.TextReader._tokenize_rows File
"pandas/_libs/parsers.pyx", line 2172, in
pandas._libs.parsers.raise_parser_error pandas.errors.ParserError:
Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
The documentation for read_csv says
sep : str, default ‘,’
If you have a tab-separate file, you need to explicitly pass
df = pd.read_csv('relationships.tsv', sep='\t')

I am getting error when opening a CSV file in pycharm

I am using pycharm and when i run a code of opening a csv file using pandas I am getting an error of no existence.
I saved the csv file in my project directory and called it using pandas.
import pandas as pd
df = pd.read_csv("E:\\students")
print(df)
The error when i run the code:
Traceback (most recent call last): File "E:/untitled232/file1.py", line 2, in <module>
df = pd.read_csv("E:\\students") File "E:\untitled232\venv\lib\site-packages\pandas\io\parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds) File "E:\untitled232\venv\lib\site-packages\pandas\io\parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds) File "E:\untitled232\venv\lib\site-packages\pandas\io\parsers.py", line 787, in __init__
self._make_engine(self.engine) File "E:\untitled232\venv\lib\site-packages\pandas\io\parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options) File "E:\untitled232\venv\lib\site-packages\pandas\io\parsers.py", line 1708, in __init__
self._reader = parsers.TextReader(src, **kwds) File "pandas\_libs\parsers.pyx", line 384, in pandas._libs.parsers.TextReader.__cinit__ File "pandas\_libs\parsers.pyx", line 695, in pandas._libs.parsers.TextReader._setup_parser_source FileNotFoundError: File b'E:\\students' does not exist
It seems I had to put .csv after the name.

forcing dtype upon read_table in pandas -- get NAs for misformatted records, don't break

I am trying to specify data type upon reading in data with pandas.read_table. My main reason is not speed, but to ignore misformatted records, which unfortunately occur. But instead of populating such records with NAs, the script simply breaks, and I found no other switch of pandas.read_table that would turn forcing on.
This is on pandas 0.17.1
What is there to do?
The relevant line is in the error message:
Traceback (most recent call last):
File "/Users/laszlo.sandor/Downloads/mock_monthly_inpatient_treatments.py", line 31, in <module>
treatments = pd.read_table(filename,usecols=[0,3,4,6], engine='c', dtype={'LopNr':np.uint16,'INDATUMA':np.uint16,'UTDATUMA':np.uint16,'DIAGNOS':object})
File "//anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 498, in parser_f
return _read(filepath_or_buffer, kwds)
File "//anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 285, in _read
return parser.read()
File "//anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 747, in read
ret = self._engine.read(nrows)
File "//anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1197, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 766, in pandas.parser.TextReader.read (pandas/parser.c:7988)
File "pandas/parser.pyx", line 788, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)
File "pandas/parser.pyx", line 865, in pandas.parser.TextReader._read_rows (pandas/parser.c:9261)
File "pandas/parser.pyx", line 972, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:10654)
File "pandas/parser.pyx", line 1053, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12010)
ValueError: cannot safely convert passed user dtype of <u2 for object dtyped data in column 3
The misformatted data in column 3 is 2008o730 here.

Categories