Dask read_csv fails where pandas doesn't

Dask read_csv fails where pandas doesn't - python

Trying to use dask's read_csv on file where pandas's read_csv like this
dd.read_csv('data/ecommerce-new.csv')
fails with the following error:
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2
The file is csv file of scraped data using scrapy with two columns, one with the url and the other with the html(which is stored multiline using " as delimiter char). Being actually parsed by pandas means it should be actually well-formatted.
html,url
https://google.com,"<a href=""link"">
</a>"
Making the sample argument big enough to load the entire file in memory seems to work, which makes me believe it actually fails when trying to infer the datatypes(also there's this issue which
should have been solved https://github.com/dask/dask/issues/1284)
Has anyone encountered this problem before? Is there a fix/workaround?
EDIT: Apparently this is a known problem with dask's read_csv if the file contains a newline character between quotes. A solution I found was to simply read it all in memory:
dd.from_pandas(pd.read_csv(input_file), chunksize=25)
This works, but at the cost of parallelism. Any other solution?

For people coming here in 2020, the dd.read_csv works directly for newlines inside quotes. It has been fixed. Update to the latest version of Dask (2.18.1 and above) to get these benefits.
import dask.dataframe as dd
df = dd.read_csv('path_to_your_file.csv')
print(df.compute())
Gives,
html url
0 https://google.com \n
OR
For people who want to use an older version for some reason, as suggested by #mdurant you might wanna pass blocksize=None to dd.read_csv which will be at a cost of parallel loading.

Related

Trailing newline in file when reading JSON in Pyspark results in empty line

When loading JSON data into Spark (v2.4.2) on AWS EMR from S3 using Pyspark, I've observed that a trailing line separator (\n) in the file results in an empty row being created on the end of the Dataframe. Thus, a file with 10,000 lines in it will produce a Dataframe with 10,001 rows, the last of which is empty/all nulls.
The file looks like this:
{line of JSON}\n
{line of JSON}\n
... <-- 9996 similar lines
{line of JSON}\n
{line of JSON}\n
There are no newlines in the JSON itself, i.e. I don't need to read the JSON as multi-line. I am reading it with the following Pyspark command:
df = spark.read.json('s3://{bucket}/{filename}.json.gz')
df.count()
-> 10001
My understanding of this quote from http://jsonlines.org/:
The last character in the file may be a line separator, and it will be treated the same as if there was no line separator present.
... is that that last empty line should not be considered. Am I missing something? I haven't seen anyone else on SO or elsewhere having this problem, yet it seems very obvious in practice. I don't see an option in the Spark Python API docs for suppressing empty lines, nor have I been able to work around it by trying different line separators and specifying them in the load command.
I have verified that removing the final line separator results in a Dataframe that has the correct number of lines.

I found the problem. The file I was uploading had an unexpected encoding (UCS-2 LE BOM instead of UTF-8). I should have thought to check it, but didn't. After I switched the encoding to the expected one (UTF-8) the load worked as intended.

Python = dask Vs pandas, error in read_csv

I've got an error on reading a file with dask, which work with pandas :
import dask.dataframe as dd
import pandas as pd
pdf = pd.read_csv("./tous_les_docs.csv")
pdf.shape
(20140796, 7)
while dask gives me an error :
df = dd.read_csv("./tous_les_docs.csv")
df.describe().compute()
ParserError: Error tokenizing data. C error: EOF inside string starting at line 192999
Answer :
Adding "blocksize=None" make it work :
df = dd.read_csv("./tous_les_docs.csv", blocksize=None)

The documentation says that this could happen
It should also be noted that this function may fail if a CSV file
includes quoted strings that contain the line terminator. To get
around this you can specify blocksize=None to not split files into
multiple partitions, at the cost of reduced parallelism.
http://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv
It seems Dask chops the file in chunks by line terminator but without scanning the whole file from the start, to see if a line terminator is in a string.

Reading old HDF5 stores created by pandas

I'm having some trouble reading and old HDF5 file that I made with pandas in python 2.7.
At the time I was using the to_hdf method to append groups to the file (e.g. db.to_hdf('File.h5', 'groupNameA', mode='a', data_columns=True, format='table'))
Now when I open the store and get the keys of the groups I find that each one has a slash added to the name ('/groupNameA' in the example above). Attempting to access those groups with store['/groupNameA'], store.select('/groupNameA'), etc. produces TypeError: getattr(): attribute name must be string. Getting that error seems correct (slashes should not be used in these keys) but that doesn't help me get my data into a python 3 environment.
If there's a way to get around this problem in python 3, that'd be great.
Alternatively, I can still load the data in my 2.7 environment. So changing the code for writing the store so that slashes don't get added would probably solve the issue as well.

Which newline character is in my CSV?

We receive a .tar.gz file from a client every day and I am rewriting our import process using SSIS. One of the first steps in my process is to unzip the .tar.gz file which I achieve via a Python script.
After unzipping we are left with a number of CSV files which I then import into SQL Server. As an aside, I am loading using the CozyRoc DataFlow Task Plus.
Most of my CSV files load without issue but I have five files which fail. By reading the log I can see that the process is reading the Header and First line as though there is no HeaderRow Delimiter (i.e. it is trying to import the column header as ColumnHeader1ColumnValue1
I took one of these CSVs, copied the top 5 rows into Excel, used Text-To-Columns to delimit the data then saved that as a new CSV file.
This version imported successfully.
That makes me think that somehow the original CSV isn't using {CR}{LF} as the row delimiter but I don't know how to check. Any suggestions?

I ended up using the suggestion commented by #vahdet because I already had notepad++ installed. I can't find the same option in EmEditor but it may exist
For those who are curious, the files are using {LF} which is consistent with the other files. My investigation continues...

Seeing that you have EmEditor, you can use EmEditor to find the eol character in two ways:
Use View > Character Code Value... at the end of a line to display a dialog box showing information about the character at the current position.
Go to View > Marks and turn on Newline Characters and CR and LF with Different Marks to show the eol while editing. LF is displayed with a down arrow while CRLF is a right angle.
Some other things you could try checking for are: file encoding, wrong type of data for a field and an inconsistent number of columns.

python pandas.DataFrame.to_csv newline issue

Greeting Dear community.
I need to write a python pandas.DataFrame into csv
I try to use something like this:
dfPRR.to_csv(prrDumpName,index=False,quotechar="'",quoting=csv.QUOTE_ALL)
It works fine in some sample but for other sample with long string. I encounter the issue that one record breaks into 2 or 3 different line.
what I want my output file:
'RcdLn','GrpPIR','w_id','fwf_id','part_typ','l_id','head_num','site_num','filename'
'2','0','01','demo_fwf_id','demo_part_typ','demo_l_id','1','0','longdemofilename'
'1100','1','01','demo_fwf_id','demo_part_typ','demo_l_id','1','0','longdemofilename'
'2198','2','01','demo_fwf_id','demo_part_typ','demo_l_id','1','0','longdemofilename'
'3296','3','01','demo_fwf_id','demo_part_typ','demo_l_id','1','0','longdemofilename'
Instead what I get...the file breaking into two seperate line::
'RcdLn','GrpPIR','w_id','fwf_id','part_typ','l_id','head_num','site_num','filename'
'2','0','01','demo_fwf_id
','demo_part_typ','demo_l_id','1','0','longdemofilename'
'1100','1','01','demo_fwf_id
','demo_part_typ','demo_l_id','1','0','longdemofilename'
'2198','2','01','demo_fwf_id
','demo_part_typ','demo_l_id','1','0','longdemofilename'
'3296','3','01','demo_fwf_id
','demo_part_typ','demo_l_id','1','0','longdemofilename'
Is there an option to tell to_csv use a specific record delimitor ?
I do not see that option in documentation of to_csv
What my goal is to create a csv and then a loader program will load the csv
As for now, the loader program cannot load the file when this happens. As it is not able to tell the record is finish or not..?
I see other sample file that the record does not break into 2 or 3 lines when the string is not as long. This is the desired behavior.
How I can enforce this ??

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dask read_csv fails where pandas doesn't - python

Related

Trailing newline in file when reading JSON in Pyspark results in empty line

Python = dask Vs pandas, error in read_csv

Reading old HDF5 stores created by pandas

Which newline character is in my CSV?

python pandas.DataFrame.to_csv newline issue

Categories

Resources