I've got an error on reading a file with dask, which work with pandas :
import dask.dataframe as dd
import pandas as pd
pdf = pd.read_csv("./tous_les_docs.csv")
pdf.shape
(20140796, 7)
while dask gives me an error :
df = dd.read_csv("./tous_les_docs.csv")
df.describe().compute()
ParserError: Error tokenizing data. C error: EOF inside string starting at line 192999
Answer :
Adding "blocksize=None" make it work :
df = dd.read_csv("./tous_les_docs.csv", blocksize=None)
The documentation says that this could happen
It should also be noted that this function may fail if a CSV file
includes quoted strings that contain the line terminator. To get
around this you can specify blocksize=None to not split files into
multiple partitions, at the cost of reduced parallelism.
http://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv
It seems Dask chops the file in chunks by line terminator but without scanning the whole file from the start, to see if a line terminator is in a string.
Related
I am getting an error while reading a big file, around 3 Million lines, using read_json function from Pandas.
The error that i get is given below for reference,
ValueError: Unmatched ''"' when when decoding 'string'
I was able to identify the issue here. And the issue was there was an incomplete line in the file and that was giving me the error. the line is given below for reference,
{"Asset":"
The correct line would look something like this,
{"Asset":"somesite", "Stream":"Company", "Tag":"SomeTag"}
I read the big file using read_json like this:
for lines in pd.read_json(file_name, lines=True, convert_dates=False, chunksize=100000):
do some processing on lines
I am not able to use try: catch after for loop because the error happens at the for loop line.
Just want to know if there is an effective way to handle this error without correcting the line in the big json file.
I have a Windows 10 laptop and I am trying to read in a csv file in Python.
I have tried this code:
import pandas as pd
df = pd.read_csv("C:\Users\dcolu\OneDrive\Documents\tennis.csv")
I copied this path above directly from my Files Explorer.
I have also tried:
import pandas as pd
df = pd.read_csv("tennis.csv")
and both still give me the same error message: No such file or directory
If the text in the question's snippet was cut-n-pasted from the editor to the browser as-is, then we can see there's hidden unicode data in the line of source code.
>>> r"""df = pd.read_csv("C:\Users\dcolu\OneDrive\Documents\tennis.csv")"""
'df = pd.read_csv(\u202a"C:\\Users\\dcolu\\OneDrive\\Documents\\tennis.csv")'
Note the \u202a above right after the left parenthesis. The \u signifies it as a unicode code point - or character.
Which causes the SyntaxError:
df = pd.read_csv("C:\Users\dcolu\OneDrive\Documents\tennis.csv")
File "<input>", line 1
df = pd.read_csv("C:\Users\dcolu\OneDrive\Documents\tennis.csv")
^
SyntaxError: invalid character in identifier
The Python interpreter thinks it's parsing an identifier.
import pandas as pd
df = pd.read_csv(r"C:\Users\dcolu\OneDrive\Documents\tennis.csv")
please be sure to put the letter "r" before the file path. I am sure it will work.
I am getting the following error when I try to upload a csv file after deleting a few rows in Numbers on Mac:
ParserError: Error tokenizing data. C error: Expected 1 fields in line
5, saw 2
To read the file I am using
df=pd.read_csv('path/file_name.csv')
Do you know the reason why I am getting that error message? Rows seem to be ok.
Thanks
Hard to say without a subset of data, however you could give a the try either to
set the sep parameter if your file is not separated by comma , (which is the default value)
switch the engine to Python by setting the engine="python" parameter.
df = pd.read_csv('path/file_name.csv', sep=';', engine='python')
But maybe it's a problem in the file itself and one or more rows in the file have more fields than the others. In this case you can get rid of them instead of returning an error by setting error_bad_linesbool to False.
df = pd.read_csv('path/file_name.csv', error_bad_linesbool=False)
Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these "bad lines" will dropped from the DataFrame that is returned.
-- pandas.read_csv
Try with:
df = pd.read_csv("path/file_name.csv", sep="<separator>", names="<columns>", error_bad_lines=<True/False>)
Could you write more info?
When loading JSON data into Spark (v2.4.2) on AWS EMR from S3 using Pyspark, I've observed that a trailing line separator (\n) in the file results in an empty row being created on the end of the Dataframe. Thus, a file with 10,000 lines in it will produce a Dataframe with 10,001 rows, the last of which is empty/all nulls.
The file looks like this:
{line of JSON}\n
{line of JSON}\n
... <-- 9996 similar lines
{line of JSON}\n
{line of JSON}\n
There are no newlines in the JSON itself, i.e. I don't need to read the JSON as multi-line. I am reading it with the following Pyspark command:
df = spark.read.json('s3://{bucket}/{filename}.json.gz')
df.count()
-> 10001
My understanding of this quote from http://jsonlines.org/:
The last character in the file may be a line separator, and it will be treated the same as if there was no line separator present.
... is that that last empty line should not be considered. Am I missing something? I haven't seen anyone else on SO or elsewhere having this problem, yet it seems very obvious in practice. I don't see an option in the Spark Python API docs for suppressing empty lines, nor have I been able to work around it by trying different line separators and specifying them in the load command.
I have verified that removing the final line separator results in a Dataframe that has the correct number of lines.
I found the problem. The file I was uploading had an unexpected encoding (UCS-2 LE BOM instead of UTF-8). I should have thought to check it, but didn't. After I switched the encoding to the expected one (UTF-8) the load worked as intended.
Trying to use dask's read_csv on file where pandas's read_csv like this
dd.read_csv('data/ecommerce-new.csv')
fails with the following error:
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2
The file is csv file of scraped data using scrapy with two columns, one with the url and the other with the html(which is stored multiline using " as delimiter char). Being actually parsed by pandas means it should be actually well-formatted.
html,url
https://google.com,"<a href=""link"">
</a>"
Making the sample argument big enough to load the entire file in memory seems to work, which makes me believe it actually fails when trying to infer the datatypes(also there's this issue which
should have been solved https://github.com/dask/dask/issues/1284)
Has anyone encountered this problem before? Is there a fix/workaround?
EDIT: Apparently this is a known problem with dask's read_csv if the file contains a newline character between quotes. A solution I found was to simply read it all in memory:
dd.from_pandas(pd.read_csv(input_file), chunksize=25)
This works, but at the cost of parallelism. Any other solution?
For people coming here in 2020, the dd.read_csv works directly for newlines inside quotes. It has been fixed. Update to the latest version of Dask (2.18.1 and above) to get these benefits.
import dask.dataframe as dd
df = dd.read_csv('path_to_your_file.csv')
print(df.compute())
Gives,
html url
0 https://google.com \n
OR
For people who want to use an older version for some reason, as suggested by #mdurant you might wanna pass blocksize=None to dd.read_csv which will be at a cost of parallel loading.