I am getting the following error when I try to upload a csv file after deleting a few rows in Numbers on Mac:
ParserError: Error tokenizing data. C error: Expected 1 fields in line
5, saw 2
To read the file I am using
df=pd.read_csv('path/file_name.csv')
Do you know the reason why I am getting that error message? Rows seem to be ok.
Thanks
Hard to say without a subset of data, however you could give a the try either to
set the sep parameter if your file is not separated by comma , (which is the default value)
switch the engine to Python by setting the engine="python" parameter.
df = pd.read_csv('path/file_name.csv', sep=';', engine='python')
But maybe it's a problem in the file itself and one or more rows in the file have more fields than the others. In this case you can get rid of them instead of returning an error by setting error_bad_linesbool to False.
df = pd.read_csv('path/file_name.csv', error_bad_linesbool=False)
Lines with too many fields (e.g. a csv line with too many commas) will by default cause an exception to be raised, and no DataFrame will be returned. If False, then these "bad lines" will dropped from the DataFrame that is returned.
-- pandas.read_csv
Try with:
df = pd.read_csv("path/file_name.csv", sep="<separator>", names="<columns>", error_bad_lines=<True/False>)
Could you write more info?
Related
I am getting an error while reading a big file, around 3 Million lines, using read_json function from Pandas.
The error that i get is given below for reference,
ValueError: Unmatched ''"' when when decoding 'string'
I was able to identify the issue here. And the issue was there was an incomplete line in the file and that was giving me the error. the line is given below for reference,
{"Asset":"
The correct line would look something like this,
{"Asset":"somesite", "Stream":"Company", "Tag":"SomeTag"}
I read the big file using read_json like this:
for lines in pd.read_json(file_name, lines=True, convert_dates=False, chunksize=100000):
do some processing on lines
I am not able to use try: catch after for loop because the error happens at the for loop line.
Just want to know if there is an effective way to handle this error without correcting the line in the big json file.
So I've been trying to upload a dataframe to an specific table that is under MSSQL, I've trying to use the BCPANDAS library to upload the data to it. However there's an issue with the data that has a lot of strings on it that contains multiple characters.
The code that I'm using is the following:
from bcpandas import SqlCreds, to_sql
creds = SqlCreds(
'server',
'dbo',
'username',
'password'
)
to_sql(df,'targeted_table',creds,index = False, if_exists='append', schema='test')
However anytime that I try to upload the data it yields this error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\jdc33\AppData\Local\Programs\Python\Python39\lib\site-packages\bcpandas\main.py", line 394, in to_sql
delim = get_delimiter(df) if delimiter is None else delimiter
File "C:\Users\jdc33\AppData\Local\Programs\Python\Python39\lib\site-packages\bcpandas\constants.py", line 68, in get_delimiter
raise BCPandasValueError(error_msg.format(typ="delimiter", opts=_DELIMITER_OPTIONS))
bcpandas.constants.BCPandasValueError: Data contains all of the possible delimiter characters (',', '|', '\t'),
cannot use BCP to import it. Replace one of the possible delimiter characters in
your data, or use another method besides bcpandas.
Further background:
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/specify-field-and-row-terminators-sql-server#characters-supported-as-terminators
What I'm thinking is happening is that the data in the rows has a lot of strings that contains multiple of the delimiters (',', '|', '\t') that are mentioned in the error above hence creating an issue with how the data is uploaded. I've tried to set the delimiter to only one of the above mentioned by ingesting the file like this:
testdf= pd.read_csv('data.csv',delimiter=',')
But the error keeps showing up.
Has any one encounter this error and know how to fix it?
Any assistance will be really helpful
So I manage to fix the issue with a simple error fix, as per question asked above the issue was that the delimiters were existing on the data in some columns, as per deep dive on the data and not to mess the integrity of it, I went over and replace all the instances of the string "," in order for the data to be ingested with BCPandas.
When loading JSON data into Spark (v2.4.2) on AWS EMR from S3 using Pyspark, I've observed that a trailing line separator (\n) in the file results in an empty row being created on the end of the Dataframe. Thus, a file with 10,000 lines in it will produce a Dataframe with 10,001 rows, the last of which is empty/all nulls.
The file looks like this:
{line of JSON}\n
{line of JSON}\n
... <-- 9996 similar lines
{line of JSON}\n
{line of JSON}\n
There are no newlines in the JSON itself, i.e. I don't need to read the JSON as multi-line. I am reading it with the following Pyspark command:
df = spark.read.json('s3://{bucket}/{filename}.json.gz')
df.count()
-> 10001
My understanding of this quote from http://jsonlines.org/:
The last character in the file may be a line separator, and it will be treated the same as if there was no line separator present.
... is that that last empty line should not be considered. Am I missing something? I haven't seen anyone else on SO or elsewhere having this problem, yet it seems very obvious in practice. I don't see an option in the Spark Python API docs for suppressing empty lines, nor have I been able to work around it by trying different line separators and specifying them in the load command.
I have verified that removing the final line separator results in a Dataframe that has the correct number of lines.
I found the problem. The file I was uploading had an unexpected encoding (UCS-2 LE BOM instead of UTF-8). I should have thought to check it, but didn't. After I switched the encoding to the expected one (UTF-8) the load worked as intended.
I've got an error on reading a file with dask, which work with pandas :
import dask.dataframe as dd
import pandas as pd
pdf = pd.read_csv("./tous_les_docs.csv")
pdf.shape
(20140796, 7)
while dask gives me an error :
df = dd.read_csv("./tous_les_docs.csv")
df.describe().compute()
ParserError: Error tokenizing data. C error: EOF inside string starting at line 192999
Answer :
Adding "blocksize=None" make it work :
df = dd.read_csv("./tous_les_docs.csv", blocksize=None)
The documentation says that this could happen
It should also be noted that this function may fail if a CSV file
includes quoted strings that contain the line terminator. To get
around this you can specify blocksize=None to not split files into
multiple partitions, at the cost of reduced parallelism.
http://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv
It seems Dask chops the file in chunks by line terminator but without scanning the whole file from the start, to see if a line terminator is in a string.
Trying to use dask's read_csv on file where pandas's read_csv like this
dd.read_csv('data/ecommerce-new.csv')
fails with the following error:
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at line 2
The file is csv file of scraped data using scrapy with two columns, one with the url and the other with the html(which is stored multiline using " as delimiter char). Being actually parsed by pandas means it should be actually well-formatted.
html,url
https://google.com,"<a href=""link"">
</a>"
Making the sample argument big enough to load the entire file in memory seems to work, which makes me believe it actually fails when trying to infer the datatypes(also there's this issue which
should have been solved https://github.com/dask/dask/issues/1284)
Has anyone encountered this problem before? Is there a fix/workaround?
EDIT: Apparently this is a known problem with dask's read_csv if the file contains a newline character between quotes. A solution I found was to simply read it all in memory:
dd.from_pandas(pd.read_csv(input_file), chunksize=25)
This works, but at the cost of parallelism. Any other solution?
For people coming here in 2020, the dd.read_csv works directly for newlines inside quotes. It has been fixed. Update to the latest version of Dask (2.18.1 and above) to get these benefits.
import dask.dataframe as dd
df = dd.read_csv('path_to_your_file.csv')
print(df.compute())
Gives,
html url
0 https://google.com \n
OR
For people who want to use an older version for some reason, as suggested by #mdurant you might wanna pass blocksize=None to dd.read_csv which will be at a cost of parallel loading.