Error When Reading CSV With C Engine - python

I have a large data file that I'm trying to read into a Pandas Dataframe.
If I try to read it using the following code:
df = pd.read_csv(file_name,
sep='|',
compression='gzip',
skiprows=54,
comment='#',
names=column_names,
header=None,
usecols=column_numbers,
engine='python',
nrows=15347,
na_values=["None", " "])
It works perfectly, but not quickly. If I try to use the C engine to speed the import up though, I get an error message:
pandas.parser.CParserError: Error tokenizing data. C error: Expected 0 fields in line 55, saw 205
It looks like something is going wrong when I change the engine, and the parser isn't figuring out how many\which columns it should be using. What I can't figure out is why. None of the input arguments are only supported by the Python engine.
The problem only occurred after I upgraded from version 14.1 to 16.0.
I can't attach a copy of the data, because it contains confidential information.

Related

Getting Python error: Too many columns specified

I am running Python through conda and my terminal. I was given a script that should be able to run without error. The script imports a url and reads it as a csv. This is what I have been given:
url = 'https://www.aoml.noaa.gov/hrd/hurdat/hurdat2.html'
data, storm, stormList = readHURDAT2(url)
columnnames= ['a,b,c,etc']
The error begins with the next line:
for line in pd.read_csv(url, header=None, names=columnnames, chunksize=1):
The computer runs several iterations before outputting this error message:
Too many columns specified: expected 20 and found 1
This happens because the data in https://www.aoml.noaa.gov/hrd/hurdat/hurdat2.html is in HTML format. I'd recommend you copy and paste the data into a local CSV file and read_csv from it. Also, because the file has a specific format that splits the document into HEADER LINES and DATA LINES, both with a different number of columns, you'd need to set engine='python' to read it. Finally, there are the maximum of 21 columns, not 20.
The code should look something like this:
for line in pd.read_csv('hurdat2.csv', # <- Here
engine='python', # <- Here
header=None,
names=columnnames,
chunksize=1,
):

Dask ParserError: Error tokenizing data when reading CSV

I am getting the same error as this question, but the recommended solution of setting blocksize=None isn't solving the issue for me. I'm trying to convert the NYC taxi data from CSV to Parquet and this is the code I'm running:
ddf = dd.read_csv(
"s3://nyc-tlc/trip data/yellow_tripdata_2010-*.csv",
parse_dates=["pickup_datetime", "dropoff_datetime"],
blocksize=None,
dtype={
"tolls_amount": "float64",
"store_and_fwd_flag": "object",
},
)
ddf.to_parquet(
"s3://coiled-datasets/nyc-tlc/2010",
engine="pyarrow",
compression="snappy",
write_metadata_file=False,
)
Here's the error I'm getting:
"ParserError: Error tokenizing data. C error: Expected 18 fields in line 2958, saw 19".
Adding blocksize=None helps sometimes, see here for example, and I'm not sure why it's not solving my issue.
Any suggestions on how to get past this issue?
This code works for the 2011 taxi data, so their must be something weird in the 2010 taxi data that's causing this issue.
The raw file s3://nyc-tlc/trip data/yellow_tripdata_2010-02.csv contains an error (one too many commas). This is the offending line (middle) and its neighbours:
VTS,2010-02-16 08:02:00,2010-02-16 08:14:00,5,4.2999999999999998,-73.955112999999997,40.786718,1,,-73.924710000000005,40.841335000000001,CSH,11.699999999999999,0,0.5,0,0,12.199999999999999
CMT,2010-02-24 16:25:18,2010-02-24 16:52:14,1,12.4,-73.988956000000002,40.736567000000001,1,,,-73.861762999999996,40.768383999999998,CAS,29.300000000000001,1,0.5,0,4.5700000000000003,35.369999999999997
VTS,2010-02-16 07:58:00,2010-02-16 08:09:00,1,2.9700000000000002,-73.977469999999997,40.779359999999997,1,,-74.004427000000007,40.742137999999997,CRD,9.3000000000000007,0,0.5,1.5,0,11.300000000000001
Some of the options are:
on_bad_lines kwarg to pandas can be set to warn or skip (so this should be also possible with dask.dataframe;
fix the raw file (knowing where the error is) with something like sed (assuming you can modify the raw files) or on the fly by reading the file line by line.

List all Pandas ParserError when using pd.read_csv()

I have a csv with multiple lines that produce the following error:
df1 = pd.read_csv('df1.csv', skiprows=[176, 2009, 2483, 3432, 7486, 7608, 7990, 11992, 12421])
ParserError: Error tokenizing data. C error: Expected 52 fields in line 12541, saw 501
As you can probably notice, I have multiple lines that produce a ParserError.
To work around this, I am just updating 'skiprows' to include the error and continue parsing the csv. I have over 30K lines and would prefer to just do this all at once rather than hitting run in Jupyter Notebook, getting a new error, and updating. Otherwise, I wish it would just skip the errors and parse the rest, I've tried googling a solution that way - but all the SO responses were too complicated for me to follow and reproduce for my data structures.
P.S. why is that when using skiprows with just 1 line, like 177, I can just enter skiprows = 177, but when using skiprows with a list, I have to do skiprows = 'errored line - 1'? Why does the counting change?
pandas ≥ 1.3
You should use the on_bad_lines parameter of read_csv (pandas ≥ 1.3.0)
df1 = pd.read_csv('df1.csv', on_bad_lines='warn')
This will skip the invalid lines and give you a warning. If you use on_bad_lines='skip' you skip the lines without warning. The default value of on_bad_lines='error' raises an error for the first issue and aborts.
pandas < 1.3
The parameters are error_bad_lines=False and warn_bad_lines=True.

Pandas Read Irregular Data From Clipboard

I am trying to copy data from this CSV file and read it via Pandas.read_clipboard().
I keep getting this error:
ParserError: Error tokenizing data. C error: Expected 5 fields in line 6, saw 7
Is it possible to read in data like this? It works with read_csv (encoding='latin-1') but not read_clipboard.
Thanks in advance!
Let's use skiprows=6 parameter, to ignore the data at the top of the file which looks like a header and not part of the core dataframe:
df = pd.read_clipboard(sep='\t', skiprows=6)

Error Tokenizing Data

I have a csv file from a collaborator. He told me I could read it into into python using
import csv
t = []
f = open("measles.csv", "rb")
d = csv.reader(f, quotechar='"', delimiter="\t", lineterminator='\r\n')
for row in d:
t.append(row)
I tried to make a dataframe out of the data by using pd.DataFrame(t[1:],colums = t[0]) which was successful. However, when I write the resulting dataframe to csv, and then try to read it back in again using pd.read_csv, I get the following error
CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
I suspect it is to do with the way the data was originally given to me. I've tried error_bad_lines = False but that doesn't seem to work. Any advice?
With
pd.read_csv try engine='python' parameter.
ex.
df = pd.read_csv(file_name , engine='python')

Categories