I have a csv with multiple lines that produce the following error:
df1 = pd.read_csv('df1.csv', skiprows=[176, 2009, 2483, 3432, 7486, 7608, 7990, 11992, 12421])
ParserError: Error tokenizing data. C error: Expected 52 fields in line 12541, saw 501
As you can probably notice, I have multiple lines that produce a ParserError.
To work around this, I am just updating 'skiprows' to include the error and continue parsing the csv. I have over 30K lines and would prefer to just do this all at once rather than hitting run in Jupyter Notebook, getting a new error, and updating. Otherwise, I wish it would just skip the errors and parse the rest, I've tried googling a solution that way - but all the SO responses were too complicated for me to follow and reproduce for my data structures.
P.S. why is that when using skiprows with just 1 line, like 177, I can just enter skiprows = 177, but when using skiprows with a list, I have to do skiprows = 'errored line - 1'? Why does the counting change?
pandas ≥ 1.3
You should use the on_bad_lines parameter of read_csv (pandas ≥ 1.3.0)
df1 = pd.read_csv('df1.csv', on_bad_lines='warn')
This will skip the invalid lines and give you a warning. If you use on_bad_lines='skip' you skip the lines without warning. The default value of on_bad_lines='error' raises an error for the first issue and aborts.
pandas < 1.3
The parameters are error_bad_lines=False and warn_bad_lines=True.
Related
I am running Python through conda and my terminal. I was given a script that should be able to run without error. The script imports a url and reads it as a csv. This is what I have been given:
url = 'https://www.aoml.noaa.gov/hrd/hurdat/hurdat2.html'
data, storm, stormList = readHURDAT2(url)
columnnames= ['a,b,c,etc']
The error begins with the next line:
for line in pd.read_csv(url, header=None, names=columnnames, chunksize=1):
The computer runs several iterations before outputting this error message:
Too many columns specified: expected 20 and found 1
This happens because the data in https://www.aoml.noaa.gov/hrd/hurdat/hurdat2.html is in HTML format. I'd recommend you copy and paste the data into a local CSV file and read_csv from it. Also, because the file has a specific format that splits the document into HEADER LINES and DATA LINES, both with a different number of columns, you'd need to set engine='python' to read it. Finally, there are the maximum of 21 columns, not 20.
The code should look something like this:
for line in pd.read_csv('hurdat2.csv', # <- Here
engine='python', # <- Here
header=None,
names=columnnames,
chunksize=1,
):
I am getting the same error as this question, but the recommended solution of setting blocksize=None isn't solving the issue for me. I'm trying to convert the NYC taxi data from CSV to Parquet and this is the code I'm running:
ddf = dd.read_csv(
"s3://nyc-tlc/trip data/yellow_tripdata_2010-*.csv",
parse_dates=["pickup_datetime", "dropoff_datetime"],
blocksize=None,
dtype={
"tolls_amount": "float64",
"store_and_fwd_flag": "object",
},
)
ddf.to_parquet(
"s3://coiled-datasets/nyc-tlc/2010",
engine="pyarrow",
compression="snappy",
write_metadata_file=False,
)
Here's the error I'm getting:
"ParserError: Error tokenizing data. C error: Expected 18 fields in line 2958, saw 19".
Adding blocksize=None helps sometimes, see here for example, and I'm not sure why it's not solving my issue.
Any suggestions on how to get past this issue?
This code works for the 2011 taxi data, so their must be something weird in the 2010 taxi data that's causing this issue.
The raw file s3://nyc-tlc/trip data/yellow_tripdata_2010-02.csv contains an error (one too many commas). This is the offending line (middle) and its neighbours:
VTS,2010-02-16 08:02:00,2010-02-16 08:14:00,5,4.2999999999999998,-73.955112999999997,40.786718,1,,-73.924710000000005,40.841335000000001,CSH,11.699999999999999,0,0.5,0,0,12.199999999999999
CMT,2010-02-24 16:25:18,2010-02-24 16:52:14,1,12.4,-73.988956000000002,40.736567000000001,1,,,-73.861762999999996,40.768383999999998,CAS,29.300000000000001,1,0.5,0,4.5700000000000003,35.369999999999997
VTS,2010-02-16 07:58:00,2010-02-16 08:09:00,1,2.9700000000000002,-73.977469999999997,40.779359999999997,1,,-74.004427000000007,40.742137999999997,CRD,9.3000000000000007,0,0.5,1.5,0,11.300000000000001
Some of the options are:
on_bad_lines kwarg to pandas can be set to warn or skip (so this should be also possible with dask.dataframe;
fix the raw file (knowing where the error is) with something like sed (assuming you can modify the raw files) or on the fly by reading the file line by line.
I am using tabula-py to read data from some pdfs, but keep getting this error.
Exception has occurred: ParserError
Error tokenizing data. C error: Expected 1 fields in line 51, saw 2
The PDF i am reading from is almost exactly the same as the one that I built this code around. For example, I built it while testing with another PDF, and am now changing to a new updated one that is the same format and style, but the code now fails and throws this error.
Not sure what I am doing wrong / why this code that previously worked no longer works.
Code snippet:
tabula.convert_into_by_batch("-----", stream = True, output_format='csv', pages='11-57')
path = ("-------")
filenamelist = os.listdir(path)
updated_path = path+ "\\" + filenamelist[0]
new_frame = pd.read_csv(updated_path, skiprows=2, encoding='ISO-8859-1') #error thrown here
The conversion of pdfs to csvs is no perfect transformation. Converting anything away from a pdf is actually quite difficult and can be finnicky no matter what library you're using. Your error is telling me that on line 51 of your converted csv's there is a comma that pandas did not expect to see. So in all of the rows leading up to the "bad" row, you only had single commas (e.g. it expected to see 1 value). Then on row 51, it encountered either 2 values, or a value with a comma at the end, which makes this an improperly formatted csv.
import pandas as pd
import io
bad_csv_file = io.StringIO("""
A
1
2
3
99
50,
100
""".strip())
pd.read_csv(bad_csv_file)
output
Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
Note that there's an extra comma on line 6 that leads to the above error. Simply removing that extra trailing comma resolves this error.
I am working to flatten some tweets into a wide data frame. I simply use the pandas.json_normalize function on my to perform this.
I then save this data frame into a CSV file. The CSV format when uploaded produces some rows that are associated with the above, rather than holding all the data on a single row. I discovered this issue when uploading the CSV into R and into Domo.
When I run the following command in a jupyter notebook the CSV loads fine,
sb_2019 = pd.read_csv('flat_tweets.csv',lineterminator='\n',low_memory=False)
Without the lineterminator I see this error:
Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Needs:
I am looking for a post-processing step to eliminate the need for a the lineterminator. I need to open the CSV in platforms and languages that do not have this specification. How might I go about doing this?
Note:
I am working with over 700k tweets. The json_normalize function works great on small pieces of my data where issues are being found. When I run json_normalize on the whole dataset I am finding this issue.
Try using '\r\n' or '\r' as lineterminator, and not '\n'.
This solution would be helpful too, opening in universal-new-line mode:
sb_2019 = pd.read_csv(open('flat_tweets.csv','rU'), encoding='utf-8', low_memory=False)
I have a large data file that I'm trying to read into a Pandas Dataframe.
If I try to read it using the following code:
df = pd.read_csv(file_name,
sep='|',
compression='gzip',
skiprows=54,
comment='#',
names=column_names,
header=None,
usecols=column_numbers,
engine='python',
nrows=15347,
na_values=["None", " "])
It works perfectly, but not quickly. If I try to use the C engine to speed the import up though, I get an error message:
pandas.parser.CParserError: Error tokenizing data. C error: Expected 0 fields in line 55, saw 205
It looks like something is going wrong when I change the engine, and the parser isn't figuring out how many\which columns it should be using. What I can't figure out is why. None of the input arguments are only supported by the Python engine.
The problem only occurred after I upgraded from version 14.1 to 16.0.
I can't attach a copy of the data, because it contains confidential information.