Dask ParserError: Error tokenizing data when reading CSV - python

I am getting the same error as this question, but the recommended solution of setting blocksize=None isn't solving the issue for me. I'm trying to convert the NYC taxi data from CSV to Parquet and this is the code I'm running:
ddf = dd.read_csv(
"s3://nyc-tlc/trip data/yellow_tripdata_2010-*.csv",
parse_dates=["pickup_datetime", "dropoff_datetime"],
blocksize=None,
dtype={
"tolls_amount": "float64",
"store_and_fwd_flag": "object",
},
)
ddf.to_parquet(
"s3://coiled-datasets/nyc-tlc/2010",
engine="pyarrow",
compression="snappy",
write_metadata_file=False,
)
Here's the error I'm getting:
"ParserError: Error tokenizing data. C error: Expected 18 fields in line 2958, saw 19".
Adding blocksize=None helps sometimes, see here for example, and I'm not sure why it's not solving my issue.
Any suggestions on how to get past this issue?
This code works for the 2011 taxi data, so their must be something weird in the 2010 taxi data that's causing this issue.

The raw file s3://nyc-tlc/trip data/yellow_tripdata_2010-02.csv contains an error (one too many commas). This is the offending line (middle) and its neighbours:
VTS,2010-02-16 08:02:00,2010-02-16 08:14:00,5,4.2999999999999998,-73.955112999999997,40.786718,1,,-73.924710000000005,40.841335000000001,CSH,11.699999999999999,0,0.5,0,0,12.199999999999999
CMT,2010-02-24 16:25:18,2010-02-24 16:52:14,1,12.4,-73.988956000000002,40.736567000000001,1,,,-73.861762999999996,40.768383999999998,CAS,29.300000000000001,1,0.5,0,4.5700000000000003,35.369999999999997
VTS,2010-02-16 07:58:00,2010-02-16 08:09:00,1,2.9700000000000002,-73.977469999999997,40.779359999999997,1,,-74.004427000000007,40.742137999999997,CRD,9.3000000000000007,0,0.5,1.5,0,11.300000000000001
Some of the options are:
on_bad_lines kwarg to pandas can be set to warn or skip (so this should be also possible with dask.dataframe;
fix the raw file (knowing where the error is) with something like sed (assuming you can modify the raw files) or on the fly by reading the file line by line.

Related

Picking out a specific column in a table

My goal is to import a table of astrophysical data that I have saved to my computer (obtained from matching 2 other tables in TOPCAT, if you know it), and extract certain relevant columns. I hope to then do further manipulations on these columns. I am a complete beginner in python, so I apologise for basic errors. I've done my best to try and solve my problem on my own but I'm a bit lost.
This script I have written so far:
import pandas as pd
input_file = "location\\filename"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The file that I'm trying to import is listed as having file type "File", in my drive. I've looked at this file in Notepad and it has a lot of descriptive bumf in the first few rows, so to try and get rid of this I've used "skiprows" as you can see. The data in the file is separated column-wise by lines--at least that's how it appears in Notepad.
The problem is when I try to extract the first column using "usecol" it instead returns what appears to be the first row in the command window, as well as a load of vertical bars between each value. I assume it is somehow not interpreting the table correctly? Not understanding what's a column and what's a row.
What I've tried: Modifying the file and saving it in a different filetype. This gives the following error:
FileNotFoundError: \[Errno 2\] No such file or directory: 'location\\filename'
Despite the fact that the new file is saved in exactly the same location.
I've tried using "pd.read_table" instead of csv, but this doesn't seem to change anything (nor does it give me an error).
When I've tried to extract multiple columns (ie "usecol=[1,2]") I get the following error:
ValueError: Usecols do not match columns, columns expected but not found: \[1, 2\]
My hope is that someone with experience can give some insight into what's likely going on to cause these problems.
Maybie you can try dataset.iloc[:,0] . With iloc you can extract the column or line you want by index(not only). [:,0] for all the lines of 1st column.
The file is incorrectly named.
I expect that you are reading a csv file or an xlsx or txt file. So the (windows) path would look similar to this:
import pandas as pd
input_file = "C:\\python\\tests\\test_csv.csv"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The error message tell you this:
No such file or directory: 'location\\filename'

List all Pandas ParserError when using pd.read_csv()

I have a csv with multiple lines that produce the following error:
df1 = pd.read_csv('df1.csv', skiprows=[176, 2009, 2483, 3432, 7486, 7608, 7990, 11992, 12421])
ParserError: Error tokenizing data. C error: Expected 52 fields in line 12541, saw 501
As you can probably notice, I have multiple lines that produce a ParserError.
To work around this, I am just updating 'skiprows' to include the error and continue parsing the csv. I have over 30K lines and would prefer to just do this all at once rather than hitting run in Jupyter Notebook, getting a new error, and updating. Otherwise, I wish it would just skip the errors and parse the rest, I've tried googling a solution that way - but all the SO responses were too complicated for me to follow and reproduce for my data structures.
P.S. why is that when using skiprows with just 1 line, like 177, I can just enter skiprows = 177, but when using skiprows with a list, I have to do skiprows = 'errored line - 1'? Why does the counting change?
pandas ≥ 1.3
You should use the on_bad_lines parameter of read_csv (pandas ≥ 1.3.0)
df1 = pd.read_csv('df1.csv', on_bad_lines='warn')
This will skip the invalid lines and give you a warning. If you use on_bad_lines='skip' you skip the lines without warning. The default value of on_bad_lines='error' raises an error for the first issue and aborts.
pandas < 1.3
The parameters are error_bad_lines=False and warn_bad_lines=True.

Unable to identify the cause of this error: Exception has occurred: ParserError Error tokenizing data. C error: Expected 1 fields in line 51, saw 2

I am using tabula-py to read data from some pdfs, but keep getting this error.
Exception has occurred: ParserError
Error tokenizing data. C error: Expected 1 fields in line 51, saw 2
The PDF i am reading from is almost exactly the same as the one that I built this code around. For example, I built it while testing with another PDF, and am now changing to a new updated one that is the same format and style, but the code now fails and throws this error.
Not sure what I am doing wrong / why this code that previously worked no longer works.
Code snippet:
tabula.convert_into_by_batch("-----", stream = True, output_format='csv', pages='11-57')
path = ("-------")
filenamelist = os.listdir(path)
updated_path = path+ "\\" + filenamelist[0]
new_frame = pd.read_csv(updated_path, skiprows=2, encoding='ISO-8859-1') #error thrown here
The conversion of pdfs to csvs is no perfect transformation. Converting anything away from a pdf is actually quite difficult and can be finnicky no matter what library you're using. Your error is telling me that on line 51 of your converted csv's there is a comma that pandas did not expect to see. So in all of the rows leading up to the "bad" row, you only had single commas (e.g. it expected to see 1 value). Then on row 51, it encountered either 2 values, or a value with a comma at the end, which makes this an improperly formatted csv.
import pandas as pd
import io
bad_csv_file = io.StringIO("""
A
1
2
3
99
50,
100
""".strip())
pd.read_csv(bad_csv_file)
output
Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
Note that there's an extra comma on line 6 that leads to the above error. Simply removing that extra trailing comma resolves this error.

Prevent New Line Formation in json_normalize Data Frames

I am working to flatten some tweets into a wide data frame. I simply use the pandas.json_normalize function on my to perform this.
I then save this data frame into a CSV file. The CSV format when uploaded produces some rows that are associated with the above, rather than holding all the data on a single row. I discovered this issue when uploading the CSV into R and into Domo.
When I run the following command in a jupyter notebook the CSV loads fine,
sb_2019 = pd.read_csv('flat_tweets.csv',lineterminator='\n',low_memory=False)
Without the lineterminator I see this error:
Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Needs:
I am looking for a post-processing step to eliminate the need for a the lineterminator. I need to open the CSV in platforms and languages that do not have this specification. How might I go about doing this?
Note:
I am working with over 700k tweets. The json_normalize function works great on small pieces of my data where issues are being found. When I run json_normalize on the whole dataset I am finding this issue.
Try using '\r\n' or '\r' as lineterminator, and not '\n'.
This solution would be helpful too, opening in universal-new-line mode:
sb_2019 = pd.read_csv(open('flat_tweets.csv','rU'), encoding='utf-8', low_memory=False)

Error When Reading CSV With C Engine

I have a large data file that I'm trying to read into a Pandas Dataframe.
If I try to read it using the following code:
df = pd.read_csv(file_name,
sep='|',
compression='gzip',
skiprows=54,
comment='#',
names=column_names,
header=None,
usecols=column_numbers,
engine='python',
nrows=15347,
na_values=["None", " "])
It works perfectly, but not quickly. If I try to use the C engine to speed the import up though, I get an error message:
pandas.parser.CParserError: Error tokenizing data. C error: Expected 0 fields in line 55, saw 205
It looks like something is going wrong when I change the engine, and the parser isn't figuring out how many\which columns it should be using. What I can't figure out is why. None of the input arguments are only supported by the Python engine.
The problem only occurred after I upgraded from version 14.1 to 16.0.
I can't attach a copy of the data, because it contains confidential information.

Categories