Comma in numbers causing problem reading csv - python

Upon calling a csv file I am getting the following error
ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 2
I opened my csv file and then went to the line and saw that the error is coming because one of the numbers is with decimals but separated by a cooma.
That entire column of my csv file has whole numbers but also decimals numbers that look like the following .
385433,4
Not sure how I can resolve this error when reading the csv file using pandas

It sounds like you have European-formatted CSV. Since you haven't provided a real sample of your CSV as requested, I will guess. If this doesn't solve your issue, edit your question to provide an actual sample:
Given test.csv:
c1;c2;c3
1,2;3,4;5,6
3,4;5,6;7,8
Then:
import pandas as pd
data = pd.read_csv('test.csv',decimal=',',delimiter=';')
print(data)
Produces:
c1 c2 c3
0 1.2 3.4 5.6
1 3.4 5.6 7.8

Related

Pandas Error Tokenizing but when trouble shooting the data points are not separated

When trying to open the file it gives me ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 22152
What I tried:
mat = pd.read_csv('/Users/csb/Desktop/Zebrafish_scRNA/sample_Control_WTA_1_RSEC_MolsPerCell.csv', sep = '\t')
Output:
Instead of data points in each separate columns it got all combined
Output I want is:
The first 7 rows of Output dataframe dropped. Want to automate this since there will be several files which comes in that Output format.
How can I resolve this issue and automate it to get the output that I want to achieve?

List all Pandas ParserError when using pd.read_csv()

I have a csv with multiple lines that produce the following error:
df1 = pd.read_csv('df1.csv', skiprows=[176, 2009, 2483, 3432, 7486, 7608, 7990, 11992, 12421])
ParserError: Error tokenizing data. C error: Expected 52 fields in line 12541, saw 501
As you can probably notice, I have multiple lines that produce a ParserError.
To work around this, I am just updating 'skiprows' to include the error and continue parsing the csv. I have over 30K lines and would prefer to just do this all at once rather than hitting run in Jupyter Notebook, getting a new error, and updating. Otherwise, I wish it would just skip the errors and parse the rest, I've tried googling a solution that way - but all the SO responses were too complicated for me to follow and reproduce for my data structures.
P.S. why is that when using skiprows with just 1 line, like 177, I can just enter skiprows = 177, but when using skiprows with a list, I have to do skiprows = 'errored line - 1'? Why does the counting change?
pandas ≥ 1.3
You should use the on_bad_lines parameter of read_csv (pandas ≥ 1.3.0)
df1 = pd.read_csv('df1.csv', on_bad_lines='warn')
This will skip the invalid lines and give you a warning. If you use on_bad_lines='skip' you skip the lines without warning. The default value of on_bad_lines='error' raises an error for the first issue and aborts.
pandas < 1.3
The parameters are error_bad_lines=False and warn_bad_lines=True.

Unable to identify the cause of this error: Exception has occurred: ParserError Error tokenizing data. C error: Expected 1 fields in line 51, saw 2

I am using tabula-py to read data from some pdfs, but keep getting this error.
Exception has occurred: ParserError
Error tokenizing data. C error: Expected 1 fields in line 51, saw 2
The PDF i am reading from is almost exactly the same as the one that I built this code around. For example, I built it while testing with another PDF, and am now changing to a new updated one that is the same format and style, but the code now fails and throws this error.
Not sure what I am doing wrong / why this code that previously worked no longer works.
Code snippet:
tabula.convert_into_by_batch("-----", stream = True, output_format='csv', pages='11-57')
path = ("-------")
filenamelist = os.listdir(path)
updated_path = path+ "\\" + filenamelist[0]
new_frame = pd.read_csv(updated_path, skiprows=2, encoding='ISO-8859-1') #error thrown here
The conversion of pdfs to csvs is no perfect transformation. Converting anything away from a pdf is actually quite difficult and can be finnicky no matter what library you're using. Your error is telling me that on line 51 of your converted csv's there is a comma that pandas did not expect to see. So in all of the rows leading up to the "bad" row, you only had single commas (e.g. it expected to see 1 value). Then on row 51, it encountered either 2 values, or a value with a comma at the end, which makes this an improperly formatted csv.
import pandas as pd
import io
bad_csv_file = io.StringIO("""
A
1
2
3
99
50,
100
""".strip())
pd.read_csv(bad_csv_file)
output
Error tokenizing data. C error: Expected 1 fields in line 6, saw 2
Note that there's an extra comma on line 6 that leads to the above error. Simply removing that extra trailing comma resolves this error.

How to read excel file in pandas with a column with more than one rows within each cell

I am trying to read excel file with pandas. But my excel has one column called error has more than one rows within each cell. Example below:
Row Error
1 Bank error
Try again
2 Limit error
Cancell
When I read this file into python, I only get first rows of the error columns. My dataframe looks like this
Row Error
0 Bank error
1 Limit error
My code below:
import pandas as pd
df = pd.read_excel('/content/drive/My Drive/error.xlsx')
How can I fix this and read whole cell to python? Thank you.
I also added the image of excel of first two rows.
With specific problems in I/O operations, you are supposed to give us a little sample data(the original excel file with problems). Or we can't get what you are talking about.

Pandas Read Irregular Data From Clipboard

I am trying to copy data from this CSV file and read it via Pandas.read_clipboard().
I keep getting this error:
ParserError: Error tokenizing data. C error: Expected 5 fields in line 6, saw 7
Is it possible to read in data like this? It works with read_csv (encoding='latin-1') but not read_clipboard.
Thanks in advance!
Let's use skiprows=6 parameter, to ignore the data at the top of the file which looks like a header and not part of the core dataframe:
df = pd.read_clipboard(sep='\t', skiprows=6)

Categories