How can I fix "Error tokenizing data" on pandas csv reader? - python

I'm trying to read a csv file with pandas.
This file actually has only one row but it causes an error whenever I try to read it.
Something wrong seems happening in line 8 but I could hardly find the 8th line since there's clearly only one row on it.
I do like:
with codecs.open("path_to_file", "rU", "Shift-JIS", "ignore") as file:
df = pd.read_csv(file, header=None, sep="\t")
df
Then I get:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 3
I don't get what's really going on, so any of your advice will be appreciated.

I struggled with this almost a half day , I opened the csv with notepad and noticed that separate is TAB not comma and then tried belo combination.
df = pd.read_csv('C:\\myfile.csv',sep='\t', lineterminator='\r')

Try df = pd.read_csv(file, header=None, error_bad_lines=False)

The existing answer will not include these additional lines in your dataframe. If you'd like your dataframe to be as wide as its widest point, you can use the following:
delimiter = ','
max_columns = max(open(path_name, 'r'), key = lambda x: x.count(delimiter)).count(delimiter)
df = pd.read_csv(path_name, header = None, skiprows = 1, names = list(range(0,max_columns)))
Set skiprows = 1 if there's actually a header, you can always retrieve the header column names later.
You can also identify rows that have more columns populated than the number of column names in the original header.

Related

Pandas use column names if do not exist

Is there a way, without reading the file twice, to check if a column exists otherwise use column names passed? I have files of the same structure but some do not contain a header for some reason.
Example with header:
Field1 Field2 Field3
data1 data2 data3
Example without header:
data1 data2 data3
When trying to use the example below, if the file has a header it will make it the first row instead of replacing the header.
pd.read_csv('filename.csv', names=col_names)
When trying to use the below, it will drop the first row of data of there is no header in the file.
pd.read_csv('filename.csv', header=0, names=col_names)
My current work around is to load the file, check if the columns exist or not, then if it doesn't read the file again.
df = pd.read_csv('filename.csv')
if `Field1` not in df.columns:
del df
df = pd.read_csv('filename.csv', names=col_names)
Is there a better way to handle this data set that doesn't involve potentially reading the file twice?
Just modify your logic so the first time through only reads the first row:
# Load first row and setup keyword args if necessary
kw_args = {}
first = pd.read_csv('filename.csv', nrows=1)
if `Field1` not in first.columns:
kw_args["names"] = col_names
# Load data
df = pd.read_csv('filename.csv', **kw_args)
You can do this with seek method of file descriptor:
with open('filename.csv') as csvfile:
headers = pd.read_csv(csvfile, nrows=0).columns.tolist()
csvfile.seek(0) # return file pointer to the beginning of the file
# do stuff here
if 'Field1' in headers:
...
else:
...
df = pd.read_csv(csvfile, ...)
The file is read only once.

remove last empty line from pandas to csv using lineterminator

I have a requirement where I have to split some columns as first row and the remaining as second row.
I have store them in one dataframe such as :
columnA columnB columnC columnD
A B C D
to a text file sample.txt:
A,B
C,D
This is the code :
cleaned_data.iloc[:, 0:1].to_csv("report_csv.txt", encoding='utf-8', index=False, header=False, line_terminator='')
cleaned_data.iloc[:,1:].to_csv("report_csv.txt", encoding='utf-8', index=False, header=False, mode='a', line_terminator='')
It should produce as expected in sample.txt. However, there is third line which is empty and I dont want it to exist. I tried lineterminator='', it does not work for '' but it works such as ' ' or 'abc' etc..
I'm sure there is better way of producing the sample text file than using what I've written. I'm up for other alternative.
Still, how can I remove the last empty line? I'm using python 3.8
I'm not able to reproduce your issue, but it might be the case that your strings in the dataframe contain trailing line breaks. I'm running Pandas 0.23.4 on linux
import pandas
print(pandas.__version__)
I created what I think your dataframe contains using the command
df = pandas.DataFrame({'colA':['A'], 'colB': ['B'], 'colC':['C'], 'colD':['D']})
To check the contents of a cell, you could use df['colA'][0].
The indexing I needed to grab the first and second columns was
df.iloc[:, 0:2]
and the way I got to a CSV did not rely on lineterminator
df.iloc[:, 0:2].to_csv("report_csv.txt", encoding='utf-8', index=False, header=False)
df.iloc[:,2:].to_csv("report_csv.txt", encoding='utf-8', index=False, header=False, mode='a')
When I run
with open('report_csv.txt','r') as file_handle:
dat = file_handle.read()
I get 'A,B\nC,D\n' from dat.
To get no trailing newline on the last line, use to_string()
with open('output.txt','w') as file_handle:
file_handle.write(df.iloc[:, 0:2].to_string(header=False,index=False)+"\n")
file_handle.write(df.iloc[:,2:].to_string(header=False,index=False))
Then we can verify the file is formatted as desired by running
with open('output.txt','r') as file_handle:
dat = file_handle.read()
The dat contains 'A B\nC D'. If spaces are not an acceptable delimiter, they could be replaced by a , prior to writing to file.

Reading header containing dates in Python pandas

I have an excel sheet:
31-12-2019 31-01-2020 28-02-2020 *(which btw is formatted as: 31-Dec-19, 31-Jan-20, etc. not sure if relevant)*
1 -0,36% 0,12% -0,09%
2 -0,18% 0,06% -0,07%
3 0,05% 0,04% 0,14%
To be clear, the problem is not in reading the file, but the issue below.
I want to read this file with pandas in python and have the dates in the header as strings. So that later i can to refer to any column with something like df['31-12-2019'].
When I read the excel now, I get a keyerror message, because the formats of the dates in the header are changed. I read it like this now:
curve = pd.read_excel("Monthly curves.xlsx", sheet_name = "swap", skiprows = 1, index_col = 0)
I receive the error when selecting for instance column 31-12-2019: "Keyerror: '31-12-2019'. Any help would be much appreciated!
Also, the first column does not have a header, how can I name it myself as 'years'?
It worked when I used this:
import pandas as pnd
file = 'excelfile.xlsx'
df = pnd.read_excel(file,sheet_name=0,index_col=0)
df.head()
I don't know about naming the headers though...
I worked around my problem by reading the file as follows:
curve = pd.read_excel("Monthly Curves.xlsx", sheet_name = "swap", index_col = 0, skiprows = 2, header = None)
Then to select for instance the 91th column I used .loc (because .ix is deprecated), and I did that in the following way:
M12 = curve.loc[:, 91]
Hope that helps others as well!

Reading bad csv files with garbage values

I wish to read a csv file which has the following format using pandas:
atrrth
sfkjbgksjg
airuqghlerig
Name Roll
airuqgorqowi
awlrkgjabgwl
AAA 67
BBB 55
CCC 07
As you can see, if I use pd.read_csv, I get the fairly obvious error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2
But I wish to get the entire data into a dataframe. Using error_bad_lines = False will remove the important stuff and leave only the garbage values
These are the 2 of the possible column names as given below :
Name : [Name , NAME , Name of student]
Roll : [Rollno , Roll , ROLL]
How to achieve this?
Open the csv file and find a row from where the column name starts:
with open(r'data.csv') as fp:
skip = next(filter(
lambda x: x[1].startswith(('Name','NAME')),
enumerate(fp)
))[0]
The value will be stored in skip parameter
import pandas as pd
df = pd.read_csv('data.csv', skiprows=skip)
Works in Python 3.X
I would like to suggest a slight modification/simplification to #RahulAgarwal's answer. Rather than closing and re-opening the file, you can continue loading the same stream directly into pandas. Instead of recording the number of rows to skip, you can record the header line and split it manually to provide the column names:
with open(r'data.csv') as fp:
names = next(line for line in fp if line.casefold().lstrip().startswith('name'))
df = pd.read_csv(fp, names=names.strip().split())
This has an advantage for files with large numbers of trash lines.
A more detailed check could be something like this:
def isheader(line):
items = line.strip().split()
if len(items) != 2:
return False
items = sorted(map(str.casefold, items))
return items[0].startswith('name') and items[1].startswith('roll')
This function will handle all your possibilities, in any order, but also currently skip trash lines with spaces in them. You would use it as a filter:
names = next(line for line in fp if isheader(line))
If that's indeed the structure (and not just an example of what sort of garbage one can get), you can simply use skiprows argument to indicate how many lines should be skipped. In other words, you should read your dataframe like this:
import pandas as pd
df = pd.read_csv('your.csv', skiprows=3)
Mind that skiprows can do much more. Check the docs.

pandas read_excel select rows

thanks to StackOverflow (so basically all of you) I've managed to solve almost all my issues regarding reading excel data to DataFrame, except one... My code goes like this:
df = pd.read_excel(
fileName,
sheetname=sheetName,
header=None,
skiprows=3,
index_col=None,
skip_footer=0,
parse_cols='A:J,AB:CC,CE:DJ',
na_values='')
The thing is that in excel files which I'm parsing last row of dataI want to load is every time on different position. The only way I can identify last row of data which interest me is to look for word "SUMA" in first column of each sheet, and the last row I want to load to df will be n-1 row from the one containing "SUMA". Rows below SUMA also have some irrevelant (for me) information and there can by quite a lot of them so I want to avoid loading them.
If you do it with generators, you could do something like this. This loads the complete DataFrame, but afterwards filters out the lines after 'SUMA', using the trick that True == 1, so you only keep the relevant info. You might need some work afterwards to get the dtypes correct
def read_files(files):
sheetname = 'my_sheet'
for file in files:
yield pd.read_excel(
file,
sheetname=sheetName,
header=None,
skiprows=3,
index_col=None,
skip_footer=0,
parse_cols='A:J,AB:CC,CE:DJ',
na_values='')
def clean_files(dataframes):
summary_text = 'SUMA'
for df in dataframes:
index_after_suma = df.index.str.startswith(summary_text).cumsum()
yield df.loc[~index_after_suma, :]

Categories