I'm trying to save a file with text information into csv format.
However, after I use 'to_csv' in pandas (without specifying anything) save the file and I then use pd.read_csv to reopen the file. It will give me this error message:
ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
But if I read the csv file in pandas like pd.read_csv('file.csv',lineterminator='\n'). It will open the file properly.
However, I need to use a software to process these text file, and the software opens the file the same way as how Excel opens the csv file, and I cannot specifying lineterminator='\n' like I did in the python. If I open the csv file using that software, some text in the column will go to other rows.
The text between the index 378 and 379 is supposed to be together in 378 row. However, they goes to other rows and go to the index column.
id text
378 1 Good morning. This row's text goes to the following rows
Dot dot NaN NaN
HELLO NaN NaN
Apple NaN NaN
379 2 This row is correct
Does anyone know how to solve this problem when i use pandas.to_csv to save the dataframe? what should i specify if i want to open the file properly in a software like Excel?
Try this:
df = pd.read_csv('file.csv', error_bad_lines=False)
Thanks for the replies. I have found the problem. It's the '\r' inside the text. I removed all '\r' inside the text and now it works. Thanks1
Related
I am trying to read a csv file, which I downloaded from City of Chicago as a csv file into a dataframe. However, for many rows all data gets read into the first column like row 2. If I delete those rows, there are only 5% of the data left. Does anyone have an idea what to do?
Also when opening the csv file as txt, the rows that do not read correctly have a leading ". I dont know if that would cause the issue.
crime_df = pd.read_csv('ChicagoCrime.csv')
crime_df.head(10)
enter image description here
Edit: I downloaded the file as a CSV for Excel and didn't get an issue. I would recommend this unless you really don't want to download it again.
My first suggestion is to specify more arguments in the read_csv function to help it out. You may try specifying a delimiter, specify the engine to use, set parse_dates to True, etc. Check out https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html for all available arguments.
I have exported a Excel file using the pandas .to_csv method on a 9-column DataFrame successfully, as well as accessing the created file with the .to_csv method likewise, with no errors whatsoever using the following code:
dfBase = pd.read_csv('C:/Users/MyUser/Documents/Scripts/Base.csv',
sep=';', decimal=',', index_col=0, parse_dates=True,
encoding='utf-8', engine='python')
However, upon modifying the same CSV file manually using Notepad (which also extends to simply opening the file and saving it without making any actual alterations), pandas won't read it anymore, giving the following error message:
ParserError: Expected 2 fields in line 2, saw 9
In the case of the modified CSV, if the index_col=0 parameter is removed from the code, pandas is able to read the DataFrame again, however the first 8 columns become the index as a tuple and only the last column is brought as a field.
Could anyone point me out as to why I am unable to access the DataFrame after modifying it? Also, why does the removal of index_col enables its reading again with nearly all the columns as the index?
Have you tried opening and saving the file with some other text editor? Notepad really isn't that great, probably it's adding some special characters upon opening of the file or maybe the file already contains those characters and Notepad does not let you see them, hence pandas can't convert correctly
try Notepad++ or some more advanced IDEs like Atom, VSCode or PyCharm
I have a series of .csv files that I'm reading with pandas.read_csv. From a bunch of columns, I only read 2, (the 2nd and 15th columns).
datafiles = glob.glob(mypath)
for dfile in datafiles:
data = pd.read_csv(dfile,header=6,usecols=['Reading','Value'])
the CSV looks like
this, with a few lines of header at the top. Every once in a while pandas reads one of these numbers off as a NaN. Excel has no trouble reading these values, and visually inspecting the file I don't see what causes the problem. Specifically in this case, the row indexed as 265 in this file, 263 in the data frame, the 'Value' column reads a NaN when it should be ~27.4.
>>>data['Value'][264]
nan
This problem is consistent doesn't change with the number of files I read. In many of the files, this problem is not present. In the rest, it will only read one random number as a NaN, in either one of the columns. I've tried changing from the automatic float64 to np.float128 using dtype, but this doesn't fix it. Any ideas on how to fix this?
Update: A grep search shows that the newline character is \M with only 4 exceptions--lines at the beginning of every file before the header. On further inspection, this specific point [264] is treated differently in the failing files: In 5/12 files, it's fine. In 2/12 files it's read out as 27.0, in 3/12 it's read out as nan, and in 2/12 files it's read out as 2.0. One of the files (one that reads out a 27.0) is available for download here
It looks like you randomly have null characters throughout your csv files, and they are causing the problem. What you need to do to fix this is replace \0 with nothing.
Here's an example of how to do so. The imports are because of loading from a string instead of from a file.
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
datafiles = glob.glob(mypath)
for dfile in datafiles:
st=''
with open(dfile,'r') as f:
for line in f:
line = line.replace('\0','')
st += line
data = pd.read_csv(StringIO(st),header=6,usecols=['Reading','Value'])
It would be cool if pandas had a function to do this by default when you load data into the DataFrame, but it appears that there is no function like that as of now.
I have a text file that is tab delimited for the first 80 rows, and these are the only rows I need in the file. I would normally open the file like this:
df=pd.read_csv(r'file.txt', sep='\t')
but this returns the error:
CParserError: Error tokenizing data. C error: Expected 7 fields in line 84, saw 81
because somewhere along the way it is no longer tab delimited Im pretty sure. If I manually delete everything in the file except for the first 80 rows I can set the tab delimiter and it reads fine, but I need to do this for lots of files. I know I can select only the first 80 using this:
df=df.iloc[:80,:]
but then my dataframe has \t separating every column instead of a space like I want. Is there a way to select only the first 80 rows while opening the file so then I can set sep='\t' without the error?
You can specify just to read the first 80 rows using param nrows:
df=pd.read_csv(r'file.txt', sep='\t', nrows=80)
You can set error parameter True which will drop the blank or malformed lines. nrows is not suitable as per my view because you have to manually add the row count for each file.
df=pd.read_csv(r'file.txt', sep='\t', error_bad_lines=False)
You can also go through these attributes
warn_bad_lines
skip_blank_lines
Folow this link to read more
I need to get specific lines of data that have certain key words in them (names) and write them to another file. The starting file is a 1.5 GB Excel file. I can't just open it up and save it as a different format. How should I handle this using python?
I'm the author and maintainer of xlrd. Please edit your question to provide answers to the following questions. [Such stuff in SO comments is VERY hard to read]
How big is the file in MB? ["Huge" is not a useful answer]
What software created the file?
How much memory do you have on your computer?
Exactly what happens when you try to open the file using Excel? Please explain "I can open it partially".
Exactly what is the error message that you get when you try to open "C:\bigfile.xls" with your script using xlrd.open_workbook? Include the script that you ran, the full traceback, and the error message
What operating system, what version of Python, what version of xlrd?
Do you know how many worksheets there are in the file?
It sounds to me like you have a spreadsheet that was created using Excel 2007 and you have only Excel 2003.
Excel 2007 can create worksheets with 1,048,576 rows by 16,384 columns while Excel 2003 can only work with 65,536 rows by 256 columns. Hence the reason you can't open the entire worksheet in Excel.
If the workbook is just bigger in dimension then xlrd should work for reading the file, but if the file is actually bigger than the amount of memory you have in your computer (which I don't think is the case here since you can open the file with EditPad lite) then you would have to find an alternate method because xlrd reads the entire workbook into memory.
Assuming the first case:
import xlrd
wb_path = r'c:\bigfile.xls'
output_path = r'c:\output.txt'
wb = xlrd.open(wb_path)
ws = wb.sheets()[0] # assuming you want to work with the first sheet in the workbook
with open(output_path, 'w') as output_file:
for i in xrange(ws.nrows):
row = [cell.value for cell in ws.row(i)]
# ... replace the following if statement with your own conditions ...
if row[0] == u'interesting':
output_file.write('\t'.join(row) + '\r\n')
This will give you a tab-delimited output file that should open in Excel.
Edit:
Based on your answer to John Machin's question 5, make sure there is a file called 'bigfile.xls' located in the root of your C drive. If the file isn't there, change the wb_path to the correct location of the file you want to open.
I haven't used it, but xlrd looks like it does a good job reading Excel data.
Your problem is that you are using Excel 2003 .. You need to use a more recent version to be able to read this file. 2003 will not open files bigger than 1M rows.