Unable to read modified csv file with pandas - python

I have exported a Excel file using the pandas .to_csv method on a 9-column DataFrame successfully, as well as accessing the created file with the .to_csv method likewise, with no errors whatsoever using the following code:
dfBase = pd.read_csv('C:/Users/MyUser/Documents/Scripts/Base.csv',
sep=';', decimal=',', index_col=0, parse_dates=True,
encoding='utf-8', engine='python')
However, upon modifying the same CSV file manually using Notepad (which also extends to simply opening the file and saving it without making any actual alterations), pandas won't read it anymore, giving the following error message:
ParserError: Expected 2 fields in line 2, saw 9
In the case of the modified CSV, if the index_col=0 parameter is removed from the code, pandas is able to read the DataFrame again, however the first 8 columns become the index as a tuple and only the last column is brought as a field.
Could anyone point me out as to why I am unable to access the DataFrame after modifying it? Also, why does the removal of index_col enables its reading again with nearly all the columns as the index?

Have you tried opening and saving the file with some other text editor? Notepad really isn't that great, probably it's adding some special characters upon opening of the file or maybe the file already contains those characters and Notepad does not let you see them, hence pandas can't convert correctly
try Notepad++ or some more advanced IDEs like Atom, VSCode or PyCharm

Related

Pandas save to csv with lineterminator='\n'

I'm trying to save a file with text information into csv format.
However, after I use 'to_csv' in pandas (without specifying anything) save the file and I then use pd.read_csv to reopen the file. It will give me this error message:
ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
But if I read the csv file in pandas like pd.read_csv('file.csv',lineterminator='\n'). It will open the file properly.
However, I need to use a software to process these text file, and the software opens the file the same way as how Excel opens the csv file, and I cannot specifying lineterminator='\n' like I did in the python. If I open the csv file using that software, some text in the column will go to other rows.
The text between the index 378 and 379 is supposed to be together in 378 row. However, they goes to other rows and go to the index column.
id text
378 1 Good morning. This row's text goes to the following rows
Dot dot NaN NaN
HELLO NaN NaN
Apple NaN NaN
379 2 This row is correct
Does anyone know how to solve this problem when i use pandas.to_csv to save the dataframe? what should i specify if i want to open the file properly in a software like Excel?
Try this:
df = pd.read_csv('file.csv', error_bad_lines=False)
Thanks for the replies. I have found the problem. It's the '\r' inside the text. I removed all '\r' inside the text and now it works. Thanks1

read_csv reads all input into first column for some rows

I am trying to read a csv file, which I downloaded from City of Chicago as a csv file into a dataframe. However, for many rows all data gets read into the first column like row 2. If I delete those rows, there are only 5% of the data left. Does anyone have an idea what to do?
Also when opening the csv file as txt, the rows that do not read correctly have a leading ". I dont know if that would cause the issue.
crime_df = pd.read_csv('ChicagoCrime.csv')
crime_df.head(10)
enter image description here
Edit: I downloaded the file as a CSV for Excel and didn't get an issue. I would recommend this unless you really don't want to download it again.
My first suggestion is to specify more arguments in the read_csv function to help it out. You may try specifying a delimiter, specify the engine to use, set parse_dates to True, etc. Check out https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html for all available arguments.

Error when reading csv with merged cells

I have a txt file that I open in Excel that has merged cells (see image).
.
These cause an error message when reading the file:
CParserError: Error tokenizing data. C error: Expected 1 fields in line 1883, saw 2
At the moment I'm manually taking them out in Excel. I'm sure there could be a way to taken these out when reading a file but I can't find anything on SO. I'm not sure if I'm using the right terminology though.
Using Excel may also be an option. I just wanted to see if there was a method using Python.
If you just want to skip the headers, you might look at this SO answer which suggests the following:
data = pd.read_csv('file1.csv', error_bad_lines=False)

How to open a .data file extension

I am working on side stuff where the data provided is in a .data file. How do I open a .data file to see what the data looks like and also how do I read from a .data file programmatically through python? I have Mac OSX
NOTE: The Data I am working with is for one of the KDD cup challenges
Kindly try using Notepad or Gedit to check delimiters in the file (.data files are text files too). After you have confirmed this, then you can use the read_csv method in the Pandas library in python.
import pandas as pd
file_path = "~/AI/datasets/wine/wine.data"
# above .data file is comma delimited
wine_data = pd.read_csv(file_path, delimiter=",")
It vastly depends on what is in it. It could be a binary file or it could be a text file.
If it is a text file then you can open it in the same way you open any file (f=open(filename,"r"))
If it is a binary file you can just add a "b" to the open command (open(filename,"rb")). There is an example here:
Reading binary file in Python and looping over each byte
Depending on the type of data in there, you might want to try passing it through a csv reader (csv python module) or an xml parsing library (an example of which is lxml)
After further into from above and looking at the page the format is:
Data Format
The datasets use a format similar as that of the text export format from relational databases:
One header lines with the variables names
One line per instance
Separator tabulation between the values
There are missing values (consecutive tabulations)
Therefore see this answer:
parsing a tab-separated file in Python
I would advise trying to process one line at a time rather than loading the whole file, but if you have the ram why not...
I suspect it doesnt open in sublime because the file is huge, but that is just a guess.
To get a quick overview of what the file may content you could do this within a terminal, using strings or cat, for example:
$ strings file.data
or
$ cat -v file.data
In case you forget to pass the -v option to cat and if is a binary file you could mess your terminal and therefore need to reset it:
$ reset
I was just dealing with this issue myself so I thought I would share my answer. I have a .data file and was unable to open it by simply right clicking it. MACOS recommended I open it using Xcode so I tried it but it did not work.
Next I tried open it using a program named "Brackets". It is a text editing program primarily used for HTML and CSS. Brackets did work.
I also tried PyCharm as I am a Python Programmer. Pycharm worked as well and I was also able to read from the file using the following lines of code:
inf = open("processed-1.cleveland.data", "r")
lines = inf.readlines()
for line in lines:
print(line, end="")
It works for me.
import pandas as pd
# define your file path here
your_data = pd.read_csv(file_path, sep=',')
your_data.head()
I mean that just take it as a csv file if it is seprated with ','.
solution from #mustious.

Csv blank rows problem with Excel

I have a csv file which contains rows from a sqlite3 database. I wrote the rows to the csv file using python.
When I open the csv file with Ms Excel, a blank row appears below every row, but the file on notepad is fine(without any blanks).
Does anyone know why this is happenning and how I can fix it?
Edit: I used the strip() function for all the attributes before writing a row.
Thanks.
You're using open('file.csv', 'w')--try open('file.csv', 'wb').
The Python csv module requires output files be opened in binary mode.
the first that comes into my mind (just an idea) is that you might have used "\r\n" as row delimiter (which is shown as one linebrak in notepad) but excel expects to get only "\n" or only "\r" and so it interprets this as two line-breaks.

Categories