Python) to_csv: my csv file data is separated and pushed back - python

I'm saving my pd.DataFrame with
"""df.to_csv('df.csv', encoding='utf-8-sig)"""
my csv file have a problem...
please see rows, where have content2-1, content2-2, and content2-3 in this pic.
Before saving(to_csv), there was no problem. All the data had right columns, 'content2' was not separated. but after df -> csv...
'content2' is all separated, and the others of 'id2' are allocated to the wrong columns.
"2018-04-21" have to be in column D, 0 with E,F,G, and url must be in column I.
why this happen? because of large csv file?(774,740KB), because of language?(Korean), or csv cannot recognize enter key?(All data with problems such as content2 were separated based on the enter key.)
how can I resolve this? I have no idea

Unfortunately I never figured out the reason for this.. I assumed it was something to do with the size of the data i was working with and excel not liking it.
What worked for me though was using .to_excel() instead of to_csv(). I know, far from a perfect answer, but thought id put it here incase it is enough for your case

Related

Can't merge dataframes because of index type mismatch

I have loaded some data from CSV files into two dataframes, X and Y that I intend to perform some analysis on. After cleaning them up a bit I can see that the indexes of my dataframes appear to match (they're just sequential numbers), except one has index with type object and the other has index with type int64. Please see attached image for a clearer idea of what I'm talking about.
I have tried manually altering this using X.index.astype('int64') and also X.reindex(Y.index) but neither seem to do anything here. Could anyone suggest anything?
Edit: Adding some additional info in case it is helpful. X was imported as row data from the csv file and transposed whereas Y was imported directly with the index set from the first column of the csv file.
So I've realised what I've done and it was pretty dumb. I should have written
X.index = X.index.astype('int64')
instead of just
X.index.astype('int64')
Oh well, the more you know.

Convert csv column into list using pandas

I'm currently working on a project that takes a csv list of student names who attended a meeting, and converts it into a list (later to be compared to full student roster list, but one thing at a time). I've been looking for answers for hours but I still feel stuck. I've tried using both pandas and the csv module. I'd like to stick with pandas, but if it's easier in the csv module that works too. CSV file example and code below.
The file is autogenerated by our video call software- so the formatting is a little weird.
Attendance.csv
see sample as image, I can't insert images yet
Code:
data = pandas.read_csv("2A Attendance Report.csv", header=3)
AttendanceList = data['A'].to_list()
print(str(AttendanceList))
However, this is raising KeyError: 'A'
Any help is really appreciated, thank you!!!
As seen in sample image, you have column headers in the first row itself. Hence you need to remove header=3 from your read_csv call. Either replace it with header=0 or don't specify any explicit header value at all.

How to keep both good and bad lines when loading text file?

I am trying to load a large text file into python dataframe. One thing I noticed is, if I want to load it successfully, I have to drop all the bad lines. But I would like to load all rows first then take a look then clean it manually. Is there a way to do that?
data = pd.read_csv('filename.txt', sep="\t", error_bad_lines=False, engine='python')
Here's warnings I've got. It's a common error, but all solutions are just skipping them, I really need to load all rows... any thought?
Skipping XXX line: Expected 28 fields in line XXX, saw 29
Without knowing more about the specific CSV file, it looks like there is either:
Too many columns in that row (an extra comma)
Quoting is off meaning there's a comma that should be quoted but isn't
The best way to remedy this is to fix the problem in the CSV file.
Technically you're not just loading the file, but also parsing it at the same time. It looks like you've handled the delimiter properly, so as you may have guessed you have too many columns or too few in some of your rows. That may actually be the case, or perhaps you have tabs within text fields that are being interpreted as delimiters.
In any case, pandas isn't going to parse those inconsistent lines.
A typical approach is to open the file in a robust text editor and look at the lines that are erroring out in Pandas. See what's actually wrong and either fix it in the text editor, or use python's native open() function to load the entire file and iterate line by line, with logic that fixes whatever the problem is.
Once you're certain that you have the same number of columns in every row load it with Pandas.

Any idea how to import this data set?

I have the following dataset:
https://github.com/felipe0216/fdfds/blob/master/sheffield_weather_station.csv
I know it is a csv, so I can use the pd.read_csv() function. However, if you look into the file it is not comma separated values nor a tab separated values, I am not sure what separation it has exactly. Any ideas about how to open it?
The proper way to do this is as follows:
df = pd.read_csv("sheffield_weather_station.csv", comment="#", delim_whitespace=True)
You should first download the .csv file. You have to tell pandas that there will be comments and that there will just be spaces separating the columns. The amount of spaces do not matter.

Trailing delimiter confuses pandas read_csv

A csv (comma delimited) file, where lines have an extra trailing delimiter, seems to confuse pandas.read_csv. (The data file is [1])
It treats the extra delimiter as if there's an extra column. So there's one more column than what headers require. Then pandas.read_csv takes the first column as row labels. The overall effect is that columns and headers are not aligned any more - the first column becomes row labels, the second column is named by first header, etc.
It is quite annoying. Any idea how to tell pandas.read_csv do the right thing? I couldn't find one.
Great book, BTW.
[1]: 2012 FEC Election Database from chapter 9 of the book Python for Data Analysis
For everyone who is still finding this. Wes wrote a blogpost about this. The problem if there is one value too many in the row it is treated as the rows name.
This behaviour can be changed by setting index_col=False as an option to read_csv.
I created a GitHub issue to have a look at handling this issue automatically:
https://github.com/pydata/pandas/issues/2442
I think the FEC file format changed slightly causing this annoying issue-- if you use the one posted here http://github.com/pydata/pydata-book you hopefully won't have that problem.
Well, there's a very simple workaround. Add a dummy column to the header when reading csv file in:
cols = ...
cols.append('')
records = pandas.read_csv('filename.txt', skiprows=1, names=cols)
Then columns and header get aligned again.

Categories