Pandas read_csv silently converting and messing up dates and strings?

Pandas read_csv silently converting and messing up dates and strings? - python

I am reading a csv file that has two adjacent columns containing dates like this:
29/11/2004 00:00,29/11/2005 00:00,2,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
When I read this using read_csv and then write it back to csv using the to_csv method, it gets converted to
29/11/2004 00:00,00:00.0,2.0,,,,,,,,
I have got two questions about this: Why does it read the first date okay but thinks the second, which seems to have exactly the same format, is 0? And why do the NULLs get converted to empty strings?
Here is the code I am using:
df = pandas.read_csv(filepath, sep = ",")
df.to_csv("C:\\tmp\\test.csv")

Not sure the reason for the missing date. I think it's influenced by other rows.
For the NULL string problem, keep_default_na can help you to avoid that:
df = pd.read_csv('test.csv', sep=',', keep_default_na=False)

Related

Python Pandas Dataframe Remove Float Trailing Zeros

I have a Pandas dataframe that I'm outputting to csv. I would like to keep the data types (i.e. not convert everything to string). I need to format the date properly and there are other non-float columns.
How do I remove trailing zeros from the floats while not changing datatypes? This is what I've tried:
pd.DataFrame(myDataFrame).to_csv("MyOutput.csv", index=False, date_format='%m/%d/%Y', float_format="%.8f")
For example, this:
09/26/2022,43.27334000,2,111.37000000
09/24/2022,16.25930000,5,73.53000000
Should be this:
09/26/2022,43.27334,2,111.37
09/24/2022,16.2593,5,73.53
Any help would be greatly appreciated!

You can load your code like this, without the float_format. Also, if the myDataFrame variable is already a dataframe object, you don't need to add the pd.DataFrame part, you can just do the following.
myDataFrame.to_csv("MyOutput.csv", index=False, date_format='%m/%d/%Y')

sep=';' not shaping dataframe in Python

I am importing a file that is semicolon delimited. my code:
df = pd.read_csv('bank-full.csv', sep = ';')
print(df.shape)
When I use this in Jupyter Notebooks and Spyder I get a shape output of (45211, 1). When I print my dataframe the data looks like this at this point:
<bound method NDFrame.head of age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"
0 58;"management";"married";"tertiary";"no";2143...
I can get the correct shape by using
df = pd.read_csv('bank-full.csv', sep = '[;]')
print(df.shape)
or
df = pd.read_csv('bank-full.csv', sep = '\;')
print(df.shape)
However when I do this the data seems to get pulled in as though each row is a string. The first and last column get added preceding and ending double quotations respectively, and when I attempt to strip them nothing is working to remove them so either way I am stuck with many of my columns called objects and unable to force them into integers when needed. My data comes out like this:
"age ""job"" ""marital"" ""education"" ""default"" \
0 "58 ""management"" ""married"" ""tertiary"" ""no""
with final column:
""y"""
0 ""no"""
I have reached out to those in my class and had them send me their .csv file, restarted from scratch, tried a different UI, and even copy/pasted their line of code to read and shape the data and get nothing. I have used every resource except asking this here and am out of ideas.

CSVs are usually separated by commas, but sometimes the cells are separated by a different character(s). So, since I don't have access to your exact dataset, I will give you advice that should help you overall.
First, look at the CSV and assess what character(s) are separating each value, then use that as the value in "sep" during your pd.read_csv() call.
Then, whatever columns you want to convert to numeric, you can use pd.to_numeric() to convert the data type. This may present problems if any of the values in the column cannot be converted to numeric, and you will then need to do additional data cleaning.
Below is an example of how to do this to a particular column that I am calling "col":
import pandas as pd
df = pd.read_csv('bank-full.csv', sep = '[;]')
df[col] = pd.to_numeric(df[col])
Let me know if you have further questions, or better yet, share the data with me if you can't get this to work for you.

How do I prevent a value from converting to a date or executing as division?

I have a column in a dataframe that has values in the format XX/XX (Ex: 05/23, 4/22, etc.) When I convert it to a csv, it converts to a date. How do I prevent this from happening?
I tried putting an equals sign in front but then it executes like division (Ex: =4/20 comes out to 0.5).
df['unique_id'] = '=' + df['unique_id']
I want the output to be in the original format XX/XX (Ex: 5/23 stays 5/23 in the csv file in Excel).

Check the datatypes of your dataframe with df.dtypes. I assume your column is interpreted as date. Then you can do df[col] = df[col].astype(np_type_you_want)
If that doenst bring the wished result, check why the column is interpreted as date when creating the df. Solution depends on where you get the data from.

The issue is not an issue with python or pandas. The issue is that excel thinks its clever and assumes it knows your data type. you were close with trying to put an = before your data but your data needs to be wrapped in qoutes and prefixed with an =. I cant claim to have come up with this answer myself. I obtained it from this answer
The following code will allow you to write a CSV file that will then open in excel without any formating trying to convert to date or executing division. However it shoudl be noted that this is only really a strategy if you will only be opening the CSV in excel. as you are wrapping formating info around your data which will then be stripped out by excel. If you are using this csv in any other software you might need to rethink about it.
import pandas as pd
import csv
data = {'key1': [r'4/5']}
df = pd.DataFrame.from_dict(data)
df['key1'] = '="' + df['key1'] + '"'
print(df)
print(df.dtypes)
with open(r'C:\Users\cd00119621\myfile.csv', 'w') as output:
df.to_csv(output)
RAW OUTPUT in file
,key1
0,"=""4/5"""
EXCEL OUTPUT

Column value is read as date instead of string - Pandas

I am having an excel file and in that one row of column Model is having value "9-3" which is a string value. I double-checked the excel file to have the column datatype as Plain string instead of Date. But still When I use read_excel and convert it into a data frame, the value is shown as 2017-09-03 00:00:00 instead of string "9-3".
Here is how I read the excel file:
table = pd.read_excel('ManualProfitAdjustmentUpdates.xlsx' , header=0, converters={'Model': str})
Any idea on why pandas is not treating value as string even when I set the converters as str?

The Plain string setting in the excel file affects only how the data is shown in Excel.
The str setting in the converter affects only how it treats the data that it gets.
To force the excel file to return the data as string, the cell's first character should be an apostrophe.
Change "9-3" to "'9-3".

The problem may be with excel. Make sure the entire column is stored as text and not just the singular value you are talking about. If excel had the column saved as a data at any point it will store a year in that cell no matter what is shown or what the datatype is changed too. Pandas is going to read the entire column as one data type so if you have dates above 9-3 it will be converted. Changing dates to strings without years can be tricky. It may be better to save the excel sheet as a csv once it is in the proper format you like and then use pandas pd.read_csv(). I made a test excel workbook "book1.xlsx"
9-3 1 Hello
12-1 2 World
1-8 3 Test
Then ran
import pandas as pd
df = pd.read_excel('book1.xlsx',header=0)
print(df)
and got back my data frame correctly. Thus, I am led to believe it is excel. Sorry is isn't the best answer but I don't believe it is a pandas error.

pandas read_csv not converting string to date

I've looked for help on this one and didn't find the answer (i'm sure i'm asking the wrong question)
I have a CSV file, it has dates in it, when i read it in, the date conversion doesn't happen.
import pandas
df = pd.read_csv('file', index_col='Sequence', parse_dates='Date')
CSV file
Sequence,Date,Unit,Name,Indexed,Arbitrated,Redo
1,2013-01-01,Aloha,first last,831,0,0
df.Date is a bunch of strings not datetime values

You need to pass the column to parse as a list, not a string:
df = pd.read_csv('file', index_col='Sequence', parse_dates=['Date'])
The docstring explanation for parse_dates says "list of ints or names", as in this way you can specify multiple columns to parse. But I have to agree that for one column it is a bit surprising.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas read_csv silently converting and messing up dates and strings? - python

Not sure the reason for the missing date. I think it's influenced by other rows. For the NULL string problem, keep_default_na can help you to avoid that: df = pd.read_csv('test.csv', sep=',', keep_default_na=False)

Related

Python Pandas Dataframe Remove Float Trailing Zeros

sep=';' not shaping dataframe in Python

How do I prevent a value from converting to a date or executing as division?

Column value is read as date instead of string - Pandas

pandas read_csv not converting string to date

Categories

Resources