Pandas falsely converting strings to floats

Pandas falsely converting strings to floats - python

I'm using a csv file from Excel to create a pandas data frame. Recently, I've encountered several ValueError messages regarding the dtypes of each column in the dataframe.
This is the most recent exception raised:
ValueError: could not convert string to float: 'OH'
After running pandas' dtypes method on my data frame, it shows that this particular column addr_state is an object, not a float.
I've pasted all my code below for clarification:
work_path = 'C:\\Users\\Projects\\loans.csv'
unfiltered_y_df = pd.read_csv(work_path, low_memory=False, encoding='latin-1')
print(unfiltered_y_df.dtypes)
filtered_y_df = unfiltered_y_df.loc[unfiltered_y_df['loan_status'].isin(['Fully Paid', 'Charged Off', 'Default'])]
X = StandardScaler().fit_transform(filtered_y_df[[column for column in filtered_y_df]])
Y = filtered_y_df['loan_status']
Also, is it possible to explicitly write out the dtypes for each column? Right now I feel like that's the only way to solve this. Thanks in advance!

So two issues here I think:
To print out the types for each column just use the ftypes or dtypes method:
i.e.
unfiltered_y_df.ftypes
You say 'addr_state' is an object not a float. Well that is the problem, StandardScaler() will only work on floats so it is trying to coerce your state 'OH' to a float and can't, hence the error

Related

I can't convert df to parquet by data type error

I'm trying to convert a pandas dataframe to parquet, but I'm getting an error "Exptected bytes, got a 'int' object", 'Conversion failed for column xxxxxxxx with type object')
This table in Excel has numbers and strings, it is like dtype 'object', even so it gives error. I've tried df['xxxxxxxx'].astype(str), df['xxxxxxxx'].astype('data_type'), but none of them work.
I tried do convert to parquet with AWS Wrangler and Pyarrow

As mentioned in this other question
A general type of the column could work. So try:
df['xxxxxxxx'] = df['xxxxxxxx'].astype(str)
df.to_parquet(path)
However, this is not a good practice as this will hide the type error, you should consider fixing the type of the column by separating data or be aware that this columnhas different types. Pandas has a warning included for these type of errors:
Columns (# of column) have mixed types. Specify dtype option on import or set low_memory=False.

Did you try :
df['xxxxxxxx'] = df['xxxxxxxx'].astype(bytes)

Trying to convert a column with strings to float via Pandas

Hi I have looked but on stackoverflow and not found a solution for my problem. Any help highly appeciated.
After importing a csv I noticed that all the types of the columns are object and not float.
My goal is to convert all the columns but the YEAR column to float. I have read that you first have to strip the columns for taking blanks out and then also convert NaNs to 0 and then try to convert strings to floats. But in the code below I'm getting an error.
My code in Jupyter notes is:
And I get the following error.
How do I have to change the code.
All the columns but the YEAR column have to be set to float.
If you can help me set the column Year to datetime that would be also very nice. But my main problem is getting the data right so I can start making calculations.
Thanks
Runy

Easiest would be
df = df.astype(float)
df['YEAR'] = df['YEAR'].astype(int)
Also, your code fails because you have two columns with the same name BBPWN, so when you do df['BBPWN'], you will get a dataframe with those two columns. Then, df['BBPWN'].str will fail.

Reading Date times from Excel to Python using Pandas

I'm trying to read from an Excel file that gets converted to python and then gets split into numbers (Integers and floats) and everything else. There are numerous columns of different types.
I currently bring in the data with
pd.read_excel
and then split the data up with
DataFrame.select_dtypes("number")
When users upload a time (so 12:30:00) they expect for it to be recognized as a time. However python (currently) treats it as dtype object.
If I specify the column with parse_dates then it works, however since I don't know what the data is in advance I ideally want this to be done automatically. I`ve tried setting parse_dates = True however it doesn't seem to make a difference.
I'm not sure if there is a way to recognize the datatime after the file is uploaded. Again however I would want this to be done without having to specify the column (so anything that can be converted is)
Many Thanks

If your data contains only one column with dtype object (I assume it is a string) you can do the following:
1) filter the column with dtype object
import pandas as pd
datatime_col = df.select_dtypes(object)
2) convert it to seconds
datetime_col_in_seconds = pd.to_timedelta(datatime_col.loc[0]).dt.total_seconds()
Then you can re-append the converted column to your original data and/or do whatever processing you want.
Eventually, you can convert it back to datetime.
datetime_col = pd.to_datetime(datetime_col_in_seconds, unit='s')
if you have more than one column with dtype object you might have to do some more pre-processing but I guess this is a good way to start tackling your particular case.

This does what I need
for column_name in df.columns:
try:
df.loc[:, column_name] = pd.to_timedelta(df.loc[:, column_name].astype(str))
except ValueError:
pass
This tries to convert every column into a timedelta format. If it isn't capable of transforming it, it returns a value error and moves onto the next column.
After being run any columns that could be recognized as a timedelta format are transformed.

Pandas Dtypewarning: How do I find the dtype of different cells in a column?

When I import a csv file in pandas, I get a DtypeWarning:
Columns (3) have mixed types. Specify dtype option on import or set low_memory=False.
How do I find out what the dtype is of each cell? I think there might be some issue with the data that is why the warning is coming but it is a file with ~5 milllion rows so hard to ideentify the culprit?
Is it a good practice to specify dtype on Import? Aid if that is done, will it not result in "loss" of data?

I agree with piRSquared. Just adding to his comments, I had a similar problem. My column was supposed to have string values, but one value was a float value (with a NaN value).
There are some things you can do to help you with your analysis. Supose your dataframe is df. You can check each column's type with:
df.dtypes
For each column of type 'object', you can inspect even more by creating a cell's type:
df['type'] = df['mycolumn'].apply(lambda x: type(x).__name__)
If your column is supposed to be string valued, you can check which cells are not string with:
df[df.type != 'str']

Pandas read scientific notation and change

I have a dataframe in pandas that i'm reading in from a csv.
One of my columns has values that include NaN, floats, and scientific notation, i.e. 5.3e-23
My trouble is that as I read in the csv, pandas views these data as an object dtype, not the float32 that it should be. I guess because it thinks the scientific notation entries are strings.
I've tried to convert the dtype using df['speed'].astype(float) after it's been read in, and tried to specify the dtype as it's being read in using df = pd.read_csv('path/test.csv', dtype={'speed': np.float64}, na_values=['n/a']). This throws the error ValueError: cannot safely convert passed user dtype of <f4 for object dtyped data in column ...
So far neither of these methods have worked. Am I missing something that is an incredibly easy fix?
this question seems to suggest I can specify known numbers that might throw an error, but i'd prefer to convert the scientific notation back to a float if possible.
EDITED TO SHOW DATA FROM CSV AS REQUESTED IN COMMENTS
7425616,12375,28,2015-08-09 11:07:56,0,-8.18644,118.21463,2,0,2
7425615,12375,28,2015-08-09 11:04:15,0,-8.18644,118.21463,2,NaN,2
7425617,12375,28,2015-08-09 11:09:38,0,-8.18644,118.2145,2,0.14,2
7425592,12375,28,2015-08-09 10:36:34,0,-8.18663,118.2157,2,0.05,2
65999,1021,29,2015-01-30 21:43:26,0,-8.36728,118.29235,1,0.206836151554794,2
204958,1160,30,2015-02-03 17:53:37,2,-8.36247,118.28664,1,9.49242000872744e-05,7
384739,,32,2015-01-14 16:07:02,1,-8.36778,118.29206,2,Infinity,4
275929,1160,30,2015-02-17 03:13:51,1,-8.36248,118.28656,1,113.318511172611,5

It's hard to say without seeing your data but it seems that problem in your rows that they contain something else except for numbers and 'n/a' values. You could load your dataframe and then convert it to numeric as show in answers for that question. If you have pandas version >= 0.17.0 then you could use following:
df1 = df.apply(pd.to_numeric, args=('coerce',))
Then you could drop row with NA values with dropna or fill them with zeros with fillna

I realised it was the infinity statement causing the issue in my data. Removing this with a find and replace worked.
#Anton Protopopov answer also works as did #DSM's comment regarding me not typing df['speed'] = df['speed'].astype(float).
Thanks for the help.

In my case, using pandas.round() worked.
df['column'] = df['column'].round(2)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.