pandas DataFrame combine_first method converts boolean in floats - python

I'm running into a strange issue where combine_first method is causing values stored as bool to be upcasted into float64s.
Example:
In [1]: import pandas as pd
In [2]: df1 = pd.DataFrame({"a": [True]})
In [3]: df2 = pd.DataFrame({"b": ['test']})
In [4]: df2.combine_first(df1)
Out[4]:
a b
0 1.0 test
This problem has already been reported in a previous post 3 years ago: pandas DataFrame combine_first and update methods have strange behavior. This issue was told to be solved but I still have this behaviour under pandas 0.18.1
thank you for your help

Somewhere along the chain of events to get to a combined dataframe, potential missing values had to be addressed. I'm aware that nothing is missing in your example. None and np.nan are not int, or bool. So in order to have a common dtype that contains a bool and a None or np.nan it is necessary to cast the column as either object or float. As 'float`, a large number of operations become far more efficient and is a decent choice. It obviously isn't the best choice all of the time, but a choice has to be made none the less and pandas tried to infer the best one.
A work around:
Setup
df1 = pd.DataFrame({"a": [True]})
df2 = pd.DataFrame({"b": ['test']})
df3 = df2.combine_first(df1)
df3
Solution
dtypes = df1.dtypes.combine_first(df2.dtypes)
for k, v in dtypes.iteritems():
df3[k] = df3[k].astype(v)
df3

I ran into the same issue. This specific case does not seem to be fixed in Pandas yet. I've filed a bug report:
https://github.com/pandas-dev/pandas/issues/20699

Related

Clarification about Replace and Dtypes with Pandas

This is a strange problem. I don't have the possibility to produce a MVE.
I have two dataset in pandas. They contain some Series that can have three values: "Yes", "No", NaN. These Series have Dtype Object at this moment.
I want to remove the NaNs from them, and to prepare them to be used by ML algorithms, so I do this:
final_df1 = d1.dropna(how='any').replace({"Yes":1, "No":0})
final_df2 = d2.dropna(how='any').replace({"Yes":1, "No":0})
In final_df1 the Dtype of the Series I mentioned before becomes automatically int64 after dropping the NaN values and replacing the values. In final_df2, this does not happen. They contain the same values (0 and 1) so I really do not understand this.
In order to create a Minimum Viable Example, I tried to
Isolate the Series, do the transformation on them one by one and check the results
Take only a small portion of the Dataframes
Save the DFs on disk and work on them from another script to recreate the problem
But, in any of those attempts, the result was different. Either both DFs ended up with Series having Object Dtype, or both with Int64 Dtype.
For me, this is important, because later on I need the dummies of those DFs, and if some Int series are Object series on the other DF, the columns will not match. This problem is easy to solve, I just need to cast explicitly, but still I have one doubt and I would need to confirm it:
If I replace the content of an Object Series (without NaNs) with numbers, is there a random possibility of this Series being cast to Int64?
I see this as the only explanation for what I am facing
Thanks in advance. If you find any way to clarify my question, please edit or comment
EDIT 1: Screenshots from Spyder
This is the code. I am printing in console the most essential relevant data: Dtype, values and number of Nulls
This is the output before the Drop/Replace. Well, I could have printed something more nice to read, but the idea is simple: before the Drop/Replace they both have null values, they both have "Yes" and "No" values, they both are object type Series.
Aaaaand this is after the Drop/Replace. As you can see, they both have no nulls now, they both have 1/0, but one of them is an object Series, the other is an int64 Series.
I really do not understand: they were the same type before!
Here is sample to reproduce.
If you change col_1 '0' to 0 it will change the dtype
import pandas as pd
import numpy as np
data = {'col_1': ['Yes', 'No', np.nan, '0'], 'col_2': [np.nan, 'Yes', 'Yes', 'No']}
df=pd.DataFrame.from_dict(data)
d1=df[['col_1']]
d2=df[['col_2']]
print(d1.dtypes)
print(d2.dtypes)
final_df1 = d1.dropna(how='any').replace({"Yes":1, "No":0})
final_df2 = d2.dropna(how='any').replace({"Yes":1, "No":0})
print(final_df1.dtypes)
print(final_df2.dtypes)
you can also convert the datatypes in the final_df definition
final_df1 = d1.dropna(how='any').replace({"Yes":1, "No":0}).astype(int)
final_df2 = d2.dropna(how='any').replace({"Yes":1, "No":0}).astype(int)

Memory handling in pandas DataFrame

I need to perform some operations on a pandas DataFrame() in order to evaluate some measure but leave my DataFrame as is. So I thought that I should start by duplicating it in memory :
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3]})
df2 = pd.DataFrame(df1)
When printing
print(id(df1), id(df2))
I do get two different system adresses. So in my sense, these are two different instances of DataFrame().
However, if I do:
df2['b'] = [4,5,6]
print(df1)
df1 appears with a 'b' column, although I only added it in df2.
Why is this happening?
How can I really duplicate my DataFrame so that operations on one do not modify the other?
I am on Python 3.5 and pandas 0.24.2
You need to use pd.DataFrame.copy
df2 = df1.copy()
An assignement, even when you assign to a new variable, is referencing the same data/indices in memory, which means a manipulation on df1 or df2 will change the same data in memory. Using copy however, df2 gets its own copy of data that can be manipulated independently.
Explanation:
Why do you get two different memory addresses when calling the pd.DataFrame on a DataFrame?
Simply put, pandas.DataFrame is a wrapper around numpy.ndarry. When you called the pd.DataFrame with df1 dataframe as input, there was a new pd.DataFrame wrapper that was created (thus a different memory address), but the data is exactly the same. You can verify that with the following code:
In [2]: import pandas as pd
...: df1 = pd.DataFrame({'a':[1,2,3]})
...: df2 = pd.DataFrame(df1)
...:
In [3]: print(id(df1), id(df2))
(4665009296, 4665009360)
In [4]: df1._data
Out[4]:
BlockManager
Items: Index([u'a'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
IntBlock: slice(0, 1, 1), 1 x 3, dtype: int64
In [5]: id(df1._data)
Out[5]: 4522343248
In [6]: id(df2._data)
Out[6]: 4522343248
As you can see, the memory address for df1._data and df2._data is exactly the same.
This is also clear when you read the DataFrame source code in github, where, at the beginning of the constructor, the same data is referenced by the new dataframe:
if isinstance(data, DataFrame):
data = data._data

pandas dataframe copy of slice warning

I'm fairly new to pandas, and was getting the infamous SettingWithCopyWarning in a large piece of code. I boiled it down to the following:
import pandas as pd
df = pd.DataFrame([[0,3],[3,3],[3,1],[1,1]], columns=list('AB'))
df
df = df.loc[(df.A>1) & (df.B>1)]
df['B'] = 10
When I run this I get the warning:
main:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
The strange thing is that if I leave off the "df" line it runs without a warning. Is this intended behavior?
In general, if I want to filter a DataFrame by the values across various columns, do I need to do a copy() to avoid the SettingWithCopyWarning?
thanks very much
Assuming your DataFrame as below from your question, this will avoid SettingWithCopyWarning
There is github Discussion and solution suggested by one of the Pandas developer Jeff :)
df
A B
1 3 3
Best to use this way.
df['B'] = df['B'].replace(3, 10)
df
A B
1 3 10

append pandas dataframe automatically cast as float but want int

How do I get pandas to append an integer and keep the integer data type? I realize I can df.test.astype(int) to the entire column after I have put in the data but if I can do it at the time I'm appending the data it seems like that would be a better way. Here is a sample:
from bitstring import BitArray
import pandas as pd
df = pd.DataFrame()
test = BitArray('0x01')
test = int(test.hex)
print(test)
df = df.append({'test':test, 'another':5}, ignore_index=True)
print(df.test)
print(df.another)
Here is the output:
1
0 1.0
Name: test, dtype: float64
0 5.0
Name: another, dtype: float64
It is changing the integers to floats.
It's because your initial dataframe is empty. Initialize it with some integer column.
df = pd.DataFrame(dict(A=[], test=[], another=[]), dtype=int)
df.append(dict(A=3, test=4, another=5), ignore_index=True)
Had I done
df = pd.DataFrame()
df.append(dict(A=3, test=4, another=5), ignore_index=True)
You need to use convert_dtypes, if you are using Pandas 1.0.0 and above. Refer link for description and use convert_dtypes
df = df.convert_dtypes()
df = df.append({'test':test, 'another':5}, ignore_index=True)
As in this issue: df.append should retain columns type if same type #18359, append method will retain column types since pandas 0.23.0.
So upgrading pandas version to 0.23.0 or newer solves this problem.
Well there are 2 workarounds, I found.
Upgrade to pandas version >= 0.23.0
But if one cannot change pandas version like when working for production code and version change might affect other scripts/codes in prod environment.
so below one-liner is a quick workaround.
df = df.astype(int)

Replacing Pandas or Numpy Nan with a None to use with MysqlDB

I am trying to write a Pandas dataframe (or can use a numpy array) to a mysql database using MysqlDB . MysqlDB doesn't seem understand 'nan' and my database throws out an error saying nan is not in the field list. I need to find a way to convert the 'nan' into a NoneType.
Any ideas?
#bogatron has it right, you can use where, it's worth noting that you can do this natively in pandas:
df1 = df.where(pd.notnull(df), None)
Note: this changes the dtype of all columns to object.
Example:
In [1]: df = pd.DataFrame([1, np.nan])
In [2]: df
Out[2]:
0
0 1
1 NaN
In [3]: df1 = df.where(pd.notnull(df), None)
In [4]: df1
Out[4]:
0
0 1
1 None
Note: what you cannot do recast the DataFrames dtype to allow all datatypes types, using astype, and then the DataFrame fillna method:
df1 = df.astype(object).replace(np.nan, 'None')
Unfortunately neither this, nor using replace, works with None see this (closed) issue.
As an aside, it's worth noting that for most use cases you don't need to replace NaN with None, see this question about the difference between NaN and None in pandas.
However, in this specific case it seems you do (at least at the time of this answer).
df = df.replace({np.nan: None})
Note: For pandas versions <1.4, this changes the dtype of all affected columns to object.
To avoid that, use this syntax instead:
df = df.replace(np.nan, None)
Credit goes to this guy here on this Github issue and Killian Huyghe's comment.
You can replace nan with None in your numpy array:
>>> x = np.array([1, np.nan, 3])
>>> y = np.where(np.isnan(x), None, x)
>>> print y
[1.0 None 3.0]
>>> print type(y[1])
<type 'NoneType'>
After stumbling around, this worked for me:
df = df.astype(object).where(pd.notnull(df),None)
Another addition: be careful when replacing multiples and converting the type of the column back from object to float. If you want to be certain that your None's won't flip back to np.NaN's apply #andy-hayden's suggestion with using pd.where.
Illustration of how replace can still go 'wrong':
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({"a": [1, np.NAN, np.inf]})
In [4]: df
Out[4]:
a
0 1.0
1 NaN
2 inf
In [5]: df.replace({np.NAN: None})
Out[5]:
a
0 1
1 None
2 inf
In [6]: df.replace({np.NAN: None, np.inf: None})
Out[6]:
a
0 1.0
1 NaN
2 NaN
In [7]: df.where((pd.notnull(df)), None).replace({np.inf: None})
Out[7]:
a
0 1.0
1 NaN
2 NaN
Just an addition to #Andy Hayden's answer:
Since DataFrame.mask is the opposite twin of DataFrame.where, they have the exactly same signature but with opposite meaning:
DataFrame.where is useful for Replacing values where the condition is False.
DataFrame.mask is used for Replacing values where the condition is True.
So in this question, using df.mask(df.isna(), other=None, inplace=True) might be more intuitive.
replace np.nan with None is accomplished differently across different version of pandas:
if version.parse(pd.__version__) >= version.parse('1.3.0'):
df = df.replace({np.nan: None})
else:
df = df.where(pd.notnull(df), None)
this solves the issue that for pandas versions <1.3.0, if the values in df are already None then df.replace({np.nan: None}) will toggle them back to np.nan (and vice versa).
Quite old, yet I stumbled upon the very same issue.
Try doing this:
df['col_replaced'] = df['col_with_npnans'].apply(lambda x: None if np.isnan(x) else x)
I believe the cleanest way would be to make use of the na_value argument in the pandas.DataFrame.to_numpy() method (docs):
na_value : Any, optional
The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.
New in version 1.1.0.
You could e.g. convert to dictionaries with NaN's replaced by None using
columns = df.columns.tolist()
dicts_with_nan_replaced = [
dict(zip(columns, x))
for x in df.to_numpy(na_value=None)
]
Convert numpy NaN to pandas NA before replacing with the where statement:
df = df.replace(np.NaN, pd.NA).where(df.notnull(), None)
Do you have a code block to review by chance?
Using .loc, pandas can access records based on logic conditions (filtering) and do action with them (when using =). Setting a .loc mask equal to some value will change the return array inplace (so be a touch careful here; I suggest test on a df copy prior to using in code block).
df.loc[df['SomeColumn'].isna(), 'SomeColumn'] = None
The outer function is df.loc[row_label, column_label] = None. We're going to use a boolean mask for row_label by using the .isna() method to find 'NoneType' values in our column SomeColumn.
We'll use the .isna() method to return a boolean array of rows/records in column SomeColumn as our row_label: df['SomeColumn'].isna(). It will isolate all rows where SomeColumn has any of the 'NoneType' items pandas checks for with the .isna() method.
We'll use the column_label both when masking the dataframe for the row_label, and to identify the column we want to act on for the .loc mask.
Finally, we set the .loc mask equal to None, so the rows/records returned are changed to None based on the masked index.
Below are links to pandas documentation regarding .loc & .isna().
References:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html
After finding that neither the recommended answer, nor the alternate suggested worked for my application after a Pandas update to 1.3.2 I settled for safety with a brute force approach:
buf = df.to_json(orient='records')
recs = json.loads(buf)
Yet another option, that actually did the trick for me:
df = df.astype(object).replace(np.nan, None)
Astoundingly, None of the previous answers worked for me, so I had to do it for each column.
for column in df.columns:
df[column] = df[column].where(pd.notnull(df[column]), None)
Doing it by hand is the only way that is working for me right now.
This answare from #rodney cox worked for me in almost every case.
The following code set all columns to object data type and then replace any null value to None. Setting the column data type to object is crucial because it prevents pandas to change the type further.
for col in df.columns:
df[col] = df[col].astype(object)
df.loc[df[col].isnull(), col] = None
Warning: This solution is not eficient, because it process columns that might not have np.nan values.
Sometimes it is better to use this code. Note that np refers to the numpy:
df = df.fillna(np.nan).replace([np.nan], [None])
This worked for me:
df = df.fillna(0)

Categories