python pandas Ignore Nan in integer comparisons - python

I am trying to create dummy variables based on integer comparisons in series where Nan is common. A > comparison raises errors if there are any Nan values, but I want the comparison to return a Nan. I understand that I could use fillna() to replace Nan with a value that I know will be false, but I would hope there is a more elegant way to do this. I would need to change the value in fillna() if I used less than, or used a variable that could be positive or negative, and that is one more opportunity to create errors. Is there any way to make 30 < Nan = Nan?
To be clear, I want this:
df['var_dummy'] = df[df['var'] >= 30].astype('int')
to return a null if var is null, 1 if it is 30+, and 0 otherwise. Currently I get ValueError: cannot reindex from a duplicate axis.

Here's a way:
s1 = pd.Series([1, 3, 4, 2, np.nan, 5, np.nan, 7])
s2 = pd.Series([2, 1, 5, 5, np.nan, np.nan, 2, np.nan])
(s1 < s2).mask(s1.isnull() | s2.isnull(), np.nan)
Out:
0 1.0
1 0.0
2 1.0
3 1.0
4 NaN
5 NaN
6 NaN
7 NaN
dtype: float64
This masks the boolean array returned from (s1 < s2) if any of them is NaN. In that case, it returns NaN. But you cannot have NaNs in a boolean array so it will be casted as float.

Solution 1
df['var_dummy'] = 1 * df.loc[~pd.isnull(df['var']), 'var'].ge(30)
Solution 2
df['var_dummy'] = df['var'].apply(lambda x: np.nan if x!=x else 1*(x>30))
x!=x is equivalent to math.isnan()

You can use the notna() method. Here is an example:
import pandas as pd
list1 = [12, 34, -4, None, 45]
list2 = ['a', 'b', 'c', 'd', 'e']
# Calling DataFrame constructor on above lists
df = pd.DataFrame(list(zip(list1, list2)), columns =['var1','letter'])
#Assigning new dummy variable:
df['var_dummy'] = df['var1'][df['var1'].notna()] >= 30
# or you can also use: df['var_dummy'] = df.var1[df.var1.notna()] >= 30
df
Will produce the below output:
var1 letter var_dummy
0 12.0 a False
1 34.0 b True
2 -4.0 c False
3 NaN d NaN
4 45.0 e True
So the new dummy variable has NaN value for the original variable's NaN rows.
The only thing that does not match your request is that the dummy variable takes False and True values instead of 0 and 1, but you can easily reassign the values.
One thing, however, you cannot change is that the new dummy variable has to be float type because it contains NaN value, which by itself is a special float value.
More information about NaN float are mentioned here:
How can I check for NaN values?
and here:
https://towardsdatascience.com/navigating-the-hell-of-nans-in-python-71b12558895b

Related

float type appear when use replace method in pandas dataframe

import pandas as pd
import numpy as np
df = pd.DataFrame(
[
[np.nan, 'None', 3],
[np.nan, 5, 6],
[7, 8, 9]
], columns=['a', 'b', 'c']
)
df.replace({np.nan: None}, inplace=True)
print(df)
df.replace({'None': None}, inplace=True)
print(df)
a b c
0 None None 3
1 None 5 6
2 7 8 9
a b c
0 NaN NaN 3
1 NaN 5.0 6
2 7.0 8.0 9
this is small example fo my case.
i wanna replace nan, "None" to None. so i use replace twice
first replace method work fine as i thought, but nan was reborn in second replace and all int is changed to float because of nan. i have no idea about why nan is reborn df.replace({'None': None}, inplace=True), how can i fix it?
If you want integers in a column with nan values you need to use pd.NA instead. nan is a float and will force an array of integers to become a floating point. Check out the documentation.
Solution
df = pd.DataFrame(
[
[np.nan, None, 3],
[np.nan, 5, 6],
[7, 8, 9]
],
columns=['a', 'b', 'c'],
)
# replace np.nan with pd.NA
# then convert columns types to Int32
df.fillna(pd.NA).astype('Int32')
Out[11]:
a b c
0 <NA> <NA> 3
1 <NA> 5 6
2 7 8 9
We can do is change it to object
out = df.replace({'None': None}).astype(object)
Out[10]:
a b c
0 NaN NaN 3
1 NaN 5 6
2 7 8 9
It sounds like you want the column type to be an integer instead of a float. You can use the nullable integer dtype introduced in pandas version 0.24.0.
With a column that is a regular integer dtype is automatically converted to a float dtype if it gets a null value. Note that if you use the pandas nullable integer dtype, the column will not become a float and the null value will be represented as <NA> the pandas.NA value.
Read more in the docs.

delete columns where values are not increasing in pandas

I have a df with values and some columns have values which are increasing and some columns have values which are either decreasing or not changing. I want to delete those columns. I tried to use is_monotonic but that returns a boolean = TRUE if the values are increasing but does not include if the values remain the same
data = [{'a': 1, 'b': 2, 'c':33}, {'a':10, 'b': 2, 'c': 30}]
df = pd.DataFrame(data)
In the above example i want to keep only column 'a' as the other two columns are the same or decreasing values. can anyone help me please?
Get difference of all columns, remove first only NaNs row and compare if all values are greater like 0:
df = df.loc[:, df.diff().iloc[1:].gt(0).all()]
print (df)
a
0 1
1 10
Details:
print (df.diff())
a b c
0 NaN NaN NaN
1 9.0 0.0 -3.0
print (df.diff().iloc[1:])
a b c
1 9.0 0.0 -3.0
print (df.diff().iloc[1:].gt(0))
a b c
1 True False False
print (df.diff().iloc[1:].gt(0).all())
a True
b False
c False
dtype: bool
Or like mentioned in comments change logic - get any columns if les or equal 0 and change mask by ~:
df = df.loc[:, ~df.diff().le(0).any()]

Pandas fillna() not filling values from series

I'm trying to fill missing values in a column in a DataFrame with the value from another DataFrame's column. Here's the setup:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [2, 3, 5, np.nan, np.nan],
'b': [10, 11, 13, 14, 15]
})
df2 = pd.DataFrame({
'x': [1]
})
I can of course do this and it works:
df['a'] = df['a'].fillna(1)
However, this results in the missing values not being filled:
df['a'] = df['a'].fillna(df2['x'])
And this results in an error:
df['a'] = df['a'].fillna(df2['x'].values)
How can I use the value from df2['x'] to fill in missing values in df['a']?
If you can guarantee df2['x'] only has a single element, then use .item:
df['a'] = df['a'].fillna(df2.values.item())
Or,
df['a'] = df['a'].fillna(df2['x'].item())
df
a b
0 2.0 10
1 3.0 11
2 5.0 13
3 1.0 14
4 1.0 15
Otherwise, this isn't possible unless they're either the same length and/or index-aligned.
As a rule of thumb; either
pass a scalar, or
pass a dictionary mapping the index of the NaN value to its replacement value (e.g., df.a.fillna({3 : 1, 4 : 1})), or
index aligned series
I think one general solution is select first value by [0] for scalar:
print (df2['x'].values[0])
1
df['a'] = df['a'].fillna(df2['x'].values[0])
#similar solution for select by loc
#df['a'] = df['a'].fillna(df2.loc[0, 'x'])
print (df)
a b
0 2.0 10
1 3.0 11
2 5.0 13
3 1.0 14
4 1.0 15

Return list of indices/index where a min/max value occurs in a pandas dataframe

I'd like to search a pandas DataFrame for minimum values. I need the min in the entire dataframe (across all values) analogous to df.min().min(). However I also need the know the index of the location(s) where this value occurs.
I've tried a number of different approaches:
df.where(df == (df.min().min())),
df.where(df == df.min().min()).notnull()(source) and
val_mask = df == df.min().min(); df[val_mask] (source).
These return a dataframe of NaNs on non-min/boolean values but I can't figure out a way to get the (row, col) of these locations.
Is there a more elegant way of searching a dataframe for a min/max and returning a list containing all of the locations of the occurrence(s)?
import pandas as pd
keys = ['x', 'y', 'z']
vals = [[1,2,-1], [3,5,1], [4,2,3]]
data = dict(zip(keys,vals))
df = pd.DataFrame(data)
list_of_lowest = []
for column_name, column in df.iteritems():
if len(df[column == df.min().min()]) > 0:
print(column_name, column.where(column ==df.min().min()).dropna())
list_of_lowest.append([column_name, column.where(column ==df.min().min()).dropna()])
list_of_lowest
output: [['x', 2 -1.0
Name: x, dtype: float64]]
Based on your revised update:
In [209]:
keys = ['x', 'y', 'z']
vals = [[1,2,-1], [3,5,-1], [4,2,3]]
data = dict(zip(keys,vals))
df = pd.DataFrame(data)
df
Out[209]:
x y z
0 1 3 4
1 2 5 2
2 -1 -1 3
Then the following would work:
In [211]:
df[df==df.min().min()].dropna(axis=1, thresh=1).dropna()
Out[211]:
x y
2 -1.0 -1.0
So this uses the boolean mask on the df:
In [212]:
df[df==df.min().min()]
Out[212]:
x y z
0 NaN NaN NaN
1 NaN NaN NaN
2 -1.0 -1.0 NaN
and we call dropna with param thresh=1 this drops columns that don't have at least 1 non-NaN value:
In [213]:
df[df==df.min().min()].dropna(axis=1, thresh=1)
Out[213]:
x y
0 NaN NaN
1 NaN NaN
2 -1.0 -1.0
Probably safer to call again with thresh=1:
In [214]:
df[df==df.min().min()].dropna(axis=1, thresh=1).dropna(thresh=1)
Out[214]:
x y
2 -1.0 -1.0

Replace values in a dataframe column based on condition

I have a seemingly easy task. Dataframe with 2 columns: A and B. If values in B are larger than values in A - replace those values with values of A. I used to do this by doing df.B[df.B > df.A] = df.A, however recent upgrade of pandas started giving a SettingWithCopyWarning when encountering this chained assignment. Official documentation recommends using .loc.
Okay, I said, and did it through df.loc[df.B > df.A, 'B'] = df.A and it all works fine, unless column B has all values of NaN. Then something weird happens:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, np.NaN, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 -9223372036854775808
2 3 -9223372036854775808
Now, if even one of B's elements satisfies the condition (larger than A), then it all works fine:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, 4, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 4
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 NaN
1 2 2
2 3 NaN
But if none of Bs elements satisfy, then all NaNs get replaces with -9223372036854775808:
In [1]: df = pd.DataFrame({'A':[1,2,3],'B':[np.NaN,1,np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 1
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 1
2 3 -9223372036854775808
Is this a bug or a feature? How should I have done this replacement?
Thank you!
This is a buggie, fixed here.
Since pandas allows basically anything to be set on the right-hand-side of an expression in loc, there are probably 10+ cases that need to be disambiguated. To give you an idea:
df.loc[lhs, column] = rhs
where rhs could be: list,array,scalar, and lhs could be: slice,tuple,scalar,array
and a small subset of cases where the resulting dtype of the column needs to be inferred / set according to the rhs. (This is a bit complicated). For example say you don't set all of the elements on the lhs and it was integer, then you need to coerce to float. But if you did set all of the elements AND the rhs was an integer then it needs to be coerced BACK to integer.
In this this particular case, the lhs is an array, so we would normally try to coerce the lhs to the type of the rhs, but this case degenerates if we have an unsafe conversion (int -> float)
Suffice to say this was a missing edge case.

Categories