Delete values from pandas dataframe based on logical operation - python

I want to delete the values that are greater than a certain threshold from a pandas dataframe. Is there an efficient way to perform this? I am doing it with apply and lambda, which works fine but a bit slow for a large dataframe and I feel like there must be a better method.
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df
A B
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
How can this be done without apply and lambda?
df['A'] = df.apply(lambda x: x['A'] if x['A'] < 3 else None, axis=1)
df
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5

Use a boolean mask against the df:
In[21]:
df[df<3]
Out[21]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
Here where the boolean condition is not met a False is returned, this will just mask out the df value returning NaN
If you actually want to drop these rows then self-assign:
df = df[df<3]
To compare a specific column:
In[22]:
df[df['A']<3]
Out[22]:
A
0 1
1 2
If you want NaN in the removed rows then you can use a trick where a double square brackets will return a single column df so we can mask the df:
In[25]:
df[df[['A']]<3]
Out[25]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
If you have multiple columns then the above won't work as the boolean mask has to match the orig df, in which case you can reindex against the orig df index:
In[31]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df[df['A']<3].reindex(df.index)
Out[31]:
A B
0 1.0 1.0
1 2.0 2.0
2 NaN NaN
3 NaN NaN
4 NaN NaN
EDIT
You've updated your question again, if you want to just overwrite the single column:
In[32]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df['A'] = df.loc[df['A'] < 3,'A']
df
Out[32]:
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5

Related

Create a new column in Pandas Dataframe based on the 'NaN' values in another column

I have a dataframe:
A B
1 NaN
2 3
4 NaN
5 NaN
6 7
I want to create a new column C containing the value from B that aren't NaN, otherwise the values from A. This would be a simple matter in Excel; is it easy in Pandas?
Yes, it's simple. Use pandas.Series.where:
df['C'] = df['A'].where(df['B'].isna(), df['B'])
Output:
>>> df
A B C
0 1 NaN 1
1 2 3.0 3
2 4 NaN 4
3 5 NaN 5
4 6 7.0 7
A bit cleaner:
df.C = df.A.where(df.B.isna(), df.B)
Alternatively:
df.C = df.B.where(df.B.notna(), df.A)

Is mask working differently using inplace?

I've just found out about this strange behaviour of mask, could someone explain this to me?
A)
[input]
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3, inplace=True)
[output]
A
B
C
0
NaN
NaN
hi
1
NaN
3.0
hi
2
4.0
5.0
hi
3
6.0
7.0
hi
4
8.0
9.0
hi
B)
[input]
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)
[output]
A
B
C
0
NaN
NaN
NaN
1
NaN
3.0
NaN
2
4.0
5.0
NaN
3
6.0
7.0
NaN
4
8.0
9.0
NaN
Thank you in advance
The root cause of different result is that you pass a boolean dataframe that is not the same shape as the dataframe you want to mask. df.mask() fill the missing part with the value of inplace.
From the sourcecode, you can see pandas.DataFrame.mask() calls pandas.DataFrame.where() internally. pandas.DataFrame.where() then calls a _where() method that replaces values where the condition is False.
I just take df.where() as an example, here is the example code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(-1, 3), columns=['A', 'B', 'C'])
df1 = df.where(df[['A', 'B']]<3)
df.where(df[['A', 'B']]<3, inplace=True)
In this example, the df is
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
df[['A', 'B']]<3, the value of cond argument, is
A B
0 True True
1 False False
2 False False
3 False False
Digging into _where() method, the following lines are the key part:
def _where(...):
# align the cond to same shape as myself
cond = com.apply_if_callable(cond, self)
if isinstance(cond, NDFrame):
cond, _ = cond.align(self, join="right", broadcast_axis=1)
...
# make sure we are boolean
fill_value = bool(inplace)
cond = cond.fillna(fill_value)
Since the shape of cond and df are different, cond.align() fills the missing with NaN value. After that, cond looks like
A B C
0 True True NaN
1 False False NaN
2 False False NaN
3 False False NaN
Then with cond.fillna(fill_value), the NaN values are replaced with the value of inplace. So C column has the same value with inplace value.
Though there are still some codes (L9048 and L9124-L9145) related with inplace. We needn't care about the detail, since the aim of these lines are to replace values where the condition is False.
Recall that the df is
A B C
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
df1=df.where(df[['A', 'B']]<3): The cond C column is False since the default value of inplace is False. After doing df.where(), the df C column is set to the value of other argument which is NaN by default.
df.where(df[['A', 'B']]<3, inplace=True): The cond C column is True. After doing df.where(), the df C column keeps the same.
# print(df1)
A B C
0 0.0 1.0 NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
# print(df) after df.where(df[['A', 'B']]<3, inplace=True)
A B C
0 0.0 1.0 2
1 NaN NaN 5
2 NaN NaN 8
3 NaN NaN 11
Think it simple.
df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
df['C'] ='hi'
df.mask(df[['A', 'B']]<3)
The last code line is asking for the full dataframe (df.). The condition was applied to columns ['A', 'B'] so, once the column 'C' was not part of the condition it will return NaN for the column C.
This below would be the same of df.mask(df[['A', 'B']]<3)
>>> df[["A","B","C"]].mask(df[['A', 'B']]<3)
A B C
0 NaN NaN NaN
1 NaN 3.0 NaN
2 4.0 5.0 NaN
3 6.0 7.0 NaN
4 8.0 9.0 NaN
>>>
And, df.mask(df[['A', 'B', 'C']]<3) will generate an error, because column 'C' is string type
TypeError: '<' not supported between instances of 'str' and 'int'
Finally, to return only columns "A" and "B"
>>> df[["A","B"]].mask(df[['A', 'B']]<3)
A B
0 NaN NaN
1 NaN 3.0
2 4.0 5.0
3 6.0 7.0
4 8.0 9.0
When you apply the command to be done inplace, it will do nothing to column C because of the NaN, which in the mask method will be 'do nothing'

transform a big dataframe with many None values to smaller one with indication of non null columns

I have a big dataframe with 4 columns with often 3 null values at every row. Sometimes there are 2 or 1 or even 0 null values but often 3.
I want to transform it to a two columns dataframe having in each row the non null value and the name of the column from which it was extracted.
Example: How to transform this dataframe
df
Out[1]:
a b c d
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 NaN NaN 3.0 2.0
3 NaN NaN 1.0 NaN
to this One:
resultDF
Out[2]:
value columnName
0 1 a
1 2 b
2 3 c
3 2 d
4 1 c
The goal is to do it without looping on rows. Is this possible?
You can use pd.melt for adjusting the dataframe :
import pandas as pd
# reading the csv
df = pd.read_csv('test.csv')
df = df.melt(value_vars=['a','b','c','d'], var_name='foo', value_name='foo_value')
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Output :
foo foo_value
0 a 1.0
1 b 2.0
2 c 3.0
3 c 1.0
4 d 2.0

Interpolate NaN values over a DataFrame as a ring

I need to interpolate the NaN values over a Dataframe but I want that interpolation to get the first values of the DataFrame in case the NaN value is the last value. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({"a": [1,2,3], "b":[1,2,np.nan]})
So the DataFrame is:
a b
0 1 1.0
1 2 2.0
2 3 NaN
But when I interpolate the nan values like:
df.interpolate(method="linear", inplace=True)
I got:
a b
0 1 1.0
1 2 2.0
2 3 2.0
The interpolation doesn't use the first value to do it. My desired output wold be to fill in with the value of 1.5 because of that circular interpolation.
One possible solution is add first row, interpolate and remove last row:
df = df.append(df.iloc[0]).interpolate(method="linear").iloc[:-1]
print (df)
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 1.5
EDIT:
More general solution:
df = pd.DataFrame.from_dict({"a": [1,2,3,4], "b":[np.nan,1,2,np.nan]})
df = pd.concat([df] * 3).interpolate(method="linear").iloc[len(df):-len(df)]
print (df)
a b
0 1 1.333333
1 2 1.000000
2 3 2.000000
3 4 1.666667
Or if need working only with last non missing values:
df = pd.DataFrame.from_dict({"a": [1,2,3,4], "b":[np.nan,1,2,np.nan]})
df1 = df.ffill().iloc[[-1]]
df2 = df.bfill().iloc[[0]]
df = pd.concat([df1, df, df2]).interpolate(method="linear").iloc[1:-1]
print (df)
a b
0 1 1.5
1 2 1.0
2 3 2.0
3 4 1.5

Python pandas.DataFrame: Make whole row NaN according to condition

I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN

Categories