DataFrameGroupBy diff() on condition - python

Suppose i have a DataFrame:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
which looks like
CATEGORY VALUE
0 a NaN
1 b 1
2 c 0
3 b 0
4 b 5
5 a 0
6 b 4
I group it:
df = df.groupby(by='CATEGORY')
And now, let me show, what i want with the help of example on one group 'b':
df.get_group('b')
group b:
CATEGORY VALUE
1 b 1
3 b 0
4 b 5
6 b 4
I need: In the scope of each group, count diff() between VALUE values, skipping all NaNs and 0s. So the result should be:
CATEGORY VALUE DIFF
1 b 1 -
3 b 0 -
4 b 5 4
6 b 4 -1

You can use diff to subtract values after dropping 0 and NaN values:
df = pd.DataFrame({'CATEGORY':['a','b','c','b','b','a','b'],
'VALUE':[pd.np.NaN,1,0,0,5,0,4]})
grouped = df.groupby("CATEGORY")
# define diff func
diff = lambda x: x["VALUE"].replace(0, np.NaN).dropna().diff()
df["DIFF"] = grouped.apply(diff).reset_index(0, drop=True)
print(df)
CATEGORY VALUE DIFF
0 a NaN NaN
1 b 1.0 NaN
2 c 0.0 NaN
3 b 0.0 NaN
4 b 5.0 4.0
5 a 0.0 NaN
6 b 4.0 -1.0

Sounds like a job for a pd.Series.shift() operation along with a notnull mask.
First we remove the unwanted values, before we group the data
nonull_df = df[(df['VALUE'] != 0) & df['VALUE'].notnull()]
groups = nonull_df.groupby(by='CATEGORY')
Now we can shift internally in the groups and calculate the diff
nonull_df['next_value'] = groups['VALUE'].shift(1)
nonull_df['diff'] = nonull_df['VALUE'] - nonull_df['next_value']
Lastly and optionally you can copy the data back to the original dataframe
df.loc[nonull_df.index] = nonull_df
df
CATEGORY VALUE next_value diff
0 a NaN NaN NaN
1 b 1.0 NaN NaN
2 c 0.0 NaN NaN
3 b 0.0 1.0 -1.0
4 b 5.0 1.0 4.0
5 a 0.0 NaN NaN
6 b 4.0 5.0 -1.0

Related

Duplicate positions from group

I have the following Dataset:
col value
0 A 1
1 A NaN
2 B NaN
3 B NaN
4 B NaN
5 B 1
6 C 3
7 C NaN
8 C NaN
9 D 5
10 E 6
There is only one value set per group, the rest in Nan. What I want to do know, is fill the NaN with he value of the group. If a group has no NaNs, I just want to ignore it.
Outcome should look like this:
col value
0 A 1
1 A 1
2 B 1
3 B 1
4 B 1
5 B 1
6 C 3
7 C 3
8 C 3
9 D 5
10 E 6
What I've tried so far is the following:
df["value"] = df.groupby(col).transform(lambda x: x.fillna(x.mean()))
However, this method is not only super slow, but doesn't give me the wished result.
Anybody an idea?
It depends of data - if there is always one non missing value you can sorting and then replace by GroupBy.ffill, it working well if some groups has NANs only:
df = df.sort_values(['col','value'])
df["value"] = df.groupby('col')["value"].ffill()
#if always only one non missing value per group, fail if all NaNs of some group
#df["value"] = df["value"].ffill()
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
Or if there is multiple values and need replace by mean, for improve performace change your solution with GroupBy.transform only mean passed to Series.fillna:
df["value"] = df["value"].fillna(df.groupby('col')["value"].transform('mean'))
print (df)
col value
0 A 1.0
1 A 1.0
5 B 1.0
2 B 1.0
3 B 1.0
4 B 1.0
6 C 3.0
7 C 3.0
8 C 3.0
9 D 5.0
10 E 6.0
You can use ffill which is the same as fillna() with method=ffill (see docs)
df["value"] = df["value"].ffill()

Pandas Groupby (shift) function to return null for first entry

For the following code, in python, I'm trying to get the difference between the latest rating, by date, minus the previous rating, which I have done in 'orinc'
However, where there's no previous rating - or the first entry for 'H_Name' it's returning the current rating. Is there anything to add to this code so this would return 'null' or 'nan'?
df2['orinc'] = df2['HIR_OfficialRating'] - df2.groupby('H_Name')['HIR_OfficialRating'].shift(1)
Sure, check if values in column H_Name are unique, if yes, for each value after shift are returned missing values, because shifting for not exist previous value.
Also first value for each group is NaN for same reason.
df2 = pd.DataFrame({
'H_Name': ['a','a','a','a','e','b','b','c','d'],
'HIR_OfficialRating': list(range(9))})
df2['new'] = df2.groupby('H_Name')['HIR_OfficialRating'].shift(1)
print (df2)
H_Name HIR_OfficialRating new
0 a 0 NaN < first value of group a
1 a 1 0.0
2 a 2 1.0
3 a 3 2.0
4 e 4 NaN <-unique e
5 b 5 NaN < first value of group b
6 b 6 5.0
7 c 7 NaN <-unique c
8 d 8 NaN <-unique d
df2['orinc'] = df2['HIR_OfficialRating'] - df2.groupby('H_Name')['HIR_OfficialRating'].shift(1)
print (df2)
H_Name HIR_OfficialRating orinc
0 a 0 NaN
1 a 1 1.0
2 a 2 1.0
3 a 3 1.0
4 e 4 NaN
5 b 5 NaN
6 b 6 1.0
7 c 7 NaN
8 d 8 NaN

How to fill a column in pandas dataframe based on some conditions set upon two different columns?

Suppose I have a df like below:
A B C
null 0 null
null 4 null
5 6 null
0 0 0
Now, I want to fill my column C based on Columns A & B condition being:
only if there is a null in column A against the '0' of column B then let column C be null otherwise in all other cases copy column B to column C. This means that I want my df to look like this:
A B C
null 0 null
null 4 4
5 6 6
0 0 0
How can i achieve this in pandas?
Any help would be appreciated as I am quite new in python and pandas programming.
Use numpy.where with conditions chained by & for bitwise AND:
import numpy as np
m1 = df.A.isna()
m2 = df.B.eq(0)
df['C'] = np.where(m1 & m2, np.nan, df.B)
print (df)
A B C
0 NaN 0 NaN
1 NaN 4 4.0
2 5.0 6 6.0
3 0.0 0 0.0
Use Series.fillna + Series.mask:
df['C']=df['C'].fillna(df['B'].mask(df['B'].eq(0)))
print(df)
A B C
0 NaN 0 NaN
1 NaN 4 4.0
2 5.0 6 6.0
3 0.0 0 0.0
or using Series.where:
df['C']=df['B'].mask(df['B'].eq(0)).where(df['C'].isnull(),df['C'])
print(df)
A B C
0 NaN 0 NaN
1 NaN 4 4.0
2 5.0 6 6.0
3 0.0 0 0.0
Using fillna and checking if A + B > 0, if so then fill C with B using loc:
mask = df['A'].fillna(0) + df['B'] > 0
df.loc[mask, 'C'] = df['B']
A B C
0 NaN 0 NaN
1 NaN 4 4.0
2 5.0 6 6.0
3 0.0 0 0.0

Find the index of first occurrence in DataFrame

I have a dataframe which looks like this:
0 1 2 3 4 5 6
0 a(A) b c c d a a
1 b h w k d c(A) k
2 g e(A) s g h s f
3 f d s h(A) c w n
4 e g s b c e w
I want to get the index of the cell which contains (A) in each column.
0 0
1 2
2 NaN
3 3
4 NaN
5 1
6 NaN
I tried this code but the result doesn't reach my expectation.
df.apply(lambda x: (x.str.contains(r'(A)')==True).idxmax(), axis=0)
Result looks like this:
0 0
1 2
2 0
3 3
4 0
5 1
6 0
I think it returns the first index if there is no (A) in that column.
How should I fix it?
Use Series.where for set default missing value for overwrite default 0 value of DataFrame.idxmax:
mask = df.apply(lambda x: x.str.contains('A'))
s1 = mask.idxmax().where(mask.any())
print (s1)
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
dtype: float64
You could do what you're doing but explicitly check if the rows contain any matches:
In [51]: pred = df.applymap(lambda x: '(A)' in x)
In [52]: pred.idxmax() * np.where(pred.any(), 1, np.nan)
Out[52]:
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
dtype: float64
Or alternatively, using DataFrame.where directly:
In [211]: pred.where(pred).idxmax()
Out[211]:
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
dtype: float64
A slightly cheatier one-liner is to use DataFrame.where on the identity:
In [78]: df.apply(lambda x: x.str.contains('A')).where(lambda x: x).idxmax()
Out[78]:
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
Add an if condition at the end of the apply:
>>> df.apply(lambda x: x.str.contains('A').idxmax() if 'A' in x[x.str.contains('A').idxmax()] else np.nan)
0 0.0
1 2.0
2 NaN
3 3.0
4 NaN
5 1.0
6 NaN
dtype: float64
>>>

Python pandas.DataFrame: Make whole row NaN according to condition

I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN

Categories