Pandas - Set value based on idxmax of group, including NaN - python

I am trying to set the value of a column, "C" based on the idxmax() of a groupby ("B"). To make it a bit more complicated though, in the event of a NaN or 0, I would like it to return the min value excluding the NaN or 0 if such a value exists. Here is an example dataframe:
Index
A
B
C
0
1
5
False
1
1
10
False
2
2
9
False
3
2
NaN
False
4
3
3
False
5
3
5
False
6
4
NaN
False
7
4
NaN
False
8
5
0
False
9
5
5
False
I am trying to set column "C" to True for the idxmax() of column B, split by a groupby on column "A":
A
B
C
0
1
5
True
1
1
10
False
2
2
9
True
3
2
NaN
False
4
3
3
True
5
3
5
False
6
4
NaN
True
7
4
NaN
False
8
5
0
False
9
5
5
True
Thanks!

Let's use groupby with transform like this:
df['C_new'] = df.groupby('A')['B'].transform('idxmax') == df.index
Output:
Index A B C C_new
0 0 1 5.0 False False
1 1 1 10.0 False True
2 2 2 9.0 False True
3 3 2 NaN False False
4 4 3 3.0 False False
5 5 3 5.0 False True
or, idxmin...
df['C_new'] = df.groupby('A')['B'].transform('idxmin') == df.index
Output:
Index A B C C_new
0 0 1 5.0 False True
1 1 1 10.0 False False
2 2 2 9.0 False True
3 3 2 NaN False False
4 4 3 3.0 False True
5 5 3 5.0 False False

Related

Isolate sequence of positive numbers in a pandas dataframe

I would like to identify what I call "periods" of data stocked in a pandas dataframe.
Let's say i have these values:
values
1 0
2 8
3 1
4 0
5 5
6 6
7 4
8 7
9 0
10 2
11 9
12 1
13 0
I would like to identify sequences of strictly positive numbers with length superior or equal to 3 numbers. Each non strictly positive numbers would end an ongoing sequence.
This would give :
values period
1 0 None
2 8 None
3 1 None
4 0 None
5 5 1
6 6 1
7 4 1
8 7 1
9 0 None
10 2 2
11 9 2
12 1 2
13 0 None
Using boolean arithmetics:
N = 3
m1 = df['values'].le(0)
m2 = df.groupby(m1.cumsum())['values'].transform('count').gt(N)
df['period'] = (m1&m2).cumsum().where((~m1)&m2)
output:
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
intermediates:
values m1 m2 CS(m1) m1&m2 CS(m1&m2) (~m1)&m2 period
1 0 True False 1 False 0 False NaN
2 8 False False 1 False 0 False NaN
3 1 False False 1 False 0 False NaN
4 0 True True 2 True 1 False NaN
5 5 False True 2 False 1 True 1.0
6 6 False True 2 False 1 True 1.0
7 4 False True 2 False 1 True 1.0
8 7 False True 2 False 1 True 1.0
9 0 True True 3 True 2 False NaN
10 2 False True 3 False 2 True 2.0
11 9 False True 3 False 2 True 2.0
12 1 False True 3 False 2 True 2.0
13 0 True False 4 False 2 False NaN
You can try
sign = np.sign(df['values'])
m = sign.ne(sign.shift()).cumsum() # continuous same value group
df['period'] = (df[sign.eq(1)] # Exclude non-positive numbers
.groupby(m)
['values'].filter(lambda col: len(col) >= 3)
.groupby(m)
.ngroup() + 1
)
print(df)
values period
1 0 NaN
2 8 NaN
3 1 NaN
4 0 NaN
5 5 1.0
6 6 1.0
7 4 1.0
8 7 1.0
9 0 NaN
10 2 2.0
11 9 2.0
12 1 2.0
13 0 NaN
A simple solution:
count = 0
n_groups = 0
seq_idx = [None]*len(df)
for i in range(len(df)):
if df.iloc[i]['values'] > 0:
count += 1
else:
if count >= 3:
n_groups += 1
seq_idx[i-count: i] = [n_groups]*count
count = 0
df['period'] = seq_idx
Output:
values period
0 0 NaN
1 8 NaN
2 1 NaN
3 0 NaN
4 5 1.0
5 6 1.0
6 4 1.0
7 7 1.0
8 0 NaN
9 2 2.0
10 9 2.0
11 1 2.0
12 0 NaN
One simple approach using find_peaks to find the plateaus (positive consecutive integers) of at least size 3:
import numpy as np
import pandas as pd
from scipy.signal import find_peaks
df = pd.DataFrame.from_dict({'values': {0: 0, 1: 8, 2: 1, 3: 0, 4: 5, 5: 6, 6: 4, 7: 7, 8: 0, 9: 2, 10: 9, 11: 1, 12: 0}})
_, plateaus = find_peaks((df["values"] > 0).to_numpy(), plateau_size=3)
indices = np.arange(len(df["values"]))[:, None]
indices = (indices >= plateaus["left_edges"]) & (indices <= plateaus["right_edges"])
res = (indices * (np.arange(indices.shape[1]) + 1)).sum(axis=1)
df["periods"] = res
print(df)
Output
values periods
0 0 0
1 8 0
2 1 0
3 0 0
4 5 1
5 6 1
6 4 1
7 7 1
8 0 0
9 2 2
10 9 2
11 1 2
12 0 0
def function1(dd:pd.DataFrame):
dd.loc[:,'period']=None
if len(dd)>=4:
dd.iloc[1:,2]=dd.iloc[1:,1]
return dd
df1.assign(col1=df1.le(0).cumsum().sub(1)).groupby('col1').apply(function1)
out:
values col1 period
0 0 0 None
1 8 0 None
2 1 0 None
3 0 1 None
4 5 1 1
5 6 1 1
6 4 1 1
7 7 1 1
8 0 2 None
9 2 2 2
10 9 2 2
11 1 2 2
12 0 3 None

Group by sequence of True

I have the following df:
df = pd.DataFrame({"val_a":[True,True,False,False,False,True,False,False,True,True,True,True,False,True,True]})
val_a
0 True
1 True
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 True
10 True
11 True
12 False
13 True
14 True
and I wish to have the following result:
val_a tx
0 True 0
1 True 0
2 False None
3 False None
4 False None
5 True 1
6 False None
7 False None
8 True 2
9 True 2
10 True 2
11 True 2
12 False None
13 True 3
14 True 3
explanation: When you see a True - count it as a group so for index 0 and 1 its the same tx (0) later comes only one True (index 5) so mark it as 1.
What have I tired: I know that cumsum and groupby must come into play here but couldnt figure how.
g = (df['val_a']==True).cumsum()
df['tx'] = df.groupby(g).ffill()
Identify the groups with cumsum then filter the rows having True values and use factorize to assign the ordinal number to each unique group
m = df['val_a']
df.loc[m, 'tx'] = (~m).cumsum()[m].factorize()[0]
Alternatively you can also use groupby + ngroup
m = df['val_a']
df['tx'] = m[m].groupby((~m).cumsum()).ngroup()
val_a tx
0 True 0.0
1 True 0.0
2 False NaN
3 False NaN
4 False NaN
5 True 1.0
6 False NaN
7 False NaN
8 True 2.0
9 True 2.0
10 True 2.0
11 True 2.0
12 False NaN
13 True 3.0
14 True 3.0

Is there a function in pandas like cumsum() but for the mean? I need to apply it based on a condition

I need to extract the cumulative mean only while my column A is different form zero. Each time it is zero, the cumulative mean should restart. Thanks so much in advance I am not so good using python.
Input:
ColumnA
0 5
1 6
2 7
3 0
4 0
5 1
6 2
7 3
8 0
9 5
10 10
11 15
Expected Output:
ColumnA CumulativeMean
0 5 5.0
1 6 5.5
2 7 6.0
3 0 0.0
4 0 0.0
5 1 1.0
6 2 1.5
7 3 2.0
8 0 0.0
9 5 5.0
10 10 7.5
11 15 10.0
You can try with cumsum to make groups and then with expanding+mean to make the cumulative mean
groups=df.ColumnA.eq(0).cumsum()
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean()).fillna(0)
Details:
Make groups when column is equal to 0 with eq and cumsum, since eq gives you a mask with True and False values, and with cumsum these values are taken as 1 or 0:
groups=df.ColumnA.eq(0).cumsum()
groups
0 0
1 0
2 0
3 1
4 2
5 2
6 2
7 2
8 3
9 3
10 3
11 3
Name: ColumnA, dtype: int32
Then group by that groups and use apply to do the cumulative mean over elements different to 0:
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean())
ColumnA
0 5.0
1 5.5
2 6.0
3 NaN
4 NaN
5 1.0
6 1.5
7 2.0
8 NaN
9 5.0
10 7.5
11 10.0
And finally use fillna to fill with 0 the nan values:
df.groupby(groups).apply(lambda x: x[x.ne(0)].expanding().mean()).fillna(0)
ColumnA
0 5.0
1 5.5
2 6.0
3 0.0
4 0.0
5 1.0
6 1.5
7 2.0
8 0.0
9 5.0
10 7.5
11 10.0
You can use boolean indexing to compare rows that are ==0 and !=0 against the previous rows with .shift(). Then, jsut take the .cumsum() to separate into groups, according to where zeros are within ColumnA.
df['CumulativeMean'] = (df.groupby((((df.shift()['ColumnA'] != 0) & (df['ColumnA'] == 0)) |
(df.shift()['ColumnA'] == 0) & (df['ColumnA'] != 0))
.cumsum())['ColumnA'].apply(lambda x: x.expanding().mean()))
Out[6]:
ColumnA CumulativeMean
0 5 5.0
1 6 5.5
2 7 6.0
3 0 0.0
4 0 0.0
5 1 1.0
6 2 1.5
7 3 2.0
8 0 0.0
9 5 5.0
10 10 7.5
11 15 10.0
I'll have broken down the logic of the boolean indexing within the .groupby statement down into multiple columns that build into the final result of the column abcd_cumsum. From there, ['ColumnA'].apply(lambda x: x.expanding().mean())) takes the mean of the group up to any given row in that group. For example, The second row (index of 1) takes the grouped mean of the first and second row, but excludes the third row.
df['a'] = (df.shift()['ColumnA'] != 0)
df['b'] = (df['ColumnA'] == 0)
df['ab'] = (df['a'] & df['b'])
df['c'] = (df.shift()['ColumnA'] == 0)
df['d'] = (df['ColumnA'] != 0)
df['cd'] = (df['c'] & df['d'])
df['abcd'] = (df['ab'] | df['cd'])
df['abcd_cumsum'] = (df['ab'] | df['cd']).cumsum()
df['CumulativeMean'] = (df.groupby(df['abcd_cumsum'])['ColumnA'].apply(lambda x: x.expanding().mean()))
Out[7]:
ColumnA a b ab c d cd abcd abcd_cumsum \
0 5 True False False False True False False 0
1 6 True False False False True False False 0
2 7 True False False False True False False 0
3 0 True True True False False False True 1
4 0 False True False True False False False 1
5 1 False False False True True True True 2
6 2 True False False False True False False 2
7 3 True False False False True False False 2
8 0 True True True False False False True 3
9 5 False False False True True True True 4
10 10 True False False False True False False 4
11 15 True False False False True False False 4
CumulativeMean
0 5.0
1 5.5
2 6.0
3 0.0
4 0.0
5 1.0
6 1.5
7 2.0
8 0.0
9 5.0
10 7.5
11 10.0

Pandas cummulative sum based on True/False condition

I'm using python and need to solve the dataframe as cumsum() the value until the boolean column change its value from True to False. How to solve this task?
Bool Value Expected_cumsum
0 False 1 1
1 False 2 3
2 False 4 7
3 True 1 8
4 False 3 3 << reset from here
5 False 5 8
6 True 2 10
....
Thank all!
You can try this
a = df.Bool.eq(True).cumsum().shift().fillna(0)
df['Expected_cumsum']= df.groupby(a)['Value'].cumsum()
df
Output
Bool Value Expected_cumsum
0 False 1 1
1 False 2 3
2 False 4 7
3 True 1 8
4 False 3 3
5 False 5 8
6 True 2 10

How to create a new column in pandas, containing index difference to the previous specific value?

Having the following dataframe:
df = pd.DataFrame(np.ones(10).reshape(10,1), columns=['A'])
df.ix[2]['A'] = 0
df.ix[6]['A'] = 0
A
0 1
1 1
2 0
3 1
4 1
5 1
6 0
7 1
8 1
9 1
I am trying to add a new column B which would contain a number of "1"-occurrences in the column A until the first "0"-event before. Expected output should be like this:
A B
0 1 0
1 1 2
2 0 0
3 1 0
4 1 0
5 1 3
6 0 0
7 1 0
8 1 0
9 1 3
Any efficient vectorized way to do this?
You can use:
a = df.A.groupby((df.A != df.A.shift()).cumsum()).cumcount() + 1
print (a)
0 1
1 2
2 1
3 1
4 2
5 3
6 1
7 1
8 2
9 3
dtype: int64
b = ((~df.A.astype(bool)).shift(-1).fillna(df.A.iat[-1].astype(bool)))
print (b)
0 False
1 True
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 True
Name: A, dtype: bool
df['B'] = ( a * b )
print (df)
A B
0 1.0 0
1 1.0 2
2 0.0 0
3 1.0 0
4 1.0 0
5 1.0 3
6 0.0 0
7 1.0 0
8 1.0 0
9 1.0 3
Explanation:
#difference with shifted A
df['C'] = df.A != df.A.shift()
#cumulative sum
df['D'] = (df.A != df.A.shift()).cumsum()
#cumulative count each group
df['a'] = df.A.groupby((df.A != df.A.shift()).cumsum()).cumcount() + 1
#invert and convert to boolean
df['F'] = ~df.A.astype(bool)
#shift
df['G'] = (~df.A.astype(bool)).shift(-1)
#fill last nan
df['b'] = (~df.A.astype(bool)).shift(-1).fillna(df.A.iat[-1].astype(bool))
print (df)
A B C D a F G b
0 1.0 0 True 1 1 False False False
1 1.0 2 False 1 2 False True True
2 0.0 0 True 2 1 True False False
3 1.0 0 True 3 1 False False False
4 1.0 0 False 3 2 False False False
5 1.0 3 False 3 3 False True True
6 0.0 0 True 4 1 True False False
7 1.0 0 True 5 1 False False False
8 1.0 0 False 5 2 False False False
9 1.0 3 False 5 3 False NaN True
Last NaN is problematic. So I check last value of column A by df.A.iat[-1] and convert to boolean. So if it is 0, output is False and finally 0 or if 1, output is True and then is used last value of a.

Categories