Pandas: Conditional Rolling Block Count - python

I have a series that looks like the following:
Time Step
0 0
1 1
2 2
3 2
4 2
5 3
6 0
7 1
8 2
9 2
10 2
11 3
I want to use Pandas to perform a conditional rolling count of each block of time that contains step = 2 and output the count to a new column. I have found answers on how to do conditional rolling counts (Pandas: conditional rolling count), but I cannot figure out how to count the sequential runs of each step as a single block. The output should look like this:
Time Step Run_count
0 0
1 1
2 2 RUN1
3 2 RUN1
4 2 RUN1
5 3
6 0
7 1
8 2 RUN2
9 2 RUN2
10 2 RUN2
11 3

Let's try:
s = df.Step.where(df.Step.eq(2))
df['Run_count'] = s.dropna().groupby(s.isna().cumsum()).ngroup()+1
Output:
Time Step Run_count
0 0 0 NaN
1 1 1 NaN
2 2 2 1.0
3 3 2 1.0
4 4 2 1.0
5 5 3 NaN
6 6 0 NaN
7 7 1 NaN
8 8 2 2.0
9 9 2 2.0
10 10 2 2.0
11 11 3 NaN

Related

Pandas: conditional shift in blocks with reset

I am trying to shift data in a Pandas dataframe in the following manner from this:
time
value
1
1
2
2
3
3
4
4
5
5
1
6
2
7
3
8
4
9
5
10
To this:
time
value
1
2
3
1
4
2
5
3
1
2
3
6
4
7
5
8
In short, I want to move the data 3 rows down each time a new cycle for a time block begins.
Have not been able to find solution on this, as it seems my English is quite limited not knowing how to describe the problem without an example.
Edit:
Both solutions work. Thank you.
IIUC, you can shift per group:
df['value_shift'] = df.groupby(df['time'].eq(1).cumsum())['value'].shift(2)
output:
time value value_shift
0 1 1 NaN
1 2 2 NaN
2 3 3 1.0
3 4 4 2.0
4 5 5 3.0
5 1 6 NaN
6 2 7 NaN
7 3 8 6.0
8 4 9 7.0
9 5 10 8.0
Try with groupby:
df["value"] = df.groupby(df["time"].diff().lt(0).cumsum())["value"].shift(2)
>>> df
time value
0 1 NaN
1 2 NaN
2 3 1.0
3 4 2.0
4 5 3.0
5 1 NaN
6 2 NaN
7 3 6.0
8 4 7.0
9 5 8.0

How to create a Pandas boolean column indicating if a value will have an n-fold increase in x periods ahead?

I have a single column in a DataFrame containing only numbers, and I need to create a boolean column to indicate if the value will have an n-fold increase in x periods ahead.
I developed a solution using two for loops, but it doesn't seem pythonic enough for me.
Is there a better, more efficient way of doing it? Maybe something with map() or apply()?
Find below my code with an MRE.
df = pd.DataFrame([1,2,2,1,3,2,1,3,4,1,2,3,4,4,5,1], columns=['column'])
df['double_in_5_periods_ahead'] = 'n/a'
periods_ahead = 5
for i in range(0,len(df)-periods_ahead):
for j in range(1,periods_ahead):
if df['column'].iloc[i+j]/df['column'].iloc[i] >= 2:
df['double_in_5_periods_ahead'].iloc[i] = 1
break
else:
df['double_in_5_periods_ahead'].iloc[i] = 0
This is the output:
column double_in_5_periods_ahead
0 1 1
1 2 0
2 2 0
3 1 1
4 3 0
5 2 1
6 1 1
7 3 0
8 4 0
9 1 1
10 2 1
11 3 n/a
12 4 n/a
13 4 n/a
14 5 n/a
15 1 n/a
Let us try rolling
n = 5
df['new'] = (df['column'].iloc[::-1].rolling(n).max() / df['column']).gt(2).astype(int)
df.iloc[-n:,1]=np.nan
df
Out[146]:
column new
0 1 1.0
1 2 0.0
2 2 0.0
3 1 1.0
4 3 0.0
5 2 0.0
6 1 1.0
7 3 0.0
8 4 0.0
9 1 1.0
10 2 1.0
11 3 NaN
12 4 NaN
13 4 NaN
14 5 NaN
15 1 NaN

How to group by data using one column perform some operation on another column and assign new groups pandas

I have a dataframe as below :
distance_along_path ID
0 0 1
1 2.2 1
2 4.5 1
3 7.0 1
4 0 2
5 0 3
6 3.0 2
7 5.0 3
8 0 4
9 2.0 4
10 5.0 4
11 0 5
12 3.0 5
11 7.0 4
12
I want be able to group these by id's first and the by distance_along_path values, every time a 0 is seen in distance along path for the id, new group is created and until the next 0 all these rows are under A group as indicated below
distance_along_path ID group
0 0 1 1
1 2.2 1 1
2 4.5 1 1
3 7.0 1 1
4 0 1 2
5 0 2 3
6 3.0 1 2
7 5.0 2 3
8 0 2 4
9 2.0 2 4
10 5.0 2 4
11 0 1 5
12 3.0 1 5
13 7.0 1 5
14 0 1 6
15 0 2 7
16 3.0 1 6
17 5.0 2 7
18 1.0 2 7
Thank you
try the following:
grp_id = df.groupby(['ID']).id.count().reset_index()
grp_distance = grp_id.groupby(['distance_along_path'].grp_id['distance_along_path']==0

Interrupt conditional cumulative sum pandas python

I have the following dataframe with attempted spendings (or transactions) from different users, every attempt has a date and an amount.
user date amount
1 1 6
1 2 5
1 3 2
1 4 3
1 5 1
2 1 11
2 2 12
2 3 5
2 4 8
2 5 1
Let's say I want to impose an arbitrary limit to the total amount spent and check which transactions go through (because the user isn't surpassing the limit) and which ones do not, let's say the limit is 10.
The desired result would be:
user date amount approved spent remaining_credit
1 1 6 1 6 4
1 2 5 0 6 4
1 3 2 1 8 2
1 4 3 0 8 2
1 5 1 1 9 1
2 1 11 0 0 10
2 2 12 0 0 10
2 3 5 1 5 5
2 4 8 0 5 5
2 5 1 1 6 4
Anyway to calculate any of the 3 last columns works to solve my problem.
The first one (approved, col number 4) would have a 1 each time the amount of the operation is less than the limit minus the sum of the amount spent prevoiusly.
The second one (spent) has the cumulative spending of the approved transactions.
The third one (remaing_credit) has the remaining credit after each attempted spending.
I tried with:
d1['spent'] = d1.sort_values('date').groupby('user')['amount'].cumsum()
d1['spent'] = d1.sort_values(['user','date']).spent.mask(d1.spent > limit).fillna(method='pat')
but then I don't know how to restart the cumulative sum when the limit isn't surpassed again.
This can be done by creating your own function in which you will iterate through the data to create each column, then groupby.apply:
def calcul_spendings (ser, val_max=1):
arr_am = ser.to_numpy()
arr_sp = np.cumsum(arr_am)
arr_ap = np.zeros(len(ser))
for i in range(len(arr_am)):
if arr_sp[i]>val_max: # check if the
arr_sp[i:] -= arr_am[i]
else:
arr_ap[i] = 1
return pd.DataFrame({'approved':arr_ap,
'spent': arr_sp,
'remaining_credit':val_max-arr_sp},
index=ser.index)
df[['approved','spent','remaining_credit']] = df.sort_values('date').groupby('user')['amount'].apply(calcul_spendings, val_max=10)
print (df)
user date amount approved spent remaining_credit
0 1 1 6 1.0 6 4
1 1 2 5 0.0 6 4
2 1 3 2 1.0 8 2
3 1 4 3 0.0 8 2
4 1 5 1 1.0 9 1
5 2 1 11 0.0 0 10
6 2 2 12 0.0 0 10
7 2 3 5 1.0 5 5
8 2 4 8 0.0 5 5
9 2 5 1 1.0 6 4

create new col based on transformation on some group based on condition

Is there a more performant way to do something like the following after group by?
For each group, I'd like to get the max value for which time is <= 3
import numpy as np
import pandas as pd
d = dict(group=[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3], times=[0,1,2,3,4]*3, values=np.random.rand(15))
df = pd.DataFrame.from_dict(d)
# e.g.:
group times values
0 1 0 0.277623
1 1 1 0.227311
2 1 2 0.798941
3 1 3 0.861006
4 1 4 0.486385
5 2 0 0.543527
6 2 1 0.347159
7 2 2 0.138165
8 2 3 0.152132
9 2 4 0.402830
10 3 0 0.688038
11 3 1 0.450904
12 3 2 0.351267
13 3 3 0.195594
14 3 4 0.834823
The following seems to work, but is a little slow and not very concise:
for label, group in df.groupby(['group']):
rows = group.index
df.loc[rows,'new_value'] = group.loc[group.time <= 3, 'values'].max()
I think you can use where before grouping. For better performance, use transform:
df['new_value'] = df['values'].where(df.times < 3).groupby(df.group).transform('max')
df
group times values new_value
0 1 0 0.271137 0.751412
1 1 1 0.262456 0.751412
2 1 2 0.751412 0.751412
3 1 3 0.364099 0.751412
4 1 4 0.462447 0.751412
5 2 0 0.022403 0.792396
6 2 1 0.792396 0.792396
7 2 2 0.181434 0.792396
8 2 3 0.106931 0.792396
9 2 4 0.226425 0.792396
10 3 0 0.425845 0.535085
11 3 1 0.527567 0.535085
12 3 2 0.535085 0.535085
13 3 3 0.194340 0.535085
14 3 4 0.958947 0.535085
This is exactly what your current code returns.
where ensures we do not consider values for times > 3, because max ignores NaNs. The groupby is computed on this intermediate result.
df['values'].where(df.times <= 3)
0 0.271137
1 0.262456
2 0.751412
3 0.364099
4 NaN
5 0.022403
6 0.792396
7 0.181434
8 0.106931
9 NaN
10 0.425845
11 0.527567
12 0.535085
13 0.194340
14 NaN
Name: values, dtype: float64

Categories