Interrupt conditional cumulative sum pandas python - python

I have the following dataframe with attempted spendings (or transactions) from different users, every attempt has a date and an amount.
user date amount
1 1 6
1 2 5
1 3 2
1 4 3
1 5 1
2 1 11
2 2 12
2 3 5
2 4 8
2 5 1
Let's say I want to impose an arbitrary limit to the total amount spent and check which transactions go through (because the user isn't surpassing the limit) and which ones do not, let's say the limit is 10.
The desired result would be:
user date amount approved spent remaining_credit
1 1 6 1 6 4
1 2 5 0 6 4
1 3 2 1 8 2
1 4 3 0 8 2
1 5 1 1 9 1
2 1 11 0 0 10
2 2 12 0 0 10
2 3 5 1 5 5
2 4 8 0 5 5
2 5 1 1 6 4
Anyway to calculate any of the 3 last columns works to solve my problem.
The first one (approved, col number 4) would have a 1 each time the amount of the operation is less than the limit minus the sum of the amount spent prevoiusly.
The second one (spent) has the cumulative spending of the approved transactions.
The third one (remaing_credit) has the remaining credit after each attempted spending.
I tried with:
d1['spent'] = d1.sort_values('date').groupby('user')['amount'].cumsum()
d1['spent'] = d1.sort_values(['user','date']).spent.mask(d1.spent > limit).fillna(method='pat')
but then I don't know how to restart the cumulative sum when the limit isn't surpassed again.

This can be done by creating your own function in which you will iterate through the data to create each column, then groupby.apply:
def calcul_spendings (ser, val_max=1):
arr_am = ser.to_numpy()
arr_sp = np.cumsum(arr_am)
arr_ap = np.zeros(len(ser))
for i in range(len(arr_am)):
if arr_sp[i]>val_max: # check if the
arr_sp[i:] -= arr_am[i]
else:
arr_ap[i] = 1
return pd.DataFrame({'approved':arr_ap,
'spent': arr_sp,
'remaining_credit':val_max-arr_sp},
index=ser.index)
df[['approved','spent','remaining_credit']] = df.sort_values('date').groupby('user')['amount'].apply(calcul_spendings, val_max=10)
print (df)
user date amount approved spent remaining_credit
0 1 1 6 1.0 6 4
1 1 2 5 0.0 6 4
2 1 3 2 1.0 8 2
3 1 4 3 0.0 8 2
4 1 5 1 1.0 9 1
5 2 1 11 0.0 0 10
6 2 2 12 0.0 0 10
7 2 3 5 1.0 5 5
8 2 4 8 0.0 5 5
9 2 5 1 1.0 6 4

Related

how to select group of rows from a dataframe if all rows follow a sequence

I'm currently working on a dataframe that has processes (based on ID) that may or not reach the end of the process. The end of the process is defined as the activity which has index=6. What i need to do is to filter those processes (ID) based on the fact they are completed, which means all 6 the activities are done (so in the process we'll have activities which have index equal to 1,2,3,4,5 and 6 in this specific order).
the dataframe is structured as follows:
ID A index
1 activity1 1
1 activity2 2
1 activity3 3
1 activity4 4
1 activity5 5
1 activity6 6
2 activity7 1
2 activity8 2
2 activity9 3
3 activity10 1
3 activity11 2
3 activity12 3
3 activity13 4
3 activity14 5
3 activity15 6
And the resulting dataframe should be:
ID A index
1 activity1 1
1 activity2 2
1 activity3 3
1 activity4 4
1 activity5 5
1 activity6 6
3 activity10 1
3 activity11 2
3 activity12 3
3 activity13 4
3 activity14 5
3 activity15 6
I've tried to do so working with sum(), creating a new column 'a' and checking if the sum of every group was greater than 20 (which means taking groups in which the sum() is at least 21, which is the sum of 1,2,3,4,5,6) with the function gt().
df['a'] = df['index'].groupby(df['index']).sum()
df2 = df[df['a'].gt(20)]
Probably this isn't the best approach, so also other approaches are more than welcome.
Any idea on how to select rows based on this condition?
Another possible solution:
out = (df.groupby('ID')
.filter(lambda g: (len(g['index']) == 6) and
(g['index'].eq([*range(1,7)]).all())))
print(out)
ID A index
0 1 activity1 1
1 1 activity2 2
2 1 activity3 3
3 1 activity4 4
4 1 activity5 5
5 1 activity6 6
9 3 activity10 1
10 3 activity11 2
11 3 activity12 3
12 3 activity13 4
13 3 activity14 5
14 3 activity15 6
this may not be the fastest method, especially on a large dataframe, but it does the job
df = df.loc[df.groupby(['ID'])['index'].transform(lambda x: list(x)==list(range(1,7)))]
Or this other variation:
df = df.loc[df.groupby('ID')['index'].filter(lambda x: list(x)==list(range(1,7))).index]
Output:
ID A index
0 1 activity1 1
1 1 activity2 2
2 1 activity3 3
3 1 activity4 4
4 1 activity5 5
5 1 activity6 6
9 3 activity10 1
10 3 activity11 2
11 3 activity12 3
12 3 activity13 4
13 3 activity14 5
14 3 activity15 6

counting consequtive duplicate elements in a dataframe and storing them in a new colum

I am trying to count the consecutive elements in a data frame and store them in a new column. I don't want to count the total number of times an element appears overall in the list but how many times it appeared consecutively, i used this:
a=[1,1,3,3,3,5,6,3,3,0,0,0,2,2,2,0]
df = pd.DataFrame(list(zip(a)), columns =['Patch'])
df['count'] = df.groupby('Patch').Patch.transform('size')
print(df)
this gave me a result like this:
Patch count
0 1 2
1 1 2
2 3 5
3 3 5
4 3 5
5 5 1
6 6 1
7 3 5
8 3 5
9 0 4
10 0 4
11 0 4
12 2 3
13 2 3
14 2 3
15 0 4
however i want the result to be like this:
Patch count
0 1 2
1 3 3
2 5 1
3 6 1
4 3 2
5 0 3
6 2 3
7 0 1
df = (
df.groupby((df.Patch != df.Patch.shift(1)).cumsum())
.agg({"Patch": ("first", "count")})
.reset_index(drop=True)
.droplevel(level=0, axis=1)
.rename(columns={"first": "Patch"})
)
print(df)
Prints:
Patch count
0 1 2
1 3 3
2 5 1
3 6 1
4 3 2
5 0 3
6 2 3
7 0 1

Pandas: Conditional Rolling Block Count

I have a series that looks like the following:
Time Step
0 0
1 1
2 2
3 2
4 2
5 3
6 0
7 1
8 2
9 2
10 2
11 3
I want to use Pandas to perform a conditional rolling count of each block of time that contains step = 2 and output the count to a new column. I have found answers on how to do conditional rolling counts (Pandas: conditional rolling count), but I cannot figure out how to count the sequential runs of each step as a single block. The output should look like this:
Time Step Run_count
0 0
1 1
2 2 RUN1
3 2 RUN1
4 2 RUN1
5 3
6 0
7 1
8 2 RUN2
9 2 RUN2
10 2 RUN2
11 3
Let's try:
s = df.Step.where(df.Step.eq(2))
df['Run_count'] = s.dropna().groupby(s.isna().cumsum()).ngroup()+1
Output:
Time Step Run_count
0 0 0 NaN
1 1 1 NaN
2 2 2 1.0
3 3 2 1.0
4 4 2 1.0
5 5 3 NaN
6 6 0 NaN
7 7 1 NaN
8 8 2 2.0
9 9 2 2.0
10 10 2 2.0
11 11 3 NaN

how to get the average of values for one column based on another column value in python (pandas, jupyter)

the image shows the test dataset I am using to verify if the right averages are being calculated.
I want to be able to get the average of the corresponding values in the 'G' column based on the filtered values in the 'T' column.
So I set the values for the 'T' coloumn based on which I want to sum the values in the 'G' column and then divide the total by the count to get an average, which is appended to a variable.
however the average is not correctly calculated. see below
screenshot
total=0
g_avg=[]
output=[]
counter=0
for i, row in df_new.iterrows():
if (row['T'] > 2):
counter+=1
total+=row['G']
if (counter != 0 and row['T']==10):
g_avg.append(total/counter)
counter = 0
total = 0
print(g_avg)
below is a better set of data as there is repetition in the 'T' values so I would need a counter in order to get my average for the G values when the T value is in a certain range i.e. from 2am to 10 am etc
sorry it wont allow me to just paste the dataset so ive took a snippy of it
If you want the average of column G values when T is between 2 and 7:
df_new.loc[(df_new['T']>2) & (df_new['T']<7), 'G'].mean()
Update
It's difficult to know exactly what you want without any expected output. If you have some data that looks like this:
print(df)
T G
0 0 0
1 0 0
2 1 0
3 2 1
4 3 3
5 4 0
6 5 4
7 6 5
8 7 0
9 8 6
10 9 7
And you want something like this:
print(df)
T G
0 0 0
1 0 0
2 1 0
3 2 1
4 3 3
5 4 3
6 5 3
7 6 3
8 7 0
9 8 6
10 9 7
Then you could use boolean indexing and DataFrame.loc:
avg = df.loc[(df['T']>2) & (df['T']<7), 'G'].mean()
df.loc[(df['T']>2) & (df['T']<7), 'G'] = avg
print(df)
T G
0 0 0.0
1 0 0.0
2 1 0.0
3 2 1.0
4 3 3.0
5 4 3.0
6 5 3.0
7 6 3.0
8 7 0.0
9 8 6.0
10 9 7.0
Update 2
If you have some sample data:
print(df)
T G
0 0 1
1 2 2
2 3 3
3 3 1
4 3 2
5 10 4
6 2 5
7 2 5
8 2 5
9 10 5
Method 1: To simply get a list of those means, you could create groups for your interval and filter on m:
m = df['T'].between(0,5,inclusive=False)
g = m.ne(m.shift()).cumsum()[m]
lst = df.groupby(g).mean()['G'].tolist()
print(lst)
[2.0, 5.0]
Method 2: If you want to include these means at their respective T values, then you could do this instead:
m = df['T'].between(0,5,inclusive=False)
g = m.ne(m.shift()).cumsum()
df['G_new'] = df.groupby(g)['G'].transform('mean')
print(df)
T G G_new
0 0 1 1
1 2 2 2
2 3 3 2
3 3 1 2
4 3 2 2
5 10 4 4
6 2 5 5
7 2 5 5
8 2 5 5
9 10 5 5

Pandas assign the groupby sum value to the last row in the original table

For example, I have a table
A
id price sum
1 2 0
1 6 0
1 4 0
2 2 0
2 10 0
2 1 0
2 5 0
3 1 0
3 5 0
What I want is like (the last row of sum should be the sum of price of a group)
id price sum
1 2 0
1 6 0
1 4 12
2 2 0
2 10 0
2 1 0
2 5 18
3 1 0
3 5 6
What I can do is find out the sum using
A['price'].groupby(A['id']).transform('sum')
However I don't know how to assign this to the sum column (last row).
Thanks
Use last_valid_index to locate rows to fill
g = df.groupby('id')
l = pd.DataFrame.last_valid_index
df.loc[g.apply(l), 'sum'] = g.price.sum().values
df
id price sum
0 1 2 0
1 1 6 0
2 1 4 12
3 2 2 0
4 2 10 0
5 2 1 0
6 2 5 18
7 3 1 0
8 3 5 6
You could do this:
df.assign(sum=df.groupby('id')['price'].transform('sum').drop_duplicates(keep='last')).fillna(0)
OR
df['sum'] = (df.groupby('id')['price']
.transform('sum')
.mask(df.id.duplicated(keep='last'), 0))
Output:
id price sum
0 1 2 0.0
1 1 6 0.0
2 1 4 12.0
3 2 2 0.0
4 2 10 0.0
5 2 1 0.0
6 2 5 18.0
7 3 1 0.0
8 3 5 6.0

Categories