conditional cumsum in pandas [duplicate] - python

This question already has an answer here:
How can I use cumsum within a group in Pandas?
(1 answer)
Closed 3 years ago.
I have following dataframe in pandas
code rank quant sales
123 1 0 2
123 1 12 2
123 1 0 2
123 2 0 1
123 2 10 1
I want to do a conditional cumsum of sales groupby rank. where quant is not zero add it in cumulative sum on the same row.
code rank quant sales cumsum
123 1 0 2 2
123 1 12 2 16
123 1 0 2 18
123 2 0 1 1
123 2 10 1 12
How to do it in pandas.

Add columns first and then use GroupBy.cumsum with df['rank'] Series:
df['cumsum'] = df['quant'].add(df['sales']).groupby(df['rank']).cumsum()
Or use sum by both columns:
df['cumsum'] = df[['quant', 'sales']].sum(axis=1).groupby(df['rank']).cumsum()
Alternative is create new column before groupby:
df['cumsum'] = (df.assign(cumsum=df['quant'].add(df['sales']))
.groupby('rank')['cumsum'].cumsum())
print (df)
code rank quant sales cumsum
0 123 1 0 2 2
1 123 1 12 2 16
2 123 1 0 2 18
3 123 2 0 1 1
4 123 2 10 1 12

Related

Pandas groupby with each group treated as a unique group

Please assist. How do I get the cumsum of a pandas groupby, but my data is boolean 0 and 1. I want to treat each group of 0s or 1s as unique values, and the count to reset when new values are met.
I currently have this which sums up all 1s and 0's
df['grp'] = df.groupby("dir")["dir"].cumsum())
My desired output
df = pd.DataFrame({"dir":[1,1,1,1,0,0,0,1,1,1,1,0,0,0],
"grp": [1,2,3,4,1,2,3,1,2,3,4,1,2,3,]})
Use:
In [1495]: df['grp'] = df.groupby((df['dir'] != df['dir'].shift(1)).cumsum()).cumcount()+1
In [1496]: df
Out[1496]:
dir grp
0 1 1
1 1 2
2 1 3
3 1 4
4 0 1
5 0 2
6 0 3
7 1 1
8 1 2
9 1 3
10 1 4
11 0 1
12 0 2
13 0 3

pandas create category column based on sequence repetition in another column

This is very likely a duplicate, but I'm not sure what to search for to find it.
I have a column in a dataframe that cycles from 0 to some value a number of times (in my example it cycles to 4 three times) . I want to create another column that simply shows which cycle it is. Example:
import pandas as pd
df = pd.DataFrame({'A':[0,1,2,3,4,0,1,2,3,4,0,1,2,3,4]})
df['desired_output'] = [0,0,0,0,0,1,1,1,1,1,2,2,2,2,2]
print(df)
A desired_output
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 0 1
6 1 1
7 2 1
8 3 1
9 4 1
10 0 2
11 1 2
12 2 2
13 3 2
14 4 2
I was thinking maybe something along the lines of a groupby(), cumsum() and transform(), but I'm not quite sure how to implement it. Could be wrong though.
Compare by 0 with Series.eq and then add Series.cumsum, last subtract 1:
df['desired_output'] = df['A'].eq(0).cumsum() - 1
print (df)
A desired_output
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 0 1
6 1 1
7 2 1
8 3 1
9 4 1
10 0 2
11 1 2
12 2 2
13 3 2
14 4 2

Groupby, Shift and Sum

I have the following dataframe:
product Week_Number Sales
1 1 10
2 1 15
1 2 20
And I would like to groupby product and week number and create a column with the sales of the next week for that product:
product Week_Number Sales next_week
1 1 10 20
2 1 15 0
1 2 20 0
Use DataFrame.sort_values with DataFrameGroupBy.shift :
#if not sure if sorted per 2 columns
df = df.sort_values(['product','Week_Number'])
#pandas 0.24+
df['next_week'] = df.groupby('product')['Sales'].shift(-1, fill_value=0)
#pandas below
#df['next_week'] = df.groupby('product')['Sales'].shift(-1).fillna(0, downcast='int')
print (df)
product Week_Number Sales next_week
0 1 1 10 20
1 2 1 15 0
2 1 2 20 0
If possible duplicates and need aggregate sum first in real data:
df = df.groupby(['product','Week_Number'], as_index=False)['Sales'].sum()
df['next_week'] = df.groupby('product')['Sales'].shift(-1).fillna(0, downcast='int')
print (df)
product Week_Number Sales next_week
0 1 1 10 20
1 1 2 20 0
2 2 1 15 0
First sort the data
Then apply shift using tranform
df = pd.DataFrame(data={'product':[1,2,1],
'week_number':[1,1,2],
'sales':[10,15,20]})
df.sort_values(['product','week_number'],inplace=True)
df['next_week'] = df.groupby(['product'])['sales'].transform(pd.Series.shift,-1,fill_value=0)
print(df)
product week_number sales next_week
0 1 1 10 20
2 1 2 20 0
1 2 1 15 0

Deleting rows in pandas unitil the specific value first occurred

I would like to delete the rows that users equal to 1 first occurred and its previous rows for each unique user in the DataFrame.
For instance, I have the following Dataframe, and I would like to get another dataframe which deletes the row in the "val" column 1 first occured and its previous rows for each user.
user val
0 1 0
1 1 1
2 1 0
3 1 1
4 2 0
5 2 0
6 2 1
7 2 0
8 3 1
9 3 0
10 3 0
11 3 0
12 3 1
user val
0 1 0
1 1 1
2 2 0
3 3 0
4 3 0
5 3 0
6 3 1
Sample Data
import pandas as pd
s = [1,1,1,1,2,2,2,2,3,3,3,3,3]
t = [0,1,0,1,0,0,1,0,1,0,0,0,1]
df = pd.DataFrame(zip(s,t), columns=['user', 'val'])
groupby checking cummax and shift to remove all rows before, and including, the first 1 in the 'val' column per user.
Assuming your values are either 1 or 0, also possible to create the mask with a double cumsum.
m = df.groupby('user').val.apply(lambda x: x.eq(1).cummax().shift().fillna(False))
# m = df.groupby('user').val.apply(lambda x: x.cumsum().cumsum().gt(1))
df.loc[m]
Output:
user val
2 1 0
3 1 1
7 2 0
9 3 0
10 3 0
11 3 0
12 3 1

How to add incremental number to Dataframe using Pandas

I have original dataframe:
ID T value
1 0 1
1 4 3
2 0 0
2 4 1
2 7 3
The value is same previous row.
The output should be like:
ID T value
1 0 1
1 1 1
1 2 1
1 3 1
1 4 3
2 0 0
2 1 0
2 2 0
2 3 0
2 4 1
2 5 1
2 6 1
2 7 3
... ... ...
I tried loop it take long time process.
Any idea how to solve this for large dataframe?
Thanks!
For solution is necessary unique integer values in T for each group.
Use groupby with custom function - for each group use reindex and then replace NaNs in value column by forward filling ffill:
df1 = (df.groupby('ID')['T', 'value']
.apply(lambda x: x.set_index('T').reindex(np.arange(x['T'].min(), x['T'].max() + 1)))
.ffill()
.astype(int)
.reset_index())
print (df1)
ID T value
0 1 0 1
1 1 1 1
2 1 2 1
3 1 3 1
4 1 4 3
5 2 0 0
6 2 1 0
7 2 2 0
8 2 3 0
9 2 4 1
10 2 5 1
11 2 6 1
12 2 7 3
If get error:
ValueError: cannot reindex from a duplicate axis
it means some duplicated values per group like:
print (df)
ID T value
0 1 0 1
1 1 4 3
2 2 0 0
3 2 4 1 <-4 is duplicates per group 2
4 2 4 3 <-4 is duplicates per group 2
5 2 7 3
Solution is aggregate values first for unique T - e.g.by sum:
df = df.groupby(['ID', 'T'], as_index=False)['value'].sum()
print (df)
ID T value
0 1 0 1
1 1 4 3
2 2 0 0
3 2 4 4
4 2 7 3

Categories