Cumulative sum of a dataframe column with restart - python

I would like to perform the following function on a dataframe.
Calculate the cumulative sum of a column, notice:
It looks at the previous index only, not including the current one, e.g. the very first one will be zero as there is no previous data to look at.
When it doesn't cumulate, e.g the increment is zero, it restarts the count.
Number Cumulative
0 1 0
1 1 1
2 1 2
3 0 3
4 0 0
5 1 0
6 1 1
7 0 2
I know there is an expanding function, but it doesnt restart when it sees zero

IIUC, this works by making groups according to whether the previous row was 0, then getting the cumulative count:
>>> df
Number
0 1
1 1
2 1
3 0
4 0
5 1
6 1
7 0
df['Cumulative'] = df.groupby(df.Number.shift().eq(0).cumsum()).cumcount()
>>> df
Number Cumulative
0 1 0
1 1 1
2 1 2
3 0 3
4 0 0
5 1 0
6 1 1
7 0 2
Alternatively, if it really is cumsum you want, then apply cumsum with the same grouping as above, and shift it 1 down:
df['Cumulative '] = df.groupby(df.Number.eq(0).cumsum()).cumsum().shift().fillna(0)
>>> df
Number Cumulative
0 1 0.0
1 1 1.0
2 1 2.0
3 0 3.0
4 0 0.0
5 1 0.0
6 1 1.0
7 0 2.0

Related

Grouping a set of numbers that reoccur in a pandas DataFrame

Say I have the following dataframe
holder
0
1
2
0
1
2
0
1
0
1
2
I want to be able to group each set of numbers that come in starting at 0, ends at the max value, assign it a value for that group.
So
holder group
0 1
1 1
2 1
0 2
1 2
2 2
0 3
1 3
0 4
1 4
2 4
I tried:
n=3
df['group'] = [int(i/n) for i,x in enumerate(df.holder)]
But this returns
holder group
0 1
1 1
2 1
0 2
1 2
2 2
0 3
1 3
0 3
1 4
2 4
Assuming each group's numbers always increase, you can check whether the numbers are less than or equal to the ones before, then take the cumulative sum, which turns the booleans into group numbers.
df['group'] = df['holder'].diff().le(0).cumsum() + 1
Result:
holder group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 0 3
7 1 3
8 0 4
9 1 4
10 2 4
(I'm using <= specifically instead of < in case of two adjacent 0s.)
This was inspired by Nickil Maveli's answer on "Groupby conditional sum of adjacent rows" but the cleaner method was posted by d.b in a comment here.
Assuming holder is monotonically nondecreasing until another 0 occurs, you can identify the zeroes and create groups by taking the cumulative sum.
df = pd.DataFrame({'holder': [0, 1, 2, 0, 1, 2, 0, 1, 0, 1, 2]})
# identify 0s and create groups
df['group'] = df['holder'].eq(0).cumsum()
print(df)
holder group
0 0 1
1 1 1
2 2 1
3 0 2
4 1 2
5 2 2
6 0 3
7 1 3
8 0 4
9 1 4
10 2 4

Calculate cumulative sum from last non-zero values for each column of a dataframe in python

Say I have a dataframe below. For each column, I have many zeros with some non-zero values. I would like to calculate cumulative sum for each column, but I want the cumsum to be reset when a zero value occurs.
My original dataframe:
pd.DataFrame({'a':[1,0,1,0,1,0,1,1],'b':[1,0,0,0,0,1,1,1]})
a b
0 1 1
1 0 0
2 1 0
3 0 0
4 1 0
5 0 1
6 1 1
7 1 1
i would like to have a cumulative sum like this:
a b
0 1 1
1 0 0
2 1 0
3 0 0
4 1 0
5 0 1
6 1 2
7 2 3
Is is possible to do it without loop in python? Thank you!
One way would be creating custom groupers for each column checking for element-wise equality with 0 and taking the cumsum of the resulting series of booleans, and transforming with the cumsum:
g = df.eq(0).cumsum()
df.apply(lambda x: x.groupby(g[x.name]).transform('cumsum'))
a b
0 1 1
1 0 0
2 1 0
3 0 0
4 1 0
5 0 1
6 1 2
7 2 3
You cannot completely avoid looping,
but you can avoid formal looping notaions.
sum(x-1000 if x > 1000 else x for x in x_sph_rand) will do it with a generator, which is a bit better, but still uses a loop...

Calculate difference of adjacent rows (decimal numbers) in a data frame for each group defined in a different column

I have a data frame with three columns of interest, 'time', 'peak' and 'cycle'. I want to calculate the time elapsed between each row for a given cycle.
time peak cycle
0 1 1 1
1 2 0 1
2 3.5 0 1
3 3.8 1 2
4 5 0 2
5 6.2 0 2
6 7 0 2
I want to add a fourth column, so the data frame would look like this when complete:
time peak cycle time_elapsed
0 1 1 1 0
1 2 0 1 1
2 3.5 0 1 1.5
3 3.8 1 2 0
4 5 0 2 1.2
5 6.2 0 2 1.2
6 7 0 2 0.8
The cycle number is calculated based on the peak information, so I don't think I need to refer to both columns.
data['time_elapsed'] = data['time'] - data['time'].shift()
Applying the above code I get:
time peak cycle time_elapsed
0 1 1 1 0
1 2 0 1 1
2 3.5 0 1 1.5
3 3.8 1 2 0.3
4 5 0 2 1.2
5 6.2 0 2 1.2
6 7 0 2 0.8
Is there a way to "reset" the calculation every time the value in 'peak' is 1?Any tips or advice would be appreciated!
Subtract first value per groups converted in Series by GroupBy.transform with GroupBy.first:
df['time_elapsed'] = df['time'].sub(df.groupby('cycle')['time'].transform('first'))
print (df)
time peak cycle time_elapsed
0 1 1 1 0
1 2 0 1 1
2 3 0 1 2
3 4 1 2 0
4 5 0 2 1
5 6 0 2 2
6 7 0 2 3
For adding reset add new Series with Series.cumsum - if values are only 1 or 0 in peak column:
s = df['peak'].cumsum()
df['time_elapsed'] = df['time'].sub(df.groupby(['cycle', s])['time'].transform('first'))

Identify first non-zero element within group composed of multiple columns in pandas

I have a dataframe that looks like the following. The rightmost column is my desired column:
Group1 Group2 Value Target_Column
1 3 0 0
1 3 1 1
1 4 1 1
1 4 1 0
2 5 5 5
2 5 1 0
2 6 0 0
2 6 1 1
2 6 9 0
How do I identify the first non-zero value in a group that is made up of two columns(Group1 & Group2) and then create a column that shows the first non-zero value and shows all else as zeroes?
This question is very similar to one posed earlier here:
Identify first non-zero element within a group in pandas
but that solution gives an error on groups based on multiple columns.
I have tried:
import pandas as pd
dt = pd.DataFrame({'Group1': [1,1,1,1,2,2,2,2,2], 'Group2': [3,3,4,4,5,5,6,6,6], 'Value': [0,1,1,1,5,1,0,1,9]})
dt['Newcol']=0
dt.loc[dt.Value.ne(0).groupby(dt['Group1','Group2']).idxmax(),'Newcol']=dt.Value
Setup
df['flag'] = df.Value.ne(0)
Using numpy.where and assign:
df.assign(
target=np.where(df.index.isin(df.groupby(['Group1', 'Group2']).flag.idxmax()),
df.Value, 0)
).drop('flag', 1)
Using loc and assign
df.assign(
target=df.loc[df.groupby(['Group1', 'Group2']).flag.idxmax(), 'Value']
).fillna(0).astype(int).drop('flag', 1)
Both produce:
Group1 Group2 Value target
0 1 3 0 0
1 1 3 1 1
2 1 4 1 1
3 1 4 1 0
4 2 5 5 5
5 2 5 1 0
6 2 6 0 0
7 2 6 1 1
8 2 6 9 0
The number may off, since when there are only have two same values, I do not know you need the which one.
Using user3483203 's setting up
df['flag'] = df.Value.ne(0)
df['Target']=df.sort_values(['flag'],ascending=False).drop_duplicates(['Group1','Group2']).Value
df['Target'].fillna(0,inplace=True)
df
Out[20]:
Group1 Group2 Value Target_Column Target
0 1 3 0 0 0.0
1 1 3 1 1 1.0
2 1 4 1 1 1.0
3 1 4 1 0 0.0
4 2 5 5 5 5.0
5 2 5 1 0 0.0
6 2 6 0 0 0.0
7 2 6 1 1 1.0

Apply a value to all instances of a number based on conditions

I have a df like this:
ID Number
1 0
1 0
1 1
2 0
2 0
3 1
3 1
3 0
I want to apply a 5 to any ids that have a 1 anywhere in the number column and a zero to those that don't. For example, if the number "1" appears anywhere in the Number column for ID 1, I want to place a 5 in the total column for every instance of that ID.
My desired output would look as such
ID Number Total
1 0 5
1 0 5
1 1 5
2 0 0
2 0 0
3 1 5
3 1 5
3 0 5
Trying to think of a way leverage applymap for this issue but not sure how to implement.
Use transform to add a column to your df as a result of a groupby on 'ID':
In [6]:
df['Total'] = df.groupby('ID').transform(lambda x: 5 if (x == 1).any() else 0)
df
Out[6]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5
You can use DataFrame.groupby() on ID column and then take max() of the Number column, and then make that into a dictionary and then use that to create the 'Total' column. Example -
grouped = df.groupby('ID')['Number'].max().to_dict()
df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
Demo -
In [44]: df
Out[44]:
ID Number
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 3 1
6 3 1
7 3 0
In [56]: grouped = df.groupby('ID')['Number'].max().to_dict()
In [58]: df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
In [59]: df
Out[59]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5

Categories