Pandas dataframe rolling sum column with groupby - python

I'm trying to create a new column that gives a rolling sum of values in the Values column. The rolling sum includes 4 rows i.e. the current row and the next three rows. I want to do this for each type in the 'Type' column.
However, if there are fewer than 4 rows before the next type starts, I want the rolling sum to use only the remaining rows. For example, if there are 2 rows after the current row for the current type, a total of 3 rows is used for the rolling sum. See the table below showing what I'm currently getting and what I expect.
Index
Type
Value
Current Rolling Sum
Expected Rolling Sum
1
left
5
22
22
2
left
9
34
34
3
left
0
NaN
25
4
left
8
NaN
25
5
left
17
NaN
17
6
straight
7
61
61
7
straight
4
77
77
8
straight
0
86
86
9
straight
50
97
97
10
straight
23
NaN
47
11
straight
13
NaN
24
12
straight
11
NaN
11
The following line of code is what I'm currently using to get the rolling sum.
rolling_sum = df.groupby('Type', sort=False)['Value'].rolling(4, min_periods = 3).sum().shift(-3).reset_index()
rolling_sum = rolling_sum.rename(columns={'Value': 'Rolling Sum'})
extracted_col = rolling_sum['Rolling Sum']
df = df.join(extracted_col)
I would really appreciate your help.

You can try running the rolling sum on the reversed values for each group and then reverse back afterward, using a min_periods of 1:
df['Rolling Sum'] = df.groupby('Type', sort=False)['Value'].apply(lambda x: x[::-1].rolling(4, min_periods=1).sum()[::-1])
Result:
Index Type Value Rolling Sum
0 1 left 5 22.0
1 2 left 9 34.0
2 3 left 0 25.0
3 4 left 8 25.0
4 5 left 17 17.0
5 6 straight 7 61.0
6 7 straight 4 77.0
7 8 straight 0 86.0
8 9 straight 50 97.0
9 10 straight 23 47.0
10 11 straight 13 24.0
11 12 straight 11 11.0

Related

resample data based on group and calculate rolling sum

I would like to create an additional column in my data-frame without having to loop through the steps
This is created in the following steps.
1.Start from end of the data.For each date resample every nth row
(in this case its 5th) from the end.
2.Take the rolling sum of x numbers from 1 (x=2)
a worked example for
11/22:5,7,3,2 (every 5th row being picked) but x=2 so 5+7=12
11/15:6,5,2 (every 5th row being picked) but x=2 so 6+5=11
cumulative
8/30/2019 2
9/6/2019 4
9/13/2019 1
9/20/2019 2
9/27/2019 3 5
10/4/2019 3 7
10/11/2019 5 6
10/18/2019 5 7
10/25/2019 7 10
11/1/2019 4 7
11/8/2019 9 14
11/15/2019 6 11
11/22/2019 5 12
Let's assume we have a set of 15 integers:
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], columns=['original_data'])
We define which nth row should be added n and how many times x we add the nth row
n = 5
x = 2
(
df
# Add `x` columsn which are all shifted `n` rows
.assign(**{
'n{} x{}'.format(n, x): df['original_data'].shift(n*x)
for x in range(1, reps)})
# take the rowwise sum
.sum(axis=1)
)
Output:
original_data n5 x1
0 1 NaN
1 2 NaN
2 3 NaN
3 4 NaN
4 5 NaN
5 6 1.0
6 7 2.0
7 8 3.0
8 9 4.0
9 10 5.0
10 11 6.0
11 12 7.0
12 13 8.0
13 14 9.0
14 15 10.0

Calculating simple moving average pandas for loop

Im currently trying to calculate the simple moving average on a dataset of several stocks. Im trying the code on just two companies (and 4 days time) for simplicity to get it working, but there seem to be some problem with the output. Below is my code.
for index, row in df3.iloc[4:].iterrows():
if df3.loc[index,'CompanyId'] == df3.loc[index-4,'CompanyId']:
df3['SMA4'] = df3.iloc[:,1].rolling(window=4).mean()
else:
df3['SMA4'] = 0
And here is the output:Output
The dataframe is sorted by date and company id. So what needs to happen is that when the company id are not equal as stated in the code, the output should be zero since i cant calculate a moving average of two different companies. Instead it output a moving average over both companies like at row 7,8,9.
Use groupby.rolling
df['SMA4']=df.groupby('CompanyId',sort=False).rolling(window=4).Price.mean().reset_index(drop='CompanyId')
print(df)
CompanyId Price SMA4
0 1 75 NaN
1 1 74 NaN
2 1 77 NaN
3 1 78 76.00
4 1 80 77.25
5 1 79 78.50
6 1 80 79.25
7 0 10 NaN
8 0 9 NaN
9 0 12 NaN
10 0 11 10.50
11 0 11 10.75
12 0 8 10.50
13 0 9 9.75
14 0 8 9.00
15 0 8 8.25
16 0 11 9.00
While ansev is right that you should use the specialized function because manual loops are much slower, I want to show why your code didn't work:
In both the if branch and the else branch, the entire SMA4 column gets assigned to (df3['SMA4']), and because on the last run through the loop, the if statement is true, so the else statement doesn't have any effect and SMA4 is never 0. So to fix that you could first create the column populated with rolling averages (note that this is not in a for loop):
df3['SMA4'] = df3.iloc[:,1].rolling(window=4).mean()
And then you run the loop to set invalid rows to 0 (though nan would be better. I kept in the other bugs, assuming that the numbers in ansev's answer are correct):
for index, row in df3.iloc[4:].iterrows():
if df3.loc[index,'CompanyId'] != df3.loc[index-4,'CompanyId']:
df3.loc[index,'SMA4'] = 0
Output (probably still buggy):
CompanyId Price SMA4
0 1 75 NaN
1 1 74 NaN
2 1 77 NaN
3 1 78 76.00
4 1 80 77.25
5 1 79 78.50
6 1 80 79.25
7 2 10 0.00
8 2 9 0.00
9 2 12 0.00
10 2 11 0.00
11 2 11 10.75
12 2 8 10.50
13 2 9 9.75
14 2 8 9.00
15 2 8 8.25
16 2 11 9.00

How to create a column displaying last recorded peak value in DataFrame?

I am trying to create a new column that will list down the last recorded peak values, until the next peak comes along. For example, suppose this is my existing DataFrame:
index values
0 10
1 20
2 15
3 17
4 15
5 22
6 20
I want to get something like this:
index values last_recorded_peak
0 10 10
1 20 20
2 15 20
3 17 17
4 15 17
5 22 22
6 20 22
So far, I have tried with np.maximum.accumulate, which 'accumulates' the max value but not quite the "peaks" (some peaks might be lower than the max value).
I have also tried with scipy.signal.find_peaks which returns an array of indexes where my peaks are (in the example, index 1, 3, 5), which is not what I'm looking for.
I'm relatively new to coding, any pointer is very much appreciated!
You're on the right track, scipy.signal.find_peaks is the way I would go, you just need to work a little bit from the result:
from scipy import signal
peaks = signal.find_peaks(df['values'])[0]
df['last_recorded_peak'] = (df.assign(last_recorded_peak=float('nan'))
.last_recorded_peak
.combine_first(df.loc[peaks,'values'])
.ffill()
.combine_first(df['values']))
print(df)
index values last_recorded_peak
0 0 10 10.0
1 1 20 20.0
2 2 15 20.0
3 3 17 17.0
4 4 15 17.0
5 5 22 22.0
6 6 20 22.0
If I understand your correcly, your are looking for rolling max:
note: you might have to play around with the window size which I set on 2 for your example dataframe
df['last_recorded_peak'] = df['values'].rolling(2).max().fillna(df['values'])
Output
values last_recorded_peak
0 10 10.0
1 20 20.0
2 15 20.0
3 17 17.0
4 15 17.0
5 22 22.0
6 20 22.0

pandas dataframe sort columns according to column totals

I was able to sort rows according to the last column. However, I also have a row at the bottom of the dataframe which has the totals of each column. I couldn't find a way to sort the columns according to the totals in the last row. The table looks like the following:
A B C T
0 9 9 9 27
1 9 10 4 23
2 7 4 8 19
3 2 6 9 17
T 27 29 30
I want this table to be sorted so that the order of columns will be from left to right C, B, A from highest total to lowest. How can this be done?
Use DataFrame.sort_values by index value T with axis=1:
df = df.sort_values('T', axis=1, ascending=False)
print (df)
C B A T
0 9 9 9 27.0
1 4 10 9 23.0
2 8 4 7 19.0
3 9 6 2 17.0
T 30 29 27 NaN

Variable shift in a pandas dataframe

import pandas as pd
df = pd.DataFrame({'A':[3,5,3,4,2,3,2,3,4,3,2,2,2,3],
'B':[10,20,30,40,20,30,40,10,20,30,15,60,20,15]})
A B
0 3 10
1 5 20
2 3 30
3 4 40
4 2 20
5 3 30
6 2 40
7 3 10
8 4 20
9 3 30
10 2 15
11 2 60
12 2 20
13 3 15
I'd like to append a C column, containing rolling average of B (rolling period = A).
For example, the C value at row index(2) should be df.B.rolling(3).mean() = mean(10,20,30), and the C value at row index(4) should be df.B.rolling(2).mean() = mean(40,20).
probably stupid slow... but this get's it done
def crazy_apply(row):
p = df.index.get_loc(row.name)
a = row.A
return df.B.iloc[p-a+1:p+1].mean()
df.apply(crazy_apply, 1)
0 NaN
1 NaN
2 20.000000
3 25.000000
4 30.000000
5 30.000000
6 35.000000
7 26.666667
8 25.000000
9 20.000000
10 22.500000
11 37.500000
12 40.000000
13 31.666667
dtype: float64
explanation
apply iterates through each column or each row. We iterate through each row because we used the parameter axis=1 (see 1 as the second argument in the call to apply). So every iteration of apply passes the a pandas series object that represents the current row. the current index value is in the name attribute of the row. The index of the row object is the same as the columns of df.
So, df.index.get_loc(row.name) finds the ordinal position of the current index value held in row.name. row.A is the column A for that row.

Categories