How to Reverse Rolling Sum? - python

I have a rolling sum calculated on a grouped data frame but its adding up the wrong way, it is a sum of the future, when I need a sum of the past.
What am I doing wrong here?
I import the data and sort by Dimension and Date (I have tried removing the date sort already)
df = pd.read_csv('Input.csv', parse_dates=True)
df.sort_values(['Dimension','Date'])
print(df)
I then create a new column which is a multi index grouped by rolling window
new_column = df.groupby('Dimension').Value1.apply(lambda x:
x.rolling(window=3).sum())
I then reset the index to be the same as the original
df['Sum_Value1'] = new_column.reset_index(level=0, drop=True)
print(df)
I have also tried reversing the index before the calculation, but that also failed.
Input
Dimension,Date,Value1,Value2
1,4/30/2002,10,20
1,1/31/2002,10,20
1,10/31/2001,10,20
1,7/31/2001,10,20
1,4/30/2001,10,20
1,1/31/2001,10,20
1,10/31/2000,10,20
2,4/30/2002,10,20
2,1/31/2002,10,20
2,10/31/2001,10,20
2,7/31/2001,10,20
2,4/30/2001,10,20
2,1/31/2001,10,20
2,10/31/2000,10,20
3,4/30/2002,10,20
3,1/31/2002,10,20
3,10/31/2001,10,20
3,7/31/2001,10,20
3,1/31/2001,10,20
3,10/31/2000,10,20
Output:
Dimension Date Value1 Value2 Sum_Value1
0 1 4/30/2002 10 20 NaN
1 1 1/31/2002 10 20 NaN
2 1 10/31/2001 10 20 30.0
3 1 7/31/2001 10 20 30.0
4 1 4/30/2001 10 20 30.0
5 1 1/31/2001 10 20 30.0
6 1 10/31/2000 10 20 30.0
7 2 4/30/2002 10 20 NaN
8 2 1/31/2002 10 20 NaN
9 2 10/31/2001 10 20 30.0
10 2 7/31/2001 10 20 30.0
11 2 4/30/2001 10 20 30.0
12 2 1/31/2001 10 20 30.0
13 2 10/31/2000 10 20 30.0
Goal Output:
Dimension Date Value1 Value2 Sum_Value1
0 1 4/30/2002 10 20 30.0
1 1 1/31/2002 10 20 30.0
2 1 10/31/2001 10 20 30.0
3 1 7/31/2001 10 20 30.0
4 1 4/30/2001 10 20 30.0
5 1 1/31/2001 10 20 NaN
6 1 10/31/2000 10 20 NaN
7 2 4/30/2002 10 20 30.0
8 2 1/31/2002 10 20 30.0
9 2 10/31/2001 10 20 30.0
10 2 7/31/2001 10 20 30.0
11 2 4/30/2001 10 20 30.0
12 2 1/31/2001 10 20 Nan
13 2 10/31/2000 10 20 NaN

You need a backward sum, therefore reverse your series before sum rolling it:
lambda x: x[::-1].rolling(window=3).sum()

You can shift the result by window-1 to get the left aligned results:
df["sum_value1"] = (df.groupby('Dimension').Value1
.apply(lambda x: x.rolling(window=3).sum().shift(-2)))

Rolling backwards is the same as rolling forward and then shifting the result:
x.rolling(window=3).sum().shift(-2)

def reverse_rolling(series, window, func):
index = series.index
series = pd.DataFrame(series.iloc[::-1])
series = series.rolling(window, 1).apply(func)
series = series.iloc[::-1]
series['index'] = index
series = series.set_index('index')
return series[0]

You can use
import pandas as pd
from pandas.api.indexers import FixedForwardWindowIndexer
df = pd.read_csv(r'C:\Users\xxxx\python\data.txt')
indexer = FixedForwardWindowIndexer(window_size=3)
df1 = df.join(df.groupby('Dimension')['Value1'].rolling(indexer, min_periods=3).sum().to_frame().reset_index(), rsuffix='_sum')
del df1['Dimension_sum']
del df1['level_1']
df1
Input:
Dimension Date Value1 Value2
0 1 4/30/2002 10 20
1 1 1/31/2002 10 20
2 1 10/31/2001 10 20
3 1 7/31/2001 10 20
4 1 4/30/2001 10 20
5 1 1/31/2001 10 20
6 1 10/31/2000 10 20
7 2 4/30/2002 10 20
8 2 1/31/2002 10 20
9 2 10/31/2001 10 20
10 2 7/31/2001 10 20
11 2 4/30/2001 10 20
12 2 1/31/2001 10 20
13 2 10/31/2000 10 20
14 3 4/30/2002 10 20
15 3 1/31/2002 10 20
16 3 10/31/2001 10 20
17 3 7/31/2001 10 20
18 3 1/31/2001 10 20
19 3 10/31/2000 10 20
OUTPUT:
Dimension Date Value1 Value2 Value1_sum
0 1 4/30/2002 10 20 30.0
1 1 1/31/2002 10 20 30.0
2 1 10/31/2001 10 20 30.0
3 1 7/31/2001 10 20 30.0
4 1 4/30/2001 10 20 30.0
5 1 1/31/2001 10 20 NaN
6 1 10/31/2000 10 20 NaN
7 2 4/30/2002 10 20 30.0
8 2 1/31/2002 10 20 30.0
9 2 10/31/2001 10 20 30.0
10 2 7/31/2001 10 20 30.0
11 2 4/30/2001 10 20 30.0
12 2 1/31/2001 10 20 NaN
13 2 10/31/2000 10 20 NaN
14 3 4/30/2002 10 20 30.0
15 3 1/31/2002 10 20 30.0
16 3 10/31/2001 10 20 30.0
17 3 7/31/2001 10 20 30.0
18 3 1/31/2001 10 20 NaN
19 3 10/31/2000 10 20 NaN

Just had to do the same thing myself and came up with a simple one-liner:
df['Sum_Value1'] = df['Value1'].iloc[::-1].rolling(window = 3).sum()

Related

How to do pandas rolling window in both forward and backward at the same time

I have a pd.DataFrame df with one column, say:
A = [1,2,3,4,5,6,7,8,2,4]
df = pd.DataFrame(A,columns = ['A'])
For each row, I want to take previous 2 values, current value and next 2 value (a window= 5) and get the sum and store it in new column. Desire output,
A A_sum
1 6
2 10
3 15
4 20
5 25
6 30
7 28
8 27
2 21
4 14
I have tried,
df['A_sum'] = df['A'].rolling(2).sum()
Tried with shift, but all doing either forward or backward, I'm looking for a combination of both.
Use rolling by 5, add parameter center=True and min_periods=1 to Series.rolling:
df['A_sum'] = df['A'].rolling(5, center=True, min_periods=1).sum()
print (df)
A A_sum
0 1 6.0
1 2 10.0
2 3 15.0
3 4 20.0
4 5 25.0
5 6 30.0
6 7 28.0
7 8 27.0
8 2 21.0
9 4 14.0
If you are allowed to use numpy, then you might use numpy.convolve to get desired output
import numpy as np
import pandas as pd
A = [1,2,3,4,5,6,7,8,2,4]
B = np.convolve(A,[1,1,1,1,1], 'same')
df = pd.DataFrame({"A":A,"A_sum":B})
print(df)
output
A A_sum
0 1 6
1 2 10
2 3 15
3 4 20
4 5 25
5 6 30
6 7 28
7 8 27
8 2 21
9 4 14
You can use shift for this (straightforward if not elegant):
df["A_sum"] = df.A + df.A.shift(-2).fillna(0) + df.A.shift(-1).fillna(0) + df.A.shift(1).fillna(0)
output:
A A_sum
0 1 6.0
1 2 10.0
2 3 14.0
3 4 18.0
4 5 22.0
5 6 26.0
6 7 23.0
7 8 21.0
8 2 14.0
9 4 6.0

How to use each vector entry to fill NAN's of a separate groups in a dataframe

Say I have a vector ValsHR which looks like this:
valsHR=[78.8, 82.3, 91.0]
And I have a dataframe MainData
Age Patient HR
21 1 NaN
21 1 NaN
21 1 NaN
30 2 NaN
30 2 NaN
24 3 NaN
24 3 NaN
24 3 NaN
I want to fill the NaNs so that the first value in valsHR will only fill in the NaNs for patient 1, the second will fill the NaNs for patient 2 and the third will fill in for patient 3.
So far I've tried using this:
mainData['HR'] = mainData['HR'].fillna(ValsHR) but it fills all the NaNs with the first value in the vector.
I've also tried to use this:
mainData['HR'] = mainData.groupby('Patient').fillna(ValsHR) fills the NaNs with values that aren't in the valsHR vector at all.
I was wondering if anyone knew a way to do this?
Create dictionary by Patient values with missing values, map to original column and replace missing values only:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- value is not replaced
4 30 2 NaN
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0
If some groups has no NaNs:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- group 2 is not replaced
4 30 2 100.0 <- group 2 is not replaced
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 100.0
5 24 3 82.3
6 24 3 82.3
7 24 3 82.3
It is simply mapping, if all of NaN should be replaced
import pandas as pd
from io import StringIO
valsHR=[78.8, 82.3, 91.0]
vals = {i:k for i,k in enumerate(valsHR, 1)}
df = pd.read_csv(StringIO("""Age Patient
21 1
21 1
21 1
30 2
30 2
24 3
24 3
24 3"""), sep="\s+")
df["HR"] = df["Patient"].map(vals)
>>> df
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 82.3
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0

Averaging every 10 rows of one column within a dataframe, pulling every tenth item from the others?

Let's say I have the following sample dataframe:
df = pd.DataFrame({'depth': list(range(0, 21)),
'time': list(range(0, 21)),
'metric': random.choices(range(10), k=21)})
df
Out[65]:
depth time metric
0 0 0 2
1 1 1 3
2 2 2 8
3 3 3 0
4 4 4 8
5 5 5 9
6 6 6 5
7 7 7 1
8 8 8 6
9 9 9 6
10 10 10 7
11 11 11 2
12 12 12 7
13 13 13 0
14 14 14 6
15 15 15 0
16 16 16 5
17 17 17 6
18 18 18 9
19 19 19 6
20 20 20 8
I want to average every ten rows of the "metric" column (preserving the first row as is) and pulling the tenth item from the depth and time columns. For example:
depth time metric
0 0 0 2
10 10 10 5.3
20 20 20 4.9
I know that groupby is usually used in these situations, but I do not know how to tweak it to get my desired outcome:
df[['metric']].groupby(df.index //10).mean()
Out[66]:
metric
0 4.8
1 4.8
2 8.0
#BENY's answer is on the right track but not quite right. Should be:
df.groupby((df.index+9)//10).agg({'depth':'last','time':'last','metric':'mean'})
You can do rolling with reindex+ffill
df.rolling(10).mean().reindex(df.index[::10]).fillna(df)
depth time metric
0 0.0 0.0 2.0
10 5.5 5.5 5.3
20 15.5 15.5 4.9
Or to match output for depth and time:
out = (df.assign(metric=df['metric'].rolling(10).mean()
.reindex(df.index[::10]).fillna(df['metric']))
.dropna(subset=['metric']))
print(out)
depth time metric
0 0 0 2.0
10 10 10 5.3
20 20 20 4.9
Let us do agg
g = df.index.isin(df.index[::10]).cumsum()[::-1]
df.groupby(g).agg({'depth':'last','time':'last','metric':'mean'})
Out[263]:
depth time metric
1 20 20 4.9
2 10 10 5.3
3 0 0 2.0

Python - pandas: Rollingmean command is not working

dataframe = pd.DataFrame(data={'user': [1,1,1,1,1,2,2,2,2,2], 'usage':
[12,18,76,32,43,45,19,42,9,10]})
dataframe['mean'] = dataframe.groupby('user'['usage'].apply(pd.rolling_mean, 2))
Why this code is not working?
i am getting an error of rolling mean attribute is not found in pandas
Use groupby with rolling, docs:
dataframe['mean'] = (dataframe.groupby('user')['usage']
.rolling(2)
.mean()
.reset_index(level=0, drop=True))
print (dataframe)
user usage mean
0 1 12 NaN
1 1 18 15.0
2 1 76 47.0
3 1 32 54.0
4 1 43 37.5
5 2 45 NaN
6 2 19 32.0
7 2 42 30.5
8 2 9 25.5
9 2 10 9.5

pandas pct_change() in reverse

Suppose we have a dataframe and we calculate as percent change between rows
y_axis = [1,2,3,4,5,6,7,8,9]
x_axis = [100,105,115,95,90,88,110,100,0]
DF = pd.DataFrame({'Y':y_axis, 'X':x_axis})
DF = DF[['Y','X']]
DF['PCT'] = DF['X'].pct_change()
Y X PCT
0 1 100 NaN
1 2 105 0.050000
2 3 115 0.095238
3 4 95 -0.173913
4 5 90 -0.052632
5 6 88 -0.022222
6 7 110 0.250000
7 8 100 -0.090909
8 9 0 -1.000000
That way it starts from the first row.
I want calculate pct_change() starting from the last row.
One way to do it
DF['Reverse'] = list(reversed(x_axis))
DF['PCT_rev'] = DF['Reverse'].pct_change()
pct_rev = DF.PCT_rev.tolist()
DF['_PCT_'] = list(reversed(pct_rev))
DF2 = DF[['Y','X','PCT','_PCT_']]
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN
But that is a very ugly and inefficient solution.
I was wondering if there are more elegant solutions?
DF.assign(_PCT_=DF.X.pct_change(-1))
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN
Series.pct_change(periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
periods : int, default 1 Periods to shift for forming percent change
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.pct_change.html
I deleted my other answer because #su79eu7k 's is way better.
You can cut your time in half by using the underlying arrays. But you also have to suppress a warning.
a = DF.X.values
DF.assign(_PCT_=np.append((a[:-1] - a[1:]) / a[1:], np.nan))
Y X PCT _PCT_
0 1 100 NaN -0.047619
1 2 105 0.050000 -0.086957
2 3 115 0.095238 0.210526
3 4 95 -0.173913 0.055556
4 5 90 -0.052632 0.022727
5 6 88 -0.022222 -0.200000
6 7 110 0.250000 0.100000
7 8 100 -0.090909 inf
8 9 0 -1.000000 NaN

Categories