Pandas Rolling Groupby Shift back 1, Trying to lag rolling sum - python

I am trying to get a rolling sum of the past 3 rows for the same ID but lagging this by 1 row. My attempt looked like the below code and i is the column. There has to be a way to do this but this method doesnt seem to work.
for i in df.columns.values:
df.groupby('Id', group_keys=False)[i].rolling(window=3, min_periods=2).mean().shift(1)
id dollars lag
1 6 nan
1 7 nan
1 6 6.5
3 7 nan
3 4 nan
3 4 5.5
3 3 5
5 6 nan
5 5 nan
5 6 5.5
5 12 5.67
5 7 8.3

I am trying to get a rolling sum of the past 3 rows for the same ID but lagging this by 1 row.
You can create the lagged rolling sum by chaining DataFrame.groupby(ID), .shift(1) for the lag 1, .rolling(3) for the window 3, and .sum() for the sum.
Example: Let's say your dataset is:
import pandas as pd
# Reproducible datasets are your friend!
d = pd.DataFrame({'grp':pd.Series(['A']*4 + ['B']*5 + ['C']*6),
'x':pd.Series(range(15))})
print(d)
grp x
A 0
A 1
A 2
A 3
B 4
B 5
B 6
B 7
B 8
C 9
C 10
C 11
C 12
C 13
C 14
I think what you're asking for is this:
d['y'] = d.groupby('grp')['x'].shift(1).rolling(3).sum()
print(d)
grp x y
A 0 NaN
A 1 NaN
A 2 NaN
A 3 3.0
B 4 NaN
B 5 NaN
B 6 NaN
B 7 15.0
B 8 18.0
C 9 NaN
C 10 NaN
C 11 NaN
C 12 30.0
C 13 33.0
C 14 36.0

Related

Python How to drop rows of Pandas DataFrame whose value in a certain column is NaN

I have this DataFrame and want only the records whose "Total" column is not NaN ,and records when A~E has more than two NaN:
A B C D E Total
1 1 3 5 5 8
1 4 3 5 5 NaN
3 6 NaN NaN NaN 6
2 2 5 9 NaN 8
..i.e. something like df.dropna(....) to get this resulting dataframe:
A B C D E Total
1 1 3 5 5 8
2 2 5 9 NaN 8
Here's my code
import pandas as pd
dfInputData = pd.read_csv(path)
dfInputData = dfInputData.dropna(axis=1,how = 'any')
RowCnt = dfInputData.shape[0]
But it looks like no modification has been made even error
Please help!! Thanks
Use boolean indexing with count all columns without Total for number of missing values and not misisng values in Total:
df = df[df.drop('Total', axis=1).isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0
Or filter columns between A:E:
df = df[df.loc[:, 'A':'E'].isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0

python: inserting row at specifc index from one dataframe to another

I have a two dataframes as follows:
df1:
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
2 0 7 6 5 8
df2:
M N O P Q R S T
0 1 2 3
1 4 5 6
2 7 8 9
3 8 6 5
4 5 4 3
I have taken out a slice of data from df1 as follows:
>data_1 = df1.loc[0:1]
>data_1
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
Now I need to insert this data_1 into df2 at specific location of Index(0,P) (row,column). Is there any way to do it? I do not want to disturb the other columns in df2.
I can extract individual values of each cell and do it but since I have to do it for a large dataset, its not possible to do it cell-wise.
Cellwise method:
>var1 = df1.iat[0,1]
>var2 = df1.iat[0,0]
>df2.at[0, 'P'] = var1
>df2.at[0, 'Q'] = var2
If you specify all the columns, it is possible to do it as follows:
df2.loc[0:1, ['P', 'Q', 'R', 'S', 'T']] = df1.loc[0:1].values
Resulting dataframe:
M N O P Q R S T
0 1 2 3 8.0 6.0 4.0 9.0 7.0
1 4 5 6 2.0 6.0 3.0 8.0 5.0
2 7 8 9
3 8 6 5
4 5 4 3
You can rename columns and index names for match to second DataFrame, so possible use DataFrame.update for correct way specifiest by tuple pos:
data_1 = df1.loc[0:1]
print (data_1)
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
pos = (2, 'P')
data_1 = data_1.rename(columns=dict(zip(data_1.columns, df2.loc[:, pos[1]:].columns)),
index=dict(zip(data_1.index, df2.loc[pos[0]:].index)))
print (data_1)
P Q R S T
2 8 6 4 9 7
3 2 6 3 8 5
df2.update(data_1)
print (df2)
M N O P Q R S T
0 1 2 3 NaN NaN NaN NaN NaN
1 4 5 6 NaN NaN NaN NaN NaN
2 7 8 9 8.0 6.0 4.0 9.0 7.0
3 8 6 5 2.0 6.0 3.0 8.0 5.0
4 5 4 3 NaN NaN NaN NaN NaN
How working rename - idea is select all columns and all index values after specified column, index name by loc and then zip by columns names of data_1 with convert to dictionary. So last replace bot, index and columns names in data_1 by next columns, index values.

opposite of df.diff() in pandas

I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.
You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN
df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44
If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs

Pandas: rolling count if within a loop

In my data frame I want to create a column '5D_Peak' as a rolling max, and then another column with rolling count of historical data that's close to the peak. I wonder if there is an easier way to simply or ideally vectorise the calculation.
This is my codes in a plain but complicated way:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,4],[4,5,2],[3,5,8],[1,8,6],[5,2,8],[1,4,10],[3,5,9],[1,4,7],[1,4,6]], columns=list('ABC'))
df['5D_Peak']=df['C'].rolling(window=5,center=False).max()
for i in range(5,len(df.A)):
val=0
for j in range(i-5,i):
if df.loc[j,'C']>df.loc[i,'5D_Peak']-2 and df.loc[j,'C']<df.loc[i,'5D_Peak']+2:
val+=1
df.loc[i,'5D_Close_to_Peak_Count']=val
This is the output I want:
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 NaN
5 1 4 10 10.0 0.0
6 3 5 9 10.0 1.0
7 1 4 7 10.0 2.0
8 1 4 6 10.0 2.0
I believe this is what you want. You can set the two values below:
'''the window within which to search "close-to_peak" values'''
lkp_rng = 5
'''how close is close?'''
closeness_measure = 2
'''function to count the number of "close-to_peak" values in the lkp_rng'''
fc = lambda x: np.count_nonzero(np.where(x >= x.max()- closeness_measure))
'''apply fc to the coulmn you choose'''
df['5D_Close_to_Peak_Count'] = df['C'].rolling(window=lkp_range,center=False).apply(fc)
df.head(10)
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 3.0
5 1 4 10 10.0 3.0
6 3 5 9 10.0 3.0
7 1 4 7 10.0 3.0
8 1 4 6 10.0 2.0
I am guessing what you mean by "historical data".

Rolling sum in subgroups of a dataframe (pandas)

I have sessions dataframe that contains E-mail and Sessions (int) columns.
I need to calculate rolling sum of sessions per email (i.e. not globally).
Now, the following works, but it's painfully slow:
emails = set(list(sessions['E-mail']))
ses_sums = []
for em in emails:
email_sessions = sessions[sessions['E-mail'] == em]
email_sessions.is_copy = False
email_sessions['Session_Rolling_Sum'] = pd.rolling_sum(email_sessions['Sessions'], window=self.window).fillna(0)
ses_sums.append(email_sessions)
df = pd.concat(ses_sums, ignore_index=True)
Is there a way of achieving the same in pandas, but using pandas operators on a dataframe instead of creating separate dataframes for each email and then concatenating them?
(either that or some other way of making this faster)
Setup
np.random.seed([3,1415])
df = pd.DataFrame({'E-Mail': np.random.choice(list('AB'), 20),
'Session': np.random.randint(1, 10, 20)})
Solution
The current and proper way to do this is with rolling.sum that can b used on the result of a pd.Series group by object.
# Series Group By
# /------------------------\
df.groupby('E-Mail').Session.rolling(3).sum()
# \--------------/
# Method you want
E-Mail
A 0 NaN
2 NaN
4 11.0
5 7.0
7 10.0
12 16.0
15 16.0
17 16.0
18 17.0
19 18.0
B 1 NaN
3 NaN
6 18.0
8 14.0
9 16.0
10 12.0
11 13.0
13 16.0
14 20.0
16 22.0
Name: Session, dtype: float64
Details
df
E-Mail Session
0 A 9
1 B 7
2 A 1
3 B 3
4 A 1
5 A 5
6 B 8
7 A 4
8 B 3
9 B 5
10 B 4
11 B 4
12 A 7
13 B 8
14 B 8
15 A 5
16 B 6
17 A 4
18 A 8
19 A 6
Say you start with
In [58]: df = pd.DataFrame({'E-Mail': ['foo'] * 3 + ['bar'] * 3 + ['foo'] * 3, 'Session': range(9)})
In [59]: df
Out[59]:
E-Mail Session
0 foo 0
1 foo 1
2 foo 2
3 bar 3
4 bar 4
5 bar 5
6 foo 6
7 foo 7
8 foo 8
In [60]: df[['Session']].groupby(df['E-Mail']).apply(pd.rolling_sum, 3)
Out[60]:
Session
E-Mail
bar 3 NaN
4 NaN
5 12.0
foo 0 NaN
1 NaN
2 3.0
6 9.0
7 15.0
8 21.0
Incidentally, note that I just rearranged your rolling_sum, but it has been deprecated - you should now use rolling:
df[['Session']].groupby(df['E-Mail']).apply(lambda g: g.rolling(3).sum())

Categories