I need to get the rolling 2nd largest value of a df.
To get the largest value I do
max = df.sort_index(ascending=True).rolling(10).max()
When I try this, python throws an error
max = df.sort_index(ascending=True).rolling(10).nlargest(2)
AttributeError: 'Rolling' object has no attribute 'nlargest'
Is this a bug? What else can I use that is performant?
I'd do something like this:
df.rolling(10).apply(lambda x: pd.Series(x).nlargest(2).iloc[-1])
Use np.sort in descending order and select second value:
np.random.seed(2019)
df = pd.DataFrame({
'B': np.random.randint(20, size=15)
})
print (df)
B
0 8
1 18
2 5
3 15
4 12
5 10
6 16
7 16
8 7
9 5
10 19
11 12
12 16
13 18
14 5
a = df.rolling(10).apply(lambda x: -np.sort(-x)[1])
#alternative
#a = df.rolling(10).apply(lambda x: np.sort(x)[-2])
print (a)
B
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 16.0
10 18.0
11 16.0
12 16.0
13 18.0
14 18.0
Related
I am trying to add a row with the condition but was having difficulty achieving this.
Currently, I have pandas dataframes in a list that looks like following
The objective is to add a row with the condition that I want to add a row with a fixed number for 'ID' and increase the month by 3.
For example, for this[1] I want it to add rows that look like following
ID | month | num
6 | 0 | 5
6 | 3 | NaN
6 | 6 | 4
6 | 9 | NaN
6 | 12 | 3
...
6 | 36 | 1
I am trying to create a function that takes the index of the list (so it would be an actual dataframe), the max number of the month of that dataframe, and month I want it to be incremented by (3), which would look like
def add_rows(df, max_mon, res):
if max_mon > res:
add rows with fixed ID and NaN num
skip the month that already exist
final = []
for i in range(len(this)):
final.append(add_rows(this[i], this[i]['month'].max(), 3))
I have tried to insert rows but I did not manage to get it work.
The toy data
d = {'ID':[5,5,5,5,5], 'month':[0,6,12,24,36], 'num':[5,4,3,2,1]}
tempo = pd.DataFrame(data = d)
d2 = {'ID':[6,6,6,6,6], 'month':[0,6,12,18,36], 'num':[5,4,3,2,1]}
tempo2 = pd.DataFrame(data = d2)
this = []
this.append(tempo)
this.append(tempo2)
I would really appreciate if I could get help on building the function!
You can use:
for i, df in enumerate(this):
this[i] = (df
.set_index('month')
.groupby('ID')
.apply(lambda x: x.drop(columns='ID')
.reindex(range(x.index.min(), x.index.max()+3, 3))
)
.reset_index()[df.columns]
)
Updated this:
[ ID month num
0 5 0 5.0
1 5 3 NaN
2 5 6 4.0
3 5 9 NaN
4 5 12 3.0
5 5 15 NaN
6 5 18 NaN
7 5 21 NaN
8 5 24 2.0
9 5 27 NaN
10 5 30 NaN
11 5 33 NaN
12 5 36 1.0,
ID month num
0 6 0 5.0
1 6 3 NaN
2 6 6 4.0
3 6 9 NaN
4 6 12 3.0
5 6 15 NaN
6 6 18 2.0
7 6 21 NaN
8 6 24 NaN
9 6 27 NaN
10 6 30 NaN
11 6 33 NaN
12 6 36 1.0]
I have a pd.DataFrame df with one column, say:
A = [1,2,3,4,5,6,7,8,2,4]
df = pd.DataFrame(A,columns = ['A'])
For each row, I want to take previous 2 values, current value and next 2 value (a window= 5) and get the sum and store it in new column. Desire output,
A A_sum
1 6
2 10
3 15
4 20
5 25
6 30
7 28
8 27
2 21
4 14
I have tried,
df['A_sum'] = df['A'].rolling(2).sum()
Tried with shift, but all doing either forward or backward, I'm looking for a combination of both.
Use rolling by 5, add parameter center=True and min_periods=1 to Series.rolling:
df['A_sum'] = df['A'].rolling(5, center=True, min_periods=1).sum()
print (df)
A A_sum
0 1 6.0
1 2 10.0
2 3 15.0
3 4 20.0
4 5 25.0
5 6 30.0
6 7 28.0
7 8 27.0
8 2 21.0
9 4 14.0
If you are allowed to use numpy, then you might use numpy.convolve to get desired output
import numpy as np
import pandas as pd
A = [1,2,3,4,5,6,7,8,2,4]
B = np.convolve(A,[1,1,1,1,1], 'same')
df = pd.DataFrame({"A":A,"A_sum":B})
print(df)
output
A A_sum
0 1 6
1 2 10
2 3 15
3 4 20
4 5 25
5 6 30
6 7 28
7 8 27
8 2 21
9 4 14
You can use shift for this (straightforward if not elegant):
df["A_sum"] = df.A + df.A.shift(-2).fillna(0) + df.A.shift(-1).fillna(0) + df.A.shift(1).fillna(0)
output:
A A_sum
0 1 6.0
1 2 10.0
2 3 14.0
3 4 18.0
4 5 22.0
5 6 26.0
6 7 23.0
7 8 21.0
8 2 14.0
9 4 6.0
I am trying to get a rolling sum of the past 3 rows for the same ID but lagging this by 1 row. My attempt looked like the below code and i is the column. There has to be a way to do this but this method doesnt seem to work.
for i in df.columns.values:
df.groupby('Id', group_keys=False)[i].rolling(window=3, min_periods=2).mean().shift(1)
id dollars lag
1 6 nan
1 7 nan
1 6 6.5
3 7 nan
3 4 nan
3 4 5.5
3 3 5
5 6 nan
5 5 nan
5 6 5.5
5 12 5.67
5 7 8.3
I am trying to get a rolling sum of the past 3 rows for the same ID but lagging this by 1 row.
You can create the lagged rolling sum by chaining DataFrame.groupby(ID), .shift(1) for the lag 1, .rolling(3) for the window 3, and .sum() for the sum.
Example: Let's say your dataset is:
import pandas as pd
# Reproducible datasets are your friend!
d = pd.DataFrame({'grp':pd.Series(['A']*4 + ['B']*5 + ['C']*6),
'x':pd.Series(range(15))})
print(d)
grp x
A 0
A 1
A 2
A 3
B 4
B 5
B 6
B 7
B 8
C 9
C 10
C 11
C 12
C 13
C 14
I think what you're asking for is this:
d['y'] = d.groupby('grp')['x'].shift(1).rolling(3).sum()
print(d)
grp x y
A 0 NaN
A 1 NaN
A 2 NaN
A 3 3.0
B 4 NaN
B 5 NaN
B 6 NaN
B 7 15.0
B 8 18.0
C 9 NaN
C 10 NaN
C 11 NaN
C 12 30.0
C 13 33.0
C 14 36.0
I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.
You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN
df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44
If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs
I have sessions dataframe that contains E-mail and Sessions (int) columns.
I need to calculate rolling sum of sessions per email (i.e. not globally).
Now, the following works, but it's painfully slow:
emails = set(list(sessions['E-mail']))
ses_sums = []
for em in emails:
email_sessions = sessions[sessions['E-mail'] == em]
email_sessions.is_copy = False
email_sessions['Session_Rolling_Sum'] = pd.rolling_sum(email_sessions['Sessions'], window=self.window).fillna(0)
ses_sums.append(email_sessions)
df = pd.concat(ses_sums, ignore_index=True)
Is there a way of achieving the same in pandas, but using pandas operators on a dataframe instead of creating separate dataframes for each email and then concatenating them?
(either that or some other way of making this faster)
Setup
np.random.seed([3,1415])
df = pd.DataFrame({'E-Mail': np.random.choice(list('AB'), 20),
'Session': np.random.randint(1, 10, 20)})
Solution
The current and proper way to do this is with rolling.sum that can b used on the result of a pd.Series group by object.
# Series Group By
# /------------------------\
df.groupby('E-Mail').Session.rolling(3).sum()
# \--------------/
# Method you want
E-Mail
A 0 NaN
2 NaN
4 11.0
5 7.0
7 10.0
12 16.0
15 16.0
17 16.0
18 17.0
19 18.0
B 1 NaN
3 NaN
6 18.0
8 14.0
9 16.0
10 12.0
11 13.0
13 16.0
14 20.0
16 22.0
Name: Session, dtype: float64
Details
df
E-Mail Session
0 A 9
1 B 7
2 A 1
3 B 3
4 A 1
5 A 5
6 B 8
7 A 4
8 B 3
9 B 5
10 B 4
11 B 4
12 A 7
13 B 8
14 B 8
15 A 5
16 B 6
17 A 4
18 A 8
19 A 6
Say you start with
In [58]: df = pd.DataFrame({'E-Mail': ['foo'] * 3 + ['bar'] * 3 + ['foo'] * 3, 'Session': range(9)})
In [59]: df
Out[59]:
E-Mail Session
0 foo 0
1 foo 1
2 foo 2
3 bar 3
4 bar 4
5 bar 5
6 foo 6
7 foo 7
8 foo 8
In [60]: df[['Session']].groupby(df['E-Mail']).apply(pd.rolling_sum, 3)
Out[60]:
Session
E-Mail
bar 3 NaN
4 NaN
5 12.0
foo 0 NaN
1 NaN
2 3.0
6 9.0
7 15.0
8 21.0
Incidentally, note that I just rearranged your rolling_sum, but it has been deprecated - you should now use rolling:
df[['Session']].groupby(df['E-Mail']).apply(lambda g: g.rolling(3).sum())