Have the following dataframe.
How do i perform a rolling 3 window quantile(0.4) which combines values from 2 columns?
### Sample Dataframe
np.random.seed(0) # Freeze randomness
a = pd.DataFrame(np.random.randint(1,10,size=(20, 1)), columns=list('A'))
b = pd.DataFrame(np.random.randint(50,90,size=(20, 1)), columns=list('B'))
df = pd.concat([a,b], axis=1)
df
A
B
quantile_AB (expected ans)
0
6
75
NaN
1
1
63
NaN
2
4
58
6.0
3
4
59
4.0
40th percentile of (6,1,4,75,63,58) should give me 6.0.
Below formula gives me the rolling quantile for 2 columns separately.
df.rolling(3)[['A','B']].quantile(0.4)
Use stack with rolling quantile
df.stack(dropna=False).rolling(window=3*len(df.columns)).\
quantile(0.4)[cols-1::cols].reset_index(-1, drop=True)
Dataframe
A B
0 6 75
1 1 63
2 4 58
3 4 59
Output:
0 NaN
1 NaN
2 6.0
3 4.0
dtype: float64
IIUC, use numpy and sliding_window_view:
from numpy.lib.stride_tricks import sliding_window_view
m = df[['A', 'B']].to_numpy()
W = 3
N = m.shape[1]
Q = 0.4
q = np.quantile(np.reshape(sliding_window_view(m, (W, N)), (-1, W*N)), q=Q, axis=1)
df['quantile_AB'] = pd.Series(q, index=df.index[N:])
Output:
>>> df
A B quantile_AB
0 6 75 NaN
1 1 63 NaN
2 4 58 6.0
3 4 59 4.0
4 8 70 8.0
5 4 66 8.0
6 6 55 8.0
7 3 65 6.0
8 5 50 6.0
9 8 68 8.0
10 7 85 8.0
11 9 74 9.0
12 9 79 9.0
13 2 69 9.0
14 7 69 9.0
15 8 64 8.0
16 8 89 8.0
17 9 82 9.0
18 2 51 9.0
19 6 59 9.0
Related
I need to get the rolling 2nd largest value of a df.
To get the largest value I do
max = df.sort_index(ascending=True).rolling(10).max()
When I try this, python throws an error
max = df.sort_index(ascending=True).rolling(10).nlargest(2)
AttributeError: 'Rolling' object has no attribute 'nlargest'
Is this a bug? What else can I use that is performant?
I'd do something like this:
df.rolling(10).apply(lambda x: pd.Series(x).nlargest(2).iloc[-1])
Use np.sort in descending order and select second value:
np.random.seed(2019)
df = pd.DataFrame({
'B': np.random.randint(20, size=15)
})
print (df)
B
0 8
1 18
2 5
3 15
4 12
5 10
6 16
7 16
8 7
9 5
10 19
11 12
12 16
13 18
14 5
a = df.rolling(10).apply(lambda x: -np.sort(-x)[1])
#alternative
#a = df.rolling(10).apply(lambda x: np.sort(x)[-2])
print (a)
B
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 16.0
10 18.0
11 16.0
12 16.0
13 18.0
14 18.0
I have a pd.DataFrame df with one column, say:
A = [1,2,3,4,5,6,7,8,2,4]
df = pd.DataFrame(A,columns = ['A'])
For each row, I want to take previous 2 values, current value and next 2 value (a window= 5) and get the sum and store it in new column. Desire output,
A A_sum
1 6
2 10
3 15
4 20
5 25
6 30
7 28
8 27
2 21
4 14
I have tried,
df['A_sum'] = df['A'].rolling(2).sum()
Tried with shift, but all doing either forward or backward, I'm looking for a combination of both.
Use rolling by 5, add parameter center=True and min_periods=1 to Series.rolling:
df['A_sum'] = df['A'].rolling(5, center=True, min_periods=1).sum()
print (df)
A A_sum
0 1 6.0
1 2 10.0
2 3 15.0
3 4 20.0
4 5 25.0
5 6 30.0
6 7 28.0
7 8 27.0
8 2 21.0
9 4 14.0
If you are allowed to use numpy, then you might use numpy.convolve to get desired output
import numpy as np
import pandas as pd
A = [1,2,3,4,5,6,7,8,2,4]
B = np.convolve(A,[1,1,1,1,1], 'same')
df = pd.DataFrame({"A":A,"A_sum":B})
print(df)
output
A A_sum
0 1 6
1 2 10
2 3 15
3 4 20
4 5 25
5 6 30
6 7 28
7 8 27
8 2 21
9 4 14
You can use shift for this (straightforward if not elegant):
df["A_sum"] = df.A + df.A.shift(-2).fillna(0) + df.A.shift(-1).fillna(0) + df.A.shift(1).fillna(0)
output:
A A_sum
0 1 6.0
1 2 10.0
2 3 14.0
3 4 18.0
4 5 22.0
5 6 26.0
6 7 23.0
7 8 21.0
8 2 14.0
9 4 6.0
I have the following pandas DataFrame
Id_household Age_Father Age_child
0 1 30 2
1 1 30 4
2 1 30 4
3 1 30 1
4 2 27 4
5 3 40 14
6 3 40 18
and I want to achieve the following result
Age_Father Age_child_1 Age_child_2 Age_child_3 Age_child_4
Id_household
1 30 1 2.0 4.0 4.0
2 27 4 NaN NaN NaN
3 40 14 18.0 NaN NaN
I tried stacking with multi-index renaming, but I am not very happy with it and I am not able to make everything work properly.
Use this:
df_out = df.set_index([df.groupby('Id_household').cumcount()+1,
'Id_household',
'Age_Father']).unstack(0)
df_out.columns = [f'{i}_{j}' for i, j in df_out.columns]
df_out.reset_index()
Output:
Id_household Age_Father Age_child_1 Age_child_2 Age_child_3 Age_child_4
0 1 30 2.0 4.0 4.0 1.0
1 2 27 4.0 NaN NaN NaN
2 3 40 14.0 18.0 NaN NaN
I have a dataframe df containing the population p assigned to some buildings b
df
p b
0 150 3
1 345 7
2 177 4
3 267 2
and a dataframe df1 that associates some other buildings b1 to the buildings in df
df1
b1 b
0 17 3
1 9 7
2 13 7
I want to assign to the buildings that have an association in df1 a population divided the number of buildings. In this way we generate df2 that assign a population of 150/2=75 to the buildings 3 and 17 and a population of 345/3=115 to the buildings 7,9,13.
df2
p b
0 75 3
1 75 17
2 115 7
3 115 9
4 115 13
5 177 4
6 267 2
IIUC, you can try with merging both dfs on b then stack() and some cleansing, finally group on p and transform count and divide p with that to get divided values on p:
m=(df.merge(df1,on='b',how='left').set_index('p').stack().reset_index(name='b')
.drop_duplicates().drop('level_1',1).sort_values('p'))
m.p=m.p/m.groupby('p')['p'].transform('count')
print(m.sort_index())
p b
0 75.0 3.0
1 75.0 17.0
2 115.0 7.0
3 115.0 9.0
5 115.0 13.0
6 177.0 4.0
7 267.0 2.0
Another way using pd.concat. After that, fillna individually b1 and p. Next, transform with mean and assign filled b1 to the final dataframe
df2 = pd.concat([df, df1], sort=True).sort_values('b')
df2['b1'] = df2.b1.fillna(df2.b)
df2['p'] = df2.p.fillna(0)
df2.groupby('b').p.transform('mean').to_frame().assign(b=df2.b1).reset_index(drop=True)
Out[159]:
p b
0 267.0 2.0
1 75.0 3.0
2 75.0 17.0
3 177.0 4.0
4 115.0 7.0
5 115.0 9.0
6 115.0 13.0
dataframe = pd.DataFrame(data={'user': [1,1,1,1,1,2,2,2,2,2], 'usage':
[12,18,76,32,43,45,19,42,9,10]})
dataframe['mean'] = dataframe.groupby('user'['usage'].apply(pd.rolling_mean, 2))
Why this code is not working?
i am getting an error of rolling mean attribute is not found in pandas
Use groupby with rolling, docs:
dataframe['mean'] = (dataframe.groupby('user')['usage']
.rolling(2)
.mean()
.reset_index(level=0, drop=True))
print (dataframe)
user usage mean
0 1 12 NaN
1 1 18 15.0
2 1 76 47.0
3 1 32 54.0
4 1 43 37.5
5 2 45 NaN
6 2 19 32.0
7 2 42 30.5
8 2 9 25.5
9 2 10 9.5