Add rolling window to columns in each row in pandas - python

I have a timeseries in a dataframe and would like to add the rolling window with window size n to each of the rows.
df['A'].rolling(window=6)
This mean each row would have additional 6 columns with the respective numbers of the rolling window of that column. How can I achieve this using the rolling function of pandas, automatically naming the columns [t-1, t-2, t-3...t-n]?

df.shift(n) is what you're looking for:
n = 6 # max t-n
for t in range(1, n+1):
df[f'A-{t}'] = df['A'].shift(t)
Ref. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.shift.html

Related

Pandas - Understanding how rolling averages work

So I'm trying to calculate rolling averages, based on some column and some groupby columns.
In my case:
rolling column = RATINGS,
groupby_columns = ["DEMOGRAPHIC","ORIGINATOR","START_ROUND_60","WDAY","PLAYBACK_PERIOD"]
one group of my data looks like that:
my code to compute the rolling average is:
df['rolling']= df.groupby(groupby_columns_keys)['RATINGS'].\
apply(lambda x: x.shift().rolling(10,min_periods=1).mean())
What I don't understand is what is happening when the RATINGS value are starting to be NaN.
As my window size is 10, I would expect the second number in the test (index 11) to be:
np.mean([178,479,72,272,158,37,85.5,159,107,164.55]) = 171.205
But it is instead 171.9444, and same apply to the next numbers.
What is happening here?
And how I should calculate the next rolling averages the way I want (simply to average the 10 last ratings - and if ratings is NaN to take the calculated average of the previous row instead).
Any help will be appreciated.
np.mean([178,479,72,272,158,37,85.5,159,107,164.55]) = 171.205
Where does the 164.55 come from? The rest of those values are from the "RATINGS" column and the 164.55 is from the "rolling" column. Maybe I am misunderstanding what the rolling function does.

How to get consecutive averages of the column values based on the condition from another column in the same data frame using pandas

I have large data frame in pandas which has two columns Time and Values. I want to calculate consecutive averages for values in column Values based on the condition which is formed from the column Time.
I want to calculate average of the first l values in column Values, then next l values from the same column and so on, till the end of the data frame. The value l is the number of values that go into every average and it is determined by the time difference in column Time. Starting data frame looks like this
Time Values
t1 v1
t2 v2
t3 v3
... ...
tk vk
For example, average needs to be taken at every 2 seconds and the number of time values inside that time difference will determine the number of values l for which the average will be calculated.
a1 would be the first average of l values, a2 next, and so on.
Second part of the question is the same calculation of averages, but if the number l is known in advance. I tried this
df['Time'].iloc[0:l].mean()
which works for the first l values.
In addition, I would need to store the average values in another data frame with columns Time and Averages for plotting using matplotlib.
How can I use pandas to achieve my goal?
I have tried the following
df = pd.DataFrame({'Time': [1595006371.756430732,1595006372.502789381 ,1595006373.784446912 ,1595006375.476658051], 'Values': [4,5,6,10]},index=list('abcd'))
I get
Time Values
a 1595006371.756430732 4
b 1595006372.502789381 5
c 1595006373.784446912 6
d 1595006375.476658051 10
Time is in the format seconds.milliseconds.
If I expect to have the same number of values in every 2 seconds till the end of the data frame, I can use the following loop to calculate value of l:
s=1
l=0
while df['Time'][s] - df['Time'][0] <= 2:
s+=1
l+=1
Could this be done differently, without the loop?
How can I do this if number l is not expected to be the same inside each averaging interval?
For the given l, I want to calculate average values of l elements in another column, for example column Values, and to populate column Averages of data frame df1 with these values.
I tried with the following code
p=0
df1=pd.DataFrame(columns=['Time','Averages']
for w in range (0, len(df)-1,2):
df1['Averages'][p]=df['Values'].iloc[w:w+2].mean()
p=p+1
Is there any other way to calculate these averages?
To clarify a bit more.
I have two columns Time and Values. I want to determine how many consecutive values from the column Values should be averaged at one point. I do that by determining this number l from the column Time by calculating how many rows are inside the time difference of 2 seconds. When I determined that value, for example 2, then I average first two values from the column Values, and then next 2, and so on till the end of the data frame. At the end, I store this value in the separate column of another data frame.
I would appreciate your assistance.
You talk about Time and Value and then groups of consecutive rows.
If you want to group by consecutive rows and get the mean of the Time and Value this does it for you. You really need to show by example what you are really trying to achieve.
d = list(pd.date_range(dt.datetime(2020,7,1), dt.datetime(2020,7,2), freq="15min"))
df = pd.DataFrame({"Time":d,
"Value":[round(random.uniform(0, 1),6) for x in d]})
df
n = 5
df.assign(grp=df.index//5).groupby("grp").agg({"Time":lambda s: s.mean(),"Value":"mean"})

pandas rolling average with a rolling mask / excluding entries

I have a pandas dataframe with a time index like this
import pandas as pd
import numpy as np
idx = pd.date_range(start='2000',end='2001')
df = pd.DataFrame(np.random.normal(size=(len(idx),2)),index=idx)
which looks like this:
0 1
2000-01-01 0.565524 0.355548
2000-01-02 -0.234161 0.888384
I would like to compute a rolling average like
df_avg = df.rolling(60).mean()
but excluding always entries corresponding to (let's say) 10 days before +- 2 days. In other words, for each date df_avg should contain the mean (exponential with ewm or flat) of previous 60 entries but excluding entries from t-48 to t-52. I guess I should do a kind of a rolling mask but I don't know how. I could also try to compute two separate averages and obtain the result as a difference but it looks dirty and I wonder if there is a better way which generalize to other non-linear computations...
Many thanks!
You can use apply to customize your function:
# select indexes you want to average over
avg_idx = [idx for idx in range(60) if idx not in range(8, 13)]
# do rolling computation, calculating average only on the specified indexes
df_avg = df.rolling(60).apply(lambda x: x[avg_idx].mean())
The x DataFrame in apply will always have 60 rows, so you can specify your positional index based on this, knowing that the first entry (0) is t-60.
I am not entirely sure about your exclusion logic, but you can easily modify my solution for your case.
Unfortunately, not. From pandas source code:
df.rolling(window, min_periods=None, freq=None, center=False, win_type=None,
on=None, axis=0, closed=None)
window : int, or offset
Size of the moving window. This is the number of observations used for
calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each
window will be a variable sized based on the observations included in
the time-period.

Filling dataframe with lagged values in Python

I am trying to write a loop which fills the elements in a dataframe or matrix with the values of the previous year. The columns represent different years within the 50 year horizon. The rows represent different discrete ages (up to 50 years old). The initial distribution in year 1 (green vector) is given. I would like to successively move the elements through the df or matrix. Hence, element 1,1 depicts the surface of age 1 in year 1. As a consequence, that element moves to 2,2; 3,3 and so on. The last row should move to the first row in the next year (indicated by the blue arrow).
I have tried to iterate through the dataframe, but I think the Keyerror has to do with the fact that [index-1] has to be bound?
import numpy as np
import pandas as pd
years = np.arange(50)
a_vector = np.arange(50)
pop_matrix = pd.DataFrame(0, index=a_vector, columns=years)
#Initial vector (green)
A0 = 5000000
for a, rows in pop_matrix.iterrows():
pop_matrix[0][a] = A0 / len(pop_matrix)
#Incorrect attempt
for t in years:
for a, rows in pop_matrix.iterrows():
if t-1 >= 0 and a-1 >= 0:
pop_matrix[t][a] = pop_matrix[t-1][a-1]
I think the best way is to use numpy roll function.
Extract the values of your index and then apply numpy roll each time with a different shift. Example :
for year in years:
col = pop_matrix.columns.tolist()[year]
pop_matrix[col] = numpy.roll(a_vector, shift=year+1)

How to decile python pandas dataframe by column value, and then sum each decile?

Say a dataframe only has one numeric column, order it desc.
What I want to get is a new dataframe with 10 rows, row 1 is sum of smallest 10% values then row 10 is sum of largest 10% values.
I can calculate this via a non-pythonic way but I guess there must be a fashion and pythonic way to achieve this.
Any help?
Thanks!
You can do this with pd.qcut:
df = pd.DataFrame({'A':np.random.randn(100)})
# pd.qcut(df.A, 10) will bin into deciles
# you can group by these deciles and take the sums in one step like so:
df.groupby(pd.qcut(df.A, 10))['A'].sum()
# A
# (-2.662, -1.209] -16.436286
# (-1.209, -0.866] -10.348697
# (-0.866, -0.612] -7.133950
# (-0.612, -0.323] -4.847695
# (-0.323, -0.129] -2.187459
# (-0.129, 0.0699] -0.678615
# (0.0699, 0.368] 2.007176
# (0.368, 0.795] 5.457153
# (0.795, 1.386] 11.551413
# (1.386, 3.664] 20.575449
pandas.qcut documentation

Categories