Hey I have a doubt on pandas rolling function.
I am currently using it to get mean for last 10 days of my time series data.
Example df:
column
2020-12-04 14
2020-12-05 15
2020-12-06 16
2020-12-07 17
2020-12-08 18
2020-12-09 19
2020-12-13 20
2020-12-14 11
2020-12-16 12
2020-12-17 13
Usage:
df['column'].rolling('10D').mean()
But the function calculates the rolling mean over the 10 calendar days. like if the current row date is 2020-12-17 it calculates till 2020-12-07.
However I would like the rolling mean on the last 10 days that are in the data frame. i.e I would want till 2020-12-04.
How can I acheive it?
Edit: So I can also have a 15 mins interval datetime index so doing window=10 is not helping in that case. Though it works here.
As said in the comments by #cs95, if you want to consider only the rows that are in the dataframe, you can ignore that your data is part of a timeseries and just specify a window sized by a number of rows, instead of by a number of days. In essence
df['column'].rolling(window=10).mean()
Just one little detail to remember. You have missing dates in you dataframe. You should fill that, otherwise it will not be a 10 day window. Instead you would have a 10-dates rolling window,which would be pretty meaningless if dates are randoly missing.
r = pd.date_range(start=df1.Date.min(), end=df1.Date.max())
df1 = df1.set_index('Date').reindex(r).fillna(0).rename_axis('Date').reset_index()
which gives you the dataframe:
Date column
0 2020-12-04 14.0
1 2020-12-05 15.0
2 2020-12-06 16.0
3 2020-12-07 17.0
4 2020-12-08 18.0
5 2020-12-09 19.0
6 2020-12-10 0.0
7 2020-12-11 0.0
8 2020-12-12 0.0
9 2020-12-13 20.0
10 2020-12-14 11.0
11 2020-12-15 0.0
12 2020-12-16 12.0
13 2020-12-17 13.0
Then applying:
df1['Mean']=df1['column'].rolling(window=10).mean()
returns
Date column Mean
0 2020-12-04 14.0 NaN
1 2020-12-05 15.0 NaN
2 2020-12-06 16.0 NaN
3 2020-12-07 17.0 NaN
4 2020-12-08 18.0 NaN
5 2020-12-09 19.0 NaN
6 2020-12-10 0.0 NaN
7 2020-12-11 0.0 NaN
8 2020-12-12 0.0 NaN
9 2020-12-13 20.0 11.9
10 2020-12-14 11.0 11.6
11 2020-12-15 0.0 10.1
12 2020-12-16 12.0 9.7
13 2020-12-17 13.0 9.3
Related
I have a large dataset which contains a date column that covers from the year 2019. Now I do want to generate number of weeks on a separate column that are contained in those dates.
Here is how the date column looks like:
import pandas as pd
data = {'date': ['2019-09-10', 'NaN', '2019-10-07', '2019-11-04', '2019-11-28',
'2019-12-02', '2020-01-24', '2020-01-29', '2020-02-05',
'2020-02-12', '2020-02-14', '2020-02-24', '2020-03-11',
'2020-03-16', '2020-03-17', '2020-03-18', '2021-09-14',
'2021-09-30', '2021-10-07', '2021-10-08', '2021-10-12',
'2021-10-14', '2021-10-15', '2021-10-19', '2021-10-21',
'2021-10-26', '2021-10-28', '2021-10-29', '2021-11-02',
'2021-11-15', '2021-11-16', '2021-12-01', '2021-12-07',
'2021-12-09', '2021-12-10', '2021-12-14', '2021-12-15',
'2022-01-13', '2022-01-14', '2022-01-21', '2022-01-24',
'2022-01-25', '2022-01-27', '2022-01-31', '2022-02-01',
'2022-02-10', '2022-02-11', '2022-02-16', '2022-02-24']}
df = pd.DataFrame(data)
Now as from the first day this data was collected, I want to count 7 days using the date column and create a week out it. an example if the first week contains the 7 dates, I create a column and call it week one. I want to do the same process until the last week the data was collected.
Maybe it will be a good idea to organize the dates in order as from the first date to current one.
I have tried this but its not generating weeks in order, it actually has repetitive weeks.
pd.to_datetime(df['date'], errors='coerce').dt.week
My intention is, as from the first date the date was collected, count 7 days and store that as week one then continue incrementally until the last week say week number 66.
Here is the expected column of weeks created from the date column
import pandas as pd
week_df = {'weeks': ['1', '2', "3", "5", '6']}
df_weeks = pd.DataFrame(week_df)
IIUC use:
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].sub(df['date'].iat[0]).dt.days // 7 + 1
print (df.head(10))
date week
0 2019-09-10 1.0
1 NaT NaN
2 2019-10-07 4.0
3 2019-11-04 8.0
4 2019-11-28 12.0
5 2019-12-02 12.0
6 2020-01-24 20.0
7 2020-01-29 21.0
8 2020-02-05 22.0
9 2020-02-12 23.0
You have more than 66 weeks here, so either you want the real week count since the beginning or you want a dummy week rank. See below for both solutions:
# convert to week period
s = pd.to_datetime(df['date']).dt.to_period('W')
# get real week number
df['week'] = s.sub(s.iloc[0]).dropna().apply(lambda x: x.n).add(1)
# get dummy week rank
df['week2'] = s.rank(method='dense')
output:
date week week2
0 2019-09-10 1.0 1.0
1 NaN NaN NaN
2 2019-10-07 5.0 2.0
3 2019-11-04 9.0 3.0
4 2019-11-28 12.0 4.0
5 2019-12-02 13.0 5.0
6 2020-01-24 20.0 6.0
7 2020-01-29 21.0 7.0
8 2020-02-05 22.0 8.0
9 2020-02-12 23.0 9.0
10 2020-02-14 23.0 9.0
11 2020-02-24 25.0 10.0
12 2020-03-11 27.0 11.0
13 2020-03-16 28.0 12.0
14 2020-03-17 28.0 12.0
15 2020-03-18 28.0 12.0
16 2021-09-14 106.0 13.0
17 2021-09-30 108.0 14.0
18 2021-10-07 109.0 15.0
19 2021-10-08 109.0 15.0
...
42 2022-01-27 125.0 26.0
43 2022-01-31 126.0 27.0
44 2022-02-01 126.0 27.0
45 2022-02-10 127.0 28.0
46 2022-02-11 127.0 28.0
47 2022-02-16 128.0 29.0
48 2022-02-24 129.0 30.0
I am trying to fill all missing values until the end of the dataframe but unable to do so. In the example below, I am taking average of the last three values. My code is only filling until 2017-01-10 whereas I want to fill until 2017-01-14. For 1/14, I want to use values from 11,12 & 13.Please help.
import pandas as pd
df = pd.DataFrame([
{"ds":"2017-01-01","y":3},
{"ds":"2017-01-02","y":4},
{"ds":"2017-01-03","y":6},
{"ds":"2017-01-04","y":2},
{"ds":"2017-01-05","y":7},
{"ds":"2017-01-06","y":9},
{"ds":"2017-01-07","y":8},
{"ds":"2017-01-08","y":2},
{"ds":"2017-01-09"},
{"ds":"2017-01-10"},
{"ds":"2017-01-11"},
{"ds":"2017-01-12"},
{"ds":"2017-01-13"},
{"ds":"2017-01-14"}
])
df["y"].fillna(df["y"].rolling(3,min_periods=1).mean(),axis=0,inplace=True)
Result:
ds y
0 2017-01-01 3.0
1 2017-01-02 4.0
2 2017-01-03 6.0
3 2017-01-04 2.0
4 2017-01-05 7.0
5 2017-01-06 9.0
6 2017-01-07 8.0
7 2017-01-08 2.0
8 2017-01-09 5.0
9 2017-01-10 2.0
10 2017-01-11 NaN
11 2017-01-12 NaN
12 2017-01-13 NaN
13 2017-01-14 NaN
Desired output:
You can iterate over the values in y and if a nan value is encountered, look at the 3 earlier values and use .at[] to set the mean of the 3 earlier values as the new value:
for index, value in df['y'].items():
if np.isnan(value):
df['y'].at[index] = df['y'].iloc[index-3: index].mean()
Resulting dataframe for the missing values:
7 2017-01-08 2.000000
8 2017-01-09 6.333333
9 2017-01-10 5.444444
10 2017-01-11 4.592593
11 2017-01-12 5.456790
12 2017-01-13 5.164609
13 2017-01-14 5.071331
I'm Looking to take the most recent value in a rolling window and divide it by the mean of all numbers in said window.
What I tried:
df.a.rolling(window=7).mean()/df.a[-1]
This doesn't work because df.a[-1] is always the most recent of the entire dataset. I need the last value of the window.
I've done a ton of searching today. I may be searching the wrong terms, or not understanding the results, because I have not gotten anything useful.
Any pointers would be appreciated.
Aggregation (use the mean()) on a rolling windows returns a pandas Series object with the same indexing as the original column. You can simply aggregate the rolling window and then divide the original column by the aggregated values.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(30), columns=['A'])
df
# returns:
A
0 0
1 1
2 2
...
27 27
28 28
29 29
You can use a rolling mean to get a series with the same index.
df.A.rolling(window=7).mean()
# returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 3.0
7 4.0
...
26 23.0
27 24.0
28 25.0
29 26.0
Because it is indexed, you can simple divide by df.A to get your desired results.
df.A.rolling(window=7).mean() / df.A
# returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0.500000
7 0.571429
8 0.625000
9 0.666667
10 0.700000
11 0.727273
12 0.750000
13 0.769231
14 0.785714
15 0.800000
16 0.812500
17 0.823529
18 0.833333
19 0.842105
20 0.850000
21 0.857143
22 0.863636
23 0.869565
24 0.875000
25 0.880000
26 0.884615
27 0.888889
28 0.892857
29 0.896552
I have a dataframe that includes two columns like the following:
date value
0 2017-05-01 1
1 2017-05-08 4
2 2017-05-15 9
each row shows Monday of the week and I have a value only for that specific day. I want to estimate this value for the whole week days until the next Monday, and get the following output:
date value
0 2017-05-01 1
1 2017-05-02 1
2 2017-05-03 1
3 2017-05-04 1
4 2017-05-05 1
5 2017-05-06 1
6 2017-05-07 1
7 2017-05-08 4
8 2017-05-09 4
9 2017-05-10 4
10 2017-05-11 4
11 2017-05-12 4
12 2017-05-13 4
13 2017-05-14 4
14 2017-05-15 9
15 2017-05-16 9
16 2017-05-17 9
17 2017-05-18 9
18 2017-05-19 9
19 2017-05-20 9
20 2017-05-21 9
in this link it shows how to select the range in Dataframe but I don't know how to fill the value column as I explained.
Here is a solution using pandas reindex and ffill:
# Make sure dates is treated as datetime
df['date'] = pd.to_datetime(df['date'], format = "%Y-%m-%d")
from pandas.tseries.offsets import DateOffset
# Create target dates: all days in the weeks in the original dataframe
new_index = pd.date_range(start=df['date'].iloc[0],
end=df['date'].iloc[-1] + DateOffset(6),
freq='D')
# Temporarily set dates as index, conform to target dates and forward fill data
# Finally reset the index as in the original df
out = df.set_index('date')\
.reindex(new_index).ffill()\
.reset_index(drop=False)\
.rename(columns = {'index' : 'date'})
Which gives the expected result:
date value
0 2017-05-01 1.0
1 2017-05-02 1.0
2 2017-05-03 1.0
3 2017-05-04 1.0
4 2017-05-05 1.0
5 2017-05-06 1.0
6 2017-05-07 1.0
7 2017-05-08 4.0
8 2017-05-09 4.0
9 2017-05-10 4.0
10 2017-05-11 4.0
11 2017-05-12 4.0
12 2017-05-13 4.0
13 2017-05-14 4.0
14 2017-05-15 9.0
15 2017-05-16 9.0
16 2017-05-17 9.0
17 2017-05-18 9.0
18 2017-05-19 9.0
19 2017-05-20 9.0
20 2017-05-21 9.0
With a DataFrame like the following:
timestamp value
0 2012-01-01 3.0
1 2012-01-05 3.0
2 2012-01-06 6.0
3 2012-01-09 3.0
4 2012-01-31 1.0
5 2012-02-09 3.0
6 2012-02-11 1.0
7 2012-02-13 3.0
8 2012-02-15 2.0
9 2012-02-18 5.0
What would be an elegant and efficient way to add a time_since_last_identical column, so that the previous example would result in:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 5 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 10 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
The important part of the problem is not necessarily the usage of time delays. Any solution that matches one particular row with the previous row of identical value, and computes something out of those two rows (here, a difference) will be valid.
Note: not interested in apply or loop-based approaches.
A simple, clean and elegant groupby will do the trick:
df['time_since_last_identical'] = df.groupby('value').diff()
Gives:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 4 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 11 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
Here is a solution using pandas groupby:
out = df.groupby(df['value'])\
.apply(lambda x: pd.to_datetime(x['timestamp'], format = "%Y-%m-%d").diff())\
.reset_index(level = 0, drop = False)\
.reindex(df.index)\
.rename(columns = {'timestamp' : 'time_since_last_identical'})
out = pd.concat([df['timestamp'], out], axis = 1)
That gives the following output:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 4 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 11 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
It does not exactly match your desired output, but I guess it is a matter of conventions (e.g. whether to include current day or not). Happy to refine if you provide more details.