I have found plenty of information related to moving averages when the data is sampled to regular intervals (ie 1 min, 5 mins, etc). However, I need a solution for a time series dataset that has irregular time intervals.
The dataset contains two columns, Timestamp and Price. Timestamp goes down to the millisecond, and there is no set interval for rows. I need to take my dataframe and add three moving average columns:
1 min
5 min
10 min
I do not want to resample the data, I want the end result to be the same number of rows but with the three columns filled as applicable. (IE, NaN until the 1/5/10 min interval for each column, respectively)
I feel like I am getting close, but cannot figure out how to pass the moving average variable to this function:
import pandas as pd
import numpy as np
# Load IBM data from CSV
df = pd.read_csv(
"C:/Documents/Python Scripts/MA.csv", names=['Timestamp',
'Price'])
# Create three moving average signals
df['Timestamp'] = pd.to_datetime(df['Timestamp'], errors='coerce')
df.set_index('Timestamp', inplace=True)
def movingaverage(values, window):
weights = np.repeat(1.0, window)/window
smas = np.convolve(values, weights, 'valid')
return smas
MA_1M = movingaverage(df, 1)
MA_5M = movingaverage(df, 5)
MA_10M = movingaverage(df, 10)
print(MA_1M)
Example Data:
Timestamp Price
2018-10-08 04:00:00.013 152.59
2018-10-08 04:00:00.223 156.34
2018-10-08 04:01:00.000 152.73
2018-10-08 04:05:00.127 156.34
2018-10-08 04:10:00.000 152.73
Expected Output:
Timestamp Price MA_1M MA_5M MA10M
2018-10-08 04:00:00.013 152.59 N/A N/A N/A
2018-10-08 04:00:00.223 156.34 N/A N/A N/A
2018-10-08 04:01:00.000 154.73 154.55 N/A N/A
2018-10-08 04:05:00.127 155.34 155.34 155.47 N/A
2018-10-08 04:10:00.000 153.73 153.73 154.54 154.55
At each row, the MA column takes that timestamp and looks back 1, 5 or 10 minutes and calculates the average. The thing that makes this difficult is that the rows can be generated at any millisecond. In my code above I am simply trying to get a moving average to work with a time variable. I am assuming that as long as the row counts match I can use the logic to add a column to my df.
The following works, except for the NaNs - I don't know how attached you are to those:
foo = df.apply(lambda x: df[(df['Timestamp'] <= x['Timestamp']) & (df['timestamp']> x['timestamp'] - pd.Timedelta('5 min'))]['Price'].mean(), axis=1)
Related
I'm currently working with a dataframe which is routinely grouped into a MultiIndex of three or more levels, with fiscal Quarter always at the top level. Necessarily, a few calculated fields are added to the frame as year/year percent change within each unique index, easily obtained with a groupby up to but not including Quarter, and pd.pct_change().
Unfortunately, this only returns accurate values if a value exists for each possible Quarter. If I have a point for 2021Q1 and my next is 2021Q4, I need to pad in a row with zeroes for 2021Q2 and 2021Q3 in order for the year/year at 2021Q4 to not return 2021Q4/2021Q1. My problem is that I often have at least six and up to fifty unique index values at each level of the MultiIndex, and to pad it correctly I need as many rows as the are unique combinations, which quickly becomes an exponential disaster which makes the code unusable.
My question: Is it possible to take a Quarter/Quarter value respecting the MultiIndex without padding out every missing quarter on the index?
Reproducible example:
idx=pd.MultiIndex.from_product([['Canine', 'Feline'],
['Chihuahua','Samoyed','Shorthair'],
[dt.datetime(2021,4,1),dt.datetime(2021,7,1),dt.datetime(2021,10,1),dt.datetime(2022,1,1)]],
names=['species','breed','cyq'])
data=pd.DataFrame(index=idx)
data.loc[:,'paid']=np.random.randint(100,200,24)
correct_ex=data.drop([('Canine','Shorthair'),('Feline','Chihuahua'),('Feline','Samoyed')])
correct_ex.loc[('Canine','Samoyed',dt.datetime(2021,7,1)),'paid']=0
incorrect_ex=correct_ex.drop([('Canine','Samoyed',dt.datetime(2021,7,1))])
correct_ex.loc[:,'paid_change']=correct_ex.groupby(['species','breed'])['paid'].pct_change()
correct_ex=correct_ex.drop([('Canine','Samoyed',dt.datetime(2021,7,1))])
incorrect_ex.loc[:,'paid_change']=correct_ex.groupby(['species','breed'])['paid'].pct_change()
Correct Results:
Incorrect Results:
The correct_ex frame above contains the values that I would want to see if Samoyeds had no data for 7/1/2021, but the only way to get it is to keep a row with paid value 0 for that date. The incorrect_ex frame above is what I get if I attempt pct_change without the added row.
Thanks for the help!
You can calculate the percentage changes in a groupby apply and mask rows where the difference between the cyq dates (= last level of the index) is more than 93 days with np.inf:
import numpy as np
import pandas as pd
import datetime as dt
idx = pd.MultiIndex.from_product([['Canine', 'Feline'],
['Chihuahua','Samoyed','Shorthair'],
[dt.datetime(2021,4,1),dt.datetime(2021,7,1),dt.datetime(2021,10,1),dt.datetime(2022,1,1)]],
names=['species','breed','cyq'])
np.random.seed(0)
data = pd.DataFrame({'paid': np.random.randint(100,200,24)}, index=idx)
data = data.drop([('Canine','Samoyed',dt.datetime(2021,7,1))])
data = data.drop([('Canine','Shorthair'),('Feline','Chihuahua'),('Feline','Samoyed')])
data['paid_change'] = data.groupby(['species','breed']).paid.apply(
lambda x: x.pct_change().mask(
x.index.get_level_values(-1).to_series().diff().gt(pd.Timedelta(days=93)),
np.inf
)
)
Result:
paid paid_change
species breed cyq
Canine Chihuahua 2021-04-01 144 NaN
2021-07-01 147 0.020833
2021-10-01 164 0.115646
2022-01-01 167 0.018293
Samoyed 2021-04-01 167 NaN
2021-10-01 183 inf
2022-01-01 121 -0.338798
Feline Shorthair 2021-04-01 181 NaN
2021-07-01 137 -0.243094
2021-10-01 125 -0.087591
2022-01-01 177 0.416000
I have a dataset that follows a weekly indexation, and a list of dates that I need to get interpolated data for. For example, I have the following df with weekly aggregation:
data value
1/01/2021 10
7/01/2021 10
14/01/2021 10
28/01/2021 10
and a list of dates that do not coincide with the df indexed dates, for example:
list_dates = [12/01/2021, 13/01/2021 ...]
I need to get what the interpolated values would be for every date on the list_dates but within a given window (for ex: using only 4 values in the df to calculate to interpolation, split between before and after --> so the 2 first dates before the list date and the 2 first dates after the list date).
To get the interpolated value of the list date 12/01/2021 in the list, I would need to use:
1/1/2021
7/1/2021
14/1/2021
28/1/2021
The output would then be:
data value
1/01/2021 10
7/01/2021 10
12/01/2021 10
13/01/2021 10
14/01/2021 10
28/01/2021 10
I have successfully coded an example of this but it fails for when there are multiple NaNs consecutively (for ex: 12/01 and 13/01). I also can't concat the interpolated value before running the next one in the list, as that would be using the interpolated date to calc the new interpolated date (for ex: using 12/01 to calculate 13/01).
Any advice on how to do this?
Use interpolate to get expected outcome but before you have to prepare your dataframe like below.
I slightly modify your input data to show you interpolation with datetimeindex (method='time'):
# Input data
df = pd.DataFrame({'data': ['1/01/2021', '7/01/2021', '14/01/2021', '28/01/2021'],
'value': [10, 10, 17, 10]})
list_dates = ['12/01/2021', '13/01/2021']
# Conversion of dates
df['data'] = pd.to_datetime(df['data'], format='%d/%m/%Y')
new_dates = pd.to_datetime(list_dates, format='%d/%m/%Y')
# Set datetime column as index and append new dates
df = df.set_index('data')
df = df.reindex(df.index.append(new_dates)).sort_index()
# Interpolate with method='time'
df['value'] = df['value'].interpolate(method='time')
Output:
>>> df
value
2021-01-01 10.0
2021-01-07 10.0
2021-01-12 15.0 # <- time interpolation
2021-01-13 16.0 # <- time interpolation
2021-01-14 17.0 # <- changed from 10 to 17
2021-01-28 10.0
Given a pandas dataframe in the following format:
toy = pd.DataFrame({
'id': [1,2,3,
1,2,3,
1,2,3],
'date': ['2015-05-13', '2015-05-13', '2015-05-13',
'2016-02-12', '2016-02-12', '2016-02-12',
'2018-07-23', '2018-07-23', '2018-07-23'],
'my_metric': [395, 634, 165,
144, 305, 293,
23, 395, 242]
})
# Make sure 'date' has datetime format
toy.date = pd.to_datetime(toy.date)
The my_metric column contains some (random) metric I wish to compute a time-dependent moving average of, conditional on the column id
and within some specified time interval that I specify myself. I will refer to this time interval as the "lookback time"; which could be 5 minutes
or 2 years. To determine which observations that are to be included in the lookback calculation, we use the date column (which could be the index if you prefer).
To my frustration, I have discovered that such a procedure is not easily performed using pandas builtins, since I need to perform the calculation conditionally
on id and at the same time the calculation should only be made on observations within the lookback time (checked using the date column). Hence, the output dataframe should consist of one row for each id-date combination, with the my_metric column now being the average of all observations that is contatined within the lookback time (e.g. 2 years, including today's date).
For clarity, I have included a figure with the desired output format (apologies for the oversized figure) when using a 2-year lookback time:
I have a solution but it does not make use of specific pandas built-in functions and is likely sub-optimal (combination of list comprehension and a single for-loop). The solution I am looking for will not make use of a for-loop, and is thus more scalable/efficient/fast.
Thank you!
Calculating lookback time: (Current_year - 2 years)
from dateutil.relativedelta import relativedelta
from dateutil import parser
import datetime
In [1691]: dt = '2018-01-01'
In [1695]: dt = parser.parse(dt)
In [1696]: lookback_time = dt - relativedelta(years=2)
Now, filter the dataframe on lookback time and calculate rolling average
In [1722]: toy['new_metric'] = ((toy.my_metric + toy[toy.date > lookback_time].groupby('id')['my_metric'].shift(1))/2).fillna(toy.my_metric)
In [1674]: toy.sort_values('id')
Out[1674]:
date id my_metric new_metric
0 2015-05-13 1 395 395.0
3 2016-02-12 1 144 144.0
6 2018-07-23 1 23 83.5
1 2015-05-13 2 634 634.0
4 2016-02-12 2 305 305.0
7 2018-07-23 2 395 350.0
2 2015-05-13 3 165 165.0
5 2016-02-12 3 293 293.0
8 2018-07-23 3 242 267.5
So, after some tinkering I found an answer that will generalize adequately. I used a slightly different 'toy' dataframe (slightly more relevant to my case). For completeness sake, here is the data:
Consider now the following code:
# Define a custom function which groups by time (using the index)
def rolling_average(x, dt):
xt = x.sort_index().groupby(lambda x: x.time()).rolling(window=dt).mean()
xt.index = xt.index.droplevel(0)
return xt
dt='730D' # rolling average window: 730 days = 2 years
# Group by the 'id' column
g = toy.groupby('id')
# Apply the custom function
df = g.apply(rolling_average, dt=dt)
# Massage the data to appropriate format
df.index = df.index.droplevel(0)
df = df.reset_index().drop_duplicates(keep='last', subset=['id', 'date'])
The result is as expected:
I have daily data, and also monthly numbers. I would like to normalize the daily data by the monthly number - so for example the first 31 days of 2017 are all divided by the number corresponding to January 2017 from another data set.
import pandas as pd
import datetime as dt
N=100
start=dt.datetime(2017,1,1)
df_daily=pd.DataFrame({"a":range(N)}, index=pd.date_range(start, start+dt.timedelta(N-1)))
df_monthly=pd.Series([1, 2, 3], index=pd.PeriodIndex(["2017-1", "2017-2", "2017-3"], freq="M"))
df_daily["a"] / df_monthly # ???
I was hoping the time series data would align in a one-to-many fashion and do the required operation, but instead I get a lot of NaN.
How would you do this one-to-many data alignment correctly in Pandas?
I might also want to concat the data, in which case I expect the monthly data to duplicate values within one month.
You can extract the information with to_period('M') and then use map.
df_daily["month"] = df_daily.index.to_period('M')
df_daily['a'] / df_daily["month"].map(df_monthly)
Without creating the month column, you can use
df_daily['a'] / df_daily.index.to_period('M').to_series().map(df_monthly)
You can create a temporary key from the index's month, then merge both the dataframe on the key i.e
df_monthly = df_monthly.to_frame().assign(key=df_monthly.index.month)
df_daily = df_daily.assign(key=df_daily.index.month)
df_new = df_daily.merge(df_monthly,how='left').set_index(df_daily.index).drop('key',1)
a 0
2017-01-01 0 1.0
2017-01-02 1 1.0
2017-01-03 2 1.0
2017-01-04 3 1.0
2017-01-05 4 1.0
For division you can then simply do :
df_new['b'] = df_new['a'] / df_new[0]
I'm trying to set up a 5-day (adjustable) running mean for ratings I made for various dates with a Python Pandas DataFrame.
I can easily get the average mean by day using the following code
import pandas as pd
import datetime as dt
RTC = pd.read_csv(...loads file, has 'Date' and 'Rating' columns...)
daterange = RTC['Date'].max() - RTC['Date'].min()
days_means = []
for day_offset in range(daterange.days+1):
filldate = (RTC['Date'].min() + dt.timedelta).strftime('%Y-%m-%d')
days_means.append(RTC[RTC['Date']==filldate]['Rating'].mean())
I'm thinking that the most natural way to extend this would be to make filldate a list (or a series?) and then have a new mask like
RTC['Date'] in filldate
But if I do this I get an error that states
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
I'd guess somewhere I'd want to put an any-statement somewhere in this, but I cannot get this working.
Does anyone have advice on how to make this work properly?
Thanks!
EDIT:
For reference, here's what my data would look like
Date Rating OtherColumns...
1-1-2014 5 ...
1-2-2014 6
1-2-2014 7
1-3-2014 8
1-3-2014 2
1-4-2014 3
1-6-2014 6
...
So the 5-day mean for 1-3-2014 would be (5+6+7+8+2+3)/6. Note that there are two entries for 1-2-2014 and 1-3-2014 nothing for 1-5-2014.
Updating answer based on the new information:
#set up new example frame
rng = pd.DatetimeIndex(pd.to_datetime(['1/1/2014','1/2/2014','1/2/2014', '1/3/2014', '1/3/2014','1/4/2014','1/6/2014']))
df = pd.DataFrame({'rating':[5,6,7,8,2,3,6],'date':rng})
#set date as datetime index
df.set_index('date',inplace=True)
Calculating the 5 day centered mean. Because the dataframe can have missing days or days with more than 1 observation the data needs to be resample a frequency of days, with blank days filled to 0:
pd.rolling_mean(df.resample('1D',how='sum').fillna(0),5, min_periods=1, center=True)
This returns:
2014-01-01 9.333333
2014-01-02 7.750000
2014-01-03 6.200000
2014-01-04 6.400000
2014-01-05 4.750000
2014-01-06 3.000000
Note the 5 day moving average for 2014-01-03 is 31/5 not 31/6
Adding a solution that gives the average using the number of observations in a 5 day centered window rather than a five day rolling average.
#create timedeltas for window
forward =pd.Timedelta('2 days')
back = pd.Timedelta('-2 days')
def f(x):
five_day_obs = df.rating[(df.date >= x + back) & (df.date < x+ forward)]
return five_day_obs.sum().astype(float)/five_day_obs.count()
This returns:
df.date.apply(f)
0 6.000000
1 5.600000
2 5.600000
3 5.166667
4 5.166667
5 5.200000
6 4.500000
Name: date, dtype: int64