Frequency in pandas timeseries index and statsmodel

Frequency in pandas timeseries index and statsmodel - python

I have a pandas timeseries y that does not work well with statsmodel functions.
import statsmodels.api as sm
y.tail(10)
2019-09-20 7.854
2019-10-01 44.559
2019-10-10 46.910
2019-10-20 49.053
2019-11-01 24.881
2019-11-10 52.882
2019-11-20 84.779
2019-12-01 56.215
2019-12-10 23.347
2019-12-20 31.051
Name: mean_rainfall, dtype: float64
I verify that it is indeed a timeseries
type(y)
pandas.core.series.Series
type(y.index)
pandas.core.indexes.datetimes.DatetimeIndex
From here, I am able to pass the timeseries through an autocorrelation function with no problem, which produces the expected output
plot_acf(y, lags=72, alpha=0.05)
However, when I try to pass this exact same object y to SARIMA
mod = sm.tsa.statespace.SARIMAX(y.mean_rainfall, order=pdq, seasonal_order=seasonal_pdq)
results = mod.fit()
I get the following error:
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The problem is that the frequency of my timeseries is not regular (it is the 1st, 10th, and 20th of every month), so I cannot set freq='m'or freq='D' for example. What is the workaround in this case?
I am new to using timeseries, any advice on how to not have my index ignored during forecasting would help. This prevents any predictions from being possible

First of all, it is extremely important to understand what the relationship between the datetime column and the target column (rainfall) is. Looking at the snippet you provide, I can think of two possibilities:
y represents the rainfall that occurred in the date-range between the current row's date and the next row's date. If that is the case, the timeseries is kind of an aggregated rainfall series with unequal buckets of date i.e. 1-10, 10-20, 20-(end-of-month). If that is the case, you have two options:
You can disaggregate your data using either an equal weightage or even better an interpolation to create a continuous and relatively smooth timeseries. You can then fit your model on the daily time-series and generate predictions which will also naturally be daily in nature. These you can aggregate back to the 1-10, 10-20, 20-(end-of-month) buckets to get your predicitons. One way to do the resampling is using the code below.
ts.Date = pd.to_datetime(ts.Date, format='%d/%m/%y')
ts['delta_time'] = (ts['Date'].shift(-1) - ts['Date']).dt.days
ts['delta_rain'] = ts['Rain'].shift(-1) - ts['Rain']
ts['timesteps'] = ts['Date']
ts['grad_rain'] = ts['delta_rain'] / ts['delta_time']
ts.set_index('timesteps', inplace=True )
ts = ts.resample('d').ffill()
ts
ts['daily_rain'] = ts['Rain'] + ts['grad_rain']*(ts.index - ts['Date']).dt.days
ts['daily_rain'] = ts['daily_rain']/ts['delta_time']
print(ts.head(50))
daily_rain is now the target column and the index i.e. timesteps is the timestamp.
The other option is that you approximate that the date-range of 1-10, 10-20, 20-(EOM) is roughly 10 days, so these are indeed equal timesteps. Of course statsmodel won't allow that so you would need to reset the index to mock datetime for which you maintain a mapping. Below is what you use in the statsmodel as y but do maintain a mapping back to your original dates. Freq will 'd' or 'daily' and you would need to rescale seasonality as well such that it follows the new date scale.
y.tail(10)
2019-09-01 7.854
2019-09-02 44.559
2019-09-03 46.910
2019-09-04 49.053
2019-09-05 24.881
2019-09-06 52.882
2019-09-07 84.779
2019-09-08 56.215
2019-09-09 23.347
2019-09-10 31.051
Name: mean_rainfall, dtype: float64
I would recommend the first option though as it's just more accurate in nature. Also you can try out other aggregation levels also during model training as well as for your predictions. More control!
The second scenario is that the data represents measurements only for the date itself and not for the range. That would mean that technically you do not have enough info now to construct an accurate timeseries - your timesteps are not equidistant and you don't have enough info for what happened between the timesteps. However, you can still improvise and get some approximations going. The second approach listed above would still work as is. For the first approach, you'd need to do interpolation but given the target variable which is rainfall and rainfall has a lot of variation, I would highly discourage this!!

As I can see, the package uses the frequency as a premise for everything, since it's a time-series problem.
So you will not be able to use it with data of different frequencies. In fact, you will have to make an assumption for your analysis to adequate your data for the use. Some options are:
1) Consider 3 different analyses (1st days, 10th days, 20th days individually) and use 30d frequency.
2) As you have ~10d equally separated data, you can consider using some kind of interpolation and then make downsampling to a frequency of 1d. Of course, this option only makes sense depending on the nature of your problem and how quickly your data change.
Either way, I just would like to point out that how you model your problem and your data is a key thing when dealing with time series and data science in general. In my experience as a data scientist, I can say that is analyzing at the domain (where your data came from) that you can have a feeling of which approach will work better.

Related

Time Series Forecasting in python

I have dataset that contins 300 rows and 4 columns: Date, Hour, counts(how many ads were emitted during this hour in TV), Visits (how many visits were made during this hour). Here is example of data:
If I want to test the effect of the tv spots on visits on the website, should I treat it as a time series and use regression for example? And what should the input table look like in that case? I know that I have to divide the date into day and month, but how to treat the counts column, leave them as they are, if my y is to be the number of visits?
Thanks

just to avoid case of single input and single output regression model, you could use hour and counts as input and predict the visits.
I don't know what format are hours in, if they are in 12hrs format convert them to 24hr format before feeding them to your model.

If you want predict the the next dates and hours in the time series, regression models or classical time series model such as ARIMA, ARMA, exponential smoothing would be useful.
But, as you need to predict the effectiveness of tv spot, I recommend to generate features using tsfresh library in python, based on counts to remove the time effect and use a machine learning model to do prediction, such as SVR or Gradient Boosting.
In your problem:
from tsfresh import extract_features
extracted_features = extract_features(df,
column_id="Hour",
column_kind=None,
column_value="Counts",
column_sort="time")
So, your target table will be:
Hour Feature_1 Feature_2 ... Visits(Avg)
0 min(Counts) max(Counts) ... mean(Visits)
1 min(Counts) max(Counts) ... mean(Visits)
2 min(Counts) max(Counts) ... mean(Visits)
min() and max() are just example features, tsfresh could extract many other features. Visit here for more information

How to choose the correct arguments of statsmodels STL function?

I've been reading about time-series decomposition, and have a fairly good idea of how it works on simple examples, but am having trouble extending the concepts.
For example, some simple synthetic data I'm playing with:
So there is no actual time associated with this data. It could be sampled every second or every year. Whatever the sampling frequency, the period is roughly 160 time steps, and using this as the period argument yields the expected results:
# seasonal=13 based on example in the statsmodels user guide
decomp = STL(synth.value, period=160, seasonal=13).fit()
fig, ax = plt.subplots(3,1, figsize=(12,6))
decomp.trend.plot(title='Trend', ax=ax[0])
decomp.seasonal.plot(title='Seasonal', ax=ax[1])
decomp.resid.plot(title='Residual', ax=ax[2])
plt.tight_layout()
plt.show()
But looking at other datasets, it's not really that easy to see the period of the seasonality, so it leads me to a couple of questions:
How do you find the correct arguments in real-world messy data, particularly the period argument but also the others too? Is it just a parameter search that you perform until the decomposition looks sane?
Parameters
endog : array_like
Data to be decomposed. Must be squeezable to 1-d.
period : Periodicity of the sequence. If None and endog is a pandas Series or DataFrame, attempts to determine from endog. If endog is a ndarray,
period must be provided.
seasonal : Length of the seasonal smoother. Must be an odd integer, and should
normally be >= 7 (default).
trend : Length of the trend smoother. Must be an odd integer. If not provided
uses the smallest odd integer greater than 1.5 * period / (1 - 1.5 /
seasonal), following the suggestion in the original implementation.

I had the same question. After tracing some of their codebase, I have found the following. This may help:
Statsmodels expects a DatetimeIndex'd DataFrame.
This DatetimeIndex can have a frequency. You can either resample your data with Pandas, or explicitly set a frequency in your index. You can check df.index, look for the freq attribute.
This leads to two situations:
Your index has frequency set
If you have set a frequency in your index, statsmodels will inherit this frequency and automatically use this to determine a period.
It makes use of the freq_to_period method internally, defined here in the tsatools submodule.
To summarise what this does: The period is the expected periodicity of your seasonal component, translated back to a year..
In other words: "how often your seasonal cycle will repeat itself in a year".
For reference, read the note on the freq_to_period method definition:
Annual maps to 1, quarterly maps to 4, monthly to 12, weekly to 52.
This is both done for the method seasonal_decompose here, as well as for STL here.
Your index has no frequency set
It gets a bit more complicated if your data does not have a freq attribute set.
The seasonal_decompose checks whether it can find an inferred_freq attribute of your index set here, STL takes the same approach here.
This inferred_freq was set using the pandas function infer_freq, which is defined in the Pandas package here, to Infer the most likely frequency given the input index.. Pandas automatically gives a DataFrame with a DatetimeIndex an index.inferred_freq attribute by default, if you have at least 3 elements.
TLDR: The period parameter should be set to the amount of times you expect the seasonal cycle to re-occur within a year. You can explicitly set this, or otherwise statsmodels will automatically infer this from the freq attribute of your datetimeindex. If the freq attribute is None, it will depend on Pandas' index.inferred_freq attribute to determine the frequency, and then convert this to pre-set periodicity.

Efficient workaround to group by multiple time coordinates in xarray

I'm currently working with CESM Large Ensemble data on the cloud (ala https://medium.com/pangeo/cesm-lens-on-aws-4e2a996397a1) using xarray and Dask and am trying to plot the trends in extreme precipitation in each season over the historical period (Dec-Jan-Feb and Jun-Jul-Aug specifically).
Eg. If one had a daily time-series data split into months like:
1920: J,F,M,A,M,J,J,A,S,O,N,D
1921: J,F,M,A,M,J,J,A,S,O,N,D
...
My aim is to group together the JJA days in each year and then take the maximum value within that group of days for each year. Ditto for DJF, however here you have to be careful because DJF is a year-skipping season; the most natural way to define it is 1921's DJF = 1920 D + 1921 JF.
Using iris this would be simple (though quite inefficient), as you could just add auxiliary time-coordinates for season and season_year and then aggregate/groupby those two coordinates and take a maximum, this would give you a (year, lat, lon) output where each year contains the maximum of the precipitation field in the chosen season (eg. maximum DJF precip in 1921 in each lat,lon pixel).
However in xarray this operation is not as natural because you can't natively groupby multiple coordinates, see https://github.com/pydata/xarray/issues/324 for further info on this. However, in this github issue someone suggests a simple, nested workaround to the problem using xarray's .apply() functionality:
def nested_groupby_apply(dataarray, groupby, apply_fn):
if len(groupby) == 1:
return dataarray.groupby(groupby[0]).apply(apply_fn)
else:
return dataarray.groupby(groupby[0]).apply(nested_groupby_apply, groupby = groupby[1:], apply_fn = apply_fn)
I'd be quite keen to try and use this workaround myself, but I have two main questions beforehand:
1) I can't seem to work out how to groupby coordinates such that I don't take the maximum of DJF in the same year?
Eg. If one simply applies the function like (for a suitable xr_max() function):
outp = nested_groupby_apply(daily_prect, ['time.season', 'time.year'], xr_max)
outp_djf = outp.sel(season='DJF')
Then you effectively define 1921's DJF as 1921 D + 1921 JF, which isn't actually what you want to look at! This is because the 'time.year' grouping doesn't account for the year-skipping behaviour of seasons like DJF. I'm not sure how to workaround this?
2) This nested groupby function is incredibly slow! As such, I was wondering if anyone in the community had found a more efficient solution to this problem, with similar functionality?
Thanks ahead of time for your help, all! Let me know if anything needs clarifying.
EDIT: Since posting this, I've discovered there already is a workaround for this in the specific case of taking DJF/JJA means each year (Take maximum rainfall value for each season over a time period (xarray)), however I'm keeping this question open because the general problem of an efficient workaround for multi-coord grouping is still unsolved.

How to reverse a seasonal log difference of timeseries in python

Could you please help me with this issue as I made many searches but cannot solve it. I have a multivariate dataframe for electricity consumption and I am doing a forecasting using VAR (Vector Auto-regression) model for time series.
I made the predictions but I need to reverse the time series (energy_log_diff) as I applied a seasonal log difference to make the serie stationary, in order to get the real energy value:
df['energy_log'] = np.log(df['energy'])
df['energy_log_diff'] = df['energy_log'] - df['energy_log'].shift(1)
For that, I did first:
df['energy'] = np.exp(df['energy_log_diff'])
This is supposed to give the energy difference between 2 values lagged by 365 days but I am not sure for this neither.
How can I do this?

The reason we use log diff is that they are additive so we can use cumulative sum then multiply by the last observed value.
last_energy=df['energy'].iloc[-1]
df['energy']=(np.exp(df['energy'].cumsum())*last_energy)
As per seasonality: if you de-seasoned the log diff simply add(or multiply) before you do the above step if you de-seasoned the original series then add after

Short answer - you have to run inverse transformations in the reversed order which in your case means:
Inverse transform of differencing
Inverse transform of log
How to convert differenced forecasts back is described e.g. here (it has R flag but there is no code and the idea is the same even for Python). In your post, you calculate the exponential, but you have to reverse differencing at first before doing that.
You could try this:
energy_log_diff_rev = []
v_prev = v_0
for v in df['energy_log_diff']:
v_prev += v
energy_log_diff_rev.append(v_prev)
Or, if you prefer pandas way, you can try this (only for the first order difference):
energy_log_diff_rev = df['energy_log_diff'].expanding(min_periods=0).sum() + v_0
Note the v_0 value, which is the original value (after log transformation before difference), it is described in the link above.
Then, after this step, you can do the exponential (inverse of log):
energy_orig = np.exp(energy_log_diff_rev)
Notes/Questions:
You mention lagged values by 365 but you are shifting data by 1. Does it mean you have yearly data? Or would you like to do this - df['energy_log_diff'] = df['energy_log'] - df['energy_log'].shift(365) instead (in case of daily granularity of data)?
You want to get the reverse time series from predictions, is that right? Or am I missing something? In such a case you would make inverse transformations on prediction not on the data I used above for explanation.

Can I plot a time-series with boolean variable?

I have a dataset which has a date variable, and a boolean variable.
The date variable has already been broken down into 'Year' and 'month'. So I have 2 fields corresponding to the date. And the boolean indicates if that particular record is late(1) or not(0).
Here is a snapshot of the data:
Date(Index) Date_Year_Key Date_Month_Key Is_Late
2014-01-01 2014 1 1
2014-01-03 2014 1 1
2014-01-03 2014 1 1
2014-01-03 2014 1 1
I want to plot the data with time to see if any trend or pattern exists in the data(orders) being late or not and if I can predict the future orders using time-series modeling.
I have tried plotting the data using an aggregate function.
temp=big_cust_tm_series.groupby(['Date_Year_Key', 'Date_Month_Key'])['Is_Late'].mean()
temp.plot(figsize=(15,5),
title= 'Late records(Monthwise)', fontsize=14)
Also, I tried this following code but it gave me an error.
import statsmodels.api as sm
sm.tsa.seasonal_decompose(temp).plot()
result = sm.tsa.stattools.adfuller(temp)
plt.show()
AttributeError: 'MultiIndex' object has no attribute 'inferred_freq'
I don't see any increasing or decreasing trend in the plot, nor any patterns. So I am not sure even if this is a proper example to do time-series analysis or not

I'll start by noting that I'm more of a general stats person and don't know much about time series modelling.
also that "binomial data" like this isn't very informative, so you need a surprising amount to detect changes. e.g. with 300 rows and your mean of 0.05 we'd expect a standard deviation of sqrt(0.05 * (1-0.05) / 300) = ~0.013, so we'd expect to see values around 0.05 +-2 * 0.013 = (0.024, 0.076).
what you're plotting looks sensible, but I'd suggest "truncating" the date at the month, rather than working with (year, month) tuples, as it's easier to turn into plots nicely. for example, if you have a datetime64 column called Date in a Pandas dataframe called df, you could do:
df['Month'] = df['Date'].dt.to_period('M').dt.start_time
the dts are annoying, but signal the desire to work with datetime like properies. after doing this, you can do your plot as before:
plt.plot(df.groupby('Month')['Is_Late'].mean())
and you'll get date labels on the x-axis. note that your plot is basically in line with the maths at the top (i.e. lowest value is around 0.03 and largest is around 0.08).
I'm not aware of being able to set a "link function" on any of the statsmodels time series analysis stuff, which means you probably can't do much what that. that said, you might get some mileage out of using a "generalised linear model" on your data, for example:
import statsmodels.api as sm
import statsmodels.formula.api as smf
df['month'] = df.Date.dt.month
df['dayofweek'] = df.Date.dt.dayofweek
mod = smf.glm('Is_Late ~ C(dayofweek) + C(month)', df2, family=sm.families.Binomial())
res = mod.fit()
res.summary()
would allow you to fit a "day of the week" effect to the data. note that this would have shown up on your above aggregation/plot by adding noise. also note that in general fitting timeseries data with a regression like the above is not recommended, but given your use of statsmodels and the lack of an obvious way of giving a link function this serves as a low-tech fallback
note that there are a lot more stats/time-series related packages/code available for the programming language "R"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.