Get better fit on test data using Auto_Arima - python

I am using the AirPassengers dataset to predict a timeseries. For the model I am using, I chosen to use auto_arima to forecast the predicted values. However, it seems that the chosen order by the auto_arima is unable to fit the model. The corresponding chart is produced.
What can I do to get a better fit?
My code for those that want to try:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from pmdarima import auto_arima
df = pd.read_csv("https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv")
df = df.rename(columns={"#Passengers":"Passengers"})
df.Month = pd.to_datetime(df.Month)
df.set_index('Month',inplace=True)
train,test=df[:-24],df[-24:]
model = auto_arima(train,trace=True,error_action='ignore', suppress_warnings=True)
model.fit(train)
forecast = model.predict(n_periods=24)
forecast = pd.DataFrame(forecast,index = test.index,columns=['Prediction'])
plt.plot(train, label='Train')
plt.plot(test, label='Valid')
plt.plot(forecast, label='Prediction')
plt.show()
from sklearn.metrics import mean_squared_error
print(mean_squared_error(test['Passengers'],forecast['Prediction']))
Thank you for reading. Any advice is appreciated.

This series is not stationary, and no amount of differencing (notice that the amplitude of the variations keeps increasing) will make it so. However, transforming the data first by taking logs should do better (experiment shows that it does do better, but not what I would call well). Setting the seasonality (as I suggest in the comment by m=12, and taking logs produces this: which is essentially perfect.

The problem was that I did not specify the m, in this case, I assigned the value of m to be 12, denoting that it is a monthly cycle, that each data row is a month. That's how I understand it. source
Feel free to comment, I'm not entirely sure as I am new to using ARIMA.
Code:
model = auto_arima(train,m=12,trace=True,error_action='ignore', suppress_warnings=True)
Just add m=12,to denote that the data is monthly.
Result:

Related

SHAP plotting waterfall using an index value in dataframe

I am working on a binary classification using random forest algorithm
Currently, am trying to explain the model predictions using SHAP values.
So, I referred this useful post here and tried the below.
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
sv = explainer(ord_test_t)
exp = Explanation(sv.values[:,:,1],
sv.base_values[:,1],
data=ord_test_t.values,
feature_names=ord_test_t.columns)
idx = 20
waterfall(exp[idx])
I like the above approach as it allows to display the feature values along with waterfall plot. So, I wish to use this approach
However, this doesn't help me get the waterfall for a specific row in ord_test_t (test data).
For example, let's consider that ord_test_t.Index.tolist() returns 3,5,8,9 etc...
Now, I want to plot the waterfall plot for ord_test_t.iloc[[9]] but when I pass exp[9], it just gets the 9th row but not the index named as 9.
When I try exp.iloc[[9]] it throws error as explanation object doesnt have iloc.
Can help me with this please?
My suggestion is as following:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from shap import TreeExplainer, Explanation
from shap.plots import waterfall
import shap
print(shap.__version__)
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
idx = 9
model = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X, y)
explainer = TreeExplainer(model)
sv = explainer(X.loc[[idx]]) # corrected, pass the row of interest as df
exp = Explanation(
sv.values[:, :, 1], # class to explain
sv.base_values[:, 1],
data=X.loc[[idx]].values, # corrected, pass the row of interest as df
feature_names=X.columns,
)
waterfall(exp[0]) # pretend you have only 1 data point which is 0th
0.40.0
Proof:
model.predict_proba(X.loc[[idx]]) # corrected
array([[0.95752656, 0.04247344]])

ARCH modelling, DataScaleWarning: y is poorly scaled

I'm currently facing an issue with GARCH modelling in python. Came across a datascale issue, where y is poorly scaled. Would really appreciate if could get an explanation on the error and perhaps a fix to the issue. GARCH model still runs but with warning messages of no successful convergence.
Error Message
This is the data which I used for my y-values. It is taken from the residual values of an ARIMA model which i did.
Y-dataset
GARCH output:
fit output
GARCH output
*Update
After setting rescale=False
inequality constraints incompatible
Minimal reproducable example
import pandas_datareader.data as pdr
import numpy as np
import datetime
import arch
from statsmodels.tsa.arima.model import ARIMA
#Extract Data, create column log returns
eurusd = pdr.DataReader('DEXUSEU', 'fred', start='1/1/2010', end='31/12/2019')
eurusd.index = pd.DatetimeIndex(eurusd.index).to_period('D')
eurusd = eurusd.to_timestamp()
eurusd['LR'] = np.log(eurusd) - np.log(eurusd.shift(1))
# ARIMA model
arima_model = ARIMA(eurusd.LR.dropna(), order=(1,0,1)).fit()
print(arima_model.summary())
# GARCH model
am = arch.arch_model(arima_model.resid)
res = am.fit()
print(res.summary())
Heading

Partial fit or incremental learning for autoregressive model

I have two time series representing two independent periods of data observation. I would like to fit an autoregressive model to this data. In other words, I would like to perform two partial fits, or two sessions of incremental learning.
This is a simplified description of a not-unusual scenario which could also apply to batch fitting on large datasets.
How do I do this (in statsmodels or otherwise)? Bonus points if the solution can generalise to other time-series models like ARIMA.
In pseudocode, something like:
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
res = AutoReg(data_1, lags=12).fit()
res.aic
# This is more like what I would like to do
model = AutoReg(lags=12)
model.partial_fit(data_1)
model.partial_fit(data_2)
model.results.aic
Statsmodels does not directly have this functionality. As Kevin S mentioned though, pmdarima does have a wrapper that provides this functionality. Specifically the update method. Per their documentation: "Update the model fit with additional observed endog/exog values.".
See example below around your particular code:
from pmdarima.arima import ARIMA
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
model = ARIMA(order=(12,0,0))
model.fit(data_1)
# update the model parameters with the new parameters
model.update(data_2)
I don't know how to achieve that in autoreg, but I think it can be achieved somehow, but need to manually evaluate results or somehow add the data.
But in ARIMA and SARIMAX, it's already implemented and it's simple.
For incremental learning, there are three functions related and it's documented here. First is apply which use fitted parameters on new unrelated data. Then there are extend and append. Append can be refit. I don't know exact difference though.
Here is my example that is different but similar...
from statsmodels.tsa.api import ARIMA
data = np.array(range(200))
order = (4, 2, 1)
model = ARIMA(data, order=order)
fitted_model = model.fit()
prediction = fitted_model.forecast(7)
new_data = np.array(range(600, 800))
fitted_model = fitted_model.apply(new_data)
new_prediction = fitted_model.forecast(7)
print(prediction) # [200. 201. 202. 203. 204. 205. 206.]
print(new_prediction) # [800. 801. 802. 803. 804. 805. 806.]
This replace all the data, so it can be used on unrelated data (unknown index). I profiled it and apply is very fast in comparison to fit.

How to use exponential smoothing to smooth the timeseries in python?

I am trying to use exponential smooting to smooth a timeseries.
Suppose my timeseries looks like this:
import pandas as pd
data = [446.6565, 454.4733, 455.663 , 423.6322, 456.2713, 440.5881, 425.3325, 485.1494, 506.0482, 526.792 , 514.2689, 494.211 ]
index= pd.date_range(start='1996', end='2008', freq='A')
oildata = pd.Series(data, index)
I want to get the smoothed version of that timeseries.
If I did something like this;
from statsmodels.tsa.api import ExponentialSmoothing
fit1 = SimpleExpSmoothing(oildata).fit(smoothing_level=0.2,optimized=False)
fcast1 = fit1.forecast(3).rename(r'$\alpha=0.2$')
it only outputs the forcasted three values, but not the smoothed version of my original timeseries. Is there a way to get the smoothed version of my original timeseries?
I am happy to provide more details if needed.
You can get the smoothed values in the fittedvalues attribute of the model, apparently.
import pandas as pd
data = [446.6565, 454.4733, 455.663 , 423.6322, 456.2713, 440.5881, 425.3325, 485.1494, 506.0482, 526.792 , 514.2689, 494.211 ]
index= pd.date_range(start='1996', end='2008', freq='A')
oildata = pd.Series(data, index)
from statsmodels.tsa.api import SimpleExpSmoothing
fit1 = SimpleExpSmoothing(oildata).fit(smoothing_level=0.2,optimized=False)
# fcast1 = fit1.forecast(3).rename(r'$\alpha=0.2$')
import matplotlib.pyplot as plt
plt.plot(oildata)
plt.plot(fit1.fittedvalues)
plt.show()
It yields:
The documentation states:
fittedvalues: ndarray
An array of the fitted values. Fitted by the Exponential Smoothing model.
Note that you can also use the fittedfcast attribute which contains all values + the first forecast, or the fcastvalues attribute which contains the forecast only.
ExponentialSmoothing is not to a tool to smoothen time series data, it is a time series forecasting method.
The fit() function will return an instance of the HoltWintersResults class that contains the learned coefficients. The forecast() or the predict() function on the result object can be called to make a forecast.
So by calling predict, what the class will doing is providing a forecast using the learned coefficients.
In order to smoothen the time series however, you can use the fittedvalues attribute, as #smarie points out
However, I'd go with a more appropriate tool, such as a savgol_filter:
from scipy.signal import savgol_filter
savgol_filter(oildata, 5, 3)
array([444.87816 , 461.58666 , 444.99296 , 441.70785143,
442.40769143, 438.36852857, 441.50125714, 472.05622571,
512.20891429, 521.74822857, 517.63141429, 493.37037143])
As mentioned in the comments, the savgol filter performs a local taylor approximation of a given polyorder on a given window size (window_length) and results in a smoothing of the time series.
Here's what it would look like with the above set up:
plt.plot(oildata)
plt.plot(pd.Series(savgol_filter(oildata, 5, 3), index=oildata.index))
plt.show()

Python: Random intercept model (have to replicate R code)

I'm trying to replicate code from R that estimates a random intercept model. The R code is:
fit=lmer(resid~-1+(1|groupid),data=df)
I'm using the lmer command of the lme4 package to estimate random intercepts for the variable resid for observations in different groups (defined by groupid). There is no 'fixed effects' part, therefore no variable before the (1|groupid). Moreover, I do not want a constant estimated so that I get an intercept for each group.
Not sure how to do similar estimation in Python. I tried something like:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.random.seed(12345)
df = pd.DataFrame(np.random.randn(25, 4), columns=list('ABCD'))
df['groupid'] = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5]
df['groupid'] = df['groupid'].astype('category')
###Random intercepts models
md = smf.mixedlm('A~B-1',data=df,groups=df['groupid'])
mdf = md.fit()
print(mdf.random_effects)
A is resid from the earlier example, while groupid is the same.
1) I am not sure whether the mdf.random_effects are the random intercepts I am looking for
2) I cannot remove the variable B, which I understand is the fixed effects part. If I try:
md = smf.mixedlm('A~-1',data=df,groups=df['groupid'])
I get an error that "Arrays cannot be empty".
Just trying to estimate the exact same model as in the R code. Any advice will be appreciated.

Categories