Partial fit or incremental learning for autoregressive model - python

I have two time series representing two independent periods of data observation. I would like to fit an autoregressive model to this data. In other words, I would like to perform two partial fits, or two sessions of incremental learning.
This is a simplified description of a not-unusual scenario which could also apply to batch fitting on large datasets.
How do I do this (in statsmodels or otherwise)? Bonus points if the solution can generalise to other time-series models like ARIMA.
In pseudocode, something like:
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
res = AutoReg(data_1, lags=12).fit()
res.aic
# This is more like what I would like to do
model = AutoReg(lags=12)
model.partial_fit(data_1)
model.partial_fit(data_2)
model.results.aic

Statsmodels does not directly have this functionality. As Kevin S mentioned though, pmdarima does have a wrapper that provides this functionality. Specifically the update method. Per their documentation: "Update the model fit with additional observed endog/exog values.".
See example below around your particular code:
from pmdarima.arima import ARIMA
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
model = ARIMA(order=(12,0,0))
model.fit(data_1)
# update the model parameters with the new parameters
model.update(data_2)

I don't know how to achieve that in autoreg, but I think it can be achieved somehow, but need to manually evaluate results or somehow add the data.
But in ARIMA and SARIMAX, it's already implemented and it's simple.
For incremental learning, there are three functions related and it's documented here. First is apply which use fitted parameters on new unrelated data. Then there are extend and append. Append can be refit. I don't know exact difference though.
Here is my example that is different but similar...
from statsmodels.tsa.api import ARIMA
data = np.array(range(200))
order = (4, 2, 1)
model = ARIMA(data, order=order)
fitted_model = model.fit()
prediction = fitted_model.forecast(7)
new_data = np.array(range(600, 800))
fitted_model = fitted_model.apply(new_data)
new_prediction = fitted_model.forecast(7)
print(prediction) # [200. 201. 202. 203. 204. 205. 206.]
print(new_prediction) # [800. 801. 802. 803. 804. 805. 806.]
This replace all the data, so it can be used on unrelated data (unknown index). I profiled it and apply is very fast in comparison to fit.

Related

Summarise the posterior of a single parameter from an array with arviz

I am estimating a model using the pyMC3 library in python. In my "real" model, there are four parameter arrays, two of which have over 170,000 parameters in them. Summarising this array of parameters is too computationally intensive on my computer. I have been trying to figure out if the summary function in arviz will allow me to only summarise one (or a small number) of parameters in the array. Below is a reprex where the same problem is present, though the model is a lot simpler. In the linear regression model below, the parameter array b has three parameters in it b[0], b[1], b[2]. I would like to know how to get the summary for just b[0] and b[1] or alternatively for just a single parameter, e.g., b[0].
import pandas as pd
import pymc3 as pm
import arviz as az
d = pd.read_csv("https://quantoid.net/files/mtcars.csv")
mpg = d['mpg'].values
hp = d['hp'].values
weight = d['wt'].values
with pm.Model() as model:
b = pm.Normal("b", mu=0, sigma=10, shape=3)
sig = pm.HalfCauchy("sig", beta=2)
mu = pm.Deterministic('mu', b[0] + b[1]*hp + b[2]*weight)
like = pm.Normal('like', mu=mu, sigma=sig, observed=mpg)
fit = pm.fit(10000, method='advi')
samp = fit.sample(1500)
with model:
smry = az.summary(samp, var_names = ["b"])
It looked like the coords argument to the summary() function would do it, but after googling around and finding a few examples, like the one here with plot_posterior() instead of summary(), I was unable to get something to work. In particular, I tried the following in the hopes that it would return the summary for b[0] and b[1].
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": range(1)})
or this to return the summary of b[0]:
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": [0]})
I suspect I am missing something simple (I'm an R user who dabbles occasionally with Python). Any help is greatly appreciated.
(BTW, I am using Python 3.8.0, pyMC3 3.9.3, arviz 0.10.0)
To use coords for this, you need to update to the development (which will still show 0.11.2 but has the code from github or any >0.11.2 release) version of ArviZ. Until 0.11.2, the coords argument in summary was not used to subset the data (like it did in all plotting functions) but instead it was only taken into account if the input was not already InferenceData in which case it was passed to the converter.
With older versions, you need to use xarray to subset the data before passing it to summary. Therefore you need to explicitly convert the trace to inferencedata beforehand. In the example above it would look like:
with model:
...
samp = fit.sample(1500)
idata = az.from_pymc3(samp)
az.summary(idata.posterior[["b"]].sel({"b_dim_0": [0]}))
Moreover, you may also want to indicate summary to compute only a subset of the stats/diagnostics as shown in the docstring examples.

Passing Two List to a Python function

I am trying to run python package pyabc(Approximate Bayesian Computation) for model selection between two list of values i.e model_1=[2,3,4,5] and model_2=[3,4,2,5]. The main function of pyabc is ABCSMC which states that
Definition : ABCSMC(models: Union[List[Model], Model], parameter_priors:
Union[List[Distribution], Distribution, Callable], distance_function: Union[Distance,
Callable]=None, population_size: Union[PopulationStrategy, int]=100, summary_statistics:
Callable[[model_output], dict]=identity, model_prior: RV=None)
I don't know where to define and pass my two lists model_1 and model_2 in the below mentioned code. I tried it several times but not able to do it as I am new to Python. I am following an example and its code in mentioned below.
import os
import tempfile
import scipy.stats as st
import pyabc
# Define a gaussian model
sigma = .5
def model(parameters):
# sample from a gaussian
y = st.norm(parameters.x, sigma).rvs()
# return the sample as dictionary
return {"y": y}
# We define two models, but they are identical so far
models = [model, model]
# However, our models' priors are not the same.
# Their mean differs.
mu_x_1, mu_x_2 = 0, 1
parameter_priors = [
pyabc.Distribution(x=pyabc.RV("norm", mu_x_1, sigma)),
pyabc.Distribution(x=pyabc.RV("norm", mu_x_2, sigma))
]
abc = pyabc.ABCSMC(
models, parameter_priors,
pyabc.PercentileDistance(measures_to_use=["y"]))
db_path = ("sqlite:///" +
os.path.join(tempfile.gettempdir(), "test.db"))
history = abc.new(db_path, {"y": y_observed})
print("ABC-SMC run ID:", history.id)
# We run the ABC until either criterion is met
history = abc.run(minimum_epsilon=0.2, max_nr_populations=5)
Model selection in pyABC aims to decide among different model candidates which model describes a common set of observed data best. In your above code, you would thus typically use different models in the list of models models = [model1, model2]. I am not sure what you mean by model selection over lists?
For the underlying algorithm and problem that it tries to solve, see also the original paper https://doi.org/10.1093/bioinformatics/btp619.

Get better fit on test data using Auto_Arima

I am using the AirPassengers dataset to predict a timeseries. For the model I am using, I chosen to use auto_arima to forecast the predicted values. However, it seems that the chosen order by the auto_arima is unable to fit the model. The corresponding chart is produced.
What can I do to get a better fit?
My code for those that want to try:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
from pmdarima import auto_arima
df = pd.read_csv("https://raw.githubusercontent.com/AileenNielsen/TimeSeriesAnalysisWithPython/master/data/AirPassengers.csv")
df = df.rename(columns={"#Passengers":"Passengers"})
df.Month = pd.to_datetime(df.Month)
df.set_index('Month',inplace=True)
train,test=df[:-24],df[-24:]
model = auto_arima(train,trace=True,error_action='ignore', suppress_warnings=True)
model.fit(train)
forecast = model.predict(n_periods=24)
forecast = pd.DataFrame(forecast,index = test.index,columns=['Prediction'])
plt.plot(train, label='Train')
plt.plot(test, label='Valid')
plt.plot(forecast, label='Prediction')
plt.show()
from sklearn.metrics import mean_squared_error
print(mean_squared_error(test['Passengers'],forecast['Prediction']))
Thank you for reading. Any advice is appreciated.
This series is not stationary, and no amount of differencing (notice that the amplitude of the variations keeps increasing) will make it so. However, transforming the data first by taking logs should do better (experiment shows that it does do better, but not what I would call well). Setting the seasonality (as I suggest in the comment by m=12, and taking logs produces this: which is essentially perfect.
The problem was that I did not specify the m, in this case, I assigned the value of m to be 12, denoting that it is a monthly cycle, that each data row is a month. That's how I understand it. source
Feel free to comment, I'm not entirely sure as I am new to using ARIMA.
Code:
model = auto_arima(train,m=12,trace=True,error_action='ignore', suppress_warnings=True)
Just add m=12,to denote that the data is monthly.
Result:

Getting transformed X values from OLS model using statsmodels

I am trying to do a linear regression. With the results I want to multiply each x with its own estimated coefficient: xi·βi.
However, I am doing a lot of transformations on xi.
For example:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np
def log_plus_1(x):
return np.log(x + 1.0)
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
formule = 'Lottery ~ pow(Literacy,2) + log_plus_1(Wealth)'
mod = smf.ols(formula=formule, data=df)
res = mod.fit()
res.params
Now I would need pow(Literacy, 2) and log_plus_1(Wealth). But since they go into the model, I was hoping to get them out of there too. Instead of transforming the data from the original dataset.
In R I would use res$model to get it.
The data is stored as attributes of the model, e.g. the design matrix is mod.exog, the dependent or response variable is mod.endog.
(I'm not sure I remember correctly the details of the following: The data that patsy returns after creating the transformed design matrix should, in this case, be a pandas DataFrame, and should be stored in mod.data.orig_exog or something like that.)
res.predict automatically handles the transformation, i.e. patsy uses the formula information to transform the data for the explanatory variables in prediction in the same way as the data was transformed in creating the model.
predict only returns the prediction and not the internally transformed predict exog.

statsmodels - printing summary of ARMA fit throws error

I want to fit an ARMA(p,q) model to simulated data, y, and check the effect of different estimation methods on the results. However, fitting a model to the same object like so
model = tsa.ARMA(y,(1,1))
results_mle = model.fit(trend='c', method='mle', disp=False)
results_css = model.fit(trend='c', method='css', disp=False)
and printing the results
print result_mle.summary()
print result_css.summary()
generates the following error
File "C:\Anaconda\lib\site-packages\statsmodels\tsa\arima_model.py", line 1572, in summary
smry.add_table_params(self, alpha=alpha, use_t=False)
File "C:\Anaconda\lib\site-packages\statsmodels\iolib\summary.py", line 885, in add_table_params
use_t=use_t)
File "C:\Anaconda\lib\site-packages\statsmodels\iolib\summary.py", line 475, in summary_params
exog_idx]
IndexError: index 3 is out of bounds for axis 0 with size 3
If, instead, I do this
model1 = tsa.ARMA(y,(1,1))
model2 = tsa.ARMA(y,(1,1))
result_mle = model1.fit(trend='c',method='css-mle',disp=False)
print result_mle.summary()
result_css = model2.fit(trend='c',method='css',disp=False)
print result_css.summary()
no error occurs. Is that supposed to be or a Bug that should be fixed?
BTW the ARMA process I generated as follows
from __future__ import division
import statsmodels.tsa.api as tsa
import numpy as np
# generate arma
a = -0.7
b = -0.7
c = 2
s = 10
y1 = np.random.normal(c/(1-a),s*(1+(a+b)**2/(1-a**2)))
e = np.random.normal(0,s,(100,))
y = [y1]
for t in xrange(e.size-1):
arma = c + a*y[-1] + e[t+1] + b*e[t]
y.append(arma)
y = np.array(y)
You could report this as a bug, even though it looks like a consequence of the current design.
Some attributes of the model change when the estimation method is changed, which should in general be avoided. Since both results instances access the same model, the older one is inconsistent with it in this case.
http://www.statsmodels.org/dev/pitfalls.html#repeated-calls-to-fit-with-different-parameters
In general, statsmodels tries to keep all parameters that need to change the model in the model.__init__ and not as arguments in fit, and attach the outcome of fit and results to the Results instance.
However, this is not followed everywhere, especially not in older models that gained new options along the way.
trend is an example that is supposed to go into the ARMA.__init__ because it is now handled together with the exog (which is an ARMAX model), but wasn't in pure ARMA. The estimation method belongs in fit and should not cause problems like these.
Aside: There is a helper function to simulate an ARMA process that uses scipy.signal.lfilter and should be much faster than an iteration loop in Python.

Categories