Refit Python Statsmodel ARIMA model parameters to new data and predict - python

I've stored the coefficients of intercept, AR, MA off ARIMA model of statsmodel package
x = df_sku
x_train = x['Weekly_Volume_Sales']
x_train_log = np.log(x_train)
x_train_log[x_train_log == -np.inf] = 0
x_train_mat = x_train_log.as_matrix()
model = ARIMA(x_train_mat, order=(1,1,1))
model_fit = model.fit(disp=0)
res = model_fit.predict(start=1, end=137, exog=None, dynamic=False)
print(res)
params = model_fit.params
But I'm unable to find any documentation on statsmodel that lets me refit the model parameters onto a set of new data and predict N steps.
Has anyone been able to accomplishing refitting the model and predicting out of time samples ?
I'm trying to accomplish something similar to R:
# Refit the old model with testData
new_model <- Arima(as.ts(testData.zoo), model = old_model)

Here is a code you can use:
def ARIMAForecasting(data, best_pdq, start_params, step):
model = ARIMA(data, order=best_pdq)
model_fit = model.fit(start_params = start_params)
prediction = model_fit.forecast(steps=step)[0]
#This returns only last step
return prediction[-1], model_fit.params
#Get the starting parameters on train data
best_pdq = (3,1,3) #It is fixed, but you can search for the best parameters
model = ARIMA(train_data, best_pdq)
model_fit = model.fit()
start_params = model_fit.params
data = train_data
predictions = list()
for t in range(len(test_data)):
real_value = data[t]
prediction = ARIMAForecasting(data, best_pdq, start_params, step)
predictions.append(prediction)
data.append(real_value)
#After you can compare test_data with predictions
Details you can check here:
https://www.statsmodels.org/dev/generated/statsmodels.tsa.arima_model.ARIMA.fit.html#statsmodels.tsa.arima_model.ARIMA.fit

Great question. I have found such example: https://alkaline-ml.com/pmdarima/develop/auto_examples/arima/example_add_new_samples.html
briefly:
import pmdarima as pmd
...
### split data as train/test:
train, test = ...
### fit initial model on `train` data:
arima = pmd.auto_arima(train)
...
### update initial fit with `test` data:
arima.update(test)
...
### create forecast using updated fit for N steps:
new_preds = arima.predict(n_periods=10)

Related

Getting the same prediction when using the PyMC3 data container to generate Bayesian regression prediction using new data

I built the Bayesian regression using PyMC3 package. I'm trying to generate prediction using new data. I used the data container pm.Data() to train the model with the training data, then passed the new data to pm.set_data() before calling pm.sample_posterior_predictive(). The prediction was what I would expect from the training data, not the new data.
Here's my model:
df_train = df.drop(['Unnamed: 0', 'DATE_AT'], axis=1)
with Model() as model:
response_mean = []
x_ = pm.Data('features', df_train) # a data container, can be changed
t = np.transpose(x_.get_value())
# intercept
y = Normal('y', mu=0, sigma=6000)
response_mean.append(y)
# channels that can have DECAY and SATURATION effects
for channel_name in delay_channels:
i = df_train.columns.get_loc(channel_name)
xx = t[i].astype(float)
print(f'Adding Delayed Channels: {channel_name}')
c = coef.loc[coef['features']==channel_name, 'coef'].values[0]
s = abs(c*0.015)
if c <= 0:
channel_b = HalfNormal(f'beta_{channel_name}', sd=s)
else:
channel_b = Normal(f'beta_{channel_name}', mu=c, sigma=s)
alpha = Beta(f'alpha_{channel_name}', alpha=3, beta=3)
channel_mu = Gamma(f'mu_{channel_name}', alpha=3, beta=1)
response_mean.append(logistic_function(
geometric_adstock_tt(xx, alpha), channel_mu) * channel_b)
# channels that have SATURATION effects only
for channel_name in non_lin_channels:
i = df_train.columns.get_loc(channel_name)
xx = t[i].astype(float)
print(f'Adding Non-Linear Logistic Channel: {channel_name}')
c = coef.loc[coef['features']==channel_name, 'coef'].values[0]
s = abs(c*0.015)
if c <= 0:
channel_b = HalfNormal(f'beta_{channel_name}', sd=s)
else:
channel_b = Normal(f'beta_{channel_name}', mu=c, sigma=s)
# logistic reach curve
channel_mu = Gamma(f'mu_{channel_name}', alpha=3, beta=1)
response_mean.append(logistic_function(xx, channel_mu) * channel_b)
# continuous external features
for channel_name in control_vars:
i = df_train.columns.get_loc(channel_name)
xx = t[i].astype(float)
print(f'Adding control: {channel_name}')
c = coef.loc[coef['features']==channel_name, 'coef'].values[0]
s = abs(c*0.015)
if c <= 0:
control_beta = HalfNormal(f'beta_{channel_name}', sd=s)
else:
control_beta = Normal(f'beta_{channel_name}', mu=c, sigma=s)
channel_contrib = control_beta * xx
response_mean.append(channel_contrib)
# categorical control variables
for var_name in index_vars:
i = df_train.columns.get_loc(var_name)
shape = len(np.unique(t[i]))
x = t[i].astype('int')
print(f'Adding Index Variable: {var_name}')
ind_beta = Normal(f'beta_{var_name}', sd=6000, shape=shape)
channel_contrib = ind_beta[x]
response_mean.append(channel_contrib)
# noise
sigma = Exponential('sigma', 10)
# define likelihood
likelihood = Normal(outcome, mu=sum(response_mean), sd=sigma, observed=df[outcome].values)
trace = pm.sample(tune=3000, cores=4, init='advi')
Here's the beta's from the model. Notice that ADWORD_SEARCH is one of the most important features:
Betas
When I zeroed out ADWORD_SEARCH feature, I got practically identical prediction, which can not be the case:
with model:
y_pred = sample_posterior_predictive(trace)
mod_channel = 'ADWORDS_SEARCH'
df_mod = df_train.copy(deep=True)
df_mod.iloc[12:-12, df_mod.columns.get_loc(mod_channel)] = 0
with model:
pm.set_data({'features':df_mod})
y_pred_mod = pm.sample_posterior_predictive(trace)
Predictions Plot
By zeroeing out ADWORD_SEARCH, I would expect that the prediction would be significantly lower than the original prediction since ADWORD_SEARCH is one of the most important features according to the betas.
I started questioning the model, but it seems to perform well:
MAPE = 6.3%
r2 = 0.7
I also tried passing in the original training data set to pm.setdata() and I got very similar results as well.
This is difference between prediction from training data and new data:
y1-y2
This is the difference between prediction from training data and the same training data using pm.setdata():
y1-y3
Anyone know what I'm doing wrong?

Implementing root mean square log error as evaluation metric for LightGBM implementation

I created the following function to use as an evaluation metric to tune hyper parameter
# function to calculate the RMSLE
def get_msle(true, predicted) :
return np.sqrt(msle(true, predicted))
# custom evaluation metric function for the LightGBM
def custom_eval(preds, dtrain):
labels = dtrain.get_label().astype(np.int)
preds = preds.clip(min=0)
return [('rmsle', get_msle(labels, preds))]
I created the following function to train and validate the hyper parameter
def get_n_estimators(evaluation_set, min_r, max_r):
results = []
for n_est in (range(min_r, max_r, 20)):
x = {}
SCORE_TRAIN = []
SCORE_VALID = []
for train, valid in (evaluation_set):
# seperate the independent and target variable from the train and validation set
train_data_x = train.drop(columns= ['WEEK_END_DATE', 'UNITS'])
train_data_y = train['UNITS']
valid_data_x = valid.drop(columns= ['WEEK_END_DATE', 'UNITS'])
valid_data_y = valid['UNITS']
# evaluation sets
# we will evaluate our model on both train and validation data
e_set = [(train_data_x, train_data_y), (valid_data_x, valid_data_y)]
# define the lgbmRegressor Model
model = lgb.LGBMRegressor(n_estimators = n_est,
learning_rate = 0.01,
n_jobs=4,
random_state=0,
objective='regression')
# fit the model
model.fit(train_data_x, train_data_y, eval_metric= custom_eval ,eval_set= e_set, verbose=False)
# store the RMSLE on train and validation sets in different lists
# so that at the end we can calculate the mean of results at the end
SCORE_TRAIN.append(model.evals_result_['validation_0']['rmsle'][-1])
SCORE_VALID.append(model.evals_result_['validation_1']['rmsle'][-1])
# calculate the mean rmsle on train and valid
mean_score_train = np.mean(SCORE_TRAIN)
mean_score_valid = np.mean(SCORE_VALID)
print('With N_ESTIMATORS:\t'+ str(n_est) + '\tMEAN RMSLE TRAIN:\t' + str(mean_score_train) + "\tMEAN RMSLE VALID: "+str(mean_score_valid))
x['n_estimators'] = n_est
x['mean_rmsle_train'] = mean_score_train
x['mean_rmsle_valid'] = mean_score_valid
results.append(x)
return pd.DataFrame.from_dict(results)
However, when I am trying to implement the function I am getting an error, this is the implementation code.
n_estimators_result = get_n_estimators(evaluation_set,min_r = 20, max_r = 901)
This is the error I am getting
'numpy.ndarray' object has no attribute 'get_label'
can you please help me in resolving this error ? I am struck with this for two days now

Forecasting using exogenous variables in ARIMAX in python

I am trying to forecast a variable called yield spread - "yieldsp" using several macroeconomic variables. "yieldsp" is a column in a dataframe called "stat2" with date datetime index. Initially, I had forecasted "yieldsp" using the ARIMA model wherein I employed the following code:
# fit the model on the train set and generate prediction for each element on the test set.
# perform a rolling forecast : re-create the ARIMA forecast when each new observation is received.
# forecast(): performs a one-step forecast from the model
# history - list created to track all the observations seeded with the training set
# => after each iteration, all new observations are appended to the list "history",
yieldsp = stat2["yieldsp"]
X = yieldsp.values
size = int(len(X) * 0.95)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
# walk-forward validation
for t in range(len(test)):
model = ARIMA(history, order=(1,0,4))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
print('predicted=%f, expected=%f' % (yhat, obs))
It worked and generated predicted and expected values. A part of the results are shown below:
predicted=0.996081, expected=0.960000
predicted=0.959644, expected=0.940000
predicted=0.937272, expected=0.930000
predicted=0.932651, expected=0.970000
predicted=0.976372, expected=0.960000
predicted=0.961283, expected=0.940000
But now, I want to use multiplied variables to forecast yieldsp. These variables in "stat2" are:
yieldsp = stat2[['ffr', 'house_st_change','rwage', 'epop_diff2','ipi_change_diff2', 'sahm_diff2']]
ffr house_st_change rwage epop_diff2 ipi_change_diff2 sahm_diff2 yieldsp
Date
1982-03-31 14.68 -28.713629 0.081837 -4.000000e-01 -3.614082 0.227545 0.19
1982-04-30 14.94 -32.573529 0.081789 2.000000e-01 0.838893 -0.061298 0.72
1982-05-31 14.45 -10.087719 0.081752 2.000000e-01 -0.765399 -0.062888 1.74
1982-06-30 14.15 -13.684211 0.080928 -2.000000e-01 0.421589 -0.039439 1.08
1982-07-31 12.59 12.007685 0.081026 -1.421085e-14 -0.141606 -0.032772 3.11
So, I attempted the following:
yieldsp = stat2[['ffr', 'house_st_change','rwage', 'epop_diff2','ipi_change_diff2', 'sahm_diff2', 'yieldsp']]
X = yielsp.values
size = int(len(X) * 0.8)
train, test = X[0:size], X[size:len(X)]
history = [x for x in train]
predictions = list()
# walk-forward validation
for t in range(len(test)):
model = ARIMA(history, order=(1,0,1))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
print('predicted=%f, expected=%f' % (yhat, obs))
But error occurred:
ValueError: could not broadcast input array from shape (7) into shape (1)
I am unsure how to fix that. I think to forecast "yieldsp" we would need the forecasted values of the exogenous variables too. And I also think we need to modify the codes which state:
history = [x for x in train]
predictions = list()
# walk-forward validation
for t in range(len(test)):
model = ARIMA(history, order=(1,0,4))
I would appreciate any kind of help.
(Reference: https://machinelearningmastery.com/make-sample-forecasts-arima-python/)

How exactly is RMSE computed in tensorflow?

The rmse computed by tensorflow does not match with the rmse computed manually by me. The relevant code has been pasted below :
# Train a linear regression model.
tf.logging.set_verbosity(tf.logging.INFO)
OUTDIR = 'sample_model_metadata'
import shutil
shutil.rmtree(OUTDIR, ignore_errors=True)
model = tf.estimator.LinearRegressor(feature_columns=make_feature_cols(), model_dir=OUTDIR)
model.train(make_train_input_fn(train_data, num_epochs=1))
#Make predictions on the validation data set.
predictions_vals = np.zeros(len(validation_data))
predictions = model.predict(input_fn = make_train_input_fn(validation_data, 1))
i =0
for items in predictions:
predictions_vals[i] = items['predictions'][0]
i += 1
evaluated_rmse = np.sqrt(mean_squared_error(predictions_vals, validation_data['Y']))
print(evaluated_rmse)
def print_rmse(model, df):
metrics = model.evaluate(input_fn = make_train_input_fn(df, 1))
print('RMSE on dataset = {}'.format(np.sqrt(metrics['average_loss'])))
print_rmse(model, validation_data)
i = 0
for items in predictions:
predictions_vals[i] = items['predictions'][0]
You are saving all the predictions into the same location of np array predictions_vals i.e. at i = 0. You missing i += 1 here ! unless you copied the code wrongly here.
The problem was with my input function, which was having shuffling turned on by default, causing the validation data to get jumbled and hence resulting in an erroneous validation score.
def make_train_input_fn(df, num_epochs=1, shuffle=True):
return tf.estimator.inputs.pandas_input_fn(
x=df,
y=df['Y'],
batch_size=128,
num_epochs=num_epochs,
shuffle=shuffle,
queue_capacity=2000)
I have made sure that shuffling is turned off while doing validation and this has fixed the issue.
predictions = model.predict(input_fn = make_train_input_fn(validation_data, 1, False))
Thanks

lightGBM predicts same value

I have one problem concerning lgb. When I write
lgb.train(.......)
it finishes in less than milisecond. (for (10 000,25) ) shape dataset.
and when I write predict, all the output variables have same value.
train = pd.read_csv('data/train.csv', dtype = dtypes)
test = pd.read_csv('data/test.csv')
test.head()
X = train.iloc[:10000, 3:-1].values
y = train.iloc[:10000, -1].values
sc = StandardScaler()
X = sc.fit_transform(X)
#pca = PCA(0.95)
#X = pca.fit_transform(X)
d_train = lgb.Dataset(X, label=y)
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'binary_logloss'
params['sub_feature'] = 0.5
params['num_leaves'] = 10
params['min_data'] = 50
params['max_depth'] = 10
num_round = 10
clf = lgb.train(params, d_train, num_round, verbose_eval=1000)
X_test = sc.transform(test.iloc[:100,3:].values)
pred = clf.predict(X_test, num_iteration = clf.best_iteration)
when I print pred, all the values are (0.49)
It's my first time using lightgbm module. Do I have some error in the code? or I should look for some mismatches in dataset.
Your num_round is too small, it just starts to learn and stops there. Other than that, make your verbose_eval smaller, so see the results visually upon training. My suggestion for you to try the lgb.train code as below:
clf = lgb.train(params, d_train, num_boost_round=5000, verbose_eval=10, early_stopping_rounds = 3500)
Always use early_stopping_rounds since the model should stop if there is no evident learning or the model starts to overfit.
Do not hesitate to ask more. Have fun.

Categories