Prediction intervals for ARMA.predict - python

The Summary of an ARMA prediction for time series (print arma_mod.summary()) shows some numbers about the confidence interval. Is it possible to use these numbers as prediction intervals in the plot which shows predicted values?
ax = indexed_df.ix[:].plot(figsize=(12,8))
ax = predict_price.plot(ax=ax, style='rx', label='Dynamic Prediction');
ax.legend();
I guess the code:
from statsmodels.sandbox.regression.predstd import wls_prediction_std
prstd, iv_l, iv_u = wls_prediction_std(results)
found here: Confidence intervals for model prediction
...does not apply here as it is made for OLS rather then for ARMA forecasting. I also checked github but did not find any new stuff which might relate to time series prediction.
(Making forecasts requires forecasting intervals i guess, especially when it comes to an out-of sample forecast.)
Help appreciated.

I suppose, for out-of-sample ARMA prediction, you can use ARMA.forecast from statsmodels.tsa
It returns three arrays: predicted values, standard error and confidence interval for the prediction.
Example with ARMA(1,1), time series y and prediction 1 step ahead:
import statsmodels as sm
arma_res = sm.tsa.ARMA(y, order=(1,1)).fit()
preds, stderr, ci = arma_res.forecast(1)

Related

Making point predictions using Cox proportional hazard

I am using the pysurvival library to model with the Cox proportional hazard model (CPH). Instead of getting the survival curves, I am interested in getting point predictions. In the library, the function predict_survival returns an array-like representing the prediction of the survival function which I assume that I can use to get the expected values - but I just cant find the right way.
Below I've attached a dummy example.
# Initializing the simulation model
sim = SimulationModel( survival_distribution = 'log-logistic',
risk_type = 'linear',
censored_parameter = 10.1,
alpha = 0.1, beta=3.2 )
# Generating N random samples
N = 200
dataset = sim.generate_data(num_samples = N, num_features = 4)
# Defining the features
features = sim.features
# Creating the X, T and E input
X, T, E = dataset[features], dataset['time'].values, dataset['event'].values
# Building the model
coxph = CoxPHModel()
coxph.fit(X,T,E, lr=0.5, l2_reg=1e-2, init_method='zeros')
When applying the function:
coxph.predict_survival(x=X)
It returns an array of arrays with the shape (200, 87) - why does it give exactly 87 values for every observation?
As far as I understand I should be able to get the expected value by taking the integral below the curve of the survival curve.
To do this I need to calculate the area under the curve which I think can be done using trapz in the numpy library, but I need to know how the spacing between the points are done.
As mentioned we can use the function predict_survival to get the estimated survival probability. Furthermore, by calling coxPH.times we obtain the time of every estimated survival probability, and thereby for example can plot the individual survival curve for every observation and calculate the area under the curve. By using the auc function from the sklearn.metrics library the following definition gives point predictions for the training and test set given a CPH model, the X_test data and X_train data:
## Get point predictions
def point_pred(model, X_test, X_train):
T_pred = []
T_pred_train = []
# Get survival curves
cph_pred = model.predict_survival(X_test)
cph_pred_train = model.predict_survival(X_train)
# get times of survival prediction
time = model.times
# test
for i in range(0,len(cph_pred)):
T_pred.append(auc(time,cph_pred[i]))
# train
for i in range(0,len(cph_pred_train)):
T_pred_train.append(auc(time,cph_pred_train[i]))
return T_pred, T_pred_train

Linear regression with outliers for Machine Learning

Python (jupyter notebook to be exact), using numpy and sklearn only
np.random.seed(16)
x = np.arange(100) 
yp = 3*x + 3 + 2*(np.random.poisson(3*x+3,100)-(3*x+3))
np.random.seed(12)
# Choose how many outliers
out = np.random.choice(100,15)
yp_wo = np.copy(yp)
np.random.seed(12) #set again
yp_wo[out] = yp_wo[out] + 5*np.random.rand(15)*yp[out]
# With outliers
plt.scatter(x,yp_wo)
# Without outliers
plt.scatter(x,yp)
For the data above (wo means "with outliers"), I need to find:
The best coefficients for two more losses: the MAE and the MAPE (Median Absolute Percentage Error)
Plot the best fit line for the MSE loss, the MAE loss, and the MAPE loss.
Apply Ridge Regression to the same data, and use cross validation to choose the optimal parameter alpha (you can use values of alpha = 10^-5, 10^-4, 10^-3, ... 10^3). Which value gives you the lowest MSE?
What confuses me is having to plot the best line fit for two or more losses.
I can follow the code from class and try to get the values, but I don't know what's meant by coefficients.
Any help / guidance?
This is for a homework I am trying to figure out (no I am not asking for the solutions)
Please excuse any formatting errors, I am very new to Stack Overflow.

How distinct a real outliers with PYod?

I am working on an anomaly detection project on a call detail record for a telephone operator, I have prepared a sample of 10000 observations and 80 dimensions which represent the totality of the observations for a day of traffic, the data are represented as follows:
this is a small part of the whole dataset.
however, I decided to use the library PYOD which is an API that offers many unsupervised learning algorithms, I decided to start with CNN:
from pyod.models.knn import KNN
knn= KNN(contamination= 0.1)
result = knn.fit_predict(conso)
Then to visualize the result I decided to resize the sample in 2 dimentions and to display it in scatter with in blue the observations that KNN predicted that were not outliers and in red those which are outliers.
from sklearn.manifold import TSNE
result_f = TSNE(n_components = 2).fit_transform(df_final_2)
result_f = pd.DataFrame(result_f)
color= ['red' if row == 1 else 'blue' for row in result_list]
'df_final_2' is the dataframe version of 'conso'.
then I put all that in the right colors:
import matplotlib.pyplot as plt
plt.scatter(result_f[0],result_f[1], s=1, c=color)
The thing that disturbs me in the graph is that the observations predict as outliers are not really outliers because normally the outliers are in the extremity of the graph and not grouped with the normal behaviors, even by analyzing these obseravations aberent they have a normal behavior in the original dataset, I have tried other PYOD algorithms and I have modified the parameters of each algorithm but I have obtained at least the same result. I made a mistake somewhere and I can not distinguish it.
Thnx.
There are several things to check:
using knn, lof, and similar models that rely on distance measures, the data should be first standardized (using sklearn StandardScaler)
tsne may now work in this case and the dimensionality reduction could be off
maybe do not use fit_predict, but do this (use y_train_pred):
# train kNN detector
clf_name = 'KNN'
clf = KNN(contamination=0.1)
clf.fit(X)
# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_ # raw outlier scores
If none of these work, feel free to open an issue report on GitHub and we will take a further investigation.

Getting correct exogenous least squares prediction in Python statsmodels

I am having trouble getting a reasonable prediction behavior from least squares fits in statsmodels version 0.6.1. It does not seem to be providing a sensible value.
Consider the following data
import numpy as np
xx = np.array([1.1,2.2,3.3,4.4]) # Independent variable
XX = sm.add_constant(xx) # Include constant for matrix fitting in statsmodels
yy = np.array([2,1,5,6]) # Dependent variable
ww = np.array([0.1,1,3,0.5]) # Weights to try
wn = ww/ww.sum() # Normalized weights
zz = 1.9 # Independent variable value to predict for
We can use numpy to do a weighted fit and prediction
np_unw_value = np.polyval(np.polyfit(xx, yy, deg=1, w=1+0*ww), zz)
print("Unweighted fit prediction from numpy.polyval is {sp}".format(sp=np_unw_value))
and we find a prediction of 2.263636.
As a sanity check, we can also see what R has to say about the matter
import pandas as pd
import rpy2.robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.pandas2ri
rpy2.robjects.pandas2ri.activate()
pdf = pd.DataFrame({'x':xx, 'y':yy, 'w':wn})
pdz = pd.DataFrame({'x':[zz], 'y':[np.Inf]})
rfit = rpy2.robjects.r.lm('y~x', data=pdf, weights=1+0*pdf['w']**2)
rpred = rpy2.robjects.r.predict(rfit, pdz)[0]
print("Unweighted fit prediction from R is {sp}".format(sp=rpred))
and again we find a prediction of 2.263636. My problem is that we do not get that result from statmodels OLS
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
owls = sm.OLS(yy, XX).fit()
sm_value_u, iv_lu, iv_uu = wls_prediction_std(owls, exog=np.array([[1,zz]]))
sm_unw_v = sm_value_u[0]
print("Unweighted OLS fit prediction from statsmodels.wls_prediction_std is {sp}".format(sp=sm_unw_v))
Instead I obtain a value 1.695814 (similar things happen with WLS()). Either there is a bug, or using statsmodels for prediction has some trick too obscure for me to find. What is going on?
The results classes have a predict method that provides the prediction for new values of the explanatory variables:
>>> print(owls.predict(np.array([[1,zz]])))
[ 2.26363636]
The first return of wls_prediction_std is the standard error for the prediction not the prediction itself.
>>> help(wls_prediction_std)
Help on function wls_prediction_std in module statsmodels.sandbox.regression.predstd:
wls_prediction_std(res, exog=None, weights=None, alpha=0.05)
calculate standard deviation and confidence interval for prediction
applies to WLS and OLS, not to general GLS,
that is independently but not identically distributed observations
Parameters
----------
res : regression result instance
results of WLS or OLS regression required attributes see notes
exog : array_like (optional)
exogenous variables for points to predict
weights : scalar or array_like (optional)
weights as defined for WLS (inverse of variance of observation)
alpha : float (default: alpha = 0.05)
confidence level for two-sided hypothesis
Returns
-------
predstd : array_like, 1d
standard error of prediction
same length as rows of exog
interval_l, interval_u : array_like
lower und upper confidence bounds
The sandbox function will be replaced by a new method get_prediction of the results classes that provides the prediction and the extra results like standard deviation and confidence and prediction intervals.
http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.get_prediction.html

Extending regressions beyond data in Matplotlib

I'm using Matplotlib and Numpy to plot linear regressions on time series plots in order to predict the trends in the future.
Generating the regressions doesn't seem to be particularly difficult, but getting the regression line to extend past the last data point is proving challenging:
How can I extend the regressions?
When you evaluate your regression model, you're predicting a value of submissions for the input date. To predict a wider range, you need to increase the range of dates that you're evaluating the model on. I'd also use np.polyval instead of the list comprehension, just because as it's more compact:
# Generate data like the question
observed_dates = pd.date_range("jan 2004", "april 2013", freq="M")
submissions = np.random.normal(5000, 100, len(observed_dates))
submissions += np.arange(len(observed_dates)) * 10
submissions[::12] += 800
# Plot the observed data
plt.plot(observed_dates, submissions, marker="o")
# Fit a model and predict future dates
predict_dates = pd.date_range("jan 2004", "jan 2020", freq="M")
model = np.polyfit(observed_dates.asi8, submissions, 1)
predicted = np.polyval(model, predict_dates.asi8)
# Plot the model
plt.plot(predict_dates, predicted, lw=3)
If you want to extend the regression line beyond the data, for example, to cover the entire x range, you can do (just change the last 3 lines):
import numpy as np
X=np.arange(xmin, xmax, 50)
line=beta1*X**2+beta2*X+beta3
plt.plot(X, line, 'r-', lw=5.)

Categories