Predicting from inferred parameters in pymc3 - python

I am trying to understand this from a non-Bayesian background.
In linear regression or blackbox machine learning tools the work flow is something like the following.
Get data
Prepare data
Model data (learn from it [or part of it, the training set])
Test model (usually on the test set)
If model is good according to some metric, goto 6, else
investigate and revise work.
Model is good enough; use it to predict/classify, etc.
So let's say I use pymc3 to try to understand the relationship between advertising expenditure and revenue from goods sold. If all stages from 1 to 5 go well, then in frequentest statistics used in R and machine learning packages such as scikit-learn, I only need to pass new unseen data to the learned model and invoke the predict method. This will usually print out a predicted value of Y (revenue from goods sold), given some unseen value(s) of X (advertising expenditure), with some confidence intervals or some other margin of error still being taken into account.
How would one go about doing that in pymc3? If I end up with many slopes and many betas then which should I use for predicting? And wouldn't taking the mean of all slopes and all betas to use be like throwing away a lot of otherwise useful learned knowledge?
I find if difficult to understand how sampling from the posterior can help in this. One can imagine bosses who need to be told about an expected revenue from goods sold Y figure given some advertising expenditure X amount, with some confidence and error margins. Aside from plotting, I don't know how sampling from posterior can be incorporated into a management report and make it useful for cash flow planning by interested parties.
I know some of us are spoiled coming from R and maybe scikit-learn, but wouldn't it be nice if there was a predict method that dealt with this matter in a more uniform and standardized way?
Thanks

One way of taking into account the uncertainty in parameters when making predictions with a model is to use the posterior predictive distribution. This distribution tells you the probability of a new observation, conditioned on the data that you used to constrain the model parameters. If the revenue is Y, the advertising expenditure is X, the model parameters are theta and the data used to constrain the model are X', then you can write
The left hand side is the probability of attaining revenue Y given an expenditure X, and the data used to constrain the model X'. This is the posterior predictive distribution of your model, and should be used when making predictions. p(Y | X, theta) is the probability of revenue Y given some set of model parameters theta and the expenditure X. p(theta | X') is the posterior distribution on the model parameters given the data that you used to constrain the model.
When using software like pymc3, you obtain samples from p(theta | X'). You can use these to do the integral above in a Monte-Carlo fashion. If you have N samples from the posterior in your MCMC chain, then you can do the sum
in other words, you compute p(Y | X, theta_n) for every set of parameters in your MCMC, and then take the average (note that this isnt the same as `taking the mean of all slopes and all betas' as you mentioned in your question, because you are computing the average of a pdf rather than the parameters themselves). In practice this should be easy to code, you just need to implement the function p(Y | X, theta) and then plug in your posterior parameter samples, then take the mean at the end. This gives you the fairest representation of your model prediction given your MCMC sampling.

Related

Why does auto arima generate a best model with q order exceeding my pre-set range?

I would like to create an auto arima model to automatically select the best parameter values. The value range that I set for q is [1,2]. However, the best q value generated by auto arima is 0. Does anyone know why it is?
Below is my code
sarimax_model = auto_arima(df_x_train['y'],exogenous=df_x_train[['black_friday_ind','holiday_season_ind','covid_ind']],start_p=0,d=1,start_q=1,max_p=1,max_d=1,max_q=2, start_P=0,D=1,start_Q=1,max_P=1,max_D=1,max_Q=1, m=seasonal_periods, information_criterion='aic',stepwise=True)
This is the best model generated by auto arima
You are testing via AIC (Akaike Information Criterion) and the lowest AIC is the best model for automated model selection. So, you see 0 for q order.
If you can share sarimax_model.summary() then we can see the lowest AIC.
But, you need to be careful with automated model selection. It is very dependent on your input data and the parameters are given to the function. Parameters need to match with the data (seasonality, seasonal period, seasonal difference order, etc).
You can verify your P and Q orders manually via ACF and PACF plots and also good to check diagnostic plots for normality and correlation of residuals.

Is it possible to do a restricted VAR-X model using python?

I have seen some similar questions but they didn't work for my situation.
Here is the model I am trying to implement.
VAR model
I suppose I would need to be able to change the coefficient of stockxSign to 0 when we calculate Stock and same thing for CDSxSign when calculating the CDS
Does someone have any idea how i could do this?
It is possible now with the package that I just wrote.
https://github.com/fstroes/ridge-varx
You can fit coefficients for specific lags only by supplying a list of lags to fit coefficient matrices for. Providing "lags=[1,12]" for instance would only use the variables at lags 1 and 12.
In addition you can use Ridge regularization if you are not sure what lags should be included. If Ridge is not used, the model is fitted using the usual multivariate least squares approach.

How to build a Gaussian Process regression model for observations that are constrained to be positive

I'm currently trying to train a GP regression model in GPflow which will predict precipitation values given some meteorological inputs. I'm using a Linear+RBF+WhiteNoise kernel, which seems appropriate given the set of predictors I'm using.
My problem at the moment is that when I get the model to predict new values, it has a tendency to predict negative precipitation - see attached figure.
How can I "enforce" physical constraints when building the model? The training data doesn't contain any negative precipitation values, but it does contain a lot of values close to zero, which I assume means the GPR model isn't learning the "precipitation must be >=0" constraint very well.
If there's a way of explicitly enforcing a constraint like this it'd be perfect, but I'm not sure how that would work. Would this require a different optimization algorithm? Or is it possible to somehow build this constraint into the kernel structure?
This is more of a question for CrossValidated ... A Gaussian process is essentially a distribution over functions with Gaussian marginals: the predictive distribution of f(x) at any point is by construction a Gaussian, not constrained. E.g. if you have lots of observations close to zero, your model expects that something just below zero must also be very likely.
If your observations are strictly positive, you could use a different likelihood, e.g. Exponential (gpflow.likelihoods.Exponential) or Beta (gpflow.likelihoods.Beta). Note that model.predict_y() always returns mean and variance, and for non-Gaussian likelihoods the variance may not actually be what you want. In practice, you're more likely to care about quantiles (e.g. 10%-90% confidence interval); there is an open issue on the GPflow github that relates to this. Which likelihood you use is part of your modelling choice, and depends on your data.
The simplest practical answer to your problem is to consider modelling the log-precipitation: if your original dataset is X and Y (with Y > 0 for all entries), compute logY = np.log(Y) and create your GP model e.g. using gpflow.models.GPR((X, logY), kernel). You then predict logY at test points, and can then convert it back from log-precipitation into precipitation space. (This is equivalent to a LogNormal likelihood, which isn't currently implemented in GPflow, though this would be straightforward.)

XGBoost (Python) Prediction for Survival Model

The docs for Xgboost imply that the output of a model trained using the Cox PH loss will be exponentiation of the individual persons predicted multiplier (against the baseline hazard). Is there no way to extract from this model the baseline hazard in order to predict the entire survival curve per person?
survival:cox: Cox regression for right censored survival time data
(negative values are considered right censored). Note that predictions
are returned on the hazard ratio scale (i.e., as HR =
exp(marginal_prediction) in the proportional hazard function h(t) =
h0(t) * HR)
No, I think not. A workaround would be to fit the baseline hazard in another package e.g. from sksurv.linear_model import CoxPHSurvivalAnalysis or in R by require(survival). Then you can use the predicted output from XGBoost as multiplyers to the fitted baseline. Just remember that if the baseline is on the log scale then use output_margin=True and add the predictions.
I hope the authors of XGBoost soon will provide some examples of how to use this function.

Polynomial Regression of a Noisy Dataset

I was wondering if I could get some help on a problem.
I am creating a tool for a former lab of mine which uses data from a physics based machine (a lot of noise) that results as simple x, y coordinates. I want to identify local maximums of the dataset, however, since there is a bunch of noise in the set, you cannot just check the slope between the points in order to determine the peak.
In order to solve this, I was thinking of using polynomial regression to somewhat "smooth out" the data set, then determine local maximums from the resulting model.
I've run through this link
http://scikit-learn.org/stable/auto_examples/linear_model/plot_polynomial_interpolation.html, however, it only tells you how create a model that is a close fit. It doesn't tell you if there is an integrated metric in which to measure which is the best model. Should I do this through Chi squared? Or is there some other metric that works better or is integrated into the scikit-learn kit?
Link procided esentially shows you how to build a Ridge Regression on top of polynomial features. Consequently this is not a "tight fit", as you can control it through regularization (alpha parameter) - prior over the parameters. Now, what do you mean by "best model" - there are infinitely many possible criterions for being a best regression, each tested through different criterion. You need to answer yourself - what is the measure that you are interested in. Should it be some kind of "golden ratio" between smoothness and close fitness? Or maybe you want a model of at most some smoothness, which minimizes some error measure (mean squared distance to the points?)? Yet another would be to test how well it captures the underlying process - through some kind of typical validation (like cross validation etc.) where you repeat building the model on the subset of the data and check error on the holdout part. There are many possible (and completely valid!) approaches - everything depends on the exact question you want to answer. "What is the best model" is not a good question, unfortunately.

Categories