How to provide custom gradients to HMC sampler in tensorflow-probability?

How to provide custom gradients to HMC sampler in tensorflow-probability? - python

I am trying to use the in-built HMC sampler of tensorflow-probability to generate samples from the posterior. According to documentation, it seems like one has to provide (possibly unnormalized) log density of posterior to target_log_prob_fn callable and tensorflow-probability automatically computes its gradient (with respect to parameters to be inferred) to perform Hamiltonian MCMC updates.
However for my application, the likelihood and the gradient of resulting posterior are computed outside of tensorflow (it involves solution of a partial differential equation and I can compute it efficiently using some other python library). So I was wondering is there a way I can somehow directly pass target_log_prob_fn the (unnormalized) log density of posterior and its gradient to perform Hamiltonian MCMC update? In other words, is there a way I can ask the HMC sampler to use the gradients provided by me to perform MCMC update?
I found a related question over here, but it does not exactly answer my question.

Related

What is gradient descent.does gradient descent can give better result than sklearn linear regression algorithm

Why we use gradient descent because sklearn can automatically find best fit line for our data.what is the purpose of gradient descent.

https://scikit-learn.org/stable/modules/sgd.html
if you want to use Gradient Descent approach, you should consider using SDRClassifier in SKlearn because SKlearn gives two Approaches to using Linear Regression. The first is LinearRegression class and is using Ordinary Least Squares solver from scipy
the other one is SDRClassifier class which is an Implementation of the Gradient Descent Algorithm. So to answer your Question if you are using SDRClassifier in SKlearn then you are using an Implementation of Gradient Descent Algorithm behind the Scene.

From Wikipedia itself
Gradient descent is a first-order iterative optimization algorithm for
finding the minimum of a function. To find a local minimum of a
function using gradient descent, one takes steps proportional to the
negative of the gradient (or approximate gradient) of the function at
the current point.
Gradient descent is just another approach used for converging, maximizing likelihood. There are other alternatives with their limitations.
The LinearRegression model in sklearn is just a fancy wrapper of the least squares solver (scipy.linalg.lstsq) built into scipy.

A gradient descent isn't much more than just following the slope locally.
If you're using a gradient descent, you're essentially taking into account how much has the value of the function you're trying to minimize changed as a function of one or more parameters. You can use that information to make a better guess as to what parameters values is likely to work best to minimize the function at the next step.
The image here is in 2D (so you would only have 2 parameters determining the value of your function to minimize). However it still remains conceptually the same (just with harder maths behind the scenes) with as many parameters as you want to use.
So a gradient descent, essentially, means taking your bicycle and letting the local slope lead you to the nearest local minimum.

To answer your question directly: gradient descent can get a solution for many models - from Logistic Regression to neural networks, called Multi-Layer Perceptrons in SKlearn (MLP).
If you are solving only for a simple linear model, then using gradient descent (like in Basilisk's answer) really has some minor benefits at the cost of performance (more flexible, but more slow). I used it when I needed more flexibility.
Besides the point, note that this question is not about programming, it is about machine learning, and should go to Cross Validated instead of Stack Overflow - though you might also want to start with the fundamentals (think about this - what do we mean by "best fit line"?).

Distributed Lag Model in Python

I have quickly looked for Distributed Lag Model in StatsModels but can't find one. The one that is similar is VAR model. Can I transform VAR model to Distributed Lag Model and how? It will be great if there are already other packages which have Distributed Lag Model. Please let me know if so.
Thanks!

If you are using a finite distributed lag model, just use OLS or FGLS, with the lagged predictors forming the covariate matrix, and some parameterized model of autocorrelation (if using FGLS).
If your target variable is vector-valued, then the same advice applies and it just becomes a multiple regression problem, with a separate regression for each component of the output, and possibly additional covariance structure if there is correlation between error terms across components of the target.
It does not appear there is a standard statistics package in Python that implements this directly, likely because it would boil down to FGLS in almost any practical situation.

Statsmodels - Negative Binomial doesn't converge while GLM does converge

I'm trying to do a Negative Binomial regression using Python's statsmodels package. The model estimates fine when using the GLM routine i.e.
model = smf.glm(formula="Sales_Focus_2016 ~ Sales_Focus_2015 + A_Calls + A_Ed", data=df, family=sm.families.NegativeBinomial()).fit()
model.summary()
However, the GLM routine doesn't estimate alpha, the dispersion term. I tried to use the Negative Binomial routine directly (which does estimate alpha) i.e.
nb = smf.negativebinomial(formula="Sales_Focus_2016 ~ Sales_Focus_2015 + A_Calls + A_Ed", data=df).fit()
nb.summary()
But this doesn't converge. Instead I get the message:
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: nan
Iterations: 0
Function evaluations: 1
Gradient evaluations: 1
My question is:
Do the two routines use different methods of estimation? Is there a way to make the smf.NegativeBinomial routine use the same estimation methods as the GLM routine?

discrete.NegativeBinomial uses either a newton method (default) in statsmodels or the scipy optimizers. The main problem is that the exponential mean function can easily result in overflow problems or problems from large gradients and hessian when we are still far away from the optimum. There are some attempts in the fit method to get good starting values but this does not always work.
a few possibilities that I usually try
check that no regressor has large values, e.g. rescale to have max below 10
use method='nm' Nelder-Mead as initial optimizer and switch to newton or bfgs after some iterations or after convergence.
try to come up with better starting values (see for example about GLM below)
GLM uses by default iteratively reweighted least squares, IRLS, which is only standard for one parameter families, i.e. it takes the dispersion parameter as given. So the same method cannot be directly used for the full MLE in discrete NegativeBinomial.
GLM negative binomial still specifies the full loglike. So it is possible to do a grid search over the dispersion parameter using GLM.fit() for estimating the mean parameters for each value of the dispersion parameter. This should be equivalent to the corresponding discrete NegativeBinomial version (nb2 ? I don't remember). It could also be used as start_params for the discrete version.
In the statsmodels master version, there is now a connection to allow arbitrary scipy optimizers instead of the ones that were hardcoded. scipy recently obtained trust region newton methods, and will get more in future, which should work for more cases than the simple newton method in statsmodels.
(However, most likely that does not work currently for discrete NegativeBinomial, I just found out about a possible problem https://github.com/statsmodels/statsmodels/issues/3747 )

Lasso Generalized linear model in Python

I would like to fit a generalized linear model with negative binomial link function and L1 regularization (lasso) in python.
Matlab provides the nice function :
lassoglm(X,y, distr)
where distr can be poisson, binomial etc.
I had a look at both statmodels and scikit-learn but I did not find any ready to use function or example that could direct me towards a solution.
In matlab it seems they minimize this:
min (1/N * Deviance(β0,β) + λ * sum(abs(β)) )
where deviance depends on the link function.
Is there a way to implement this easily with scikit or statsmodels or I should go for cvxopt?

statsmodels has had for some time a fit_regularized for the discrete models including NegativeBinomial.
http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit_regularized.html
which doesn't have the docstring (I just saw). The docstring for Poisson has the same information http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.Poisson.fit_regularized.html
and there should be some examples available in the documentation or unit tests.
It uses an interior algorithm with either scipy slsqp or optionally, if installed, cvxopt. Compared to steepest descend or coordinate descend methods, this is only appropriate for cases where the number of features/explanatory variables is not too large.
Coordinate descend with elastic net for GLM is in a work in progress pull request and will most likely be available in statsmodels 0.8.

Defining priors and marginalizing over priors in pymc

I am going through the tutorial about Monte Carlo Markov Chain process with pymc library. I am also a newbie using pymc and try to establish my own MCMC process. I have faced couple of question that I couldn't find proper answer in pymc tutorial:
First: How could we define priors with pymc and then marginalise over the priors in the chain process?
My second question is about Dirichlet distribution , how is this distribution related to the prior information in MCMC and how should it be defined?

I recommend following the PyMC user's guide. It explicitly shows you how to specify your model (including priors). With MCMC, you end up getting marginals of all posterior values, so you don't need to know how to marginalize over priors.
The Dirichlet is often used as a prior to multinomial probabilities in Bayesian models. The values of the Dirichlet parameters can be used to encode prior information, typically in terms of a notional number of prior events corresponding to each element of the multinomial. For example, a Dirichlet with a vector of ones as the parameters is just a generalization of a Beta(1,1) prior to multinomial quantities.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.