PYMC3: NUTS has difficulty sampling from a hierarchical zero inflated gamma model

PYMC3: NUTS has difficulty sampling from a hierarchical zero inflated gamma model - python

I'm trying to replicate the data analysis from a paper from Richard McElreath, in which he fitted the data with a hierarchical zero inflated Gamma model. The data is about the hunting returns of around 15000 hunting trips from about 150 hunters over twenty years. Because a good many hunting trips have zero returns, the model assume each trip has pi probability of zero returns, and 1 - pi probability of positive returns which follow a Gamma distribution with parameters alpha and beta.
The predictor variable is age, the model use an age polynomial (up to order 3) to model pi and alpha. And since the 15000 trips belong to 150 individual hunters, each hunter has coefficients of his own and all the coefficients follow a common multivariate normal distribution. For details of the model please refer to the following code. The model specification seems alright, but NUTS is having trouble start sampling: it gives only about 10 samples after about 20 minutes, and the sampler just halted there, and told me it will take hundreds of hours to finish the sampling. I want to know what is causing the problems.
The usual imports
import pymc3 as pm
import numpy as np
from pymc3.distributions import Continuous, Gamma
import theano.tensor as tt
The data can be obtained from github
n_trip = len(d)
n_hunter = len(d['hunter.id'].unique())
idx_hunter = d['hunter.id'].values
y = d['kg.meat'].values
age = d['age.s'].values
age2 = (d['age.s'].values)**2
age3 = (d['age.s'].values)**3
The log probability density function for Zero inflated Gamma.
class ZeroInflatedGamma(Continuous):
def __init__(self, alpha, beta, pi, *args, **kwargs):
super(ZeroInflatedGamma, self).__init__(*args, **kwargs)
self.alpha = alpha
self.beta = beta
self.pi = pi = tt.as_tensor_variable(pi)
self.gamma = Gamma.dist(alpha, beta)
def logp(self, value):
return tt.switch(value > 0,
tt.log(1 - self.pi) + self.gamma.logp(value),
tt.log(self.pi))
This is a matrix to index the correlation matrix prior to a 9X9 matrix, the LKJ prior in pymc3 is given as a one dimentional vector
dim = 9
n_elem = dim * (dim - 1) / 2
tri_index = np.zeros([dim, dim], dtype=int)
tri_index[np.triu_indices(dim, k=1)] = np.arange(n_elem)
tri_index[np.triu_indices(dim, k=1)[::-1]] = np.arange(n_elem)
And here is the model
with pm.Model() as Vary9_model:
# hyper-priors
mu_a = pm.Normal('mu_a', mu=0, sd=100, shape=9)
sigma_a = pm.HalfCauchy('sigma_a', 5, shape=9)
# build the covariance matrix
C_triu = pm.LKJCorr('C_triu', n=2, p=9)
C = tt.fill_diagonal(C_triu[tri_index], 1)
sigma_diag = tt.nlinalg.diag(sigma_a)
cov = tt.nlinalg.matrix_dot(sigma_diag, C, sigma_diag)
# priors for each hunter and all the linear components, 9 dimensional Gaussian
a = pm.MvNormal('a', mu=mu_a, cov=cov, shape=(n_hunter, 9))
# linear function
mupi = a[:,0][idx_hunter] + a[:,1][idx_hunter] * age + a[:,2][idx_hunter] * age2 + a[:,3][idx_hunter] * age3
mualpha = a[:,4][idx_hunter] + a[:,5][idx_hunter] * age + a[:,6][idx_hunter] * age2 + a[:,7][idx_hunter] * age3
pi = pm.Deterministic('pi', pm.math.sigmoid(mupi))
alpha = pm.Deterministic('alpha', pm.math.exp(mualpha))
beta = pm.Deterministic('beta', pm.math.exp(a[:,8][idx_hunter]))
y_obs = ZeroInflatedGamma('y_obs', alpha, beta, pi, observed=y)
Vary9_trace = pm.sample(6000, njobs=2)
And this is the status of the model:
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -28,366: 100%|██████████| 200000/200000 [15:36<00:00, 213.57it/s]
Finished [100%]: Average ELBO = -28,365
0%| | 22/6000 [15:51<63:49:25, 38.44s/it]
I have some thoughts on the problem but not sure which might be the reason.
is the nine dimensional Gaussian too difficult to sample with? I previously only modeled the intercepts for mualpha and mupi as bivariate Gaussian, it's slow but worked(the model fitting took about 20 minutes)
is it the probability density that's causing the problem? I wrote the density function myself and am not sure if it's functioning well. I think the density function is not differentiable at zero, will this cause trouble for the nuts sampler?
is it because the predictor variables are highly correlated? The linear model components in this model are polynomials of age, to the third degree, and naturally the predictors are highly correlated.
Or maybe it's because of something else?
As a side note, I tried using the Metropolis sampler, my computer has run out of memory and the chains still haven't converged.

The ZeroInflatedGamma looks fine. The density function is differentiable with respect to pi, alpha and beta. That is all you need for an observed variable. You only need the derivatives with respect to the value if you are trying to estimate the values.
There was an issue in the implementation of LKJCorr:
https://github.com/pymc-devs/pymc3/pull/1863
You could try again on master. Sadly, pymc3 does not have support for using MVNormal and LKJCorr in cholesky decomposed parametrization. This might help, too. There is a work in progress pull request for this on github:
https://github.com/pymc-devs/pymc3/pull/1875
To improve convergence you could try a non-centered parameterization for a. Something along the lines of
a_raw = pm.Normal('a_raw', shape=(9, n_hunter))
a = mu_a[None, :] + tt.dot(tt.slinalg.cholesky(cov), a_raw)
Of course this would be faster if we had that cholesky LKJCorr...

Related

How to write a basic Pymc3 model? A minor substitution in the tutorial is problematic

I'm trying to infer 2 parameters (beta and gamma) given a deterministic equation and simulated noisy data. For some reason, the equation I'm using seems to be problematic, as I just copied the basic pymc3 tutorial and used my own deterministic equation. Here is the model I'm using:
# True parameter values
beta, gamma = 0.21, 0.07
# Size of dataset
days = 50
# Predictor variable
time = np.arange(0,days,1)
# Simulate outcome variable
data = []
for t in time:
data.append((beta/((beta-gamma))*(np.exp(t*(beta-gamma))-1)+1) + np.random.normal(0,1))
basic_model = pm.Model()
def smodel(beta,gamma):
s = beta/((beta-gamma))*(tt.exp(time*(beta-gamma))-1)+1
return s
with basic_model:
# Priors for unknown model parameters
beta = pm.Normal("beta", mu=0, sigma=10)
gamma = pm.Normal("gamma", mu=0, sigma=10)
# Expected value of outcome
#smodel_pm = pm.Deterministic('smodel', smodel(inputParam))
y_obs = pm.Normal('obs', mu=smodel(beta,gamma), sigma=1,observed=data)
# Draw the specified number of samples
trace = pm.sample(step=pm.Metropolis())
However, when I run a summary of the trace, I'm getting 0's for everything. Anyone know what the issue is?

The parameterization is rather poor here (correlated variables, symmetric solutions in domain), plus Metropolis-Hastings simply needs to run for a long time, whereas the default settings assume NUTS.
Here's a suggested alternative parameterization, plus tuning and draw counts more reasonable for this sampling strategy:
basic_model = pm.Model()
def smodel(a, b):
s = a*(tt.exp(b*time)-1)+1
return s
with basic_model:
# priors for pre-transformed model parameters
a = pm.Normal("a", mu=0, sigma=10)
b = pm.HalfNormal("b", sigma=10)
# (transformed) parameters of interest
beta = pm.Deterministic("beta", a*b)
gamma = pm.Deterministic("gamma", (a-1)*b)
# expected value of outcome
y_obs = pm.Normal('obs', mu=smodel(a, b), sigma=1, observed=data)
# Draw the specified number of samples
trace = pm.sample(step=pm.Metropolis(), tune=100000, draws=50000)
One could probably up the draws even further, since the effective sample sizes (ESS) are so small, but the numbers for the parameters of interest are about where they should be:
Looking at the pairs plots, one can see the correlation for a,b is still very high, which explains the samples are so highly auto-correlated.
The traces and densities (per chain) look decent to me:

Bad quality of Viterbi Algorithm (HMM)

I've been trying to get into hidden Markov models and the Viterbi algorithm recently. I found a library called hmmlearn (http://hmmlearn.readthedocs.io/en/latest/tutorial.html) to help me generate a state sequence for two states (with Gaussian emissions). Then I wanted to re-determine the state sequence using Viterbi. My code works, but predicts approximately 5% of the states wrong (depending on the means and variances of the Gaussian emissions). The hmmlearn library has a .predict method which also uses Viterbi to determine the state sequence.
My problem now is that the Viterbi algorithm by hmmlearn is much better than my hand-written one (error rate is lower than 0.5% compared to my 5%). I couldn't find any major problem in my code, so I'm not sure why this is the case. Below is my code where I first generate the state and observation sequence Z and X, predict Z with hmmlearn and finally predict it with my own code:
# Import libraries
import numpy as np
import scipy.stats as st
from hmmlearn import hmm
# Generate a sequence
model = hmm.GaussianHMM(n_components = 2, covariance_type = "spherical")
model.startprob_ = pi
model.transmat_ = A
model.means_ = obs_means
model.covars_ = obs_covars
X, Z = model.sample(T)
## Predict the states from generated observations with the hmmlearn library
Z_pred = model.predict(X)
# Predict the state sequence with Viterbi by hand
B = np.concatenate((st.norm(mean_1,var_1).pdf(X), st.norm(mean_2,var_2).pdf(X)), axis = 1)
delta = np.zeros(shape = (T, 2))
psi = np.zeros(shape= (T, 2))
### Calculate starting values
for s in np.arange(2):
delta[0, s] = np.log(pi[s]) + np.log(B[0, s])
psi = np.zeros((T, 2))
### Take everything in log space since values get very low as t -> T
for t in range(1,T):
for s_post in range(0, 2):
delta[t, s_post] = np.max([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1) + np.log(B[t, s_post])
psi[t, s_post] = np.argmax([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1)
### Backtrack
states = np.zeros(T, dtype=np.int32)
states[T-1] = np.argmax(delta[T-1])
for t in range(T-2, -1, -1):
states[t] = psi[t+1, states[t+1]]
I'm not sure if I have a big error in my code or if hmmlearn just uses a more refined Viterbi algorithm. I have noticed by looking into the falsely predicted states that the impact of the emission probability B seems to be too big as it causes the states to change too frequently even if the transition probability to go to the other state is really low.
I'm rather new to python so please excuse my ugly coding. Thanks in advance for any tips you might have!
Edit: As you can see in the code, I'm stupid and used variances instead of the standard deviation to determine the emission probabilities. After fixing this, I get the same result as the implemented Viterbi algorithm.

Multivariate linear regression in pymc3

I've recently started learning pymc3 after exclusively using emcee for ages and I'm running into some conceptual problems.
I'm practising with Chapter 7 of Hogg's Fitting a model to data. This involves a mcmc fit to a straight line with arbitrary 2d uncertainties. I've accomplished this quite easily in emcee, but pymc is giving me some problems.
It essentially boils down to using a multivariate gaussian likelihood.
Here is what I have so far.
from pymc3 import *
import numpy as np
import matplotlib.pyplot as plt
size = 200
true_intercept = 1
true_slope = 2
true_x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * true_x
# add noise
# here the errors are all the same but the real world they are usually not!
std_y, std_x = 0.1, 0.1
y = true_regression_line + np.random.normal(scale=std_y, size=size)
x = true_x + np.random.normal(scale=std_x, size=size)
y_err = np.ones_like(y) * std_y
x_err = np.ones_like(x) * std_x
data = dict(x=x, y=y)
with Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
# Define priors
intercept = Normal('Intercept', 0, sd=20)
gradient = Normal('gradient', 0, sd=20)
# Define likelihood
likelihood = MvNormal('y', mu=intercept + gradient * x,
tau=1./(np.stack((y_err, x_err))**2.), observed=y)
# start the mcmc!
start = find_MAP() # Find starting value by optimization
step = NUTS(scaling=start) # Instantiate MCMC sampling algorithm
trace = sample(2000, step, start=start, progressbar=False) # draw 2000 posterior samples using NUTS sampling
This raises the error: LinAlgError: Last 2 dimensions of the array must be square
So I'm trying to pass MvNormal the measured values for x and y (mus) and their associated measurement uncertainties (y_err and x_err). But it appears that it is not liking the 2d tau argument.
Any ideas? This must be possible
Thanks

You may try by adapting the following model. Is a "regular" linear regression. But x and y have been replaced by Gaussian distributions. Here I am assuming not only the measured values of the input and output variables but also a reliable estimation of the their error (for example as provided by a measurement device). If you do not trust those error values you may instead try to estimate them from the data.
with pm.Model() as model:
intercept = pm.Normal('intercept', 0, sd=20)
gradient = pm.Normal('gradient', 0, sd=20)
epsilon = pm.HalfCauchy('epsilon', 5)
obs_x = pm.Normal('obs_x', mu=x, sd=x_err, shape=len(x))
obs_y = pm.Normal('obs_y', mu=y, sd=y_err, shape=len(y))
likelihood = pm.Normal('y', mu=intercept + gradient * obs_x,
sd=epsilon, observed=obs_y)
trace = pm.sample(2000)
If you are estimating the error from the data it could be reasonable to assume they could be correlated and hence, instead of using two separate Gaussian you can use a Multivariate Gaussian. In such a case you will end up with a model like the following:
df_data = pd.DataFrame(data)
cov = df_data.cov()
with pm.Model() as model:
intercept = pm.Normal('intercept', 0, sd=20)
gradient = pm.Normal('gradient', 0, sd=20)
epsilon = pm.HalfCauchy('epsilon', 5)
obs_xy = pm.MvNormal('obs_xy', mu=df_data, tau=pm.matrix_inverse(cov), shape=df_data.shape)
yl = pm.Normal('yl', mu=intercept + gradient * obs_xy[:,0],
sd=epsilon, observed=obs_xy[:,1])
mu, sds, elbo = pm.variational.advi(n=20000)
step = pm.NUTS(scaling=model.dict_to_array(sds), is_cov=True)
trace = pm.sample(1000, step=step, start=mu)
Notice that in the previous model the covariance matrix was computed from the data. If you are going to do that then I think is better to go with the first model, but if instead you are going to estimate the covariance matrix then the second model could be a sensible approach.
For the second model I use ADVI to initialize it. ADVI can be a good way to initialize models, often it works much better than find_MAP().
You may also want to check this repository by David Hogg. And the book Statistical Rethinking where McElreath discuss the problem of doing linear regression including the errors in the input and output variables.

What determines whether my Python gradient descent algorithm converges?

I've implemented a single-variable linear regression model in Python that uses gradient descent to find the intercept and slope of the best-fit line (I'm using gradient descent rather than computing the optimal values for intercept and slope directly because I'd eventually like to generalize to multiple regression).
The data I am using are below. sales is the dependent variable (in dollars) and temp is the independent variable (degrees celsius) (think ice cream sales vs temperature, or something similar).
sales temp
215 14.20
325 16.40
185 11.90
332 15.20
406 18.50
522 22.10
412 19.40
614 25.10
544 23.40
421 18.10
445 22.60
408 17.20
And this is my data after it has been normalized:
sales temp
0.06993007 0.174242424
0.326340326 0.340909091
0 0
0.342657343 0.25
0.515151515 0.5
0.785547786 0.772727273
0.529137529 0.568181818
1 1
0.836829837 0.871212121
0.55011655 0.46969697
0.606060606 0.810606061
0.51981352 0.401515152
My code for the algorithm:
import numpy as np
import pandas as pd
from scipy import stats
class SLRegression(object):
def __init__(self, learnrate = .01, tolerance = .000000001, max_iter = 10000):
# Initialize learnrate, tolerance, and max_iter.
self.learnrate = learnrate
self.tolerance = tolerance
self.max_iter = max_iter
# Define the gradient descent algorithm.
def fit(self, data):
# data : array-like, shape = [m_observations, 2_columns]
# Initialize local variables.
converged = False
m = data.shape[0]
# Track number of iterations.
self.iter_ = 0
# Initialize theta0 and theta1.
self.theta0_ = 0
self.theta1_ = 0
# Compute the cost function.
J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
print('J is: ', J)
# Iterate over each point in data and update theta0 and theta1 on each pass.
while not converged:
diftemp0 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) for i in range(m)])
diftemp1 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) * data[i][1] for i in range(m)])
# Subtract the learnrate * partial derivative from theta0 and theta1.
temp0 = self.theta0_ - (self.learnrate * diftemp0)
temp1 = self.theta1_ - (self.learnrate * diftemp1)
# Update theta0 and theta1.
self.theta0_ = temp0
self.theta1_ = temp1
# Compute the updated cost function, given new theta0 and theta1.
new_J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
print('New J is: %s') % (new_J)
# Test for convergence.
if abs(J - new_J) <= self.tolerance:
converged = True
print('Model converged after %s iterations!') % (self.iter_)
# Set old cost equal to new cost and update iter.
J = new_J
self.iter_ += 1
# Test whether we have hit max_iter.
if self.iter_ == self.max_iter:
converged = True
print('Maximum iterations have been reached!')
return self
def point_forecast(self, x):
# Given feature value x, returns the regression's predicted value for y.
return self.theta0_ + self.theta1_ * x
# Run the algorithm on a data set.
if __name__ == '__main__':
# Load in the .csv file.
data = np.squeeze(np.array(pd.read_csv('sales_normalized.csv')))
# Create a regression model with the default learning rate, tolerance, and maximum number of iterations.
slregression = SLRegression()
# Call the fit function and pass in the data.
slregression.fit(data)
# Print out the results.
print('After %s iterations, the model converged on Theta0 = %s and Theta1 = %s.') % (slregression.iter_, slregression.theta0_, slregression.theta1_)
# Compare our model to scipy linregress model.
slope, intercept, r_value, p_value, slope_std_error = stats.linregress(data[:,1], data[:,0])
print('Scipy linear regression gives intercept: %s and slope = %s.') % (intercept, slope)
# Test the model with a point forecast.
print('As an example, our algorithm gives y = %s given x = .87.') % (slregression.point_forecast(.87)) # Should be about .83.
print('The true y-value for x = .87 is about .8368.')
I'm having trouble understanding exactly what allows the algorithm to converge versus return values that are completely wrong. Given learnrate = .01, tolerance = .0000000001, and max_iter = 10000, in combination with normalized data, I can get the gradient descent algorithm to converge. However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005. This brings changes in the cost function from iteration to iteration down to around 614, but I can't get it to go any lower.
Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why? Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values? For instance, if I were going to deliver this algorithm to a client so they could make predictions of their own (I'm not, but for the sake of argument..), wouldn't I want them to simply be able to plug in the un-normalized x-value?
All and all, playing around with the tolerance, max_iter, and learnrate gives me non-convergent results the majority of the time. Is this normal, or are there flaws in my algorithm that are contributing to this issue?

Given learnrate = .01, tolerance = .0000000001, and max_iter = 10000, in combination with normalized data, I can get the gradient descent algorithm to converge. However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005
That's kind of to be expected the way you have your algorithm set up.
The normalization of the data makes it so the y-intercept of the best fit is around 0.0. Otherwise, you could have a y-intercept thousands of units off of the starting guess, and you'd have to trek there before you ever really started the optimization part.
Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why?
No, absolutely not, but if you don't normalize, you should pick a starting point more intelligently (you're starting at (m,b) = (0,0)). Your learnrate may also be too small if you don't normalize your data, and same with your tolerance.
Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values?
Apply whatever transformation that you applied to the original data to get the normalized data to your new x-value. (The code for normalization is outside of what you have shown). If this test point fell within the (minx,maxx) range of your original data, once transformed, it should fall within 0 <= x <= 1. Once you have this normalized test point, plug it into your theta equation of a line (remember, your thetas are m,b of the y-intercept form of the equation of a line).
All and all, playing around with the tolerance, max_iter, and learnrate gives me non-convergent results the majority of the time.
For a well-formed problem, if you're in fact diverging it often means your step size is too large. Try lowering it.
If it's simply not converging before it hits the max iterations, that could be a few issues:
Your step size is too small,
Your tolerance is too small,
Your max iterations is too small,
Your starting point is poorly chosen
In your case, using the non normalized data results in your starting point of (0,0) being very far off (the (m,b) of the non-normalized data is around (-159, 30) while the (m,b) of your normalized data is (0.10,0.79)), so most if not all of your iterations are being used just getting to the area of interest.
The problem with this is that by increasing the step size to get to the area of interest faster also makes it less-likely to find convergence once it gets there.
To account for this, some gradient descent algorithms have dynamic step size (or learnrate) such that large steps are taken at the beginning, and smaller ones as it nears convergence.
It may also be helpful for you to keep a history of of the theta pairs throughout the algorithm, then plot them. You'll be able to see the difference immediately between using normalized and non-normalized input data.

Posterior Sampling in pymc3

1:23 PM (20 minutes ago)
Hi,
Trying to learn pymc3 (never learned pymc2, so jumping into the new stuff), and I suspect there is a very simple example/pseudocode for what I'm trying to do. Wondering if someone can help me out, as the past few hours I've not made much progress...
My problem is to sample from a posterior in a rather straightforward manner. Let "x" be a vector, "t(x)" be a function (R^n --> R^n map) of that vector, and "D" be some observed data. I want to sample vectors x from
P( x | D ) \propto P( D | x ) P(x)
Usual Bayesian stuff. An example of how to do this using NUTS would be spectacular! My main problem seems to be getting the function t(x) to work appropriately, and have the model return samples from the posterior (rather than the prior).
Any and all help/hints appreciated. In the mean time I'll continue to try stuff out.
Best,
TJ

Your notation is a little confusing to me, but if I understand correctly, you want to sample from the likelihood (some function of the parameters and data) times the prior. And I agree - that's typical Bayesian stuff.
I think Bayesian logistic regression is a good example since we can't solve it analytically. Let's say our model is the following:
B ~ Normal(0, sigma2 * I)
p(y_i | B) = p_i ^ {y_i} (1 - p_i) ^{1 - y_i}
Where y_i is observed and p_i = 1 / (1 + exp(-z_i)) and
z_i = B_0 + B_1 * x_i
We'll assume sigma2 is known. After we load data into numpy arrays x and y, we can sample from the posterior with the following:
with pm.Model() as model:
#Priors
b0 = pm.Normal("b0", mu=0, tau=1e-6)
b1 = pm.Normal("b1", mu=0, tau=1e-6)
#Likelihood
yhat = pm.Bernoulli("yhat", 1 / (1 + t.exp(-(b0 + b1*x))), observed=y)
# Sample from the posterior
trace = pm.sample(10000, pm.NUTS(), progressbar=False)
To see a full example, check out this iPython notebook:
http://nbviewer.ipython.org/gist/jbencook/9295751c917941208349
pymc3 also has a nice glm syntax. You can see how that works here:
http://jbencook.github.io/portfolio/bayesian_logistic_regression.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.