Posterior Sampling in pymc3 - python

1:23 PM (20 minutes ago)
Hi,
Trying to learn pymc3 (never learned pymc2, so jumping into the new stuff), and I suspect there is a very simple example/pseudocode for what I'm trying to do. Wondering if someone can help me out, as the past few hours I've not made much progress...
My problem is to sample from a posterior in a rather straightforward manner. Let "x" be a vector, "t(x)" be a function (R^n --> R^n map) of that vector, and "D" be some observed data. I want to sample vectors x from
P( x | D ) \propto P( D | x ) P(x)
Usual Bayesian stuff. An example of how to do this using NUTS would be spectacular! My main problem seems to be getting the function t(x) to work appropriately, and have the model return samples from the posterior (rather than the prior).
Any and all help/hints appreciated. In the mean time I'll continue to try stuff out.
Best,
TJ

Your notation is a little confusing to me, but if I understand correctly, you want to sample from the likelihood (some function of the parameters and data) times the prior. And I agree - that's typical Bayesian stuff.
I think Bayesian logistic regression is a good example since we can't solve it analytically. Let's say our model is the following:
B ~ Normal(0, sigma2 * I)
p(y_i | B) = p_i ^ {y_i} (1 - p_i) ^{1 - y_i}
Where y_i is observed and p_i = 1 / (1 + exp(-z_i)) and
z_i = B_0 + B_1 * x_i
We'll assume sigma2 is known. After we load data into numpy arrays x and y, we can sample from the posterior with the following:
with pm.Model() as model:
#Priors
b0 = pm.Normal("b0", mu=0, tau=1e-6)
b1 = pm.Normal("b1", mu=0, tau=1e-6)
#Likelihood
yhat = pm.Bernoulli("yhat", 1 / (1 + t.exp(-(b0 + b1*x))), observed=y)
# Sample from the posterior
trace = pm.sample(10000, pm.NUTS(), progressbar=False)
To see a full example, check out this iPython notebook:
http://nbviewer.ipython.org/gist/jbencook/9295751c917941208349
pymc3 also has a nice glm syntax. You can see how that works here:
http://jbencook.github.io/portfolio/bayesian_logistic_regression.html

Related

How to add an independent variable constant to OLS formula in statsmodels?

I'm currently working on a data set that is defined by the quadratic function
y = b0 - b1(x + c)**2
where b0, b1, and c are non-zero constants I'm hoping to find.
Whilst statsmodels formula ols can find the curve with the formula 'y ~ I(x**2)' through my x and y values, it is stubbornly stuck to intercepting the x axis at 0.
For example, here is a graph showing the curve fitting of ols (orange) alongside random data with no residual error and the original function (blue):
Graph
I've been reading the patsy documentation, but I have not been able to find anything of help so far.
I'd really appreciate your help.
I realised after stepping away for 5 minutes how to "solve" this.
The (x + c)**2 can be expanded to a polynomial in the form of x**2 + x + c which is doable.

Coursera ML - Implementing regularized logistic regression cost function in python

I am working through Andrew Ng's Machine Learning on Coursera by implementing all the code in python rather than MATLAB.
In Programming Exercise 3, I implemented my regularized logistic regression cost function in a vectorized form:
def compute_cost_regularized(theta, X, y, lda):
reg =lda/(2*len(y)) * np.sum(theta**2)
return 1/len(y) * np.sum(-y # np.log(sigmoid(X#theta))
- (1-y) # np.log(1-sigmoid(X#theta))) + reg
On the following test inputs:
theta_test = np.array([-2,-1,1,2])
X_test = np.concatenate((np.ones((5,1)),
np.fromiter((x/10 for x in range(1,16)), float).reshape((3,5)).T), axis = 1)
y_test = np.array([1,0,1,0,1])
lambda_test = 3
the above cost function outputs 3.734819396109744. However, according to the skeleton MATLAB code provided to us, the correct output should be 2.534819. I'm puzzled because I cannot find anything wrong with my cost function but I believe it has a bug. In fact, I've also implemented it in Programming Exercise 2 in the binary classification case and it works fine, giving a result close to the expected value.
I thought that one reason could be that I've constructed my *_test input arrays wrongly based on misinterpreting the provided skeleton MATLAB code which are:
theta_t = [-2; -1; 1; 2];
X_t = [ones(5,1) reshape(1:15,5,3)/10];
y_t = ([1;0;1;0;1] >= 0.5);
lambda_t = 3;
However, I had ran them through an Octave interpreter to see what they actually are, and ensure that I could match them exactly in python.
Furthermore, the computation of gradient based on these inputs using my own vectorized and regularized gradient function is also correct. Lastly, I decided to just proceed with the computation and examine the prediction results. The accuracy of my predictions were way lower than the expected accuracy, so it gives all the more reason to suspect that something is wrong with my cost function that is making everything else wrong.
Help please! Thank you.
If you recall from regularization, you do not regularize the bias coefficient. Not only do you set the gradient to zero when performing gradient descent but you do not include this in the cost function. You have a slight mistake where you are including this as part of the sum (see cell #18 on your notebook that you linked - the sum should start from j = 1 but you have it as j = 0). Therefore, you need to sum from the second element to the end for your theta, not the first. You can verify this on Page 9 of the ex2.pdf PDF assignment that is seen on your Github repo. This explains the inflated cost as you are including the bias unit as part of the regularization.
Therefore, when computing regularization in reg, index theta so that you start from the second element and onwards:
def compute_cost_regularized(theta, X, y, lda):
reg =lda/(2*len(y)) * np.sum(theta[1:]**2) # Change here
return 1/len(y) * np.sum(-y # np.log(sigmoid(X#theta))
- (1-y) # np.log(1-sigmoid(X#theta))) + reg
Once I do this, define your test values as well as define your sigmoid function, I get the right answer that you're expecting:
In [8]: def compute_cost_regularized(theta, X, y, lda):
...: reg =lda/(2*len(y)) * np.sum(theta[1:]**2)
...: return 1/len(y) * np.sum(-y # np.log(sigmoid(X#theta))
...: - (1-y) # np.log(1-sigmoid(X#theta))) + reg
...:
In [9]: def sigmoid(z):
...: return 1 / (1 + np.exp(-z))
...:
In [10]: theta_test = np.array([-2,-1,1,2])
...: X_test = np.concatenate((np.ones((5,1)),
...: np.fromiter((x/10 for x in range(1,16)), float).reshape((3,5)).T), axis = 1)
...: y_test = np.array([1,0,1,0,1])
...: lambda_test = 3
...:
In [11]: compute_cost_regularized(theta_test, X_test, y_test, lambda_test)
Out[11]: 2.5348193961097438

Gradient Descent Algorithm in Python

I am trying to write a gradient descent function in python as part of a multivariate linear regression exercise. It runs, but does not compute the correct answer. My code is below. I've been trying for weeks to finish this problem but have made zero progress.
I believe that I understand the concept of gradient descent to optimize a multivariate linear regression function and also that the 'math' is correct. I believe that the error is in my code, but I am still learning python. Your help is very much appreciated.
def regression_gradient_descent(feature_matrix,output,initial_weights,step_size,tolerance):
from math import sqrt
converged = False
weights = np.array(initial_weights)
while not converged:
predictions = np.dot(feature_matrix,weights)
errors = predictions - output
gradient_sum_squares = 0
for i in range(len(weights)):
derivative = -2 * np.dot(errors[i],feature_matrix[i])
gradient_sum_squares = gradient_sum_squares + np.dot(derivative, derivative)
weights[i] = weights[i] - step_size * derivative[i]
gradient_magnitude = sqrt(gradient_sum_squares)
print gradient_magnitude
if gradient_magnitude < tolerance:
converged = True
return(weights)
Feature matrix is:
sales = gl.SFrame.read_csv('kc_house_data.csv',column_type_hints = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float,'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str,'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int,'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int,'view':int})
I'm calling the function as:
train_data,test_data = sales.random_split(.8,seed=0)
simple_features = ['sqft_living']
my_output= 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights,step_size,tolerance)
**get_numpy_data is just a function to convert everything into arrays and works as intended
Update: I fixed the formula to:
derivative = 2 * np.dot(errors,feature_matrix)
and it seems to have worked. The derivation of this formula in my online course used
-2 * np.dot(errors,feature_matrix)
and I'm not sure why this formula did not provide the correct answer.
The step size seems too small, and the tolerance unusually big. Perhaps you meant to use them the other way around?
In general, the step size is determined by a trial-and-error procedure: the "natural" step size α=1 might lead to divergence, so one could try to lower the value (e.g. taking α=1/2, α=1/4, etc until convergence is achieved. Don't start with a very small step size.

PYMC3: NUTS has difficulty sampling from a hierarchical zero inflated gamma model

I'm trying to replicate the data analysis from a paper from Richard McElreath, in which he fitted the data with a hierarchical zero inflated Gamma model. The data is about the hunting returns of around 15000 hunting trips from about 150 hunters over twenty years. Because a good many hunting trips have zero returns, the model assume each trip has pi probability of zero returns, and 1 - pi probability of positive returns which follow a Gamma distribution with parameters alpha and beta.
The predictor variable is age, the model use an age polynomial (up to order 3) to model pi and alpha. And since the 15000 trips belong to 150 individual hunters, each hunter has coefficients of his own and all the coefficients follow a common multivariate normal distribution. For details of the model please refer to the following code. The model specification seems alright, but NUTS is having trouble start sampling: it gives only about 10 samples after about 20 minutes, and the sampler just halted there, and told me it will take hundreds of hours to finish the sampling. I want to know what is causing the problems.
The usual imports
import pymc3 as pm
import numpy as np
from pymc3.distributions import Continuous, Gamma
import theano.tensor as tt
The data can be obtained from github
n_trip = len(d)
n_hunter = len(d['hunter.id'].unique())
idx_hunter = d['hunter.id'].values
y = d['kg.meat'].values
age = d['age.s'].values
age2 = (d['age.s'].values)**2
age3 = (d['age.s'].values)**3
The log probability density function for Zero inflated Gamma.
class ZeroInflatedGamma(Continuous):
def __init__(self, alpha, beta, pi, *args, **kwargs):
super(ZeroInflatedGamma, self).__init__(*args, **kwargs)
self.alpha = alpha
self.beta = beta
self.pi = pi = tt.as_tensor_variable(pi)
self.gamma = Gamma.dist(alpha, beta)
def logp(self, value):
return tt.switch(value > 0,
tt.log(1 - self.pi) + self.gamma.logp(value),
tt.log(self.pi))
This is a matrix to index the correlation matrix prior to a 9X9 matrix, the LKJ prior in pymc3 is given as a one dimentional vector
dim = 9
n_elem = dim * (dim - 1) / 2
tri_index = np.zeros([dim, dim], dtype=int)
tri_index[np.triu_indices(dim, k=1)] = np.arange(n_elem)
tri_index[np.triu_indices(dim, k=1)[::-1]] = np.arange(n_elem)
And here is the model
with pm.Model() as Vary9_model:
# hyper-priors
mu_a = pm.Normal('mu_a', mu=0, sd=100, shape=9)
sigma_a = pm.HalfCauchy('sigma_a', 5, shape=9)
# build the covariance matrix
C_triu = pm.LKJCorr('C_triu', n=2, p=9)
C = tt.fill_diagonal(C_triu[tri_index], 1)
sigma_diag = tt.nlinalg.diag(sigma_a)
cov = tt.nlinalg.matrix_dot(sigma_diag, C, sigma_diag)
# priors for each hunter and all the linear components, 9 dimensional Gaussian
a = pm.MvNormal('a', mu=mu_a, cov=cov, shape=(n_hunter, 9))
# linear function
mupi = a[:,0][idx_hunter] + a[:,1][idx_hunter] * age + a[:,2][idx_hunter] * age2 + a[:,3][idx_hunter] * age3
mualpha = a[:,4][idx_hunter] + a[:,5][idx_hunter] * age + a[:,6][idx_hunter] * age2 + a[:,7][idx_hunter] * age3
pi = pm.Deterministic('pi', pm.math.sigmoid(mupi))
alpha = pm.Deterministic('alpha', pm.math.exp(mualpha))
beta = pm.Deterministic('beta', pm.math.exp(a[:,8][idx_hunter]))
y_obs = ZeroInflatedGamma('y_obs', alpha, beta, pi, observed=y)
Vary9_trace = pm.sample(6000, njobs=2)
And this is the status of the model:
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -28,366: 100%|██████████| 200000/200000 [15:36<00:00, 213.57it/s]
Finished [100%]: Average ELBO = -28,365
0%| | 22/6000 [15:51<63:49:25, 38.44s/it]
I have some thoughts on the problem but not sure which might be the reason.
is the nine dimensional Gaussian too difficult to sample with? I previously only modeled the intercepts for mualpha and mupi as bivariate Gaussian, it's slow but worked(the model fitting took about 20 minutes)
is it the probability density that's causing the problem? I wrote the density function myself and am not sure if it's functioning well. I think the density function is not differentiable at zero, will this cause trouble for the nuts sampler?
is it because the predictor variables are highly correlated? The linear model components in this model are polynomials of age, to the third degree, and naturally the predictors are highly correlated.
Or maybe it's because of something else?
As a side note, I tried using the Metropolis sampler, my computer has run out of memory and the chains still haven't converged.
The ZeroInflatedGamma looks fine. The density function is differentiable with respect to pi, alpha and beta. That is all you need for an observed variable. You only need the derivatives with respect to the value if you are trying to estimate the values.
There was an issue in the implementation of LKJCorr:
https://github.com/pymc-devs/pymc3/pull/1863
You could try again on master. Sadly, pymc3 does not have support for using MVNormal and LKJCorr in cholesky decomposed parametrization. This might help, too. There is a work in progress pull request for this on github:
https://github.com/pymc-devs/pymc3/pull/1875
To improve convergence you could try a non-centered parameterization for a. Something along the lines of
a_raw = pm.Normal('a_raw', shape=(9, n_hunter))
a = mu_a[None, :] + tt.dot(tt.slinalg.cholesky(cov), a_raw)
Of course this would be faster if we had that cholesky LKJCorr...

Maximum likelihood linear regression tensorflow

After I implemented a LS estimation with gradient descent for a simple linear regression problem, I'm now trying to do the same with Maximum Likelihood.
I used this equation from wikipedia. The maximum has to be found.
train_X = np.random.rand(100, 1) # all values [0-1)
train_Y = train_X
X = tf.placeholder("float", None)
Y = tf.placeholder("float", None)
theta_0 = tf.Variable(np.random.randn())
theta_1 = tf.Variable(np.random.randn())
var = tf.Variable(0.5)
hypothesis = tf.add(theta_0, tf.mul(X, theta_1))
lhf = 1 * (50 * np.log(2*np.pi) + 50 * tf.log(var) + (1/(2*var)) * tf.reduce_sum(tf.pow(hypothesis - Y, 2)))
op = tf.train.GradientDescentOptimizer(0.01).minimize(lhf)
This code works, but I still have some questions about it:
If I change the lhf function from 1 * to -1 * and minimize -lhf (according to the equation), it does not work. But why?
The value for lhf goes up and down during optimization. Shouldn't it only change in one direction?
The value for lhf sometimes is a NaN during optimization. How can I avoid that?
In the equation, σ² is the variance of the error (right?). My values are perfectly on a line. Why do I get a value of var above 100?
The symptoms in your question indicate a common problem: the learning rate or step size might be too high for the problem.
The zig-zag behaviour, where the function to be maximized goes up and down, is usual when the learning rate is too high. Specially when you get NaNs.
The simplest solution is to lower the learning rate, by dividing your current learning rate by 10 until the learning curve is smooth and there are no NaNs or up-down behavior.
As you are using TensorFlow you can also try AdamOptimizer as this adjust the learning rate dynamically as you train.

Categories