Multivariate linear regression in pymc3 - python

I've recently started learning pymc3 after exclusively using emcee for ages and I'm running into some conceptual problems.
I'm practising with Chapter 7 of Hogg's Fitting a model to data. This involves a mcmc fit to a straight line with arbitrary 2d uncertainties. I've accomplished this quite easily in emcee, but pymc is giving me some problems.
It essentially boils down to using a multivariate gaussian likelihood.
Here is what I have so far.
from pymc3 import *
import numpy as np
import matplotlib.pyplot as plt
size = 200
true_intercept = 1
true_slope = 2
true_x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * true_x
# add noise
# here the errors are all the same but the real world they are usually not!
std_y, std_x = 0.1, 0.1
y = true_regression_line + np.random.normal(scale=std_y, size=size)
x = true_x + np.random.normal(scale=std_x, size=size)
y_err = np.ones_like(y) * std_y
x_err = np.ones_like(x) * std_x
data = dict(x=x, y=y)
with Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
# Define priors
intercept = Normal('Intercept', 0, sd=20)
gradient = Normal('gradient', 0, sd=20)
# Define likelihood
likelihood = MvNormal('y', mu=intercept + gradient * x,
tau=1./(np.stack((y_err, x_err))**2.), observed=y)
# start the mcmc!
start = find_MAP() # Find starting value by optimization
step = NUTS(scaling=start) # Instantiate MCMC sampling algorithm
trace = sample(2000, step, start=start, progressbar=False) # draw 2000 posterior samples using NUTS sampling
This raises the error: LinAlgError: Last 2 dimensions of the array must be square
So I'm trying to pass MvNormal the measured values for x and y (mus) and their associated measurement uncertainties (y_err and x_err). But it appears that it is not liking the 2d tau argument.
Any ideas? This must be possible
Thanks

You may try by adapting the following model. Is a "regular" linear regression. But x and y have been replaced by Gaussian distributions. Here I am assuming not only the measured values of the input and output variables but also a reliable estimation of the their error (for example as provided by a measurement device). If you do not trust those error values you may instead try to estimate them from the data.
with pm.Model() as model:
intercept = pm.Normal('intercept', 0, sd=20)
gradient = pm.Normal('gradient', 0, sd=20)
epsilon = pm.HalfCauchy('epsilon', 5)
obs_x = pm.Normal('obs_x', mu=x, sd=x_err, shape=len(x))
obs_y = pm.Normal('obs_y', mu=y, sd=y_err, shape=len(y))
likelihood = pm.Normal('y', mu=intercept + gradient * obs_x,
sd=epsilon, observed=obs_y)
trace = pm.sample(2000)
If you are estimating the error from the data it could be reasonable to assume they could be correlated and hence, instead of using two separate Gaussian you can use a Multivariate Gaussian. In such a case you will end up with a model like the following:
df_data = pd.DataFrame(data)
cov = df_data.cov()
with pm.Model() as model:
intercept = pm.Normal('intercept', 0, sd=20)
gradient = pm.Normal('gradient', 0, sd=20)
epsilon = pm.HalfCauchy('epsilon', 5)
obs_xy = pm.MvNormal('obs_xy', mu=df_data, tau=pm.matrix_inverse(cov), shape=df_data.shape)
yl = pm.Normal('yl', mu=intercept + gradient * obs_xy[:,0],
sd=epsilon, observed=obs_xy[:,1])
mu, sds, elbo = pm.variational.advi(n=20000)
step = pm.NUTS(scaling=model.dict_to_array(sds), is_cov=True)
trace = pm.sample(1000, step=step, start=mu)
Notice that in the previous model the covariance matrix was computed from the data. If you are going to do that then I think is better to go with the first model, but if instead you are going to estimate the covariance matrix then the second model could be a sensible approach.
For the second model I use ADVI to initialize it. ADVI can be a good way to initialize models, often it works much better than find_MAP().
You may also want to check this repository by David Hogg. And the book Statistical Rethinking where McElreath discuss the problem of doing linear regression including the errors in the input and output variables.

Related

How to write a basic Pymc3 model? A minor substitution in the tutorial is problematic

I'm trying to infer 2 parameters (beta and gamma) given a deterministic equation and simulated noisy data. For some reason, the equation I'm using seems to be problematic, as I just copied the basic pymc3 tutorial and used my own deterministic equation. Here is the model I'm using:
# True parameter values
beta, gamma = 0.21, 0.07
# Size of dataset
days = 50
# Predictor variable
time = np.arange(0,days,1)
# Simulate outcome variable
data = []
for t in time:
data.append((beta/((beta-gamma))*(np.exp(t*(beta-gamma))-1)+1) + np.random.normal(0,1))
basic_model = pm.Model()
def smodel(beta,gamma):
s = beta/((beta-gamma))*(tt.exp(time*(beta-gamma))-1)+1
return s
with basic_model:
# Priors for unknown model parameters
beta = pm.Normal("beta", mu=0, sigma=10)
gamma = pm.Normal("gamma", mu=0, sigma=10)
# Expected value of outcome
#smodel_pm = pm.Deterministic('smodel', smodel(inputParam))
y_obs = pm.Normal('obs', mu=smodel(beta,gamma), sigma=1,observed=data)
# Draw the specified number of samples
trace = pm.sample(step=pm.Metropolis())
However, when I run a summary of the trace, I'm getting 0's for everything. Anyone know what the issue is?
The parameterization is rather poor here (correlated variables, symmetric solutions in domain), plus Metropolis-Hastings simply needs to run for a long time, whereas the default settings assume NUTS.
Here's a suggested alternative parameterization, plus tuning and draw counts more reasonable for this sampling strategy:
basic_model = pm.Model()
def smodel(a, b):
s = a*(tt.exp(b*time)-1)+1
return s
with basic_model:
# priors for pre-transformed model parameters
a = pm.Normal("a", mu=0, sigma=10)
b = pm.HalfNormal("b", sigma=10)
# (transformed) parameters of interest
beta = pm.Deterministic("beta", a*b)
gamma = pm.Deterministic("gamma", (a-1)*b)
# expected value of outcome
y_obs = pm.Normal('obs', mu=smodel(a, b), sigma=1, observed=data)
# Draw the specified number of samples
trace = pm.sample(step=pm.Metropolis(), tune=100000, draws=50000)
One could probably up the draws even further, since the effective sample sizes (ESS) are so small, but the numbers for the parameters of interest are about where they should be:
Looking at the pairs plots, one can see the correlation for a,b is still very high, which explains the samples are so highly auto-correlated.
The traces and densities (per chain) look decent to me:

MLE in python for 2 parameters

I have a data set X which i need to use to maximise the parameters by MLE. I have the log likelihood function
def llh(alpha, beta):
a = [0]*999
for i in range(1, 1000):
a[i-1] = (-0.5)*(((1/beta)*(X[i]-np.sin((alpha)*X[i-1])))**2)
return sum(a)
I need to maximise this but i have no idea how. I can only think of plotting 3d graphs to find the maximum point but that gives me weird answers that are not what I want.
This is the plot I got
Is there any other possible way to get my maximum parameters or am I going about this the wrong way? My dataset model function is Xk = sin(alphaXk-1) + betaWk where Wk is normally distributed with mean 0 and sigma 1. Any help would be appreciated.Thank you!
You have to find the maximum of your likelihood numerically. In practice this is done by computing the negative (log) likelihood and using numerical minimization to find the most likely parameters of your model to describe your data. Make use of scipy.optimize.minimize to minimize your likelihood.
I implemented a short example for normal distributed data.
import numpy as np
from scipy.stats import norm
from scipy.optimize import minimize
def neg_llh(popt, X):
return -np.log(norm.pdf(X, loc = popt[0], scale = popt[1])).sum()
# example data
X = np.random.normal(loc = 5, scale = 2, size = 1000)
# minimize log likelihood
res = minimize(neg_llh, x0 = [2, 2], args = (X))
print(res.x)
array([5.10023503, 2.01174199])
Since you are using sum I suppose the likelihood you defined above is already a (negative?) log-likelihood.
def neg_llh(popt, X):
alpha = popt[0]
beta = popt[1]
return np.sum((-0.5)*(((1 / beta)*(X - np.sin((alpha) * X)))**2))
Try minimizing your negative likelihood. Using your plot you can make a good initial guess (x0) about the values of alpha and beta.

Incorporate known data moments into GPFlow fitting

I have recently been working with gpflow, in-particular Gaussian process regression, to model a process for which I have access to approximated moments for each input. I have a vector of input values X of size (N,1) and a vector of responses Y of size (N,1). However, I also know, for each (x,y) pair, an approximation of the associated variance, skewness, kurtosis and so on for the particular y value.
From this, I know properties that inform me of appropriate likelihoods to use for each data point.
In the simplest case, I just assume all likelihoods are Gaussian, and specify the variance at each point. I've created a minimal example of my code by adapting the tutorial on: https://nbviewer.jupyter.org/github/GPflow/GPflow/blob/develop/doc/source/notebooks/advanced/varying_noise.ipynb#Demo-2:-grouped-noise-variances.
import numpy as np
import gpflow
def generate_data(N=100):
X = np.random.rand(N)[:, None] * 10 - 5 # Inputs, shape N x 1
F = 2.5 * np.sin(6 * X) + np.cos(3 * X) # Mean function values
groups = np.arange( 0, N, 1 ).reshape(-1,1)
NoiseVar = np.array([i/100.0 for i in range(N)])[groups]
Y = F + np.random.randn(N, 1) * np.sqrt(NoiseVar) # Noisy data
return X, Y, groups, NoiseVar
# Get data
X, Y, groups, NoiseVar = generate_data()
Y_data = np.hstack([Y, groups])
# Generate one likelihood per data-point
likelihood = gpflow.likelihoods.SwitchedLikelihood( [gpflow.likelihoods.Gaussian(variance=NoiseVar[i]) for i in range(Y.shape[0])])
# model construction (notice that num_latent is 1)
kern = gpflow.kernels.Matern52(input_dim=1, lengthscales=0.5)
model = gpflow.models.VGP(X, Y_data, kern=kern, likelihood=likelihood, num_latent=1)
# Specify the likelihood as non-trainable
model.likelihood.set_trainable(False)
# build the natural gradients optimiser
natgrad_optimizer = gpflow.training.NatGradOptimizer(gamma=1.)
natgrad_tensor = natgrad_optimizer.make_optimize_tensor(model, var_list=[(model.q_mu, model.q_sqrt)])
session = model.enquire_session()
session.run(natgrad_tensor)
# update the cache of the variational parameters in the current session
model.anchor(session)
# Stop Adam from optimising the variational parameters
model.q_mu.trainable = False
model.q_sqrt.trainable = False
# Create Adam tensor
adam_tensor = gpflow.train.AdamOptimizer(learning_rate=0.1).make_optimize_tensor(model)
for i in range(200):
session.run(natgrad_tensor)
session.run(adam_tensor)
# update the cache of the parameters in the current session
model.anchor(session)
print(model)
The above code works for a gaussian likelihood, and known variances. Inspecting my real data, I see that it is skewed very often and as a result, I want to use non-gaussian likelihoods to model it, but am unsure how to specify these other likelihood parameters given what I know.
So my question is: Given this setup, how can I adapt my code so far to include non-Gaussian likelihoods at each step, in-particular specifying and fixing their parameters based on my known variances, skewness, kurtosis and so on associated with each individual y value?
Firstly, you will need to choose which non-Gaussian likelihood you use. GPflow includes various ones in likelihoods.py. You then need to adapt the line
likelihood = gpflow.likelihoods.SwitchedLikelihood(
[gpflow.likelihoods.Gaussian(variance=NoiseVar[i]) for i in range(Y.shape[0])]
)
to give a list of your non-Gaussian likelihoods.
Which likelihood can take advantage of your skewness and kurtosis information is a statistical question. Depending on what you come up with, you may need to implement your own likelihood class, which can be done by inheriting from Likelihood. You should be able to follow some other examples from likelihoods.py.

Bad quality of Viterbi Algorithm (HMM)

I've been trying to get into hidden Markov models and the Viterbi algorithm recently. I found a library called hmmlearn (http://hmmlearn.readthedocs.io/en/latest/tutorial.html) to help me generate a state sequence for two states (with Gaussian emissions). Then I wanted to re-determine the state sequence using Viterbi. My code works, but predicts approximately 5% of the states wrong (depending on the means and variances of the Gaussian emissions). The hmmlearn library has a .predict method which also uses Viterbi to determine the state sequence.
My problem now is that the Viterbi algorithm by hmmlearn is much better than my hand-written one (error rate is lower than 0.5% compared to my 5%). I couldn't find any major problem in my code, so I'm not sure why this is the case. Below is my code where I first generate the state and observation sequence Z and X, predict Z with hmmlearn and finally predict it with my own code:
# Import libraries
import numpy as np
import scipy.stats as st
from hmmlearn import hmm
# Generate a sequence
model = hmm.GaussianHMM(n_components = 2, covariance_type = "spherical")
model.startprob_ = pi
model.transmat_ = A
model.means_ = obs_means
model.covars_ = obs_covars
X, Z = model.sample(T)
## Predict the states from generated observations with the hmmlearn library
Z_pred = model.predict(X)
# Predict the state sequence with Viterbi by hand
B = np.concatenate((st.norm(mean_1,var_1).pdf(X), st.norm(mean_2,var_2).pdf(X)), axis = 1)
delta = np.zeros(shape = (T, 2))
psi = np.zeros(shape= (T, 2))
### Calculate starting values
for s in np.arange(2):
delta[0, s] = np.log(pi[s]) + np.log(B[0, s])
psi = np.zeros((T, 2))
### Take everything in log space since values get very low as t -> T
for t in range(1,T):
for s_post in range(0, 2):
delta[t, s_post] = np.max([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1) + np.log(B[t, s_post])
psi[t, s_post] = np.argmax([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1)
### Backtrack
states = np.zeros(T, dtype=np.int32)
states[T-1] = np.argmax(delta[T-1])
for t in range(T-2, -1, -1):
states[t] = psi[t+1, states[t+1]]
I'm not sure if I have a big error in my code or if hmmlearn just uses a more refined Viterbi algorithm. I have noticed by looking into the falsely predicted states that the impact of the emission probability B seems to be too big as it causes the states to change too frequently even if the transition probability to go to the other state is really low.
I'm rather new to python so please excuse my ugly coding. Thanks in advance for any tips you might have!
Edit: As you can see in the code, I'm stupid and used variances instead of the standard deviation to determine the emission probabilities. After fixing this, I get the same result as the implemented Viterbi algorithm.

PYMC3: NUTS has difficulty sampling from a hierarchical zero inflated gamma model

I'm trying to replicate the data analysis from a paper from Richard McElreath, in which he fitted the data with a hierarchical zero inflated Gamma model. The data is about the hunting returns of around 15000 hunting trips from about 150 hunters over twenty years. Because a good many hunting trips have zero returns, the model assume each trip has pi probability of zero returns, and 1 - pi probability of positive returns which follow a Gamma distribution with parameters alpha and beta.
The predictor variable is age, the model use an age polynomial (up to order 3) to model pi and alpha. And since the 15000 trips belong to 150 individual hunters, each hunter has coefficients of his own and all the coefficients follow a common multivariate normal distribution. For details of the model please refer to the following code. The model specification seems alright, but NUTS is having trouble start sampling: it gives only about 10 samples after about 20 minutes, and the sampler just halted there, and told me it will take hundreds of hours to finish the sampling. I want to know what is causing the problems.
The usual imports
import pymc3 as pm
import numpy as np
from pymc3.distributions import Continuous, Gamma
import theano.tensor as tt
The data can be obtained from github
n_trip = len(d)
n_hunter = len(d['hunter.id'].unique())
idx_hunter = d['hunter.id'].values
y = d['kg.meat'].values
age = d['age.s'].values
age2 = (d['age.s'].values)**2
age3 = (d['age.s'].values)**3
The log probability density function for Zero inflated Gamma.
class ZeroInflatedGamma(Continuous):
def __init__(self, alpha, beta, pi, *args, **kwargs):
super(ZeroInflatedGamma, self).__init__(*args, **kwargs)
self.alpha = alpha
self.beta = beta
self.pi = pi = tt.as_tensor_variable(pi)
self.gamma = Gamma.dist(alpha, beta)
def logp(self, value):
return tt.switch(value > 0,
tt.log(1 - self.pi) + self.gamma.logp(value),
tt.log(self.pi))
This is a matrix to index the correlation matrix prior to a 9X9 matrix, the LKJ prior in pymc3 is given as a one dimentional vector
dim = 9
n_elem = dim * (dim - 1) / 2
tri_index = np.zeros([dim, dim], dtype=int)
tri_index[np.triu_indices(dim, k=1)] = np.arange(n_elem)
tri_index[np.triu_indices(dim, k=1)[::-1]] = np.arange(n_elem)
And here is the model
with pm.Model() as Vary9_model:
# hyper-priors
mu_a = pm.Normal('mu_a', mu=0, sd=100, shape=9)
sigma_a = pm.HalfCauchy('sigma_a', 5, shape=9)
# build the covariance matrix
C_triu = pm.LKJCorr('C_triu', n=2, p=9)
C = tt.fill_diagonal(C_triu[tri_index], 1)
sigma_diag = tt.nlinalg.diag(sigma_a)
cov = tt.nlinalg.matrix_dot(sigma_diag, C, sigma_diag)
# priors for each hunter and all the linear components, 9 dimensional Gaussian
a = pm.MvNormal('a', mu=mu_a, cov=cov, shape=(n_hunter, 9))
# linear function
mupi = a[:,0][idx_hunter] + a[:,1][idx_hunter] * age + a[:,2][idx_hunter] * age2 + a[:,3][idx_hunter] * age3
mualpha = a[:,4][idx_hunter] + a[:,5][idx_hunter] * age + a[:,6][idx_hunter] * age2 + a[:,7][idx_hunter] * age3
pi = pm.Deterministic('pi', pm.math.sigmoid(mupi))
alpha = pm.Deterministic('alpha', pm.math.exp(mualpha))
beta = pm.Deterministic('beta', pm.math.exp(a[:,8][idx_hunter]))
y_obs = ZeroInflatedGamma('y_obs', alpha, beta, pi, observed=y)
Vary9_trace = pm.sample(6000, njobs=2)
And this is the status of the model:
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -28,366: 100%|██████████| 200000/200000 [15:36<00:00, 213.57it/s]
Finished [100%]: Average ELBO = -28,365
0%| | 22/6000 [15:51<63:49:25, 38.44s/it]
I have some thoughts on the problem but not sure which might be the reason.
is the nine dimensional Gaussian too difficult to sample with? I previously only modeled the intercepts for mualpha and mupi as bivariate Gaussian, it's slow but worked(the model fitting took about 20 minutes)
is it the probability density that's causing the problem? I wrote the density function myself and am not sure if it's functioning well. I think the density function is not differentiable at zero, will this cause trouble for the nuts sampler?
is it because the predictor variables are highly correlated? The linear model components in this model are polynomials of age, to the third degree, and naturally the predictors are highly correlated.
Or maybe it's because of something else?
As a side note, I tried using the Metropolis sampler, my computer has run out of memory and the chains still haven't converged.
The ZeroInflatedGamma looks fine. The density function is differentiable with respect to pi, alpha and beta. That is all you need for an observed variable. You only need the derivatives with respect to the value if you are trying to estimate the values.
There was an issue in the implementation of LKJCorr:
https://github.com/pymc-devs/pymc3/pull/1863
You could try again on master. Sadly, pymc3 does not have support for using MVNormal and LKJCorr in cholesky decomposed parametrization. This might help, too. There is a work in progress pull request for this on github:
https://github.com/pymc-devs/pymc3/pull/1875
To improve convergence you could try a non-centered parameterization for a. Something along the lines of
a_raw = pm.Normal('a_raw', shape=(9, n_hunter))
a = mu_a[None, :] + tt.dot(tt.slinalg.cholesky(cov), a_raw)
Of course this would be faster if we had that cholesky LKJCorr...

Categories