Posterior predictive check in PyMC3, discrete distributions

Posterior predictive check in PyMC3, discrete distributions - python

I want to implement a basic model of capture and recapture in PyMC3 (you capture 100 animals and mark them, then you liberate them and recapture 100 of them after they have mixed and annotate how many are marked). This is my code:
import numpy as np
import pymc3 as pm
import arviz as az
# Datos:
K = 100 #marked animals in first round
n = 100 #captured animals in second round
obs = 10 #observed marked in second round
with pm.Model() as my_model:
N = pm.DiscreteUniform("N", lower=K, upper=10000)
likelihood = pm.HyperGeometric('likelihood', N=N, k=K, n=n, observed=obs)
trace = pm.sample(10000)
print(pm.summary(trace))
print(trace['N'])
ppc = pm.sample_posterior_predictive(trace, 100, var_names=["N"])
data_ppc = az.from_pymc3(trace=trace, posterior_predictive=ppc) #create inference data
az.plot_ppc(data_ppc, figsize=(12, 6))
But I obtain the error in plot_ppc that 'var names: "[\'likelihood\'] are not present" in dataset'. Also the warning posterior predictive variable N's shape not compatible with number of chains and draws. This can mean that some draws or even whole chains are not represented.
What is happening and what can I do to obtain a posterior predictive check plot?

The root of all the problems is in this line ppc = pm.sample_posterior_predictive(trace, 100, var_names=["N"]).
By using var_names=["N"] you are indicating PyMC to "sample" only the variable N which is actually a latent variable that was sampled while sampling the posterior in the pm.sample call. Doing this in the pm.sample_posterior_predictive call is indicating PyMC to not sample the observed variable (likelihood in this case) and to just copy the samples for N to the posterior predictive too. You will see that data_ppc is an InferenceData object with multiple groups, N is already in the posterior group (like it is in the trace object).
By using 100 (aka samples=100 as a positional argument) you are indicating PyMC to draw posterior predictive samples only for the first 100 draws of the first chain. This is a bad idea, so ArviZ prints a warning when converting to InferenceData. You should generate one posterior predictive sample per posterior sample, only generating samples for a subset of the posterior if posterior predictive sampling were very slow.
My recommendation, which also applies as a general rule, is to trust PyMC defaults unless you have a reason not to or want things to give the same result with multiple versions. We update the defaults from time to time to try and keep them coherent with best practices which are updated and improve over the time. You should therefore do: ppc = pm.sample_posterior_predictive(trace). PyMC will default to sampling the likelihood variable only and to generate one sample per posterior draw.

Related

trouble getting started with simple pymc3 example

I am new to using the PyMC3 package and am just trying to implement an example from a course on measurement uncertainty that I’m taking. (Note this is an optional employee education course through work, not a graded class where I shouldn’t find answers online). The course uses R but I find python to be preferable.
The (simple) problem is posed as following:
Say you have an end-gauge of actual (unknown) length at room-temperature length, and measured length m. The relationship between the two is:
length = m / (1 + alpha*dT)
where alpha is an expansion coefficient and dT is the deviation from room temperature and m is the measured quantity. The goal is to find the posterior distribution on length in order to determine its expected value and standard deviation (i.e. the measurement uncertainty)
The problem specifies prior distributions on alpha and dT (Gaussians with small standard deviation) and a loose prior on length (Gaussian with large standard deviation). The problem specifies that m was measured 25 times with an average of 50.000215 and standard deviation of 5.8e-6. We assume that the measurements of m are normally distributed with a mean of the true value of m.
One issue I had is that the likelihood doesn’t seem like it can be specified just based on these statistics in PyMC3, so I generated some dummy measurement data (I ended up doing 1000 measurements instead of 25). Again, the question is to get a posterior distribution on length (and in the process, although of less interest, updated posteriors on alpha and dT).
Here’s my code, which is not working and having convergence issues:
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as stats
import pymc3 as pm
import theano.tensor as tt
basic_model = pm.Model()
xdata = np.random.normal(50.000215,5.8e-6*np.sqrt(1000),1000)
with basic_model:
#prior distributions
theta = pm.Normal('theta',mu=-.1,sd=.04)
alpha = pm.Normal('alpha',mu=.0000115,sd=.0000012)
length = pm.Normal('length',mu=50,sd=1)
mumeas = length*(1+alpha*theta)
with basic_model:
obs = pm.Normal('obs',mu=mumeas,sd=5.8e-6,observed=xdata)
#yobs = Normal('yobs',)
start = pm.find_MAP()
#trace = pm.sample(2000, step=pm.Metropolis, start=start)
step = pm.Metropolis()
trace = pm.sample(10000, tune=200000,step=step,start=start,njobs=1)
length_samples = trace['length']
fig,ax=plt.subplots()
plt.hist(length_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\lambda_1$", color="#A60628", normed=True)
I would really appreciate any help as to why this isn’t working. I've been trying for a while and it never converges to the expected solution given from the R code. I tried the default sampler (NUTS I think) as well as Metropolis but that completely failed with a zero gradient error. The (relevant) course slides are attached as an image. Finally, here is the comparable R code:
library(rjags)
#Data
jags_data <- list(xbar=50.000215)
jags_code <- jags.model(file = "calibration.txt",
data = jags_data,
n.chains = 1,
n.adapt = 30000)
post_samples <- coda.samples(model = jags_code,
variable.names =
c("l","mu","alpha","theta"),#,"ypred"),
n.iter = 30000)
summary(post_samples)
mean(post_samples[[1]][,"l"])
sd(post_samples[[1]][,"l"])
plot(post_samples)
and the calibration.txt model:
model{
l~dnorm(50,1.0)
alpha~dnorm(0.0000115,694444444444)
theta~dnorm(-0.1,625)
mu<-l*(1+alpha*theta)
xbar~dnorm(mu,29726516052)
}
(note I think the dnorm distribution takes 1/sigma^2, hence the weird-looking variances)
Any help or insight as to why the PyMC3 sampling isn't converging and what I should do differently would be extremely appreciated. Thanks!

I also had trouble getting anything useful from the generated data and model in the code. It seems to me that the level of noise in the fake data could equally be explained by the different sources of variance in the model. That can lead to a situation of highly correlated posterior parameters. Add to that the extreme scale imbalances, then it makes sense this would have sampling issues.
However, looking at the JAGS model, it seems they really are using just that one input observation. I've never seen this technique(?) before, that is, inputting summary statistics of data instead of the raw data itself. I suppose it worked for them in JAGS, so I decided to try running the exact same MCMC, including using the precision (tau) parameterization of the Gaussian.
Original Model with Metropolis
with pm.Model() as m0:
# tau === precision parameterization
dT = pm.Normal('dT', mu=-0.1, tau=625)
alpha = pm.Normal('alpha', mu=0.0000115, tau=694444444444)
length = pm.Normal('length', mu=50.0, tau=1.0)
mu = pm.Deterministic('mu', length*(1+alpha*dT))
# only one input observation; tau indicates the 5.8 nm sd
obs = pm.Normal('obs', mu=mu, tau=29726516052, observed=[50.000215])
trace = pm.sample(30000, tune=30000, chains=4, cores=4, step=pm.Metropolis())
While it's still not that great at sampling length and dT, it at least appears convergent overall:
I think noteworthy here is that despite the relatively weak prior on length (sd=1), the strong priors on all the other parameters appear to propagate a tight uncertainty bound on the length posterior. Ultimately, this is the posterior of interest, so this seems to be consistent with the intent of the exercise. Also, see that mu comes out in the posterior as exactly the distribution described, namely, N(50.000215, 5.8e-6).
Trace Plots
Forest Plot
Pair Plot
Here, however, you can see the core problem is still there. There's both strong correlation between length and dT, plus 4 or 5 orders of magnitude scale difference between the standard errors. I'd definitely do a long run before I really trusted the result.
Alternative Model with NUTS
In order to get this running with NUTS, you'd have to address the scaling issue. That is, somehow we need to reparameterize to get all the tau values closer to 1. Then, you'd run the sampler and transform back into the units you're interested in. Unfortunately, I don't have time to play around with this right now (I'd have to figure it out too), but maybe it's something you can start exploring on your own.

Get the distribution when a point is sampled from a mixture in PyMC3

I have a model with a pm.NormalMixture(), and when I sample from the normal mixture, I also want to know which of the mixed distributions that point is being sampled from.
import numpy as np
import pymc3 as pm
obs = np.concatenate([np.random.normal(5,1,100),
np.random.normal(10,2,200)])
with pm.Model() as model:
mu = pm.Normal('mu', 10, 10, shape=2)
sd = pm.Normal('sd', 10, 10, shape=2)
x = pm.NormalMixture('x', mu=mu, sd=sd, observed=obs)
I sample from that model, then use that trace to sample from the posterior predictive distribution, and what I want to know is for each x in the posterior predictive trace, which of the two normal distributions being sampled from it belongs to. Is that possible in PyMC3 without doing it manually?

This example demonstrates how posterior predictive checks (PPCs) work. The gist of a PPC is that you first draw random samples from the trace. The trace is essentially always multivariate, and in your model a single sample would be defined by the vector (mu[i,0], mu[i,1], sd[i,0], sd[i,1]). Then, for each trace sample, generate random numbers from the distribution specified for the likelihood with its parameter values equal to those from the trace samples. In your case, this would be NormalMixture(mu[i,:], sd[i,:]). In your model, x is the likelihood function, not an individual point of the trace.
Some practical notes:
You haven't specified a weighting variable, so I'm assuming by default it forces the normal distributions to be weighted equally (I haven't tested this).
The odds of a given point coming from one distribution or the other is just the ratio between the probability densities at that point.
Check out this for recommendations on how to choose priors. For example, your SD prior is placing a lot of weight on very large SDs, which would bias your results, especially for smaller datasets.
Good luck!

Posterior Predictive Check on PyMC3 Deterministic Variable

TL; DR
What's the right way to do posterior predictive checks on pm.Deterministic variables that take stochastics (rendering the deterministic also stochastic) as input?
Too Short; Didn't Understand
Say we have a pymc3 model like this:
import pymc3 as pm
with pm.Model() as model:
# Arbitrary, trainable distributions.
dist1 = pm.Normal("dist1", 0, 1)
dist2 = pm.Normal("dist2", dist1, 1)
# Arbitrary, deterministic theano math.
val1 = pm.Deterministic("val1", arb1(dist2))
# Arbitrary custom likelihood.
cdist = pm.DensityDistribution("cdist", logp(val1), observed=get_data())
# Arbitrary, deterministic theano math.
val2 = pm.Deterministic("val2", arb2(val1))
I may be misunderstanding, but my intention is for the posteriors of dist1 and dist2 to be sampled, and for those samples to fed into the deterministic variables. Is the posterior predictive check only possible on observed random variables?
It's straightforward to get posterior predictive samples from dist2 and other random variables using pymc3.sampling.sample_ppc, but the majority of my model's value is derived from the state of val1 and val2, given those samples.
The problem arises in that pm.Deterministic(.) seems to return a th.TensorVariable. So, when this is called:
ppc = pm.sample_ppc(_trace, vars=[val1, val2])["val1", "val2"]
...and pymc3 attempts this block of code in pymc3.sampling:
410 for var in vars:
--> 411 ppc[var.name].append(var.distribution.random(point=param,
412 size=size))
...it complains because a th.TensorVariable obviously doesn't have a .distribution.
So, what is the right way to carry the posterior samples of stochastics through deterministics? Do I need to explicitly create a th.function that takes stochastic posterior samples and calculates the deterministic values? That seems silly given the fact that pymc3 already has the graph in place.

Yes, I was misunderstanding the purpose of .sample_ppc. You don't need it for unobserved variables because those have samples in the trace. Observed variables aren't sampled from, because their data is observed, thus you need sample_ppc to generate samples.
In short, I can gather samples of the pm.Deterministic variables from the trace.

How to infer the parameters of a 1D gaussian distribution using PyMC?

I'm pretty new to PyMC and I'm trying desperately to infer the parameters of an underlying gaussian distribution that best fits a distribution of observed data that I have, not with a pre-build normal distrubution, but with a more general method using histograms of the simulated data to build pdfs. But so far I can't get my code to converge, and I don't know why...
So here's a summary of what my code does.
I have a dataset of 5000 points distributed normally (mean=5,sigma=2). I want to retrieve these values (mean, sigma) with a bayesian inference (using MCMC).
I have a data simulator that generates for each iteration of the MCMC process a normal distribution of 5000 points with a random mean and sigma (uniform prior)
From the simulated distribution of points I compute a numpy histogram normed to 1 representing the pdf of the distribution (Nbins=int(sqrt(5000))). I then compute the mean and standard deviation of this distribution.
What I want is the set of parameters that will allow me to build a simulated distribution that best fits the observed data.
I use the most general definition of the log likelihood, that is:
ln L(θ|x)=∑ln(f(xi|θ)) (the likelihood function being defined as the probability distribution of the observed data given the parameters θ)
Then I interpolate linearly the histogram values for every bin center. Therefore I have a continuous pdf for the simulated distribution. So here f is the interpolated function I made from the histogram of the simulation.
I sum the log(f(xi)) contributions for every (real) data point and return the loglikelihood value at the end.
But some (real) data points are so far off the mean of the simulated distribution that f(xi)=0. For these points the code raises a math domain error (Reminder: log(0)=-inf). So I artificially set the pdf to a small epsilon for the points where it's usually set to 0.
But here's the thing. The loglikelihood is not computed for every iteration. And actually it is not computed at all, in the present architecture of my code. So that's why the MCMC process is not converging. But... I don't know why.
Turns out that building custom likelihood functions does not seem to be very casual approach in the PyMC community, where people usually prefer to used pre-built distributions. I'm having troubles to find some help on these matters, so ideas and suggestions will be deeply appreciated :)
import numpy as np
import matplotlib.pyplot as plt
import math
import pymc as pm
from scipy.interpolate import InterpolatedUnivariateSpline
# Generate the data
np.random.seed(0)
N=5000
true_mean=5.
true_sigma = 2.
data = np.random.normal(true_mean,true_sigma,N)
#prior
m=pm.Uniform('m', lower=4, upper=6)
s=pm.Uniform('s', lower=1, upper=3)
#pm.deterministic
def data_simulator(mean_input=m,sig_input=s):
out=np.empty(4,dtype=object)
datasim = np.random.normal(mean_input,sig_input,N)
hist, bin_edges = np.histogram(datasim, bins=int(math.sqrt(len(datasim))), density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2
m_sim=np.mean(datasim)
s_sim=np.std(datasim)
out[0]=m_sim
out[1]=s_sim
out[2]=bin_centers
out[3]=hist
return out
#pm.stochastic(observed=True)
def logp(value=data,mean_output=data_simulator.value[0],sigma_output=data_simulator.value[1],bin_centers_sim=data_simulator.value[2],hist_sim=data_simulator.value[3]):
interp_sim=InterpolatedUnivariateSpline(bin_centers_sim,hist_sim,k=1,ext=0) #returns the extrapolated values
logp=np.sum(np.log(interp_sim(value)))
print 'logp=',logp
return logp
model = pm.Model({"mean": m,"sigma":s,"data_simulator":data_simulator,"loglikelihood":loglikelihood})
#Run the MCMC sampler
mcmc = pm.MCMC(model)
mcmc.sample(iter=10000, burn=5000)
#Plot the marginals
pm.Matplot.plot(mcmc)

How does pymc represent the prior distribution and likelihood function?

If pymc implements the Metropolis-Hastings algorithm to come up with samples from the posterior density over the parameters of interest, then in order to decide whether to move to the next state in the markov chain it must be able to evaluate something proportional to the posterior density for all given parameter values.
The posterior density is proportion to the likelihood function based on the observed data times the prior density.
How are each of these represented within pymc? How does it calculate each of these quantities from the model object?
I wonder if anyone can give me a high level description of the approach or point me to where I can find it.

To represent the prior, you need an instance of the Stochastic class, which has two primary attributes:
value : the variable's current value
logp : the log probability of the variable's current value given the values of its parents
You can initialize a prior with the name of the distribution you are using.
To represent the likelihood, you need a so-called Data Stochastic. That is, an instance of class Stochastic whose observed flag is set to True. The value of this variable cannot be changed and it will not be sampled. Again, you can initialize the likelihood with the name of the distribution you are using (but don't forget to set the observed flag to True).
Say we have the following setup:
import pymc as pm
import numpy as np
import theano.tensor as t
x = np.array([1,2,3,4,5,6])
y = np.array([0,1,0,1,1,1])
We can run a simple logistic regression with the following:
with pm.Model() as model:
#Priors
b0 = pm.Normal("b0", mu=0, tau=1e-6)
b1 = pm.Normal("b1", mu=0, tau=1e-6)
#Likelihood
z = b0 + b1 * x
yhat = pm.Bernoulli("yhat", 1 / (1 + t.exp(-z)), observed=y)
# Sample from the posterior
trace = pm.sample(10000, pm.Metropolis())
Most of the above came from Chris Fonnesbeck's iPython notebook here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.