Posterior Predictive Check on PyMC3 Deterministic Variable

Posterior Predictive Check on PyMC3 Deterministic Variable - python

TL; DR
What's the right way to do posterior predictive checks on pm.Deterministic variables that take stochastics (rendering the deterministic also stochastic) as input?
Too Short; Didn't Understand
Say we have a pymc3 model like this:
import pymc3 as pm
with pm.Model() as model:
# Arbitrary, trainable distributions.
dist1 = pm.Normal("dist1", 0, 1)
dist2 = pm.Normal("dist2", dist1, 1)
# Arbitrary, deterministic theano math.
val1 = pm.Deterministic("val1", arb1(dist2))
# Arbitrary custom likelihood.
cdist = pm.DensityDistribution("cdist", logp(val1), observed=get_data())
# Arbitrary, deterministic theano math.
val2 = pm.Deterministic("val2", arb2(val1))
I may be misunderstanding, but my intention is for the posteriors of dist1 and dist2 to be sampled, and for those samples to fed into the deterministic variables. Is the posterior predictive check only possible on observed random variables?
It's straightforward to get posterior predictive samples from dist2 and other random variables using pymc3.sampling.sample_ppc, but the majority of my model's value is derived from the state of val1 and val2, given those samples.
The problem arises in that pm.Deterministic(.) seems to return a th.TensorVariable. So, when this is called:
ppc = pm.sample_ppc(_trace, vars=[val1, val2])["val1", "val2"]
...and pymc3 attempts this block of code in pymc3.sampling:
410 for var in vars:
--> 411 ppc[var.name].append(var.distribution.random(point=param,
412 size=size))
...it complains because a th.TensorVariable obviously doesn't have a .distribution.
So, what is the right way to carry the posterior samples of stochastics through deterministics? Do I need to explicitly create a th.function that takes stochastic posterior samples and calculates the deterministic values? That seems silly given the fact that pymc3 already has the graph in place.

Yes, I was misunderstanding the purpose of .sample_ppc. You don't need it for unobserved variables because those have samples in the trace. Observed variables aren't sampled from, because their data is observed, thus you need sample_ppc to generate samples.
In short, I can gather samples of the pm.Deterministic variables from the trace.

Related

Posterior predictive check in PyMC3, discrete distributions

I want to implement a basic model of capture and recapture in PyMC3 (you capture 100 animals and mark them, then you liberate them and recapture 100 of them after they have mixed and annotate how many are marked). This is my code:
import numpy as np
import pymc3 as pm
import arviz as az
# Datos:
K = 100 #marked animals in first round
n = 100 #captured animals in second round
obs = 10 #observed marked in second round
with pm.Model() as my_model:
N = pm.DiscreteUniform("N", lower=K, upper=10000)
likelihood = pm.HyperGeometric('likelihood', N=N, k=K, n=n, observed=obs)
trace = pm.sample(10000)
print(pm.summary(trace))
print(trace['N'])
ppc = pm.sample_posterior_predictive(trace, 100, var_names=["N"])
data_ppc = az.from_pymc3(trace=trace, posterior_predictive=ppc) #create inference data
az.plot_ppc(data_ppc, figsize=(12, 6))
But I obtain the error in plot_ppc that 'var names: "[\'likelihood\'] are not present" in dataset'. Also the warning posterior predictive variable N's shape not compatible with number of chains and draws. This can mean that some draws or even whole chains are not represented.
What is happening and what can I do to obtain a posterior predictive check plot?

The root of all the problems is in this line ppc = pm.sample_posterior_predictive(trace, 100, var_names=["N"]).
By using var_names=["N"] you are indicating PyMC to "sample" only the variable N which is actually a latent variable that was sampled while sampling the posterior in the pm.sample call. Doing this in the pm.sample_posterior_predictive call is indicating PyMC to not sample the observed variable (likelihood in this case) and to just copy the samples for N to the posterior predictive too. You will see that data_ppc is an InferenceData object with multiple groups, N is already in the posterior group (like it is in the trace object).
By using 100 (aka samples=100 as a positional argument) you are indicating PyMC to draw posterior predictive samples only for the first 100 draws of the first chain. This is a bad idea, so ArviZ prints a warning when converting to InferenceData. You should generate one posterior predictive sample per posterior sample, only generating samples for a subset of the posterior if posterior predictive sampling were very slow.
My recommendation, which also applies as a general rule, is to trust PyMC defaults unless you have a reason not to or want things to give the same result with multiple versions. We update the defaults from time to time to try and keep them coherent with best practices which are updated and improve over the time. You should therefore do: ppc = pm.sample_posterior_predictive(trace). PyMC will default to sampling the likelihood variable only and to generate one sample per posterior draw.

What is the difference between sample() and rsample()?

When I sample from a distribution in PyTorch, both sample and rsample appear to give similar results:
import torch, seaborn as sns
x = torch.distributions.Normal(torch.tensor([0.0]), torch.tensor([1.0]))
sns.distplot(x.sample((100000,)))
sns.distplot(x.rsample((100000,)))
When should I use sample(), and when should I use rsample()?

Using rsample allows for pathwise derivatives:
The other way to implement these stochastic/policy gradients would be to use the reparameterization trick from the rsample() method, where the parameterized random variable can be constructed via a parameterized deterministic function of a parameter-free random variable. The reparameterized sample therefore becomes differentiable.

sample(): random sampling from the probability distribution. So, we cannot backpropagate, because it is random! (the computation graph is cut off).
See the source code of sample in torch.distributions.normal.Normal:
def sample(self, sample_shape=torch.Size()):
shape = self._extended_shape(sample_shape)
with torch.no_grad():
return torch.normal(self.loc.expand(shape), self.scale.expand(shape))
torch.normal returns a tensor of random numbers. Also, torch.no_grad() context prevents the computation graph from growing any further.
You see, we cannot backprop. The returned tensor of sample() contains just some numbers, not the whole computational graph.
So, what is rsample()?
By using rsample, we can backpropagate, because it keeps the computation graph alive.
How? By putting the randomness aside in a separate parameter. This is called the "reparameterization trick".
rsample: sampling using reparameterization trick.
There is eps in the source code:
def rsample(self, sample_shape=torch.Size()):
shape = self._extended_shape(sample_shape)
eps = _standard_normal(shape, dtype=self.loc.dtype, device=self.loc.device)
return self.loc + eps * self.scale
# `self.loc` is the mean and `self.scale` is the standard deviation.
eps is the separate parameter responsible for the randomness of the sampling.
Look at the return: mean + eps * standard deviation
eps does not depend on the parameters you want to differentiate with respect to.
So, now you can freely backpropagate(=differentiate) because eps does not change when the parameters change.
(If we change the parameters, the distribution of the reparameterized samples does change because self.loc and self.scale change, but the distribution of the eps does not change.)
Note that the randomness of the sampling comes from the random sampling of the eps. There is no randomness in the computation graph itself. Once eps is chosen, it is fixed. (the distribution of the elements of the eps is fixed, after they are sampled.)
For example, in an implementation of the SAC(Soft Actor-Critic) algorithm in reinforcement learning, eps may consist of elements corresponding to a single minibatch of actions (and one action may consist of many elements).

Get the distribution when a point is sampled from a mixture in PyMC3

I have a model with a pm.NormalMixture(), and when I sample from the normal mixture, I also want to know which of the mixed distributions that point is being sampled from.
import numpy as np
import pymc3 as pm
obs = np.concatenate([np.random.normal(5,1,100),
np.random.normal(10,2,200)])
with pm.Model() as model:
mu = pm.Normal('mu', 10, 10, shape=2)
sd = pm.Normal('sd', 10, 10, shape=2)
x = pm.NormalMixture('x', mu=mu, sd=sd, observed=obs)
I sample from that model, then use that trace to sample from the posterior predictive distribution, and what I want to know is for each x in the posterior predictive trace, which of the two normal distributions being sampled from it belongs to. Is that possible in PyMC3 without doing it manually?

This example demonstrates how posterior predictive checks (PPCs) work. The gist of a PPC is that you first draw random samples from the trace. The trace is essentially always multivariate, and in your model a single sample would be defined by the vector (mu[i,0], mu[i,1], sd[i,0], sd[i,1]). Then, for each trace sample, generate random numbers from the distribution specified for the likelihood with its parameter values equal to those from the trace samples. In your case, this would be NormalMixture(mu[i,:], sd[i,:]). In your model, x is the likelihood function, not an individual point of the trace.
Some practical notes:
You haven't specified a weighting variable, so I'm assuming by default it forces the normal distributions to be weighted equally (I haven't tested this).
The odds of a given point coming from one distribution or the other is just the ratio between the probability densities at that point.
Check out this for recommendations on how to choose priors. For example, your SD prior is placing a lot of weight on very large SDs, which would bias your results, especially for smaller datasets.
Good luck!

PyMC observed data for a sum of random variables

I'm trying to infer models parameters with PyMC. In particular the observed data is modeled as a sum of two different random variables: a negative binomial and a poisson.
In PyMC, an algebraic composition of random variables is described by a "deterministic" object. Is it possible to assign the observed data to this deterministic object?
If not possible, we still know that the PDF of the sum is the convolution the PDF of the components. Is there any trick to compute this convolution efficiently?

It is not possible to make a deterministic node observed in PyMC2, but you can achieve an equivalent model by making one part of your convolution a latent variable. Here is a small example:
def model(values):
# priors for model parameters
mu_A = pm.Exponential('mu_A', beta=1, value=1)
alpha_A = pm.Exponential('alpha_A', beta=1, value=1)
mu_B_minus_A = pm.Uninformative('mu_B_minus_A', value=1)
# latent variable for negative binomial
A = pm.NegativeBinomial('A', mu=mu_A, alpha=alpha_A, value=0)
# observed variable for conditional poisson
B = pm.Poisson('B', mu=mu_B_minus_A+A, value=values, observed=True)
return locals()
Here is a notebook that tests it out. It seems like it will be tough to fit without some additional information on the model parameters. Perhaps there is a clever way to calculate or approximate the convolution of a NB and a Poisson that you could use as a custom observed stochastic instead.

How does pymc represent the prior distribution and likelihood function?

If pymc implements the Metropolis-Hastings algorithm to come up with samples from the posterior density over the parameters of interest, then in order to decide whether to move to the next state in the markov chain it must be able to evaluate something proportional to the posterior density for all given parameter values.
The posterior density is proportion to the likelihood function based on the observed data times the prior density.
How are each of these represented within pymc? How does it calculate each of these quantities from the model object?
I wonder if anyone can give me a high level description of the approach or point me to where I can find it.

To represent the prior, you need an instance of the Stochastic class, which has two primary attributes:
value : the variable's current value
logp : the log probability of the variable's current value given the values of its parents
You can initialize a prior with the name of the distribution you are using.
To represent the likelihood, you need a so-called Data Stochastic. That is, an instance of class Stochastic whose observed flag is set to True. The value of this variable cannot be changed and it will not be sampled. Again, you can initialize the likelihood with the name of the distribution you are using (but don't forget to set the observed flag to True).
Say we have the following setup:
import pymc as pm
import numpy as np
import theano.tensor as t
x = np.array([1,2,3,4,5,6])
y = np.array([0,1,0,1,1,1])
We can run a simple logistic regression with the following:
with pm.Model() as model:
#Priors
b0 = pm.Normal("b0", mu=0, tau=1e-6)
b1 = pm.Normal("b1", mu=0, tau=1e-6)
#Likelihood
z = b0 + b1 * x
yhat = pm.Bernoulli("yhat", 1 / (1 + t.exp(-z)), observed=y)
# Sample from the posterior
trace = pm.sample(10000, pm.Metropolis())
Most of the above came from Chris Fonnesbeck's iPython notebook here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.