I'm trying to infer models parameters with PyMC. In particular the observed data is modeled as a sum of two different random variables: a negative binomial and a poisson.
In PyMC, an algebraic composition of random variables is described by a "deterministic" object. Is it possible to assign the observed data to this deterministic object?
If not possible, we still know that the PDF of the sum is the convolution the PDF of the components. Is there any trick to compute this convolution efficiently?
It is not possible to make a deterministic node observed in PyMC2, but you can achieve an equivalent model by making one part of your convolution a latent variable. Here is a small example:
def model(values):
# priors for model parameters
mu_A = pm.Exponential('mu_A', beta=1, value=1)
alpha_A = pm.Exponential('alpha_A', beta=1, value=1)
mu_B_minus_A = pm.Uninformative('mu_B_minus_A', value=1)
# latent variable for negative binomial
A = pm.NegativeBinomial('A', mu=mu_A, alpha=alpha_A, value=0)
# observed variable for conditional poisson
B = pm.Poisson('B', mu=mu_B_minus_A+A, value=values, observed=True)
return locals()
Here is a notebook that tests it out. It seems like it will be tough to fit without some additional information on the model parameters. Perhaps there is a clever way to calculate or approximate the convolution of a NB and a Poisson that you could use as a custom observed stochastic instead.
Related
When I sample from a distribution in PyTorch, both sample and rsample appear to give similar results:
import torch, seaborn as sns
x = torch.distributions.Normal(torch.tensor([0.0]), torch.tensor([1.0]))
sns.distplot(x.sample((100000,)))
sns.distplot(x.rsample((100000,)))
When should I use sample(), and when should I use rsample()?
Using rsample allows for pathwise derivatives:
The other way to implement these stochastic/policy gradients would be to use the reparameterization trick from the rsample() method, where the parameterized random variable can be constructed via a parameterized deterministic function of a parameter-free random variable. The reparameterized sample therefore becomes differentiable.
sample(): random sampling from the probability distribution. So, we cannot backpropagate, because it is random! (the computation graph is cut off).
See the source code of sample in torch.distributions.normal.Normal:
def sample(self, sample_shape=torch.Size()):
shape = self._extended_shape(sample_shape)
with torch.no_grad():
return torch.normal(self.loc.expand(shape), self.scale.expand(shape))
torch.normal returns a tensor of random numbers. Also, torch.no_grad() context prevents the computation graph from growing any further.
You see, we cannot backprop. The returned tensor of sample() contains just some numbers, not the whole computational graph.
So, what is rsample()?
By using rsample, we can backpropagate, because it keeps the computation graph alive.
How? By putting the randomness aside in a separate parameter. This is called the "reparameterization trick".
rsample: sampling using reparameterization trick.
There is eps in the source code:
def rsample(self, sample_shape=torch.Size()):
shape = self._extended_shape(sample_shape)
eps = _standard_normal(shape, dtype=self.loc.dtype, device=self.loc.device)
return self.loc + eps * self.scale
# `self.loc` is the mean and `self.scale` is the standard deviation.
eps is the separate parameter responsible for the randomness of the sampling.
Look at the return: mean + eps * standard deviation
eps does not depend on the parameters you want to differentiate with respect to.
So, now you can freely backpropagate(=differentiate) because eps does not change when the parameters change.
(If we change the parameters, the distribution of the reparameterized samples does change because self.loc and self.scale change, but the distribution of the eps does not change.)
Note that the randomness of the sampling comes from the random sampling of the eps. There is no randomness in the computation graph itself. Once eps is chosen, it is fixed. (the distribution of the elements of the eps is fixed, after they are sampled.)
For example, in an implementation of the SAC(Soft Actor-Critic) algorithm in reinforcement learning, eps may consist of elements corresponding to a single minibatch of actions (and one action may consist of many elements).
I've set up the following binomial switchpoint model in PyMC3:
with pm.Model() as switchpoint_model:
switchpoint = pm.DiscreteUniform('switchpoint', lower=df['covariate'].min(), upper=df['covariate'].max())
# Priors for pre- and post-switch parameters
early_rate = pm.Beta('early_rate', 1, 1)
late_rate = pm.Beta('late_rate', 1, 1)
# Allocate appropriate binomial probabilities to years before and after current
p = pm.math.switch(switchpoint >= df['covariate'].values, early_rate, late_rate)
p = pm.Deterministic('p', p)
y = pm.Binomial('y', p=p, n=df['trials'].values, observed=df['successes'].values)
It seems to run fine, except that it entirely centers in on one value for the switchpoint (999), as shown below.
Upon further investigation it seems that the results for this model are highly dependent on the starting value (in PyMC3, "testval"). The below shows what happens when I set the testval = 750.
switchpoint = pm.DiscreteUniform('switchpoint', lower=gp['covariate'].min(),
upper=gp['covariate'].max(), testval=750)
I get similarly different results with additional different starting values.
For context, this is what my dataset looks like:
My questions are:
Is my model somehow incorrectly specified?
If it's correctly specified, how should I interpret these results? In particular, how do I compare / select results generated by different testvals? The only idea I've had has been using WAIC to evaluate out of sample performance...
Models with discrete values can be problematic, all the nice sampling techniques using the derivatives don't work anymore, and they can behave a lot like multi modal distributions. I don't really see why this would be that problematic in this case, but you could try to use a continuous variable for the switchpoint instead (wouldn't that also make more sense conceptually?).
TL; DR
What's the right way to do posterior predictive checks on pm.Deterministic variables that take stochastics (rendering the deterministic also stochastic) as input?
Too Short; Didn't Understand
Say we have a pymc3 model like this:
import pymc3 as pm
with pm.Model() as model:
# Arbitrary, trainable distributions.
dist1 = pm.Normal("dist1", 0, 1)
dist2 = pm.Normal("dist2", dist1, 1)
# Arbitrary, deterministic theano math.
val1 = pm.Deterministic("val1", arb1(dist2))
# Arbitrary custom likelihood.
cdist = pm.DensityDistribution("cdist", logp(val1), observed=get_data())
# Arbitrary, deterministic theano math.
val2 = pm.Deterministic("val2", arb2(val1))
I may be misunderstanding, but my intention is for the posteriors of dist1 and dist2 to be sampled, and for those samples to fed into the deterministic variables. Is the posterior predictive check only possible on observed random variables?
It's straightforward to get posterior predictive samples from dist2 and other random variables using pymc3.sampling.sample_ppc, but the majority of my model's value is derived from the state of val1 and val2, given those samples.
The problem arises in that pm.Deterministic(.) seems to return a th.TensorVariable. So, when this is called:
ppc = pm.sample_ppc(_trace, vars=[val1, val2])["val1", "val2"]
...and pymc3 attempts this block of code in pymc3.sampling:
410 for var in vars:
--> 411 ppc[var.name].append(var.distribution.random(point=param,
412 size=size))
...it complains because a th.TensorVariable obviously doesn't have a .distribution.
So, what is the right way to carry the posterior samples of stochastics through deterministics? Do I need to explicitly create a th.function that takes stochastic posterior samples and calculates the deterministic values? That seems silly given the fact that pymc3 already has the graph in place.
Yes, I was misunderstanding the purpose of .sample_ppc. You don't need it for unobserved variables because those have samples in the trace. Observed variables aren't sampled from, because their data is observed, thus you need sample_ppc to generate samples.
In short, I can gather samples of the pm.Deterministic variables from the trace.
I have a variable A which is Bernoulli distributed, A = pymc.Bernoulli('A', p_A), but I don't have a hard value for p_A and want to sample for it. I do know that it should be small, so I want to use an exponential distribution p_A = pymc.Exponential('p_A', 10).
However, the exponential distribution can return values higher than 1, which would throw off A. Is there a way of bounding the output of p_A without having to re-implement either the Bernoulli or the Exponential distributions in my own #pymc.stochastic-decorated function?
You can use a deterministic function to truncate the Exponential distribution. Personally I believe it would be better if you use a distribution that is bound between 0 and 1, but to exactly solve your problem you can do as follows:
import pymc as pm
p_A = pm.Exponential('p_A',10)
#pm.deterministic
def p_B(p=p_A):
return min(1, p)
A = pm.Bernoulli('A', p_B)
model = dict(p_A=p_A, p_B=p_B, A=A)
S = pm.MCMC(model)
S.sample(1000)
p_B_trace = S.trace('p_B')[:]
PyMC provides bounds. The following should also work:
p_A = pymc.Bound(pymc.Exponential, upper=1)('p_A', lam=10)
For any other lost souls who come across this:
I think the best solution for my purposes (that is, I was only using the exponential distribution because the probabilities I was looking to generate were probably small, rather than out of mathematical convenience) was to use a Beta function instead.
For certain parameter values it approximates the shape of an exponential function (and can do the same for binomials and normals), but is bounded to [0 1]. Probably only useful for doing things numerically, though, as I imagine it's a pain to do any analysis with.
If pymc implements the Metropolis-Hastings algorithm to come up with samples from the posterior density over the parameters of interest, then in order to decide whether to move to the next state in the markov chain it must be able to evaluate something proportional to the posterior density for all given parameter values.
The posterior density is proportion to the likelihood function based on the observed data times the prior density.
How are each of these represented within pymc? How does it calculate each of these quantities from the model object?
I wonder if anyone can give me a high level description of the approach or point me to where I can find it.
To represent the prior, you need an instance of the Stochastic class, which has two primary attributes:
value : the variable's current value
logp : the log probability of the variable's current value given the values of its parents
You can initialize a prior with the name of the distribution you are using.
To represent the likelihood, you need a so-called Data Stochastic. That is, an instance of class Stochastic whose observed flag is set to True. The value of this variable cannot be changed and it will not be sampled. Again, you can initialize the likelihood with the name of the distribution you are using (but don't forget to set the observed flag to True).
Say we have the following setup:
import pymc as pm
import numpy as np
import theano.tensor as t
x = np.array([1,2,3,4,5,6])
y = np.array([0,1,0,1,1,1])
We can run a simple logistic regression with the following:
with pm.Model() as model:
#Priors
b0 = pm.Normal("b0", mu=0, tau=1e-6)
b1 = pm.Normal("b1", mu=0, tau=1e-6)
#Likelihood
z = b0 + b1 * x
yhat = pm.Bernoulli("yhat", 1 / (1 + t.exp(-z)), observed=y)
# Sample from the posterior
trace = pm.sample(10000, pm.Metropolis())
Most of the above came from Chris Fonnesbeck's iPython notebook here.