Calculate maximum likelihood using PyMC3 - python

There are cases when I'm not actually interested in the full posterior of a Bayesian inference, but simply the maximum likelihood (or maximum a posterior for suitably chosen priors), and possibly it's Hessian. PyMC3 has functions to do that, but find_MAP seems to return the model parameters in transformed form depending on the prior distribution on them. Is there an easy way to get the untransformed values from these? The output of find_hessian is even less clear to me, but it's most likely in the transformed space too.

May be the simpler solution will be to pass the argument transform=None, to avoid PyMC3 doing the transformation and then using find_MAP
I let you and example for a simple model.
data = np.repeat((0, 1), (3, 6))
with pm.Model() as normal_aproximation:
p = pm.Uniform('p', 0, 1, transform=None)
w = pm.Binomial('w', n=len(data), p=p, observed=data.sum())
mean_q = pm.find_MAP()
std_q = ((1/pm.find_hessian(mean_q))**0.5)[0]
print(mean_q['p'], std_q)
Have you considered using ADVI?

I came across this once more and found a way to get the untransformed values from the transformed ones. Just in case somebody else needs this as-well. The gist of it is that the untransformed values are essentially theano expressions that can be evaluated given the transformed values. PyMC3 helps here a little by providing the Model.fn() function which creates such an evaluation function accepting values by name. Now you only need to supply the untransformed variables of interest to the outs argument. A complete example:
data = np.repeat((0, 1), (3, 6))
with pm.Model() as normal_aproximation:
p = pm.Uniform('p', 0, 1)
w = pm.Binomial('w', n=len(data), p=p, observed=data.sum())
map_estimate = pm.find_MAP()
# create a function that evaluates p, given the transformed values
evalfun = normal_aproximation.fn(outs=p)
# create name:value mappings for the free variables (e.g. transformed values)
inp = {v:map_estimate[v.name] for v in model.free_RVs}
# now use that input mapping to evaluate p
p_estimate = evalfun(inp)
outs can also receive a list of variables, evalfun will then output the values of the corresponding variables in the same order.

Related

How to pass independent variable as parameter to lmfit minimize

I am new to this statistical thing.
So I have a data set y = f(x) for some values of x. I want to fit this data to a func so that for every point in y I can calculate the value of x.
suppose the model I want to fit is something like
def func(x,a,b,c):
return a+b*x/c
now to use minimize function, i've to define parameters:
params = Parameters()
params.add('a' , value = 10)
params.add('b' , value = 1)
params.add('c' , value = 2)
result = minimize(func, param, arg=(x, y))
My question is, what if I want to make my x variable as parameter and pass it as parameter.
Basically when I pass x as variable i am passing an array which corresponds to specific points in my data set. However I want to find use x as a parameter because I want to find value of x for certain points of data y.
Parameters and fitting variables are scalar floating-point numbers. That is to say, they have one value that can take a continuous range of values.
Do you mean that you want every element of x to be independently varied in the fit?
It is common to use minimization methods to fit data: find the set of values for the variables (say, your a, b, and c) so that y - func(x, a, b, c) is as small as possible.
Your incomplete code snippet (to be clear, it is always better to include a complete example) doesn't do that -- it doesn't pass y into func.
More importantly, you seem to be looking to "find the value of x for certain points of data y". That doesn't quite make sense to me.... Maybe clean up and clarify the question?
It's not a one way step.
I achieved it by using Model class in lmfit
Consider this.
I have model function which I name say: MF through which I calculate exact samples as of raw data.
I have raw data : Raw_Data (say for example coming from sensors)
then I have certain parameters. (x, y, z, samples)
Now I consider that samples is an independent variable.
My goal is to estimate one of the parameters.
First I have to create a Model class
mod = Model (MF, parameters, independant_vars = ['samples']
then you set the initial parameters using Parameters() class.
and add init values
fit_params = Parameters()
fit_params.add('x', value = 170)
fit_params.add('y', value = 120)
fit_params.add('z', value = 110)
Then you fit the model
Result - mod.fit(Raw_data, params = fit_params, samples = your_samples)
The output is a dict.

How do I improve a Gaussian/Normal fit in Python 3.X by using a running median?

I have an array of 100x100 data points, where I'm trying to perform a Gaussian fit to each column of 100 values in the array. I then want the parameters of the Gaussian found by using the fit of the first column to be the initial parameters of the starting point for the next column to use. Let's say I start with the initial parameters of 1000, 0, and 1, and the fit finds values of 800, 3, and 1.5. I then want the fitter to use these three parameters as initial values for the next column.
My code is:
x = np.linspace(-50,50,100)
Gauss_Model = models.Gaussian1D(amplitude = 1000., mean = 0, stddev = 1.)
Fitting_Model = fitting.LevMarLSQFitter()
Fit_Data = []
for i in range(0, Data_Array.shape[0]):
Fit_Data.append(Fitting_Model(Gauss_Model, x, Data_Array[:,i]))
Right now it uses the same initial values for every fit. Does anyone know how to perform such a running median/mean for a Gaussian fitting method? Would really appreciate any help or being pointed in the right direction, thanks!
I'm not familiar with the specific library you are using, but if you can get your fitted parameters out with something like fit_data[-1].amplitude or fit_data[-1].mean, then you could modify your loop to use something like:
for i in range(0, data_array.shape[0]):
if fit_data: # true if not an empty list
Gauss_Model = models.Gaussian1D(amplitude=fit_data[-1].amplitude,
mean=fit_data[-1].mean,
stddev=fit_data[-1].stddev)
fit_data.append(Fitting_Model(Gauss_Model, x, Data_Array[:,i]))
basically checking whether you have already fit a model, and if you have, use the most recent fitted amplitude, mean, and standard deviation as the starting point for your next Gauss_Model.
A thought: this might speed up your fitting, but it shouldn't result in a "better" fit to the 100 data points in each fit operation. Your resulting model is probably the best fit model to the data it was presented. If you want to estimate the error in the parameters of your model, you can use the fact that, for two normal distributions A ~ N(m_a, v_a) and B ~ N(m_b, v_b), the distribution A + B will have mean m_a + m_b and variance is v_a + v_b. Thus, the distribution of your means will be N(sum(means)/n, sum(variances)/n). Basically you can say that your true mean is centered at the mean of your means with standard deviation (sum(stddev)/sqrt(n)).
I also cannot tell what library you are using, and the details of how to do this probably depend on the details of how that library stores the fitted values. I can say that for lmfit (https://lmfit.github.io/lmfit-py/) we struggled with this sort of usage and arrived at a design that makes what you are trying to do pretty easy. With lmfit, you might compose this problem as:
import numpy as np
from lmfit import GaussianModel
x = np.linspace(-50,50,100)
# get Data_Array from somewhere....
# create a model for a Gaussian
Gauss_Model = GaussianModel()
# make a set of parameters, setting initial values
params = Gauss_Model.make_params(amplitude=1000, center=0, sigma=1.0)
Fit_Results = []
for i in range(Data_Array.shape[1]):
result = Gauss_Model.fit(Data_Array[:, i], params, x=x)
Fit_Results.append(result)
# update `params` with the current best fit params for the next column
params = result.params
Note that this works because lmfit is careful that Model.fit() will not alter the input parameters, and will put the resulting best-fit parameters for each fit in result.params.
And, if you decide you do want to have all columns use the original initial values, just comment out that last params = result.params.
Lmfit has a lot more bells and whistles, but I hope that helps you do what you need.

PyMC3 binomial switchpoint model highly dependent on testval

I've set up the following binomial switchpoint model in PyMC3:
with pm.Model() as switchpoint_model:
switchpoint = pm.DiscreteUniform('switchpoint', lower=df['covariate'].min(), upper=df['covariate'].max())
# Priors for pre- and post-switch parameters
early_rate = pm.Beta('early_rate', 1, 1)
late_rate = pm.Beta('late_rate', 1, 1)
# Allocate appropriate binomial probabilities to years before and after current
p = pm.math.switch(switchpoint >= df['covariate'].values, early_rate, late_rate)
p = pm.Deterministic('p', p)
y = pm.Binomial('y', p=p, n=df['trials'].values, observed=df['successes'].values)
It seems to run fine, except that it entirely centers in on one value for the switchpoint (999), as shown below.
Upon further investigation it seems that the results for this model are highly dependent on the starting value (in PyMC3, "testval"). The below shows what happens when I set the testval = 750.
switchpoint = pm.DiscreteUniform('switchpoint', lower=gp['covariate'].min(),
upper=gp['covariate'].max(), testval=750)
I get similarly different results with additional different starting values.
For context, this is what my dataset looks like:
My questions are:
Is my model somehow incorrectly specified?
If it's correctly specified, how should I interpret these results? In particular, how do I compare / select results generated by different testvals? The only idea I've had has been using WAIC to evaluate out of sample performance...
Models with discrete values can be problematic, all the nice sampling techniques using the derivatives don't work anymore, and they can behave a lot like multi modal distributions. I don't really see why this would be that problematic in this case, but you could try to use a continuous variable for the switchpoint instead (wouldn't that also make more sense conceptually?).

PyMC throwing Error or Crashing When Trying to Sample

I'm trying to model a process where the number of trials n used in a binomial process is generated by a non-homogeneous Poisson process. I'm currently using PyMC to fit the Poisson part of this (example here, without the whole capping part), but I can't figure out how to integrate this with the Binomial. How can I take what I've generated with the Poisson process and use it in fitting a Binomial process? Or is there a better way to do this using similar methods?
Here's what I've tried:
import pymc as pm
import numpy as np
t = np.arange(5)
a = pm.Uniform(name='a', value=1., lower=0, upper=10)
b = pm.Uniform(name='b', value=1., lower=0, upper=10)
#pm.deterministic
def linear(a=a, b=b):
return a * t + b
N_A = pm.Poisson(mu=linear, name='N_A')
C = pm.Beta('C', 1, 1)
obs_A = pm.Binomial('obs_A', N_A, C, observed=True, value=np.array([0,1,4,3,7])
mcmc = pm.MCMC([obs_A, C, N_A, a, b])
mcmc.sample(10000,5000)
When I try to pull a sample, it throws the error
pymc.Node.ZeroProbability: Stochastic obs_A's value is outside its support, or it forbids its parents' current values.
I'm sure I'm formulating this incorrectly, but I'm not sure how.
The error appears because the automatically generated initial value of N_A (the number of trials) is less than some of the observed values specified in pm.Binomial, i.e. the probability of observing such a number is zero, and the initial value has to be rejected. One solution would be to supply an acceptable initial value for N_A explicitly:
x = np.array([0,1,4,3,7])
N_A = pm.Poisson(mu=linear, name='N_A', value=x)
For reference: https://github.com/pymc-devs/pymc/issues/12:
This is a model mis-specification issue, not a bug. It means that the initial parameter value for the likelhood are incompatible with the data -- you need to specify compatible initial values.

Debugging pymc probability calculations

I've tried to model a mixture of exponentials by copying the mixture-of-Gaussians example given here. The code is below. I know there are some funky aspects to the inference here, but my question is more about how to debug the calculations in models like this.
The idea is that it's a mixture of three exponentials, with scale parameters taken from the Gamma assigned to scales. However, all observations get assigned to the zeroth exponential during the ElemwiseCategoricalStep. You can see that the assignments of the observations to the exponential components are initially diverse by looking at initial_assignments, and you can see that all observations are assigned to the zeroth component on all interations from the fact that set(tr['exp'].flatten()) contains only 0.
I assume this is because all of the values assigned to p in the expression array([logp(v * self.sh) for v in self.values]) in ElemwiseCategoricalStep.astep are minus infinity. I would like to know why that is and how to correct it, but even more, I would like to know what tools are available to debug this kind of thing. Is there any way for me to step through the calculation of logp(v * self.sh) to see how the result is determined? If I try to do it using pdb, I think I get stymied at outputs = self.fn() in theano.compile.function_module.Function.__call__, which I guess I can't step into because it's a native function.
Even knowing how to compute the pdf for a given set of model parameters would be a useful start.
import numpy as np
import pymc as pm
from pymc import Model, Gamma, Normal, Dirichlet, Exponential
from pymc import Categorical
from pymc import sample, Metropolis, ElemwiseCategoricalStep
durations = np.concatenate(
[np.random.exponential(1/lam, 10)
for lam in [1e-3,7e-5,2e-6]])
initial_assignments = np.random.randint(0, 3, len(durations))
print 'initial_assignments', initial_assignments
with Model() as model:
scales = Gamma('hp', 1, 1, shape=3)
props = Dirichlet('props', a=np.array([1., 1., 1.]), shape=3)
category = Categorical('exp', p=props, shape=len(durations))
points = Exponential('obs', lam=scales[category], observed=durations)
step1 = pm.Metropolis(vars=[props,scales])
step2 = ElemwiseCategoricalStep(var=category, values=[0,1,2])
start = {'exp': initial_assignments,
'hp': np.ones(3),
'props': np.ones(3),}
tr = sample(3000, step=[step1, step2], start=start)
print set(tr['exp'].flatten())
Excellent question. One thing you can do is look at the pdf for each of the components.
The Model and each of the variables should have both a .logp and a .elemwise_logp property and them which returns a function that can take a point or parameter values.
Thus you can say something like print scales.logp(start) or print model.logp(start) or print scales.dlogp()(start).
For now, I think you unfortunately have to specify all the parameter values (even ones that don't affect the result for a particular variable).
Model, FreeRV and ObservedRV all inherit from Factor which provides this functionality and has a few other methods. You'll probably want the non fast versions since those are more forgiving in the kinds of arguments they accept.
Does that help? Please let me know if you have other ideas for things that might help you in debugging. This is one area where we know pymc3 and theano needs some work.

Categories