Debugging pymc probability calculations

Debugging pymc probability calculations - python

I've tried to model a mixture of exponentials by copying the mixture-of-Gaussians example given here. The code is below. I know there are some funky aspects to the inference here, but my question is more about how to debug the calculations in models like this.
The idea is that it's a mixture of three exponentials, with scale parameters taken from the Gamma assigned to scales. However, all observations get assigned to the zeroth exponential during the ElemwiseCategoricalStep. You can see that the assignments of the observations to the exponential components are initially diverse by looking at initial_assignments, and you can see that all observations are assigned to the zeroth component on all interations from the fact that set(tr['exp'].flatten()) contains only 0.
I assume this is because all of the values assigned to p in the expression array([logp(v * self.sh) for v in self.values]) in ElemwiseCategoricalStep.astep are minus infinity. I would like to know why that is and how to correct it, but even more, I would like to know what tools are available to debug this kind of thing. Is there any way for me to step through the calculation of logp(v * self.sh) to see how the result is determined? If I try to do it using pdb, I think I get stymied at outputs = self.fn() in theano.compile.function_module.Function.__call__, which I guess I can't step into because it's a native function.
Even knowing how to compute the pdf for a given set of model parameters would be a useful start.
import numpy as np
import pymc as pm
from pymc import Model, Gamma, Normal, Dirichlet, Exponential
from pymc import Categorical
from pymc import sample, Metropolis, ElemwiseCategoricalStep
durations = np.concatenate(
[np.random.exponential(1/lam, 10)
for lam in [1e-3,7e-5,2e-6]])
initial_assignments = np.random.randint(0, 3, len(durations))
print 'initial_assignments', initial_assignments
with Model() as model:
scales = Gamma('hp', 1, 1, shape=3)
props = Dirichlet('props', a=np.array([1., 1., 1.]), shape=3)
category = Categorical('exp', p=props, shape=len(durations))
points = Exponential('obs', lam=scales[category], observed=durations)
step1 = pm.Metropolis(vars=[props,scales])
step2 = ElemwiseCategoricalStep(var=category, values=[0,1,2])
start = {'exp': initial_assignments,
'hp': np.ones(3),
'props': np.ones(3),}
tr = sample(3000, step=[step1, step2], start=start)
print set(tr['exp'].flatten())

Excellent question. One thing you can do is look at the pdf for each of the components.
The Model and each of the variables should have both a .logp and a .elemwise_logp property and them which returns a function that can take a point or parameter values.
Thus you can say something like print scales.logp(start) or print model.logp(start) or print scales.dlogp()(start).
For now, I think you unfortunately have to specify all the parameter values (even ones that don't affect the result for a particular variable).
Model, FreeRV and ObservedRV all inherit from Factor which provides this functionality and has a few other methods. You'll probably want the non fast versions since those are more forgiving in the kinds of arguments they accept.
Does that help? Please let me know if you have other ideas for things that might help you in debugging. This is one area where we know pymc3 and theano needs some work.

Related

Exclude/Ignore data region in polynomial fit (zfit)

I wanted to know if there's a way to exclude one or more data regions in a polynomial fit. Currently this doesn't seem to work as I would expect. Here a small example:
import numpy as np
import pandas as pd
import zfit
# Create test data
left_data = np.random.uniform(0, 3, size=1000).tolist()
mid_data = np.random.uniform(3, 6, size=5000).tolist()
right_data = np.random.uniform(6, 9, size=1000).tolist()
testsample = pd.DataFrame(left_data + mid_data + right_data, columns=["x"])
# Define fit parameter
coeff1 = zfit.Parameter('coeff1', 0.1, -3, 3)
coeff2 = zfit.Parameter('coeff2', 0.1, -3, 3)
# Define Space for the fit
obs_all = zfit.Space("x", limits=(0, 9))
# Perform the fit
bkg_fit = zfit.pdf.Chebyshev(obs=obs_all, coeffs=[coeff1, coeff2], coeff0=1)
new_testsample = zfit.Data.from_pandas(obs=obs_all, df=testsample.query("x<3 or x>6"), weights=None)
nll = zfit.loss.UnbinnedNLL(model=bkg_fit, data=new_testsample)
minimizer = zfit.minimize.Minuit()
result = minimizer.minimize(nll)
TestSample.png
Here I've created a small testsample with 3 uniformly distributed data. I only want to use the data in x < 3 OR x > 6 and ignore the 'peak' in between. Because of their equal shape and height, I'd expect that coeff1 and coeff2 would be at (nearly) zero and the fitted curve would be a straight, horizontal line. Obviously this doesn't happen because zfit assumes that there're just no entries between 3 and 6.
I also tried using MultiSpaces to ignore that region via
limit1 = zfit.Space("x", limits=(0, 3))
limit2 = zfit.Space("x", limits=(6, 9))
obs_data = limit1 + limit2
But this leads to a
ValueError: obs need to be a Space with exactly one limit if rescaling is requested.
Anyone has an idea how to solve this?
Thanks in advance ^^

Indeed, this is a bit of a tricky problem, but that may just needs a small update in zfit.
What you are doing is correct: simply use only the data in the desired region. However, this is not the whole story because there is a "normalization range": probabilistically speaking, it's like a conditioning on a certain region as we know the data can only be in a specific region. Hence the normalization of the PDF should only integrate over the included (LOW and HIGH) regions.
This can normally be done in two ways:
Using multispace
using the multispace property as you do. This should work (it is though most probably not the way to go in the future), except for a quirk in the polynomial function: the polynomials are defined from -1 to 1. Currently, the data is simply rescaled therefore to be within -1 and 1 (and for that it should use the "space" property of the PDF). This, currently, requires to be a simple space (which could also be allowed in principle, using the minimum and maximum of the limits).
Simultaneous fit
As mentioned in the comments by #jtlz2, you can do a simultaneous fit. That is nothing to worry about, it is simply splitting the likelihood into two parts. As it is a product of probabilities, we can just conceptually split it into two products and multiply (or add their log).
So you can have the pdf fit the lower region and the upper at the same time. However, this does not solve the problem of the normalization: what should the PDF be normalized to? We will run into the same problem.
Solution 1: different space and norm
Space and the normalization range are however not the same. By default, the space (usually called 'obs') is also used as the default normalization range but not required. So you could use one space going from the lowest to the largest point as the obs and then set the norm range with your multispace (set_norm should do it or set_norm_range if you're using not the newest version). This, I think, should do the trick.
Solution 2: manual re-scaling
The actual problem is that it complains about the re-scaling to -1 and 1 that can't be done. Every polynomial which does that can also be told not to do that by using the apply_scaling=False argument. With that, you're responsible to scale the data within -1 and 1 (as the polynomials are not defined outside) and there should not be any error.

Dynamic Factor Model Estimation

I'm looking for a python or matlab based package which can estimate parameters for the following model:
In the original paper they refer to this code by Koop. The problem I have is that this program as well the standard packages from Python's statsmodel estimate a DFM of the form:
The difference to the model in the paper is that if we have two factors, then A_1 is two-dimensional, but in the model I want to estimate, we only want to estimate a_11 and assume a_12 = 0. Is there a package which can estimate such models?

There are two ways to do this in Statsmodels, although there are trade-offs to each approach:
(1) If you are okay with 1 lag for the error terms (i.e. if it is okay to have e(i,t) = \phi(i,1) e(i,t-1) + u(i,t), from your linked "Model" equations), then you can use the DynamicFactorMQ class. For two factors that evolve independently, you can use the following:
mod = sm.tsa.DynamicFactorMQ(y, factors=['f1', 'f2'],
factor_orders={'f1':1, 'f2':1},
idiosyncratic_ar1=True)
res = mod.fit()
See here for more details on how the factors and factor_orders arguments work. Basically, by specifying factor_orders={'f1':1, 'f2':1} instead of factor_orders={('f1', 'f2'):1} (which is the default if you don't specify anything), the factors evolve separately (which is the same as having diagonal A matrices).
(2) Otherwise, if you do not have too many left-hand-side variables, you could use the DynamicFactor class with fixed parameters:
mod = sm.tsa.DynamicFactor(std, factor_order=1, k_factors=2,
error_order=1,
enforce_stationarity=False)
with mod.fix_params({'L1.f2.f1': 0, 'L1.f1.f2': 0}):
res = mod.fit()
In this case, when you do mod.fix_params({'L1.f2.f1': 0, 'L1.f1.f2': 0}), you are specifying that a_12 = a_21 = 0. See here for some more details about using fix_params.
But in general, the DynamicFactorMQ class from option (1) above is more robust, and is likely the better option.

Fitting data with function with several domains whose boundaries are variable

As the title mentions, I am having trouble fitting data points to a function with 3 domains whose boundaries are a parameter of my function. Here is the function I am dealing with:
global sigma_m
sigma_m=2*10**(-12)
global sigma_f
sigma_f=10**3
def Conductivity (phi,phi_c,t,s):
sigma=[0]*(len(phi))
for i in range (0,len(phi)):
if phi[i]<phi_c:
sigma[i]=sigma_m*(phi_c-phi[i])**(-s)
elif phi[i]==phi_c:
sigma[i]=sigma_f*(sigma_m/sigma_f)**(t/(t+s))
else:
sigma[i]=sigma_f*(phi[i]-phi_c)**t
return sigma
And my data points are:
phi_data=[0,0.005,0.007,0.008,0.017,0.05,0.085,0.10]
sigma_data=[2.00E-12,2.50E-12,3.00E-12,9.00E-04,1.00E-01,1.00E+00,2.00E+00,3.00E+00]
My constraints are that phi_c, s, and t must be strictly greater than zero (in practice, phi_c is rarely higher than 0.1 but higher than 0.001, s is usually between 0.5 and 1.5, and t is usually anywhere between 1.5 and 6).
My goal is to fit my data points and have my fit give me values of phi_c, s, and t. s and t can be estimated to help the code (in the specific set of data points that I showed, t should be around 2, and s should be around 0.5). phi_c is completely unknown, except for the range of values that I mentioned just above.
I have used both curve_fit from scipy and Model from lmfit but both provide ridiculously small phi_c values (like 10**(-16) or similarly small values that make me believe the programme wants phi_c to be negative).
Here is my code for when I used curve_fit:
popt, pcov = curve_fit(Conductivity, phi_data, sigma_data, p0=[0.01,2,0.5], bounds=(0,[0.5,10,3]))
Here is my code for when I used Model from lmfit:
t_estimate=0.5
s_estimate=2
phi_c_estimate=0.005
condmodel = Model(Conductivity)
params = condmodel.make_params(phi_c=phi_c_estimate,t=t_estimate,s=s_estimate)
result = condmodel.fit(sigma_data, params, phi=phi_data)
params['phi_c'].min = 0
params['phi_c'].max = 0.1
Both options give an okay fit when plotted, but the estimated value of phi_c is nowhere near plausible.
If you have any idea what I could do to have a better fit, please let me know!
PS: I have a read a promising post about using the package symfit to fit the data on the different regions separately, unfortunately the package symfit does not work for me. It keeps uninstalling my version of scipy then reinstalling an older version, and then it tells me it needs a newer version of scipy to function.
EDIT: I managed to make the symfit package work. Here is my entire code:
from symfit import parameters, variables, Fit, Piecewise, exp, Eq
import numpy as np
import matplotlib.pyplot as plt
global sigma_m
sigma_m=2*10**(-12)
global sigma_f
sigma_f=10**3
phi, sigma = variables ('phi, sigma')
t, s, phi_c = parameters('t, s, phi_c')
phi_c.min = 0.001
phi_c.max = 0.1
sigma1 = sigma_m*(phi_c-phi)**(-s)
sigma2 = sigma_f*(phi-phi_c)**t
model = {sigma: Piecewise ((sigma1, phi <= phi_c), (sigma2, phi > phi_c))}
constraints = [Eq(sigma1.subs({phi: phi_c}), sigma2.subs({phi: phi_c}))]
phi_data=np.array([0,0.005,0.007,0.008,0.017,0.05,0.085,0.10])
sigma_data=np.array([2.00E-12,2.50E-12,3.00E-12,9.00E-04,1.00E-01,1.00E+00,2.00E+00,3.00E+00])
fit = Fit(model, phi=phi_data, sigma=sigma_data, constraints=constraints)
fit_result = fit.execute()
print(fit_result)
Unfortunately I get the following error:
File "D:\Programs\Anaconda\lib\site-packages\sympy\printing\pycode.py", line 236, in _print_ComplexInfinity
return self._print_NaN(expr)
File "D:\Programs\Anaconda\lib\site-packages\sympy\printing\pycode.py", line 74, in _print_known_const
known = self.known_constants[expr.__class__.__name__]
KeyError: 'ComplexInfinity'
My knowledge of coding is very limited, I have no idea what this means and what I should do to not have this error anymore. Please let me know if you have an idea.

I'm not certain that I have a single answer for you, but this will be too long to fit into a comment.
First, a model that switches functional form is especially challenging. But, what's more is that your form has
elif phi[i]==phi_c:
For floating point numbers that are variables, this is going to basically never be true. You might not mean "exactly equal" but "pretty close", which might be
elif abs(phi[i] - phi_c) < 1.0e-5:
or something...
But also, converting that from a for loop to using numpy.where() is probably worth looking into.
Second, it is not at all clear that your different forms actually evaluate to the same values at the boundaries to ensure a continuous function. You might want to check that.
Third, models with powers and exponentials are especially challenging to fit as a small change in power can have a huge impact on the resulting value. It's also very easy to get "negative value raised to non-integer value", which is of course, complex.
Fourth, those sigma_m and sigma_f constants look like they could easily cause trouble. You should definitely evaluate your model with your starting parameter values and see if you can sort of reproduce your data with your model and reasonable starting values. I suspect that you'll need to change your starting values.

PyMC3 binomial switchpoint model highly dependent on testval

I've set up the following binomial switchpoint model in PyMC3:
with pm.Model() as switchpoint_model:
switchpoint = pm.DiscreteUniform('switchpoint', lower=df['covariate'].min(), upper=df['covariate'].max())
# Priors for pre- and post-switch parameters
early_rate = pm.Beta('early_rate', 1, 1)
late_rate = pm.Beta('late_rate', 1, 1)
# Allocate appropriate binomial probabilities to years before and after current
p = pm.math.switch(switchpoint >= df['covariate'].values, early_rate, late_rate)
p = pm.Deterministic('p', p)
y = pm.Binomial('y', p=p, n=df['trials'].values, observed=df['successes'].values)
It seems to run fine, except that it entirely centers in on one value for the switchpoint (999), as shown below.
Upon further investigation it seems that the results for this model are highly dependent on the starting value (in PyMC3, "testval"). The below shows what happens when I set the testval = 750.
switchpoint = pm.DiscreteUniform('switchpoint', lower=gp['covariate'].min(),
upper=gp['covariate'].max(), testval=750)
I get similarly different results with additional different starting values.
For context, this is what my dataset looks like:
My questions are:
Is my model somehow incorrectly specified?
If it's correctly specified, how should I interpret these results? In particular, how do I compare / select results generated by different testvals? The only idea I've had has been using WAIC to evaluate out of sample performance...

Models with discrete values can be problematic, all the nice sampling techniques using the derivatives don't work anymore, and they can behave a lot like multi modal distributions. I don't really see why this would be that problematic in this case, but you could try to use a continuous variable for the switchpoint instead (wouldn't that also make more sense conceptually?).

(Python) Estimating regression parameter confidence intervals with scikits bootstrap

I've just started to try out a nice bootstrapping package available through scikits:
https://github.com/cgevans/scikits-bootstrap
but I've encountered a problem when trying to estimate confidence intervals for the correlation coefficient from linear regression. The confidence intervals returned lie completely outside the range of the original statistic.
Here is the code:
import numpy as np
from scipy import stats
import bootstrap as boot
np.random.seed(0)
x = np.arange(10)
y = 10 + 1.5*x + 2*np.random.randn(10)
r0 = stats.linregress(x, y)[2]
def my_function(y):
return stats.linregress(x, y)[2]
ci = boot.ci(y, statfunction=my_function, alpha=0.05, n_samples=1000, method='pi')
This yields a result of ci = [-0.605, 0.644], but the original statistic is r0=0.894.
I've tried this in R and it seems to work fine there: the ci straddles r0 as expected.
Please help!

Could you provide your R code? I'd be interested in knowing how this is dealt with in R.
The problem here is that you're only passing y to boot.ci, but every time it runs my_function, it uses the entire, original x (note the lack of x input to my_function). Bootstrapping applies the statistic function to resampled data, so if you're applying your statistic function using the original x and a sample of y, you're going to have a nonsensical result. This is why the BCA method doesn't work at all, actually: it can't apply your statistic function to jackknife samples, which don't have the same number of elements.
Samples are taken along axis 0 (rows), so if you want to pass multiple 1D arrays to your statistic function, you can use multiple columns: xy = vstack((x,y)).T would work, and then use a statfunction that takes data from those columns:
def my_function(xysample):
return stats.linregress(xysample[:,0], xysample[:,1])[2]
Alternatively, if you wanted to avoid messing with your data at all, you could define a function that operates on indexes, and then just pass indexes to boot.ci:
def my_function2(i):
return stats.linregress(x[i], y[i])[2]
boot.ci(np.arange(len(x)), statfunction=my_function2, alpha=0.05, n_samples=1000, method='pi')
Note that in either of these cases, BCA works, so you may as well use method='bca' unless you really do want to use percentage intervals; BCA is pretty much always better.
I do realize that both of these methods are less than ideal. Honestly, I've never had a need to pass multiple arrays like this to my statfunction, and the majority of people are likely using mean as their statfunction. I think the best idea here may be to allow lists of equally-size[0] arrays to be passed, eg, boot.ci([x,y],...), and then sample all of those at the same time and pass them all to the statfunction as separate arguments. In that case, you could just have a my_function(x,y). I'll see if I can do this, but if you can show me your R code, that would be great, as I'd like to see if there is a better way of dealing with this.
Update:
In the most recent version of scikits.bootstrap (v0.3.1), a tuple of arrays can be provided, and samples from them will be passed as separate arguments to statfunction. Additionally, statfunction can provide array output, and confidence intervals will be calculated for each point in the output. Thus, this is now very easy to do. The following will give confidence intervals for every output of linregress:
cis = boot.ci( (x,y), statfunction=stats.linregress )
cis[:,2] in this case will be the desired confidence interval.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.