Compound model in PyMC

Compound model in PyMC - python

I'm trying to use PyMC 2.3 to obtain an estimate of the parameter of a compound model.
By "compound" I mean a random variable that follows a distribution whose whose parameter is another random variable. ("nested" or "hierarchical" are somtimes used to refer to this situation, but I think they are less specific and generate more confusion in this context).
Let make a specific example. The "real" data is generated from a compound distribution that is a Poisson with a parameter that is distributed as an Exponential. In plain scipy the data is generated as follows:
import numpy as np
from scipy.stats import distributions
np.random.seed(3) # for repeatability
nsamples = 1000
tau_true = 50
orig_lambda_sample = distributions.expon(scale=tau_true).rvs(nsamples)
data = distributions.poisson(orig_lambda_sample).rvs(nsamples)
I want to obtain an estimate of the model parameter tau_true.
My approach so far in modelling this problem in PyMC is the following:
tau = pm.Uniform('tau', 0, 100)
count_rates = pm.Exponential('count_rates', beta=1/tau, size=nsamples)
counts = pm.Poisson('counts', mu=count_rates, value=data, observed=True)
Note that I use size=nsamples to have a new stochastic variable for each sample.
Finally I run the MCMC as:
model = pm.Model([count_rates, counts, tau])
mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000)
The model converges (although slowly, > 10^5 iterations) to a distribution centred around 50 (tau_true). However it seems like an overkill to define 1000 stochastic variables to fit a single distribution with a single parameter.
Is there a better way to describe a compound model in PyMC?
PS I also tried with a more informative prior (tau = pm.Normal('tau', mu=51, tau=1/2**2)) but the results are similar and the convergence is still slow.

It looks like what you are trying to model is data that is over-dispersed. In fact, a negative binomial distribution is just a Poisson with a mean that is distributed according to a gamma distribution (the general form of the Exponential). So, one way to get around defining 1000 variables is to use the negative binomial directly. Keep in mind, though, despite there being nominally 1000 variables, the effective number of variables is somewhere between 1 and 1000, depending on how constrained the distribution of means is. You are essentially defining a random effect here.

Related

Comparing models fitting multivariate data

I have trouble using WAIC (widely applicable information criterion) in PyMC3. Namely, I have data which I know to be distributed according to multivariate Dirichlet distribution. I try to fit the data by assuming that marginal distributions are in one case the beta distributions, while in the other lognormal distributions. Obviously in the first case I get lower (better) WAIC value, than in the second case.
The problem arises in the third case then I assume that data is distributed according to Dirichlet distribution. The third WAIC is significantly larger (worse) than in the first two cases. I would expect this WAIC to be lower (better) than the one I get in the second (log-normal) case.
Basically I want to show that log-normal fit is bad. This is easily seen by the naked eye, but I would like to have formal result to show.
The minimal code to replicate what I get:
import pandas as pd
import numpy as np
import pymc3 as pm
# generate the data
df=pd.DataFrame(np.random.dirichlet([10,10,10],size=2000))
# fit the first case (assuming beta marginal distributions)
betaModel=pm.Model()
with betaModel:
alpha=pm.Uniform("alpha",lower=0,upper=20,shape=3)
beta=pm.Uniform("beta",lower=0,upper=40,shape=3)
observed=pm.Beta("obs",alpha=alpha,beta=beta,observed=df.values,shape=df.shape)
betaTrace=pm.sample()
# fit the second case (assuming log-normal marginal distributions)
lognormalModel=pm.Model()
with lognormalModel:
mu=pm.Normal("mu",mu=0,sd=3,shape=3)
sd=pm.HalfNormal("sd",sd=3,shape=3)
observed=pm.Lognormal("obs",mu=mu,sd=sd,observed=df.values,shape=df.shape)
lognormalTrace=pm.sample()
# fit the third case (assuming Dirichlet multivariate distribution)
dirichletModel=pm.Model()
with dirichletModel:
alpha=pm.HalfNormal("alpha",sd=3,shape=3)
observed=pm.Dirichlet("obs",a=alpha,observed=df.values,shape=df.shape)
dirichletTrace=pm.sample()
# compare WAIC
print(pm.waic(betaTrace,betaModel))
print(pm.waic(lognormalTrace,lognormalModel))
print(pm.waic(dirichletTrace,dirichletModel))
The output is:
WAIC_r(WAIC=-12801.95319823564, WAIC_se=105.07502476563575, p_WAIC=5.941977774190434)
WAIC_r(WAIC=-12534.643059697866, WAIC_se=115.43257184238044, p_WAIC=6.68850211472046)
WAIC_r(WAIC=-9156.050975326929, WAIC_se=81.45146980652675, p_WAIC=2.7977039523187996)
I guess the problem might be related to an error:
ValueError: operands could not be broadcast together with shapes (6000,) (2000,)
which I get when I try to run:
pm.compare((betaTrace,lognormalTrace,dirichletTrace),(betaModel,lognormalModel,dirichletModel))
Any suggestions how to obtain a reasonable comparison?
Edit
After thinking about the problem I would believe that it is somewhat "improper". I tend to think so because WAIC is a relative measure, thus it is likely that only similar statistical models can be reasonably compared. If the models are too dissimilar, then you get what I got.
The error I get from pm.compare seems to be related to how random vectors are treated. In the first two cases each component of a random vector is treated as a separate random variate (3 components per 2000 vectors = 6000 points). In the third case random vector as whole is treated as a random variate (2000 vectors = 2000 points).
Initially I thought that this problem could be resolved by reducing the number of points in the first two cases. But as the first two statistical models (wrongly) assume that components are independent, adding log-probabilities does not change anything. WAIC values remain the same.
Currently I think that a small cheat is possible. Namely to fit data to the Dirichlet distribution, but calculate WAIC as if I would have fitted the beta distribution. This gives an expected result - WAIC for the Dirichlet fit is slightly larger than WAIC for the beta fit, but smaller than WAIC for the log-normal fit.
The code for this "cheat":
from collections import namedtuple
from scipy.special import logsumexp
def cheat_logp(tracePoint,model):
values=model.obs.eval()
_,components=values.shape
cb=[None]*components
beta=np.sum(tracePoint["alpha"])
for i in range(components):
cheatBeta=pm.Beta.dist(alpha=tracePoint["alpha"][i],beta=beta-tracePoint["alpha"][i])
cb[i]=cheatBeta.logp(values[:,i]).eval()
return np.array(cb).T
def _log_post_trace(trace, model):
# copy the contents of _log_post_trace function from pymc3/stats.py
# but replace "var.logp_elemwise(pt)" with "cheat_logp(pt,model)"
# <...>
def mywaic(trace, model=None, pointwise=False):
# copy the contents of waic function from pymc3/stats.py
# <...>
Obviously this cheat is not very "nice" and I am still very much interested on how to achieve similar results, but in a proper manner. Of course if it is possible.

Optimizer Tuning in sklearn Gaussian Process Regressor

I'm attempting to use a GaussianProcessRegressor as part of scikit-learn 0.18.1
I'm training on 200 data points and using 13 input features for my kernel - one constant multiplied by a radial basis function with twelve elements. The model runs without complaints, but if I run the same script several times I notice that I sometimes get different solutions. It may be worth noting that several of the optimized parameters are running into the bounds I've provided them (I'm currently working out which features matter).
I've tried increasing the parameter n_restarts_optimizer to 50, and while this takes considerably longer to run it doesn't eliminate the element of apparent randomness. It seems possible to change the optimizer itself, though I've had no luck. From a quick scan it seems the most similar syntactically are scipy's fmin_tnc and fmin_slsqp (other optimizers do not include bounds). However, using either of those cause other issues: for example, fmin_tnc does not return the value of the objective function at its minimum.
Are there any suggestions for how to have a more deterministic script? Ideally I'd like it to print the same values regardless of iteration, because as it stands it feels a bit like a lottery (and therefore drawing any conclusions is questionable).
A snippet of the code I'm using:
from sklearn.gaussian_process import GaussianProcessRegressor as GPR
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
lbound = 1e-2
rbound = 1e1
n_restarts = 50
n_features = 12 # Actually determined elsewhere in the code
kernel = C(1.0, (lbound,rbound)) * RBF(n_features*[10], (lbound,rbound))
gp = GPR(kernel=kernel, n_restarts_optimizer=n_restarts)
gp.fit(train_input, train_outputs)
test_model, sigma2_pred = gp.predict(test_input, return_std=True)
print gp.kernel_

This uses random values to initialize optimization:
As the LML may have multiple local optima, the optimizer can be
started repeatedly by specifying n_restarts_optimizer.
As far as I understand, there will always be a random factor. Sometimes it will find local minima, which are the bounds you've mentioned.
If your data allows it (invertible X matrix) You can use normal equation if it suits your needs, no random factor there.
You can do (something like a random forest) sampling on top of this, where you run this algorithm several times and choose the best fit or the common value: You will have to weigh consistency versus accuracy.
Hope I understood your question correctly.

How to create dataset for fitting a function in scipy stats?

I want to fit some data to a Pareto distribution using the scipy.stats library. I am not sure if the issue might be numerical, so just to be safe; I have values measured for the dependent variable (let's call them 'pushes') for the independent variable ('minutes') starting at a few thousand minutes and every ten minutes thereafter (with the exception of a few points that were removed during data cleaning).
e.g.
2780.0 362.0
2800.0 376.0
2810.0 393.0
...
The best info I can find says something like
from scipy.stats import pareto
result = pareto.fit(data)
and I have no idea how this data is to be formatted in this case. I've tried the following but all result in errors.
result = pareto.fit(zip(minutes, pushes))
result = pareto.fit(pushes)
The error is usually
Warning: invalid value encountered in double_scalars
would greatly appreciate some guidance, thank you.

As I mentioned in the comments above, pareto.fit() is not what you're looking for.
The .fit() methods of the continuous distributions in scipy.stats obtain an estimate of the parameters of the distribution that maximise the probability of observing some particular set of sample values. Therefore, pareto.fit() wants only a single array argument containing the samples you want to fit the distribution to. The other keyword arguments control various aspects of the fitting process, for example by specifying initial values for the distribution parameters.
What you're actually trying to do is to fit the relationship between some independent variable x and some dependent variable y, i.e.
y_fit = f(x, params)
What you need to do is:
Choose some functional form for f. From your description, the plot of y vs x resembles the probability density function for a Pareto distribution, so perhaps either this or a decaying exponential might be appropriate.
Find the set of params that minimize some measure of the difference between y and y_fit (usually the sum of squared differences). You could use scipy.optimize.curve_fit or scipy.optimize.minimize to do this.

How to provide a custom variance to OLS in order to estimate coefficients confidence intervals?

I'm using statsmodels in order to analyze some data. I determined six coefficients using OLS and I get the confidence interval that I suppose are being calculated by the library estimating the variance as ssr / df_resid.
Let's suppose I now want to provide the OLS a custom variance I estimated and make it use this in order to calculate the coefficients confidence interval. How do I do it?

It is a bit tricky to change attributes in the Results instances in statsmodels, because many attributes are calculated lazily and then cached. Any use of backdoors or direct assignment to attributes might result in inconsistent and incorrect results. However, since it's Python, it would be possible to change the internally stored cov_params_default.
One possibility to get the inference for the parameters based on a covariance matrix given by the user, is to use t_test which has a cov_p argument for providing a cov_params.
Something like
tt = results.t_test(np.eye(len(results.params)), cov_p = myscale * results.normalized_cov_params)
then tt contains the same information as a summary params table, see dir(tt) and print(tt)
However, since fixing the scale, error variance estimate is not an uncommon use case, it is now supported directly, in current statsmodels master and will be released with 0.7. Currently only supported for the linear regression models OLS, WLS, GLS.
https://github.com/statsmodels/statsmodels/pull/2137
for example for WLS:
res = WLS(ydata, xdata, weights=weights).fit(cov_type='fixed scale', cov_kwds={'scale':2})

Python Statsmodels Testing Coefficients from Robust Linear Model based on M-Estimators

I have a linear model that I'm trying to fit to data with a good # of outliers in the endogenous variable, but not in the exogenous space. I've researched that RLM's based on M-estimators are good in this situation.
When I fit an RLM to my data in the follow way:
import numpy as np
import statsmodels.formula.api as smf
import statsmodels as sm
modelspec = ('cost ~ np.log(units) + np.log(units):item + item') #where item is a categorical variable
results = smf.rlm(modelspec, data = dataset, M = sm.robust.norms.TukeyBiweight()).fit()
print results.summary()
the summary results shows a z statistic, and seemingly the coefficient test of significance is based off of this rather than a t statistic. However, the following R manual (http://www.dst.unive.it/rsr/BelVenTutorial.pdf) shows the use of t statistics on pg. 19-21
Two questions:
Can someone explain to me conceptually why statsmodels uses a z-test rather than a t-test?
All terms and interactions are highly significant in the results (|z| > 4). In most cases, each item has 40 or more observations. There are some items that have 21-25 observations. Is there reason to believe that RLM is not effective in a small sample environment? The line it produces must be the best fit line after reweighting outliers, but is the z-test effective for samples of this size (ie, is there a reason to believe the confidence interval produced by smf.rlm() does not produce 95% probability coverage? I know for t-tests this potentially can be an issue...)?
Thanks!

I have mostly only a general answer, I never read any small sample Monte Carlo studies for M-estimators.
To 1.
In many models, like M-estimators, RLM, or generalized linear models, GLM, we have only asymptotic results, except for maybe a few special cases. Asymptotic results provide conditions that the estimator is normally distributed. Given this, statsmodels defaults to using normal distribution for all models outside of the linear regression model, OLS, and similar, and chisquare instead of the F distribution for Wald tests with joint hypothesis.
There is some evidence that in many cases using the t or F distribution with appropriate choice of degrees of freedom provides a better small sample approximation to the distribution of the test statistic. This relies on Monte Carlo results and is not directly justified by the theory, as far as I know.
In the next release, and in the current development version, of statsmodels users can choose to use the t and F distribution for the results, instead of the normal and chisquare distribution. The defaults stay the same as they are now.
There are other cases where it is not clear whether the t-distribution, and which small sample degrees of freedom should be used. In many cases, statsmodels tries to follow the lead of STATA, for example in cluster robust standard errors after OLS.
Another consequence is that sometimes equivalent models that are special cases of different models use different default assumptions on the distribution, both in Stata and in statsmodels.
I recently read the SAS documentation for M-estimators, and SAS is using the chisquare distribution, i.e. also the normal assumption, for the significance of the parameter estimates and for the confidence intervals.
To 2.
(see first sentence)
I think the same as for linear models also applies here. If the data is highly non-normal, then the test statistics could have incorrect coverage in small samples. This can also be the case with some robust sandwich covariance estimators. On the other hand, if we don't use heteroscedasticity or correlation robust covariance estimators, then the tests can also be strongly biased.
For robust estimation methods like M-estimators, RLM, the effective sample size also depends on the number of inliers, or the weights assigned to the observations, not just the total number of observations.
For your case, I think the z-values and the sample size are large enough that, for example, using the t-distribution would not make them much less significant.
Comparing M-estimators with different norms and scale estimates would provide an additional check on the robustness on the assumption on the outliers and for the choice of robust estimator. Another cross check: Does OLS with dropped outliers (observations with small weights in the RLM estimate) give a similar answer.
Finally as general caution:
The references on robust methods often warn that we should not use (outlier-)robust methods blindly. Using robust methods estimates a relationship based on "inliers". But is our discarding or down-weighting of outliers justified? Or, do we have missing non-linearities, missing variables, a mixture distribution or different regimes?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.