PyMC3: PositiveDefiniteError when sampling a Categorical variable - python

I am trying to sample a simple model of a categorical distribution with a Dirichlet prior. Here is my code:
import numpy as np
from scipy import optimize
from pymc3 import *
k = 6
alpha = 0.1 * np.ones(k)
with Model() as model:
p = Dirichlet('p', a=alpha, shape=k)
categ = Categorical('categ', p=p, shape=1)
tr = sample(10000)
And I get this error:
PositiveDefiniteError: Scaling is not positive definite. Simple check failed. Diagonal contains negatives. Check indexes [0 1 2 3 4]

The problem is that NUTS is failing to initialize properly. One solution is to use another sampler like this:
with pm.Model() as model:
p = pm.Dirichlet('p', a=alpha)
categ = pm.Categorical('categ', p=p)
step = pm.Metropolis(vars=p)
tr = pm.sample(1000, step=step)
Here I am manually assigning p to Metropolis, and letting PyMC3 assign categ to a proper sampler.

Related

Custom F-test in linearmodels (Python)

I have fit a linearmodels.PanelOLS model and stored it in m. I now want to test if certain coefficients are simultaneously equal to zero.
Does a fitted linearmodels.PanelOLS object have an F-test function where I can pass my own restriction matrix?
I am looking for something like statsmodels' f_test method.
Here's a minimum reproducible example.
# Libraries
from linearmodels.panel import PanelOLS
from linearmodels.datasets import wage_panel
# Load data and set index
df = wage_panel.load()
df = df.set_index(['nr','year'])
# Add constant term
df['const'] = 1
# Fit model
m = PanelOLS(dependent=df['lwage'], exog=df[['const','expersq','married']])
m = m.fit(cov_type='clustered', cluster_entity=True)
# Is there an f_test method for m???
m.f_test(r_mat=some_matrix_here) # Something along these lines?
You can use wald_test (a standard F-test is numerically identical to a Walkd test under some assumptions on the covariance).
# Libraries
from linearmodels.panel import PanelOLS
from linearmodels.datasets import wage_panel
# Load data and set index
df = wage_panel.load()
df = df.set_index(['nr','year'])
# Add constant term
df['const'] = 1
# Fit model
m = PanelOLS(dependent=df['lwage'], exog=df[['const','expersq','married']])
m = m.fit(cov_type='clustered', cluster_entity=True)
Then the test
import numpy as np
# Use matrix notation RB - q = 0 where R is restr and q is value
# Restrictions: expersq = 0.001 & expersq+married = 0.2
restr = np.array([[0,1,0],[0,1,1]])
value = np.array([0.01, 0.2])
m.wald_test(restr, value)
This returns
Linear Equality Hypothesis Test
H0: Linear equality constraint is valid
Statistic: 0.2608
P-value: 0.8778
Distributed: chi2(2)
WaldTestStatistic, id: 0x2271cc6fdf0
You can also use formula syntax if you used formulas to define your model, which can be easier to code up.
fm = PanelOLS.from_formula("lwage~ 1 + expersq + married", data=df)
fm = fm.fit(cov_type='clustered', cluster_entity=True)
fm.wald_test(formula="expersq = 0.001,expersq+married = 0.2")
The result is the same as above.

Why does it work when columns are larger than rows in Python Sklearn (Linear Regression) [duplicate]

it's known that when the number of variables (p) is larger than the number of samples (n) the least square estimator is not defined.
In sklearn I receive this values:
In [30]: lm = LinearRegression().fit(xx,y_train)
In [31]: lm.coef_
Out[31]:
array([[ 0.20092363, -0.14378298, -0.33504391, ..., -0.40695124,
0.08619906, -0.08108713]])
In [32]: xx.shape
Out[32]: (1097, 3419)
Call [30] should return an error. How does sklearn work when p>n like in this case?
EDIT:
It seems that the matrix is filled with some values
if n > m:
# need to extend b matrix as it will be filled with
# a larger solution matrix
if len(b1.shape) == 2:
b2 = np.zeros((n, nrhs), dtype=gelss.dtype)
b2[:m,:] = b1
else:
b2 = np.zeros(n, dtype=gelss.dtype)
b2[:m] = b1
b1 = b2
When the linear system is underdetermined, then the sklearn.linear_model.LinearRegression finds the minimum L2 norm solution, i.e.
argmin_w l2_norm(w) subject to Xw = y
This is always well defined and obtainable by applying the pseudoinverse of X to y, i.e.
w = np.linalg.pinv(X).dot(y)
The specific implementation of scipy.linalg.lstsq, which is used by LinearRegression uses get_lapack_funcs(('gelss',), ... which is precisely a solver that finds the minimum norm solution via singular value decomposition (provided by LAPACK).
Check out this example
import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(5, 10)
y = rng.randn(5)
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
coef1 = lr.fit(X, y).coef_
coef2 = np.linalg.pinv(X).dot(y)
print(coef1)
print(coef2)
And you will see that coef1 == coef2. (Note that fit_intercept=False is specified in the constructor of the sklearn estimator, because otherwise it would subtract the mean of each feature before fitting the model, yielding different coefficients)

Non-Linearity Test NaN error

I wanted to try statsmodel's linear_harvey_collier test with an easy example. However, I get nan as a result. Can you see, where my error lies?
import numpy as np
from statsmodels.regression.linear_model import OLS
np.random.seed(44)
n_samples, n_features = 50, 4
X = np.random.randn(n_samples, n_features)
coef=np.random.uniform(-12,12,4)
y = np.dot(X, coef)
var = 400
y += var**(1/2) * np.random.normal(size=n_samples)
regr=OLS(y, X).fit()
print(regr.params)
print(regr.summary())
sms.linear_harvey_collier(regr)
I get the result Ttest_1sampResult(statistic=nan, pvalue=nan).
If I perform the test while exluding one variable I get a result:
X3=X[:,:3]
regr3=OLS(y, X3).fit()
In [1]: sms.linear_harvey_collier(regr3)
Out[2]: Ttest_1sampResult(statistic=0.2447803429683807, pvalue=0.806727747845282)
Is there a problem with not adding a constant and intercept? This is just a feeling and if there is indeed a problem, I don't understand why.
There is a bug in linear_harvey_collier, that hard codes the number of initial observations to 3.
https://github.com/statsmodels/statsmodels/pull/6727
linear_harvey_collier has only two lines of code.
A workaround is to compute the test directly
res = regr
from scipy import stats
skip = len(res.params) # bug in linear_harvey_collier
rr = sms.recursive_olsresiduals(res, skip=skip, alpha=0.95, order_by=None)
stats.ttest_1samp(rr[3][skip:], 0)
Ttest_1sampResult(statistic=0.03092937323130299, pvalue=0.9754626388210277)

Use Python lmfit with a variable number of parameters in function

I am trying to deconvolve complex gas chromatogram signals into individual gaussian signals. Here is an example, where the dotted line represents the signal I am trying to deconvolve.
I was able to write the code to do this using scipy.optimize.curve_fit; however, once applied to real data the results were unreliable. I believe being able to set bounds to my parameters will improve my results, so I am attempting to use lmfit, which allows this. I am having a problem getting lmfit to work with a variable number of parameters. The signals I am working with may have an arbitrary number of underlying gaussian components, so the number of parameters I need will vary. I found some hints here, but still can't figure it out...
Creating a python lmfit Model with arbitrary number of parameters
Here is the code I am currently working with. The code will run, but the parameter estimates do not change when the model is fit. Does anyone know how I can get my model to work?
import numpy as np
from collections import OrderedDict
from scipy.stats import norm
from lmfit import Parameters, Model
def add_peaks(x_range, *pars):
y = np.zeros(len(x_range))
for i in np.arange(0, len(pars), 3):
curve = norm.pdf(x_range, pars[i], pars[i+1]) * pars[i+2]
y = y + curve
return(y)
# generate some fake data
x_range = np.linspace(0, 100, 1000)
peaks = [50., 40., 60.]
a = norm.pdf(x_range, peaks[0], 5) * 2
b = norm.pdf(x_range, peaks[1], 1) * 0.1
c = norm.pdf(x_range, peaks[2], 1) * 0.1
fake = a + b + c
param_dict = OrderedDict()
for i in range(0, len(peaks)):
param_dict['pk' + str(i)] = peaks[i]
param_dict['wid' + str(i)] = 1.
param_dict['mult' + str(i)] = 1.
# In case, you'd like to see the plot of fake data
#y = add_peaks(x_range, *param_dict.values())
#plt.plot(x_range, y)
#plt.show()
# Initialize the model and fit
pmodel = Model(add_peaks)
params = pmodel.make_params()
for i in param_dict.keys():
params.add(i, value=param_dict[i])
result = pmodel.fit(fake, params=params, x_range=x_range)
print(result.fit_report())
I think you would be better off using lmfits ability to build composite model.
That is, with a single peak defined with
from scipy.stats import norm
def peak(x, amp, center, sigma):
return amp * norm.pdf(x, center, sigma)
(see also lmfit.models.GaussianModel), you can build a model with many peaks:
npeaks = 3
model = Model(peak, prefix='p1_')
for i in range(1, npeaks):
model = model + Model(peak, prefix='p%d_' % (i+1))
params = model.make_params()
Now model will be a sum of 3 Gaussian functions, and the params created for that model will have names like p1_amp, p1_center, p2_amp, ..., which you can add sensible initial values and/or bounds and/or constraints.
Given your example data, you could pass in initial values to make_params like
params = model.make_params(p1_amp=2.0, p1_center=50., p1_sigma=2,
p2_amp=0.2, p2_center=40., p2_sigma=2,
p3_amp=0.2, p3_center=60., p3_sigma=2)
result = model.fit(fake, params, x=x_range)
I was able to find a solution here:
https://lmfit.github.io/lmfit-py/builtin_models.html#example-3-fitting-multiple-peaks-and-using-prefixes
Building on the code above, the following accomplishes what I was trying to do...
from lmfit.models import GaussianModel
gauss1 = GaussianModel(prefix='g1_')
gauss2 = GaussianModel(prefix='g2_')
gauss3 = GaussianModel(prefix='g3_')
gauss4 = GaussianModel(prefix='g4_')
gauss5 = GaussianModel(prefix='g5_')
gauss = [gauss1, gauss2, gauss3, gauss4, gauss5]
prefixes = ['g1_', 'g2_', 'g3_', 'g4_', 'g5_']
mod = np.sum(gauss[0:len(peaks)])
pars = mod.make_params()
for i, prefix in zip(range(0, len(peaks)), prefixes[0:len(peaks)]):
pars[prefix + 'center'].set(peaks[i])
init = mod.eval(pars, x=x_range)
out = mod.fit(fake, pars, x=x_range)
print(out.fit_report(min_correl=0.5))
out.plot_fit()
plt.show()

How to specify size for bernoulli distribution with pymc3?

In trying to make my way through Bayesian Methods for Hackers, which is in pymc, I came across this code:
first_coin_flips = pm.Bernoulli("first_flips", 0.5, size=N)
I've tried to translate this to pymc3 with the following, but it just returns a numpy array, rather than a tensor (?):
first_coin_flips = pm.Bernoulli("first_flips", 0.5).random(size=50)
The reason the size matters is that it's used later on in a deterministic variable. Here's the entirety of the code that I have so far:
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
import mpld3
import theano.tensor as tt
model = pm.Model()
with model:
N = 100
p = pm.Uniform("cheating_freq", 0, 1)
true_answers = pm.Bernoulli("truths", p)
print(true_answers)
first_coin_flips = pm.Bernoulli("first_flips", 0.5)
second_coin_flips = pm.Bernoulli("second_flips", 0.5)
# print(first_coin_flips.value)
# Create model variables
def calc_p(true_answers, first_coin_flips, second_coin_flips):
observed = first_coin_flips * true_answers + (1-first_coin_flips) * second_coin_flips
# NOTE: Where I think the size param matters, since we're dividing by it
return observed.sum() / float(N)
calced_p = pm.Deterministic("observed", calc_p(true_answers, first_coin_flips, second_coin_flips))
step = pm.Metropolis(model.free_RVs)
trace = pm.sample(1000, tune=500, step=step)
pm.traceplot(trace)
html = mpld3.fig_to_html(plt.gcf())
with open("output.html", 'w') as f:
f.write(html)
f.close()
And the output:
The coin flips and uniform cheating_freq output look correct, but the observed doesn't look like anything to me, and I think it's because I'm not translating that size param correctly.
The pymc3 way to specify the size of a Bernoulli distribution is by using the shape parameter, like:
first_coin_flips = pm.Bernoulli("first_flips", 0.5, shape=N)

Categories