Using PyMC3 to fit a stretched exponential: bad initial energy - python

I am trying to change the very simplest getting started - example of pymc3 (https://docs.pymc.io/notebooks/getting_started.html), the motivating example of linear regression into fitting a stretched exponential.
The simplest version of the model I tried is y = exp(-x**beta)
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
# Initialize random number generator
np.random.seed(1234)
# True parameter values
sigma = .1
beta = 1
# Size of dataset
size = 1000
# Predictor variable
X1 = np.random.randn(size)
# Simulate outcome variable
Y = np.exp(-X1**beta) + np.random.randn(size)*sigma
# specify the model
import pymc3 as pm
import theano.tensor as tt
print('Running on PyMC3 v{}'.format(pm.__version__))
basic_model = pm.Model()
with basic_model:
# Priors for unknown model parameters
beta = pm.HalfNormal('beta', sigma=1)
sigma = pm.HalfNormal('sigma', sigma=1)
# Expected value of outcome
mu = pm.math.exp(-X1**beta)
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=Y)
with basic_model:
# draw 500 posterior samples
trace = pm.sample(500)
which yields the output
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, beta]
Sampling 4 chains: 0%| | 0/4000 [00:00<?, ?draws/s]/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2920: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2920: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
Bad initial energy, check any log probabilities that are inf or -inf, nan or very small:
Y_obs NaN
---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 160, in _start_loop
point, stats = self._compute_point()
File "/opt/conda/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 191, in _compute_point
point, stats = self._step_method.step(self._point)
File "/opt/conda/lib/python3.7/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
apoint, stats = self.astep(array)
File "/opt/conda/lib/python3.7/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 144, in astep
raise SamplingError("Bad initial energy")
pymc3.exceptions.SamplingError: Bad initial energy
"""
The above exception was the direct cause of the following exception:
SamplingError Traceback (most recent call last)
SamplingError: Bad initial energy
The above exception was the direct cause of the following exception:
ParallelSamplingError Traceback (most recent call last)
<ipython-input-310-782c941fbda8> in <module>
1 with basic_model:
2 # draw 500 posterior samples
----> 3 trace = pm.sample(500)
/opt/conda/lib/python3.7/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, **kwargs)
435 _print_step_hierarchy(step)
436 try:
--> 437 trace = _mp_sample(**sample_args)
438 except pickle.PickleError:
439 _log.warning("Could not pickle model, sampling singlethreaded.")
/opt/conda/lib/python3.7/site-packages/pymc3/sampling.py in _mp_sample(draws, tune, step, chains, cores, chain, random_seed, start, progressbar, trace, model, **kwargs)
967 try:
968 with sampler:
--> 969 for draw in sampler:
970 trace = traces[draw.chain - chain]
971 if (trace.supports_sampler_stats
/opt/conda/lib/python3.7/site-packages/pymc3/parallel_sampling.py in __iter__(self)
391
392 while self._active:
--> 393 draw = ProcessAdapter.recv_draw(self._active)
394 proc, is_last, draw, tuning, stats, warns = draw
395 if self._progress is not None:
/opt/conda/lib/python3.7/site-packages/pymc3/parallel_sampling.py in recv_draw(processes, timeout)
295 else:
296 error = RuntimeError("Chain %s failed." % proc.chain)
--> 297 raise error from old_error
298 elif msg[0] == "writing_done":
299 proc._readable = True
ParallelSamplingError: Bad initial energy
INFO (theano.gof.compilelock): Waiting for existing lock by process '30255' (I am process '30252')
INFO (theano.gof.compilelock): To manually release the lock, delete /home/jovyan/.theano/compiledir_Linux-4.4--generic-x86_64-with-debian-buster-sid-x86_64-3.7.3-64/lock_dir
/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2920: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
/opt/conda/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2920: RuntimeWarning: Mean of empty slice.
out=out, **kwargs)
Instead of the stretched exponential, I have also tried power laws, and sine functions. It seems to me that the problem arises as soon as my model is not injective. Can this be an issue (as apparent, I am a newbie in this field)? Can I restrict sampling to only positive x values? Are there any tricks to this?

So the problem here is that
X1**beta
is only defined when X1 >= 0, or when beta is an integer. When you feed this into your observations, for most places, beta will be a float, and so many of
mu = pm.math.exp(-X1**beta)
will be nan.
I found this out with
>>> basic_model.check_test_point()
beta_log__ -0.77
sigma_log__ -0.77
Y_obs NaN
Name: Log-probability of test_point, dtype: float64
I am not sure what model you are trying to specify! There are ways to require beta to be an integer, and ways to require that X1 be positive, but I would need more details to help you describe the model.

Related

PyMC3 %-th leading minor of the array is not positive definite

I am trying to configure PyMC3 Polynomial kernel with the following hyperpriors;
with pm.Model() as self.model:
EPSILON = 0.1
l = pm.Gamma("l", alpha=2, beta=1)
offset = pm.Gamma("offset", alpha=2, beta=1)
nu = pm.HalfCauchy("nu", beta=1)
d = pm.HalfNormal("d", sd=5)
cov = nu ** 2 * pm.gp.cov.Polynomial(X.shape[1], l, d, offset)
self.gp = pm.gp.Marginal(cov_func=cov)
sigma = pm.HalfCauchy("sigma", beta=1)
y_ = self.gp.marginal_likelihood("y", X=X, y=Y, noise=sigma)
self. map_trace = [pm.find_MAP()]
However, I'm getting Cholesky decomposition failed error as follows;
LinAlgError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/theano/compile/function_module.py in __call__(self, *args, **kwargs)
902 outputs =\
--> 903 self.fn() if output_subset is None else\
904 self.fn(output_subset=output_subset)
24 frames
LinAlgError: 7-th leading minor of the array is not positive definite
During handling of the above exception, another exception occurred:
LinAlgError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/scipy/linalg/decomp_cholesky.py in _cholesky(a, lower, overwrite_a, clean, check_finite)
38 if info > 0:
39 raise LinAlgError("%d-th leading minor of the array is not positive "
---> 40 "definite" % info)
41 if info < 0:
42 raise ValueError('LAPACK reported an illegal value in {}-th argument'
LinAlgError: 7-th leading minor of the array is not positive definite
Apply node that caused the error: Cholesky{lower=True, destructive=False, on_error='raise'}(Elemwise{Composite{((sqr(i0) * i1) + i2 + i3)}}[(0, 0)].0)
Toposort index: 11
Inputs types: [TensorType(float64, matrix)]
Inputs shapes: [(40, 40)]
Inputs strides: [(320, 8)]
Inputs values: ['not shown']
Outputs clients: [[Solve{A_structure='lower_triangular', lower=False, overwrite_A=False, overwrite_b=False}(Cholesky{lower=True, destructive=False, on_error='raise'}.0, TensorConstant{[ 69.79 .. 472.83]}), Solve{A_structure='lower_triangular', lower=False, overwrite_A=False, overwrite_b=False}(Cholesky{lower=True, destructive=False, on_error='raise'}.0, Elemwise{Composite{(sqr(i0) * i1)}}[(0, 0)].0)]]
Changing the hyperpriors seems to change the error like instead of 7th leading minor it will show some other x-th leading minor. But I'm not sure if this is caused by hyperpriors or something else.
Any thoughts are welcome :)
Thanks

Calling Gekko solve gives TypeError: object of type 'int' has no len()

I'm trying to solve an optimal control problem using Gekko. When I try to call m.solve(), it gives me TypeError: object of type 'int' has no len(), details below. I get this error regardless of my choice of objective function; however, the only similar issue I've found had an issue with non-differentiable constraints, and I'm pretty sure my constraints are differentiable. Is there another reason I might get this type of error with Gekko?
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-25-9f7b73717b27> in <module>
1 from gekko import GEKKO
----> 2 solve_system()
<ipython-input-24-f224d4cff3fc> in solve_system(theta, alpha, rho, chi, L_bar, n, w, delta_inc, xi, phi, tau, kappa, GAMMA, T, SIGMA, BETA, s_init, i_init, r_init)
257 ##### solve model #####
258 m.options.IMODE = 6
--> 259 m.solve()
~\Anaconda3\lib\site-packages\gekko\gekko.py in solve(self, disp, debug, GUI, **kwargs)
1955 # Build the model
1956 if self._model != 'provided': #no model was provided
-> 1957 self._build_model()
1958 if timing == True:
1959 print('build model', time.time() - t)
~\Anaconda3\lib\site-packages\gekko\gk_write_files.py in _build_model(self)
54 model += '\t%s' % variable
55 if not isinstance(variable.VALUE.value, (list,np.ndarray)):
---> 56 if not (variable.VALUE==None):
57 i = 1
58 model += ' = %s' % variable.VALUE
~\Anaconda3\lib\site-packages\gekko\gk_operators.py in __len__(self)
23 return self.name
24 def __len__(self):
---> 25 return len(self.value)
26 def __getitem__(self,key):
27 return self.value[key]
~\Anaconda3\lib\site-packages\gekko\gk_operators.py in __len__(self)
142
143 def __len__(self):
--> 144 return len(self.value)
145
146 def __getitem__(self,key):
TypeError: object of type 'int' has no len()
I do call an external (but differentiable) function in my constraints. However, removing it and just doing the work without the function didn't solve the issue. I'd really appreciate any input y'all might be able to offer. Thank you!
This error may be because you are using a Numpy array or Python list inside a Gekko expression.
import numpy as np
x = np.array([0,1,2,3]) # numpy array
y = [2,3,4,5] # python list
from gekko import GEKKO
m = GEKKO()
m.Minimize(x) # error, use Gekko Param or Var
m.Equation(m.sum(x)==5) # error, use Gekko Param or Var
You can avoid this error by switching to a Gekko parameter or variable. Gekko can initialize with a Python list or Numpy array.
xg = m.Param(x)
yg = m.Var(y)
m.Minimize(xg)
m.Equation(m.sum(yg)==5)
m.solve()

Type Error : flip() missing 1 required positional argument: 'axis'

I am trying to plot KMeans sum of squares using KElbowVisualizer from library yellowbrick. The code was working fine before but strangely the Type Error started popping up saying "flip() missing 1 required positional argument: 'axis.'" I have some idea that it might be related to numpy version but cannot figure it out. The code that i want to run is as below along with its error.
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from yellowbrick.cluster import KElbowVisualizer
# Generate synthetic dataset with 8 random clusters
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, random_state=42)
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X) # Fit the data to the visualizer
visualizer.show()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-6e34e2651568> in <module>
11 visualizer = KElbowVisualizer(model, k=(4,12))
12
---> 13 visualizer.fit(X) # Fit the data to the visualizer
14 visualizer.show()
/anaconda3/lib/python3.7/site-packages/yellowbrick/cluster/elbow.py in fit(self, X, y, **kwargs)
332 }.get(self.metric, {})
333 elbow_locator = KneeLocator(
--> 334 self.k_values_, self.k_scores_, **locator_kwargs
335 )
336 if elbow_locator.knee is None:
/anaconda3/lib/python3.7/site-packages/yellowbrick/utils/kneed.py in __init__(self, x, y, S, curve_nature, curve_direction)
108 self.y_normalized,
109 self.curve_direction,
--> 110 self.curve_nature,
111 )
112 # normalized difference curve
/anaconda3/lib/python3.7/site-packages/yellowbrick/utils/kneed.py in transform_xy(x, y, direction, curve)
164 # flip decreasing functions to increasing
165 if direction == "decreasing":
--> 166 y = np.flip(y)
167
168 if curve == "convex":
TypeError: flip() missing 1 required positional argument: 'axis'
Did you try the following?
np.flip(y, axis=None)

Converting TfidfVectorizer sparse matrix to dataframe or dense array results in memory error

My input is a pandas dataframe ("vector") with one column and 178885 rows holding strings with up to 600 words each.
0 this is an example text...
1 more examples...
...
178885 last example
Name: vectortext, Length: 178886, dtype: object
I'm doing feature extraction (unigrams) using the TfidfVectorizer:
vectorizer_uni = TfidfVectorizer(ngram_range=(1,1), use_idf=True, analyzer="word", stop_words=stop)
X = vectorizer_uni.fit_transform(vector).toarray()
X = pd.DataFrame(X, columns=vectorizer_uni.get_feature_names()) #map grams
k = len(X.columns) #number of features
Unfortunately I'm receiving a Memory Error as below. I'm using the 64bit version of python 3.6 with 16GB RAM on my windows 10 machine. I've red alot about python generators etc. but I can't figure out how to solve this problem without limiting the number of features (which is not really an option). Any ideas how to solve this? Could I somehow split my dataframe before?
Traceback:
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-88-15b6091ceec7> in <module>()
1 vectorizer_uni = TfidfVectorizer(ngram_range=(1,1), use_idf=True, analyzer="word", stop_words=stop)
----> 2 X = vectorizer_uni.fit_transform(vector).toarray()
3 X = pd.DataFrame(X, columns=vectorizer_uni.get_feature_names()) #map grams
4 k = len(X.columns) # number of features
C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out)
962 def toarray(self, order=None, out=None):
963 """See the docstring for `spmatrix.toarray`."""
--> 964 return self.tocoo(copy=False).toarray(order=order, out=out)
965
966 ##############################################################
C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\coo.py in toarray(self, order, out)
250 def toarray(self, order=None, out=None):
251 """See the docstring for `spmatrix.toarray`."""
--> 252 B = self._process_toarray_args(order, out)
253 fortran = int(B.flags.f_contiguous)
254 if not fortran and not B.flags.c_contiguous:
C:\Programme\Anaconda3\lib\site-packages\scipy\sparse\base.py in _process_toarray_args(self, order, out)
1037 return out
1038 else:
-> 1039 return np.zeros(self.shape, dtype=self.dtype, order=order)
1040
1041 def __numpy_ufunc__(self, func, method, pos, inputs, **kwargs):
MemoryError:

cov_type='HAC' error statsmodels 0.7 IPython 3 notebook python 2.7 anaconda mac os x

i was trying fit OLS model, thats works correctly without robust estimation, but i want improve my regression so, like below, i try to implement that with this problem, in comments have other attempts to solve it.
I don't know if a apply correctly the keyword, so I apresure any helps.
Code:
# Fit and summarize OLS model
sumrz = dict()
for i, ca in enumerate(ccaa):
x = sm.add_constant(data.dy[ca])
mod = sm.OLS(endog=data.du[ca], exog=x, hasconst=True, missing='drop')
res = mod.fit(cov_type='HAC', cov_kwds={'maxlags':1})
# res = res.get_robustcov_results(cov_type='HAC', maxlags=1, use_correction=True)
# res = res.get_robustcov_results(cov_type='HC0')
sumrz[ca] = res.summary(xname=['const','dy'], yname='du', title=ca)
Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-114-87912e59a35d> in <module>()
9 # res = res.get_robustcov_results(cov_type='HAC', maxlags=1, use_correction=True)
10 # res = res.get_robustcov_results(cov_type='HC0')
---> 11 sumrz[ca] = res.summary(xname=['const','dy'], yname='du', title=ca)
/Users/mmngreco/anaconda/lib/python2.7/site-packages/statsmodels/regression/linear_model.pyc in summary(self, yname, xname, title, alpha)
1950 top_right = [('R-squared:', ["%#8.3f" % self.rsquared]),
1951 ('Adj. R-squared:', ["%#8.3f" % self.rsquared_adj]),
-> 1952 ('F-statistic:', ["%#8.4g" % self.fvalue] ),
1953 ('Prob (F-statistic):', ["%#6.3g" % self.f_pvalue]),
1954 ('Log-Likelihood:', None), #["%#6.4g" % self.llf]),
/Users/mmngreco/anaconda/lib/python2.7/site-packages/statsmodels/tools/decorators.pyc in __get__(self, obj, type)
92 if _cachedval is None:
93 # Call the "fget" function
---> 94 _cachedval = self.fget(obj)
95 # Set the attribute in obj
96 # print("Setting %s in cache to %s" % (name, _cachedval))
/Users/mmngreco/anaconda/lib/python2.7/site-packages/statsmodels/regression/linear_model.pyc in fvalue(self)
1214 # assume const_idx exists
1215 idx = lrange(k_params)
-> 1216 idx.pop(const_idx)
1217 mat = mat[idx] # remove constant
1218 ft = self.f_test(mat)
TypeError: an integer is required
(It's good to see a full traceback in a question.)
The following is my guess based on the traceback.
I guess there is a bug in the constant detection if hasconst=True is specifiec.
Try to leave out the argument hasconst=True.
Background
If we don't allow for misspecified heteroscedasticity or correlation, and we don't use a robust covariance matrix, then the F statistic can be calculated from the residual sum of squares.
If a robust cov_type is specified, then we use the Wald test for the null hypothesis that all slope coefficients are zero. This is valid with a robust covariance of the parameters even if heteroscedasticity or correlation are misspecified.
In this case the index for the column with the constant, const_idx, is not correctly set and we get the TypeError.

Categories