I have been trying to use the pymc3 package but have constantly been receiving errors. First off, when I import the pymc3 package, here is what happens:
import pymc3 as pm
WARNING (theano.configdefaults): g++ not detected ! Theano will be unable to execute optimized C-implementations (for both CPU and GPU) and will default to Python implementations. Performance will be severely degraded. To remove this warning, set Theano flags cxx to an empty string.
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
Afterwards, here is my code:
x = np.linspace(-5,5, 50) # Wavelength data. Here we have fifty points between -5 and 5
sigma = 1
mu = 0
A = 20
B = 100
# define underlying model -- Gaussian
y = A * np.exp( - (x - mu)**2 / (2 * sigma**2)) + B
y_noise = np.random.normal(0, 1, 50) # Let's add some noise
data = y+y_noise
# Set model
basic_model = pm.Model()
with basic_model:
# Priors for unknown model parameters
A = pm.Uniform("A", lower=0, upper=50)
B = pm.Uniform("B", lower=0, upper=200)
sigma = 1
# Expected value of outcome
y_m = A*np.exp(-(x)**2/2)+B
# Likelihood of observations
Y_obs = pm.Normal("Y_obs", mu=mu, sigma=np.sqrt(data), observed=data)
# Now sample
with basic_model:
# draw posterior samples
trace = pm.sample_smc(100, parallel=True)
And here is the error output:
RemoteTraceback Traceback (most recent call
last) RemoteTraceback: """ Traceback (most recent call last): File
"/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/multiprocessing/pool.py",
line 125, in worker
result = (True, func(*args, **kwds)) File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/multiprocessing/pool.py",
line 51, in starmapstar
return list(itertools.starmap(args[0], args[1])) File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/pymc3/smc/sample_smc.py",
line 267, in sample_smc_int
smc.setup_kernel() File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/pymc3/smc/smc.py",
line 135, in setup_kernel
self.likelihood_logp_func = logp_forw([self.model.datalogpt], self.variables, shared) File
"/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/pymc3/smc/smc.py",
line 288, in logp_forw
f = theano_function([inarray0], out_list[0]) File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/theano/compile/function/init.py",
line 337, in function
fn = pfunc( File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/theano/compile/function/pfunc.py",
line 524, in pfunc
return orig_function( File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/theano/compile/function/types.py",
line 1970, in orig_function
m = Maker( File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/theano/compile/function/types.py",
line 1573, in init
self._check_unused_inputs(inputs, outputs, on_unused_input) File "/home/osgrinds/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/theano/compile/function/types.py",
line 1745, in _check_unused_inputs
raise UnusedInputError(msg % (inputs.index(i), i.variable, err_msg)) theano.compile.function.types.UnusedInputError:
theano.function was asked to create a function computing outputs given
certain inputs, but the provided input variable at index 0 is not part
of the computational graph needed to compute the outputs: inarray. To
make this error into a warning, you can pass the parameter
on_unused_input='warn' to theano.function. To disable it completely,
use on_unused_input='ignore'. """
The above exception was the direct cause of the following exception:
UnusedInputError Traceback (most recent call
last) Input In [7], in <cell line: 2>()
1 # Now sample
2 with basic_model:
3 # draw posterior samples
----> 4 trace = pm.sample_smc(100, parallel=True)
File
~/anaconda3/envs/pymc3Env/lib/python3.10/site-packages/pymc3/smc/sample_smc.py:196,
in sample_smc(draws, kernel, n_steps, start, tune_steps, p_acc_rate,
threshold, save_sim_data, save_log_pseudolikelihood, model,
random_seed, parallel, chains, cores)
194 loggers = [_log] + [None] * (chains - 1)
195 pool = mp.Pool(cores)
--> 196 results = pool.starmap(
197 sample_smc_int, [(*params, random_seed[i], i, loggers[i]) for i in range(chains)]
198 )
200 pool.close()
201 pool.join()
File
~/anaconda3/envs/pymc3Env/lib/python3.10/multiprocessing/pool.py:372,
in Pool.starmap(self, func, iterable, chunksize)
366 def starmap(self, func, iterable, chunksize=None):
367 '''
368 Like map() method but the elements of the iterable are expected to
369 be iterables as well and will be unpacked as arguments. Hence
370 func and (a, b) becomes func(a, b).
371 '''
--> 372 return self._map_async(func, iterable, starmapstar, chunksize).get()
File
~/anaconda3/envs/pymc3Env/lib/python3.10/multiprocessing/pool.py:771,
in ApplyResult.get(self, timeout)
769 return self._value
770 else:
--> 771 raise self._value
UnusedInputError: theano.function was asked to create a function
computing outputs given certain inputs, but the provided input
variable at index 0 is not part of the computational graph needed to
compute the outputs: inarray. To make this error into a warning, you
can pass the parameter on_unused_input='warn' to theano.function. To
disable it completely, use on_unused_input='ignore'.
I am simply following a tutorial on medium, so I don't think there is a problem with the code. I have a strong feeling that the problems arises from the way I installed the packages. I installed pymc3 in a conda environment by using these 3 commands:
conda install numpy scipy mkl
conda install theano pygpu
conda install pymc3
I have also tried installing pymc3 by following the guide from the developpers on github:
conda create -c conda-forge -n pymc3_env pymc3 theano-pymc mkl mkl-service
conda activate pymc3_env
I was able to replicate the issue in PyMC3. This appears to be something wrong with the SMC sampler when combined with multiprocessing. Changing to parallel=False will get the SMC sampler working. Changing to NUTS or Metropolis sampling will also work fine.
Please file a bug report on the PyMC repository. However, you may also want to try upgrading to PyMC v4 first.
Related
I am working on a .wav signals using python 3.5 and trying to extract mfcc, mfcc delta, mfcc delta-deltas, and other signal features. but there is an error raised only with mfcc delta with is:
Traceback (most recent call last):
mfcc_delta = librosa.feature.delta(mfcc)
File "C:\Users\hp\AppData\Local\Programs\Python\Python35\lib\site-packages\librosa\feature\utils.py", line 116, in delta
**kwargs)
File "C:\Users\hp\AppData\Local\Programs\Python\Python35\lib\site-packages\scipy\signal\_savitzky_golay.py", line 337, in savgol_filter
coeffs = savgol_coeffs(window_length, polyorder, deriv=deriv, delta=delta)
File "C:\Users\hp\AppData\Local\Programs\Python\Python35\lib\site-packages\scipy\signal\_savitzky_golay.py", line 139, in savgol_coeffs
coeffs, _, _, _ = lstsq(A, y)
File "C:\Users\hp\AppData\Local\Programs\Python\Python35\lib\site-packages\scipy\linalg\basic.py", line 1226, in lstsq
% (-info, lapack_driver))
ValueError: illegal value in 4-th argument of internal None
I am working on the following code:
import librosa
import numpy as np
import librosa
from scipy import signal
import scipy.stats
def preprocess_cough(x,fs, cutoff = 6000, normalize = True, filter_ = True, downsample = True):
#Preprocess Data
if len(x.shape)>1:
x = np.mean(x,axis=1) # Convert to mono
if normalize:
x = x/(np.max(np.abs(x))+1e-17) # Norm to range between -1 to 1
if filter_:
b, a = butter(4, fs_downsample/fs, btype='lowpass') # 4th order butter lowpass filter
x = filtfilt(b, a, x)
if downsample:
x = signal.decimate(x, int(fs/fs_downsample)) # Downsample for anti-aliasing
fs_new = fs_downsample
return np.float32(x), fs_new
audio_data = 'F:/test/'
files = librosa.util.find_files(audio_data, ext=['wav'])
x,fs = librosa.load(myFile,sr=48000)
arr, f = preprocess_cough(x,fs)
mfcc = librosa.feature.mfcc(y=arr, sr=f, n_mfcc=13)
mfcc_delta = librosa.feature.delta(mfcc)
mfcc_delta2 = librosa.feature.delta(mfcc, order=2)
when I remove the mffcs calculations and calculate the other wav signal features the error does not appear again. Also, I have tried to remove n_mfcc=13 parameter but the error still raises.
Sample of the output and the shape of mfcc variable
[-3.86701782e+02 -4.14421021e+02 -4.67373749e+02 -4.76989105e+02
-4.23713501e+02 -3.71329285e+02 -3.47003693e+02 -3.19309082e+02
-3.29547089e+02 -3.32584625e+02 -2.78399109e+02 -2.43284348e+02
-2.47878128e+02 -2.59308533e+02 -2.71102844e+02 -2.87314514e+02
-2.58869965e+02 -6.01125565e+01 1.66160011e+01 -8.58060551e+00
-8.49179382e+01 -9.29880371e+01 -9.96001358e+01 -1.04499428e+02
-3.65511665e+01 -3.82106819e+01 -8.69802475e+01 -1.22267052e+02
-1.70187592e+02 -2.35996841e+02 -2.96493286e+02 -3.39086365e+02
-3.59514771e+02]
and the shape is (13,33)
Can anyone help me, please?
Thanks in advance
Somewhat similarly to the issue raised in this question the issue is related to the intricacies of the underlying numerical operations that librosa defers to scipy. SciPy depends on LAPACK library being installed. So at first I would check if you have it installed.
Also, you may want to debug the script step-by-step to step into SciPy and examine actual values that are percolating from librosa.feature.delta to scipy.signal.savgol_filter which may tell you the reason when you cross-check them with documentation.
I have been wrestling with a known and documented SVD converge issue. Having read up on similar issues raised by others, I have double checked my data and reduced this to a tiny DataFrame - 10 rows/2 columns only - both float64's. There are definitely no NaN or infinities.
On first run, I pause at the offending line via breakpoint. First time I manually execute the next (offending) line i get a console error (see below) - but on subsequent runs it resolves without errors!! I am using numpy 1.19.1
I would greatly appreciate thoughts or ideas on how to resolve this. It is driving me nuts and its shaken my confidence.
Thanks in advance for any suggestions. I really want to get to the bottom of this.
Luthor
The code:
# Simplifying the df
df = df.head(10)
df = df[['dti','close']]
print(df)
ltt2_poly = np.polyfit(df['dti'] - df['dti'][0], df['close'], 2)
At runtime:
pydev debugger: process 36368 is connecting
Connected to pydev debugger (build 202.6948.78)
Importing local settings
dti close
0 0 11.28
1 3 11.35
2 4 11.10
3 5 10.95
4 6 11.07
5 7 11.45
6 10 11.46
7 11 11.46
8 12 11.74
9 13 11.96
**ltt2_poly = np.polyfit(df['dti'] - df['dti'][0], df['close'], 2)**
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm 2020.1.1\plugins\python\helpers\pydev\_pydevd_bundle\pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "<__array_function__ internals>", line 5, in polyfit
File "C:\Users\luthor\PycharmProjects\MC\venv\lib\site-packages\numpy\lib\polynomial.py", line 629, in polyfit
c, resids, rank, s = lstsq(lhs, rhs, rcond)
File "<__array_function__ internals>", line 5, in lstsq
File "C:\Users\luthor\PycharmProjects\MC\venv\lib\site-packages\numpy\linalg\linalg.py", line 2306, in lstsq
x, resids, rank, s = gufunc(a, b, rcond, signature=signature, extobj=extobj)
File "C:\Users\luthor\PycharmProjects\MC\venv\lib\site-packages\numpy\linalg\linalg.py", line 100, in _raise_linalgerror_lstsq
raise LinAlgError("SVD did not converge in Linear Least Squares")
numpy.linalg.LinAlgError: SVD did not converge in Linear Least Squares
*In the SAME debug session:*
**ltt2_poly = np.polyfit(df['dti'] - df['dti'][0], df['close'], 2)** now works!!!
print(ltt2_poly)
[ 1.00902938e-02 -8.70161869e-02 1.13247743e+01]
print(np.version)
1.19.1
To add insult to injury, when I reduce the df to between 5-9 - it works without the failure. What am I missing??
I don't have a solution to that, but I can tell you that you're not alone. I have the same bug.
I "fixed" it by simply wrapping the NumPy function in a while-try statement.
while True:
try:
NumPy-function
break
except:
continue
Try using a conda environment with numpy 1.18
This version seems to predate these kinds of errors and it worked for me
In my case, the following code may lead to exactly this non-convergence error:
pipe = make_pipeline(StandardScaler(), LinearRegression())
pipe.fit(trainingDf, ySer)
This happens on rare occasions, even when the data does not contain Nan or infinite data points. However, as soon as I removed the normalization, the code runs fine:
pipe = make_pipeline(StandardScaler(with_std=False), LinearRegression())
pipe.fit(trainingDf, ySer)
Similar in the past before pipelines with:
regr = LinearRegression(normalize=True)
regr.fit(XDf.to_numpy(),ySer.to_numpy())
and:
regr = LinearRegression(normalize=False)
regr.fit(XDf.to_numpy(),ySer.to_numpy())
Can someone point me to the docs that will explain what I'm seeing?
Pink stuff in a Jupyter notebook makes me think something is wrong.
Using PyMC3 (btw, it's an exercise for a class and I have no idea what I'm doing).
I plugged in the numbers, initially got an error about 0s on the diagonal, swapped alpha_est and rate_est to be 1/alpha_est and 1/rate_est (and stopped getting the error), but I still get the pink stuff.
This code came with the exercise:
# An initial guess for the gamma distribution's alpha and beta
# parameters can be made as described here:
# https://wiki.analytica.com/index.php?title=Gamma_distribution
alpha_est = np.mean(no_insurance)**2 / np.var(no_insurance)
beta_est = np.var(no_insurance) / np.mean(no_insurance)
# PyMC3 Gamma seems to use rate = 1/beta
rate_est = 1/beta_est
# Initial parameter estimates we'll use below
alpha_est, rate_est
And then the code I'm supposed to add:
Should the pink stuff make me nervous or do I just say "No errors, move on"?
=======
The "zero problem"
---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 110, in run
self._start_loop()
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 160, in _start_loop
point, stats = self._compute_point()
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 191, in _compute_point
point, stats = self._step_method.step(self._point)
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
apoint, stats = self.astep(array)
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 130, in astep
self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
File "/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/step_methods/hmc/quadpotential.py", line 231, in raise_ok
raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV `alpha__log__`.ravel()[0] is zero.
"""
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV `alpha__log__`.ravel()[0] is zero.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-14-36f8e5cebbe5> in <module>
13 g = pm.Gamma('g', alpha=alpha_, beta=rate_, observed=no_insurance)
14
---> 15 trace = pm.sample(10000)
/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, **kwargs)
435 _print_step_hierarchy(step)
436 try:
--> 437 trace = _mp_sample(**sample_args)
438 except pickle.PickleError:
439 _log.warning("Could not pickle model, sampling singlethreaded.")
/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/sampling.py in _mp_sample(draws, tune, step, chains, cores, chain, random_seed, start, progressbar, trace, model, **kwargs)
967 try:
968 with sampler:
--> 969 for draw in sampler:
970 trace = traces[draw.chain - chain]
971 if (trace.supports_sampler_stats
/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py in __iter__(self)
391
392 while self._active:
--> 393 draw = ProcessAdapter.recv_draw(self._active)
394 proc, is_last, draw, tuning, stats, warns = draw
395 if self._progress is not None:
/Local/Users/vlb/anaconda3/lib/python3.7/site-packages/pymc3/parallel_sampling.py in recv_draw(processes, timeout)
295 else:
296 error = RuntimeError("Chain %s failed." % proc.chain)
--> 297 raise error from old_error
298 elif msg[0] == "writing_done":
299 proc._readable = True
RuntimeError: Chain 0 failed.
is the "hint" in the instructions here telling me I should use 1/rate_est?
You are now going to create your own PyMC3 model!
Use an exponential prior for alpha. Call this stochastic variable alpha_.
Similarly, use an exponential prior for the rate ( 1/𝛽 ) parameter in PyMC3's Gamma.
Call this stochastic variable rate_ (but it will be supplied as pm.Gamma's beta parameter). Hint: to set up a prior with an exponential distribution for 𝑥 where you have an initial estimate for 𝑥 of 𝑥0 , use a scale parameter of 1/𝑥0 .
Create your Gamma distribution with your alpha_ and rate_ stochastic variables and the observed data.
Perform 10000 draws.
The zero problem could be because you are sampling zeros from exponential distribution.
Ah:
rate_est is 0.00021265346963636103
rate_ci = np.percentile(trace['rate_'], [2.5, 97.5])
rate_ci = [0.00022031, 0.00028109]
1/rate_est is 4702.486170152818
I can believe I am sampling zeros if I use rate_est.
I have doubts about your 1/alpha step. See this discussion: https://discourse.pymc.io/t/help-with-fitting-gamma-distribution/2630
The zero problem could be because you are sampling zeros from exponential distribution.
You could look here: https://docs.pymc.io/notebooks/PyMC3_tips_and_heuristic.html cell[6]
I think you are okay with the sampler output. You can check your distributions by using traceplot.
I am trying to use optuna lib in Python to optimise parameters for recommender systems' models. Those models are custom and look like standard fit-predict sklearn models (with methods get/set params).
What I do: simple objective function that selects two parameters from uniform int distribution, set these params to model, predicts the model (there no fit stage as it simple model that uses params only in predict stage) and calculates some metric.
What I get: the first trial runs normal, it samples params and prints results to log. But on the second and next trial I have some strange errors (look code below) that I can't solve or google. When I run study on just 1 trial everything is okay.
What I tried: to rearrange parts of objective function, put fit stage inside, try to calculate more simpler metrics - nothing helps.
Here is my objective function:
# getting train, test
# fitting model
self.model = SomeRecommender()
self.model.fit(train, some_other_params)
def objective(trial: optuna.Trial):
# save study
if path is not None:
joblib.dump(study, some_path)
# sampling params
alpha = trial.suggest_uniform('alpha', 0, 100)
beta = trial.suggest_uniform('beta', 0, 100)
# setting params to model
params = {'alpha': alpha,
'beta': beta}
self.model.set_params(**params)
# getting predict
recs = self.model.predict(some_other_params)
# metric computing
metric_result = Metrics.hit_rate_at_k(recs, test, k=k)
return metric_result
# starting study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=3, n_jobs=1)
That's what I get on three trials:
[I 2019-10-01 12:53:59,019] Finished trial#0 resulted in value: 0.1. Current best value is 0.1 with parameters: {'alpha': 59.6135986324444, 'beta': 40.714559720597585}.
[W 2019-10-01 13:39:58,140] Setting status of trial#1 as TrialState.FAIL because of the following error: AttributeError("'_BaseUniformDistribution' object has no attribute 'to_internal_repr'")
Traceback (most recent call last):
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/study.py", line 448, in _run_trial
result = func(trial)
File "/Users/roseaysina/code/project/model.py", line 100, in objective
'alpha', 0, 100)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 180, in suggest_uniform
return self._suggest(name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 453, in _suggest
self.study, trial, name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 127, in sample_independent
values, scores = _get_observation_pairs(study, param_name)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 558, in _get_observation_pairs
param_value = distribution.to_internal_repr(trial.params[param_name])
AttributeError: '_BaseUniformDistribution' object has no attribute 'to_internal_repr'
[W 2019-10-01 13:39:58,206] Setting status of trial#2 as TrialState.FAIL because of the following error: AttributeError("'_BaseUniformDistribution' object has no attribute 'to_internal_repr'")
Traceback (most recent call last):
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/study.py", line 448, in _run_trial
result = func(trial)
File "/Users/roseaysina/code/project/model.py", line 100, in objective
'alpha', 0, 100)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 180, in suggest_uniform
return self._suggest(name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 453, in _suggest
self.study, trial, name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 127, in sample_independent
values, scores = _get_observation_pairs(study, param_name)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 558, in _get_observation_pairs
param_value = distribution.to_internal_repr(trial.params[param_name])
AttributeError: '_BaseUniformDistribution' object has no attribute 'to_internal_repr'
I can't understand where is the problem and why the first trial is working. Please, help.
Thank you!
Your code seems to have no problems.
I ran a simplified version of your code (see below), and it worked well in my environment:
import optuna
def objective(trial: optuna.Trial):
# sampling params
alpha = trial.suggest_uniform('alpha', 0, 100)
beta = trial.suggest_uniform('beta', 0, 100)
# evaluating params
return alpha + beta
# starting study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=3, n_jobs=1)
Could you tell me about your environment in order to investigate the problem? (e.g., OS, Python version, Python interpreter (CPython, PyPy, IronPython or Jython), Optuna version)
why the first trial is working.
This error is raised by optuna/samplers/tpe/sampler.py#558, and this line is only executed when the number of completed trials in the study is greater than zero.
BTW, you might be able to avoid this problem by using RandomSampler as follows:
sampler = optuna.samplers.RandomSampler()
study = optuna.create_study(direction='maximize', sampler=sampler)
Notice that the optimization performance of RandomSampler tends to be worse than TPESampler that is the default sampler of Optuna.
My regression model using statsmodels in python works with 48,065 lines of data, but while adding new data I have tracked down one line of code that produces a singular matrix error. Answers to similar questions seem to suggest missing data but I have checked and there is nothing visibibly irregular from the error prone row of code causing me major issues. Does anyone know if this is an error in my code or knows a solution to fix it as I'm out of ideas.
Data2.csv - http://www.sharecsv.com/s/8ff31545056b8864f2ad26ef2fe38a09/Data2.csv
import pandas as pd
import statsmodels.formula.api as smf
data = pd.read_csv("Data2.csv")
formula = 'is_success ~ goal_angle + goal_distance + np_distance + fp_distance + is_fast_attack + is_header + prev_tb + is_rebound + is_penalty + prev_cross + is_tb2 + is_own_goal + is_cutback + asst_dist'
model = smf.mnlogit(formula, data=data, missing='drop').fit()
CSV Line producing error: 0,0,0,0,0,0,0,1,22.94476,16.877204,13.484806,20.924627,0,0,11.765203
Error with Problematic line within the model:
runfile('C:/Users/User1/Desktop/Model Check.py', wdir='C:/Users/User1/Desktop')
Optimization terminated successfully.
Current function value: 0.264334
Iterations 20
Traceback (most recent call last):
File "<ipython-input-76-eace3b458e24>", line 1, in <module>
runfile('C:/Users/User1/Desktop/xG_xA Model Check.py', wdir='C:/Users/User1/Desktop')
File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/User1/Desktop/xG_xA Model Check.py", line 6, in <module>
model = smf.mnlogit(formula, data=data, missing='drop').fit()
File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\discrete\discrete_model.py", line 587, in fit
disp=disp, callback=callback, **kwargs)
File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\base\model.py", line 434, in fit
Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 526, in inv
ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 90, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
LinAlgError: Singular matrix
As far as I can see:
The problem is the variable is_own_goal because all observation where this is 1 also have the dependent variable is_success equal to 1. That means there is no variation in the outcome because is_own_goal already specifies that it is a success.
As a consequence, we cannot estimate a coefficient for is_own_goal, the coefficient is not identified by the data. The variance of the coefficient would be infinite and inverting the Hessian to get the covariance of the parameter estimates fails because the Hessian is singular.
Given floating point precision, with some computational noise the hessian might be invertible and the Singular Matrix exception would not show up. Which, I guess, is the reason that it works with some but not all observations.
BTW: If the dependent variable, endog, is binary, then Logit is more appropriate, even though MNLogit has it as a special case.
BTW: Penalized estimation would be another way to force an estimate even in singular cases, although the coefficient would still not be identified by the data and be just a consequence of the penalization.
In this example,
mod = smf.logit(formula, data=data, missing='drop').fit_regularized()
works for me. This is L1 penalization. In statsmodels 0.8, there is also elastic net penalization for GLM which has Binomial (i.e. Logit) as a family.