My regression model using statsmodels in python works with 48,065 lines of data, but while adding new data I have tracked down one line of code that produces a singular matrix error. Answers to similar questions seem to suggest missing data but I have checked and there is nothing visibibly irregular from the error prone row of code causing me major issues. Does anyone know if this is an error in my code or knows a solution to fix it as I'm out of ideas.
Data2.csv - http://www.sharecsv.com/s/8ff31545056b8864f2ad26ef2fe38a09/Data2.csv
import pandas as pd
import statsmodels.formula.api as smf
data = pd.read_csv("Data2.csv")
formula = 'is_success ~ goal_angle + goal_distance + np_distance + fp_distance + is_fast_attack + is_header + prev_tb + is_rebound + is_penalty + prev_cross + is_tb2 + is_own_goal + is_cutback + asst_dist'
model = smf.mnlogit(formula, data=data, missing='drop').fit()
CSV Line producing error: 0,0,0,0,0,0,0,1,22.94476,16.877204,13.484806,20.924627,0,0,11.765203
Error with Problematic line within the model:
runfile('C:/Users/User1/Desktop/Model Check.py', wdir='C:/Users/User1/Desktop')
Optimization terminated successfully.
Current function value: 0.264334
Iterations 20
Traceback (most recent call last):
File "<ipython-input-76-eace3b458e24>", line 1, in <module>
runfile('C:/Users/User1/Desktop/xG_xA Model Check.py', wdir='C:/Users/User1/Desktop')
File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/User1/Desktop/xG_xA Model Check.py", line 6, in <module>
model = smf.mnlogit(formula, data=data, missing='drop').fit()
File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\discrete\discrete_model.py", line 587, in fit
disp=disp, callback=callback, **kwargs)
File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\base\model.py", line 434, in fit
Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 526, in inv
ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 90, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
LinAlgError: Singular matrix
As far as I can see:
The problem is the variable is_own_goal because all observation where this is 1 also have the dependent variable is_success equal to 1. That means there is no variation in the outcome because is_own_goal already specifies that it is a success.
As a consequence, we cannot estimate a coefficient for is_own_goal, the coefficient is not identified by the data. The variance of the coefficient would be infinite and inverting the Hessian to get the covariance of the parameter estimates fails because the Hessian is singular.
Given floating point precision, with some computational noise the hessian might be invertible and the Singular Matrix exception would not show up. Which, I guess, is the reason that it works with some but not all observations.
BTW: If the dependent variable, endog, is binary, then Logit is more appropriate, even though MNLogit has it as a special case.
BTW: Penalized estimation would be another way to force an estimate even in singular cases, although the coefficient would still not be identified by the data and be just a consequence of the penalization.
In this example,
mod = smf.logit(formula, data=data, missing='drop').fit_regularized()
works for me. This is L1 penalization. In statsmodels 0.8, there is also elastic net penalization for GLM which has Binomial (i.e. Logit) as a family.
Related
I am trying to use a GPT2 architecture for musical applications and consequently need to train it from scratch. After a bit of googling I found that the issue #1714 from huggingface's github already had "solved" the question. When I try the to run the propose solution :
from transformers import GPT2Config, GPT2Model
NUMLAYER = 4
NUMHEAD = 4
SIZEREDUCTION = 10 #the factor by which we reduce the size of the velocity argument.
VELSIZE = int(np.floor(127/SIZEREDUCTION)) + 1
SEQLEN=40 #size of data sequences.
EMBEDSIZE = 5
config = GPT2Config(vocab_size = VELSIZE, n_positions = SEQLEN, n_embd = EMBEDSIZE, n_layer = NUMLAYER, n_ctx = SEQLEN, n_head = NUMHEAD)
model = GPT2Model(config)
I get the following error :
Traceback (most recent call last):
File "<ipython-input-7-b043a7a2425f>", line 1, in <module>
runfile('C:/Users/cnelias/Desktop/PHD/Swing project/code/script/GPT2.py', wdir='C:/Users/cnelias/Desktop/PHD/Swing project/code/script')
File "C:\Users\cnelias\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "C:\Users\cnelias\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/cnelias/Desktop/PHD/Swing project/code/script/GPT2.py", line 191, in <module>
model = GPT2Model(config)
File "C:\Users\cnelias\Anaconda3\lib\site-packages\transformers\modeling_gpt2.py", line 355, in __init__
self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])
File "C:\Users\cnelias\Anaconda3\lib\site-packages\transformers\modeling_gpt2.py", line 355, in <listcomp>
self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])
File "C:\Users\cnelias\Anaconda3\lib\site-packages\transformers\modeling_gpt2.py", line 223, in __init__
self.attn = Attention(nx, n_ctx, config, scale)
File "C:\Users\cnelias\Anaconda3\lib\site-packages\transformers\modeling_gpt2.py", line 109, in __init__
assert n_state % config.n_head == 0
What does it mean and how can I solve it ?
Also more generally, is there a documentation on how to do a forward call with the GPT2 ? Can I define my own train() function or do I have to use the model's build-in function ? Am I forced to use a Dataset to do the training or can I feed it individual tensors ?
I looked for it but couldn't find answer to these on the doc, but maybe I missed something.
PS : I already read the blogpost fron huggingface.co, but it omits too much informations and details to be usefull for my application.
I think the error message is pretty clear:
assert n_state % config.n_head == 0
Tracing it back through the code, we can see
n_state = nx # in Attention: n_state=768
which indicates that n_state represents the embedding dimension (which is generally 768 by default in BERT-like models). When we then look at the GPT-2 documentation, it seems the parameter specifying this is n_embd, which you are setting to 5. As the error indicates, the embedding dimension has to be evenly divisible through the number of attention heads, which were specified as 4. So, choosing a different embedding dimension as a multiple of 4 should solve the problem. Of course, you can also change the number of heads to begin with, but it seems that odd embedding dimensions are not supported.
I am trying to use optuna lib in Python to optimise parameters for recommender systems' models. Those models are custom and look like standard fit-predict sklearn models (with methods get/set params).
What I do: simple objective function that selects two parameters from uniform int distribution, set these params to model, predicts the model (there no fit stage as it simple model that uses params only in predict stage) and calculates some metric.
What I get: the first trial runs normal, it samples params and prints results to log. But on the second and next trial I have some strange errors (look code below) that I can't solve or google. When I run study on just 1 trial everything is okay.
What I tried: to rearrange parts of objective function, put fit stage inside, try to calculate more simpler metrics - nothing helps.
Here is my objective function:
# getting train, test
# fitting model
self.model = SomeRecommender()
self.model.fit(train, some_other_params)
def objective(trial: optuna.Trial):
# save study
if path is not None:
joblib.dump(study, some_path)
# sampling params
alpha = trial.suggest_uniform('alpha', 0, 100)
beta = trial.suggest_uniform('beta', 0, 100)
# setting params to model
params = {'alpha': alpha,
'beta': beta}
self.model.set_params(**params)
# getting predict
recs = self.model.predict(some_other_params)
# metric computing
metric_result = Metrics.hit_rate_at_k(recs, test, k=k)
return metric_result
# starting study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=3, n_jobs=1)
That's what I get on three trials:
[I 2019-10-01 12:53:59,019] Finished trial#0 resulted in value: 0.1. Current best value is 0.1 with parameters: {'alpha': 59.6135986324444, 'beta': 40.714559720597585}.
[W 2019-10-01 13:39:58,140] Setting status of trial#1 as TrialState.FAIL because of the following error: AttributeError("'_BaseUniformDistribution' object has no attribute 'to_internal_repr'")
Traceback (most recent call last):
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/study.py", line 448, in _run_trial
result = func(trial)
File "/Users/roseaysina/code/project/model.py", line 100, in objective
'alpha', 0, 100)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 180, in suggest_uniform
return self._suggest(name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 453, in _suggest
self.study, trial, name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 127, in sample_independent
values, scores = _get_observation_pairs(study, param_name)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 558, in _get_observation_pairs
param_value = distribution.to_internal_repr(trial.params[param_name])
AttributeError: '_BaseUniformDistribution' object has no attribute 'to_internal_repr'
[W 2019-10-01 13:39:58,206] Setting status of trial#2 as TrialState.FAIL because of the following error: AttributeError("'_BaseUniformDistribution' object has no attribute 'to_internal_repr'")
Traceback (most recent call last):
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/study.py", line 448, in _run_trial
result = func(trial)
File "/Users/roseaysina/code/project/model.py", line 100, in objective
'alpha', 0, 100)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 180, in suggest_uniform
return self._suggest(name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 453, in _suggest
self.study, trial, name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 127, in sample_independent
values, scores = _get_observation_pairs(study, param_name)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 558, in _get_observation_pairs
param_value = distribution.to_internal_repr(trial.params[param_name])
AttributeError: '_BaseUniformDistribution' object has no attribute 'to_internal_repr'
I can't understand where is the problem and why the first trial is working. Please, help.
Thank you!
Your code seems to have no problems.
I ran a simplified version of your code (see below), and it worked well in my environment:
import optuna
def objective(trial: optuna.Trial):
# sampling params
alpha = trial.suggest_uniform('alpha', 0, 100)
beta = trial.suggest_uniform('beta', 0, 100)
# evaluating params
return alpha + beta
# starting study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=3, n_jobs=1)
Could you tell me about your environment in order to investigate the problem? (e.g., OS, Python version, Python interpreter (CPython, PyPy, IronPython or Jython), Optuna version)
why the first trial is working.
This error is raised by optuna/samplers/tpe/sampler.py#558, and this line is only executed when the number of completed trials in the study is greater than zero.
BTW, you might be able to avoid this problem by using RandomSampler as follows:
sampler = optuna.samplers.RandomSampler()
study = optuna.create_study(direction='maximize', sampler=sampler)
Notice that the optimization performance of RandomSampler tends to be worse than TPESampler that is the default sampler of Optuna.
I am trying to use a OLS regression to predict missing (NAN) values of ustar using know data of wind speed (WS), variation of WS by month, and radiation (Rn) using known values of all the variables just mentioned. All variables within the formula do have some missing data at some point within the dataframe, but my regression formula gave me strong correlations with all my variables in the formula and an R squared valued of .80, so I know this gap-filling method of predicted regression data is feasible. Here is my code below:
regression_data = pd.DataFrame([])
regression_data['ustar'] = data['ustar']
regression_data['WS'] = data['WS']
regression_data['Rn'] = data['Rn']
regression_data['month'] = data.index.month
formula = "ustar ~ WS + (WS:C(month)) + (WS:Rn) + 1"
regression_model = sm.regression.linear_model.OLS.from_formula(formula,regression_data)
results = regression_model.fit()
predicted_values = results.predict(regression_data)
Traceback (most recent call last):
File "<ipython-input-61-073df0b2ae63>", line 1, in <module>
predicted_values = results.predict(regression_data)
File "/Users/JasonDucker/anaconda/lib/python3.5/site-packages/statsmodels/base/model.py", line 739, in predict
exog = dmatrix(self.model.data.orig_exog.design_info.builder,
File "/Users/JasonDucker/anaconda/lib/python3.5/site-packages/pandas/core/generic.py", line 2360, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'design_info'
I understand that there has been past similar issues with the same error, but I do know if the complexity of my formula is not handling well inside the "predict" attribute coding. I was wondering if anyone has a perspective of how I should approach this problem.
When you write a program with lots of code, its difficult to find out which values have a big influence of your final result.
In my case i have got a few differential equations which I solve with odeint.
It would take a lot of time to find out which values have a big influence on my result (velocity).
Is there any Tool in python to analyze your values or does someone have any idea?
Thanks for your help.
[Edit]
MathBio: "In general a sensitivity analysis is what you would do. "
#MathBio I read a few blogs about SALib now(SALib Guide) and tried to write a "easier" test program to solve differential equations.
Below you see my written program:
I get the error-message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "C:\Python27\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/Tim_s/Desktop/Workspace/test/TrySALib.py", line 44, in <module>
Y = Odefunk(param_values)
File "C:/Users/Tim_s/Desktop/Workspace/test/TrySALib.py", line 24, in Odefunk
dT=odeint(dTdt,T0,t,args=(P,))
File "C:\Python27\lib\site-packages\scipy\integrate\odepack.py", line 148, in odeint
ixpr, mxstep, mxhnil, mxordn, mxords)
File "C:/Users/Tim_s/Desktop/Workspace/test/TrySALib.py", line 17, in dTdt
dT[0]=P[0]*(T[1]-T[0])+P[2]
IndexError: tuple index out of range
Here the code:
from SALib.sample import saltelli
from SALib.analyze import sobol
import numpy as np
from pylab import *
from scipy.integrate import odeint
Tu=20.
t=linspace(0,180,90)
def dTdt(T,t,P):# DGL
dT=zeros(2)
dT[0]=P[0]*(T[1]-T[0])+P[2]
dT[1]=P[1]*(Tu-T[1])+P[0]*(T[0]-T[1])
return dT
T0=[Tu,Tu]
def Odefunk(values):
for P in enumerate(values):
dT=odeint(dTdt,T0,t,args=(P,))
return dT
# Define the model inputs
problem = {
'num_vars': 3,
'names': ['P0', 'P1', 'P2'],
'bounds': [[ 0.1, 0.2],
[ 0.01, 0.02],
[ 0.5, 1]]
}
# Generate samples
param_values = saltelli.sample(problem, 1000, calc_second_order=True)
# Run model (example)
Y = Odefunk(param_values)
# Perform analysis
Si = sobol.analyze(problem, Y, print_to_console=False)
# Print the first-order sensitivity indices
print Si['S1']
You really should include your odes, so we can see the parameters and the initial condition. Code would be nice also.
In general a sensitivity analysis is what you would do. Performing a nondimentionalization is standard also. Look up these tips and try to implement them to see how varying a parameter a small amount will effect your solution.
I suggest you look up these concepts, and include your code. You should try the first steps yourself, and then I'd be happy to help with anything technical once you've clearly made an effort. Best wishes.
I use PyMC to implement a multinomial-dirichlet pair. I want to MAP the model for all the instances that we have.
The issue I face is that once MAP.fit() the prior distribution is changed. Thus, for every new instance, I need to have a new prior distribution, which should be fine. However, I keep seeing this error:
Traceback (most recent call last):
File "/Users/xingweiy/Project/StarRating/TimePlot/BayesianPrediction/DiricheletMultinomialStarRating.py", line 41, in <module>
prediction = predict.predict(input,prior)
File "/Users/xingweiy/Project/StarRating/TimePlot/BayesianPrediction/predict.py", line 12, in predict
likelihood = pm.Categorical('rating',prior,value = exp_data,observed = True)
File "/Library/Python/2.7/site-packages/pymc-2.3.4-py2.7-macosx-10.9-intel.egg/pymc/distributions.py", line 3170, in __init__
verbose=verbose, **kwds)
File "/Library/Python/2.7/site-packages/pymc-2.3.4-py2.7-macosx-10.9-intel.egg/pymc/PyMCObjects.py", line 772, in __init__
if not isinstance(self.logp, float):
File "/Library/Python/2.7/site-packages/pymc-2.3.4-py2.7-macosx-10.9-intel.egg/pymc/PyMCObjects.py", line 929, in get_logp
raise ZeroProbability(self.errmsg)
pymc.Node.ZeroProbability: Stochastic rating's value is outside its support,
or it forbids its parents' current values.
Here is the code:
alpha= np.array([0.1,0.1,0.1,0.1,0.1])
prior = pm.Dirichlet('prior',alpha)
exp_data = np.array(input)
likelihood = pm.Categorical('rating',prior,value = exp_data,observed = True)
MaximumPosterior = inf.inference(prior, likelihood, exp_data)
def inference(prior,likelihood,observation):
model = Model({'likelihood':likelihood,'prior':prior})
M = MAP(model)
M.fit()
result = M.prior.value
result = np.append(result,1- np.sum(M.prior.value))
return result
I think it is a bug of pymc package. Is there any way to do MAP without changing the prior distribution?
Thanks
The answer in the link below solved my issue:
https://groups.google.com/forum/#!topic/pymc/uYQSGW4acf8
Basically, the dirichlet distribution generates probability that is close to 0.
The link below solved my issue:
https://groups.google.com/forum/#!topic/pymc/uYQSGW4acf8