I am trying to use a OLS regression to predict missing (NAN) values of ustar using know data of wind speed (WS), variation of WS by month, and radiation (Rn) using known values of all the variables just mentioned. All variables within the formula do have some missing data at some point within the dataframe, but my regression formula gave me strong correlations with all my variables in the formula and an R squared valued of .80, so I know this gap-filling method of predicted regression data is feasible. Here is my code below:
regression_data = pd.DataFrame([])
regression_data['ustar'] = data['ustar']
regression_data['WS'] = data['WS']
regression_data['Rn'] = data['Rn']
regression_data['month'] = data.index.month
formula = "ustar ~ WS + (WS:C(month)) + (WS:Rn) + 1"
regression_model = sm.regression.linear_model.OLS.from_formula(formula,regression_data)
results = regression_model.fit()
predicted_values = results.predict(regression_data)
Traceback (most recent call last):
File "<ipython-input-61-073df0b2ae63>", line 1, in <module>
predicted_values = results.predict(regression_data)
File "/Users/JasonDucker/anaconda/lib/python3.5/site-packages/statsmodels/base/model.py", line 739, in predict
exog = dmatrix(self.model.data.orig_exog.design_info.builder,
File "/Users/JasonDucker/anaconda/lib/python3.5/site-packages/pandas/core/generic.py", line 2360, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'design_info'
I understand that there has been past similar issues with the same error, but I do know if the complexity of my formula is not handling well inside the "predict" attribute coding. I was wondering if anyone has a perspective of how I should approach this problem.
Related
I am trying to use optuna lib in Python to optimise parameters for recommender systems' models. Those models are custom and look like standard fit-predict sklearn models (with methods get/set params).
What I do: simple objective function that selects two parameters from uniform int distribution, set these params to model, predicts the model (there no fit stage as it simple model that uses params only in predict stage) and calculates some metric.
What I get: the first trial runs normal, it samples params and prints results to log. But on the second and next trial I have some strange errors (look code below) that I can't solve or google. When I run study on just 1 trial everything is okay.
What I tried: to rearrange parts of objective function, put fit stage inside, try to calculate more simpler metrics - nothing helps.
Here is my objective function:
# getting train, test
# fitting model
self.model = SomeRecommender()
self.model.fit(train, some_other_params)
def objective(trial: optuna.Trial):
# save study
if path is not None:
joblib.dump(study, some_path)
# sampling params
alpha = trial.suggest_uniform('alpha', 0, 100)
beta = trial.suggest_uniform('beta', 0, 100)
# setting params to model
params = {'alpha': alpha,
'beta': beta}
self.model.set_params(**params)
# getting predict
recs = self.model.predict(some_other_params)
# metric computing
metric_result = Metrics.hit_rate_at_k(recs, test, k=k)
return metric_result
# starting study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=3, n_jobs=1)
That's what I get on three trials:
[I 2019-10-01 12:53:59,019] Finished trial#0 resulted in value: 0.1. Current best value is 0.1 with parameters: {'alpha': 59.6135986324444, 'beta': 40.714559720597585}.
[W 2019-10-01 13:39:58,140] Setting status of trial#1 as TrialState.FAIL because of the following error: AttributeError("'_BaseUniformDistribution' object has no attribute 'to_internal_repr'")
Traceback (most recent call last):
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/study.py", line 448, in _run_trial
result = func(trial)
File "/Users/roseaysina/code/project/model.py", line 100, in objective
'alpha', 0, 100)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 180, in suggest_uniform
return self._suggest(name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 453, in _suggest
self.study, trial, name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 127, in sample_independent
values, scores = _get_observation_pairs(study, param_name)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 558, in _get_observation_pairs
param_value = distribution.to_internal_repr(trial.params[param_name])
AttributeError: '_BaseUniformDistribution' object has no attribute 'to_internal_repr'
[W 2019-10-01 13:39:58,206] Setting status of trial#2 as TrialState.FAIL because of the following error: AttributeError("'_BaseUniformDistribution' object has no attribute 'to_internal_repr'")
Traceback (most recent call last):
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/study.py", line 448, in _run_trial
result = func(trial)
File "/Users/roseaysina/code/project/model.py", line 100, in objective
'alpha', 0, 100)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 180, in suggest_uniform
return self._suggest(name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 453, in _suggest
self.study, trial, name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 127, in sample_independent
values, scores = _get_observation_pairs(study, param_name)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 558, in _get_observation_pairs
param_value = distribution.to_internal_repr(trial.params[param_name])
AttributeError: '_BaseUniformDistribution' object has no attribute 'to_internal_repr'
I can't understand where is the problem and why the first trial is working. Please, help.
Thank you!
Your code seems to have no problems.
I ran a simplified version of your code (see below), and it worked well in my environment:
import optuna
def objective(trial: optuna.Trial):
# sampling params
alpha = trial.suggest_uniform('alpha', 0, 100)
beta = trial.suggest_uniform('beta', 0, 100)
# evaluating params
return alpha + beta
# starting study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=3, n_jobs=1)
Could you tell me about your environment in order to investigate the problem? (e.g., OS, Python version, Python interpreter (CPython, PyPy, IronPython or Jython), Optuna version)
why the first trial is working.
This error is raised by optuna/samplers/tpe/sampler.py#558, and this line is only executed when the number of completed trials in the study is greater than zero.
BTW, you might be able to avoid this problem by using RandomSampler as follows:
sampler = optuna.samplers.RandomSampler()
study = optuna.create_study(direction='maximize', sampler=sampler)
Notice that the optimization performance of RandomSampler tends to be worse than TPESampler that is the default sampler of Optuna.
My regression model using statsmodels in python works with 48,065 lines of data, but while adding new data I have tracked down one line of code that produces a singular matrix error. Answers to similar questions seem to suggest missing data but I have checked and there is nothing visibibly irregular from the error prone row of code causing me major issues. Does anyone know if this is an error in my code or knows a solution to fix it as I'm out of ideas.
Data2.csv - http://www.sharecsv.com/s/8ff31545056b8864f2ad26ef2fe38a09/Data2.csv
import pandas as pd
import statsmodels.formula.api as smf
data = pd.read_csv("Data2.csv")
formula = 'is_success ~ goal_angle + goal_distance + np_distance + fp_distance + is_fast_attack + is_header + prev_tb + is_rebound + is_penalty + prev_cross + is_tb2 + is_own_goal + is_cutback + asst_dist'
model = smf.mnlogit(formula, data=data, missing='drop').fit()
CSV Line producing error: 0,0,0,0,0,0,0,1,22.94476,16.877204,13.484806,20.924627,0,0,11.765203
Error with Problematic line within the model:
runfile('C:/Users/User1/Desktop/Model Check.py', wdir='C:/Users/User1/Desktop')
Optimization terminated successfully.
Current function value: 0.264334
Iterations 20
Traceback (most recent call last):
File "<ipython-input-76-eace3b458e24>", line 1, in <module>
runfile('C:/Users/User1/Desktop/xG_xA Model Check.py', wdir='C:/Users/User1/Desktop')
File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/User1/Desktop/xG_xA Model Check.py", line 6, in <module>
model = smf.mnlogit(formula, data=data, missing='drop').fit()
File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\discrete\discrete_model.py", line 587, in fit
disp=disp, callback=callback, **kwargs)
File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\base\model.py", line 434, in fit
Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 526, in inv
ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 90, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
LinAlgError: Singular matrix
As far as I can see:
The problem is the variable is_own_goal because all observation where this is 1 also have the dependent variable is_success equal to 1. That means there is no variation in the outcome because is_own_goal already specifies that it is a success.
As a consequence, we cannot estimate a coefficient for is_own_goal, the coefficient is not identified by the data. The variance of the coefficient would be infinite and inverting the Hessian to get the covariance of the parameter estimates fails because the Hessian is singular.
Given floating point precision, with some computational noise the hessian might be invertible and the Singular Matrix exception would not show up. Which, I guess, is the reason that it works with some but not all observations.
BTW: If the dependent variable, endog, is binary, then Logit is more appropriate, even though MNLogit has it as a special case.
BTW: Penalized estimation would be another way to force an estimate even in singular cases, although the coefficient would still not be identified by the data and be just a consequence of the penalization.
In this example,
mod = smf.logit(formula, data=data, missing='drop').fit_regularized()
works for me. This is L1 penalization. In statsmodels 0.8, there is also elastic net penalization for GLM which has Binomial (i.e. Logit) as a family.
I want to fit an ARMA(p,q) model to simulated data, y, and check the effect of different estimation methods on the results. However, fitting a model to the same object like so
model = tsa.ARMA(y,(1,1))
results_mle = model.fit(trend='c', method='mle', disp=False)
results_css = model.fit(trend='c', method='css', disp=False)
and printing the results
print result_mle.summary()
print result_css.summary()
generates the following error
File "C:\Anaconda\lib\site-packages\statsmodels\tsa\arima_model.py", line 1572, in summary
smry.add_table_params(self, alpha=alpha, use_t=False)
File "C:\Anaconda\lib\site-packages\statsmodels\iolib\summary.py", line 885, in add_table_params
use_t=use_t)
File "C:\Anaconda\lib\site-packages\statsmodels\iolib\summary.py", line 475, in summary_params
exog_idx]
IndexError: index 3 is out of bounds for axis 0 with size 3
If, instead, I do this
model1 = tsa.ARMA(y,(1,1))
model2 = tsa.ARMA(y,(1,1))
result_mle = model1.fit(trend='c',method='css-mle',disp=False)
print result_mle.summary()
result_css = model2.fit(trend='c',method='css',disp=False)
print result_css.summary()
no error occurs. Is that supposed to be or a Bug that should be fixed?
BTW the ARMA process I generated as follows
from __future__ import division
import statsmodels.tsa.api as tsa
import numpy as np
# generate arma
a = -0.7
b = -0.7
c = 2
s = 10
y1 = np.random.normal(c/(1-a),s*(1+(a+b)**2/(1-a**2)))
e = np.random.normal(0,s,(100,))
y = [y1]
for t in xrange(e.size-1):
arma = c + a*y[-1] + e[t+1] + b*e[t]
y.append(arma)
y = np.array(y)
You could report this as a bug, even though it looks like a consequence of the current design.
Some attributes of the model change when the estimation method is changed, which should in general be avoided. Since both results instances access the same model, the older one is inconsistent with it in this case.
http://www.statsmodels.org/dev/pitfalls.html#repeated-calls-to-fit-with-different-parameters
In general, statsmodels tries to keep all parameters that need to change the model in the model.__init__ and not as arguments in fit, and attach the outcome of fit and results to the Results instance.
However, this is not followed everywhere, especially not in older models that gained new options along the way.
trend is an example that is supposed to go into the ARMA.__init__ because it is now handled together with the exog (which is an ARMAX model), but wasn't in pure ARMA. The estimation method belongs in fit and should not cause problems like these.
Aside: There is a helper function to simulate an ARMA process that uses scipy.signal.lfilter and should be much faster than an iteration loop in Python.
I use PyMC to implement a multinomial-dirichlet pair. I want to MAP the model for all the instances that we have.
The issue I face is that once MAP.fit() the prior distribution is changed. Thus, for every new instance, I need to have a new prior distribution, which should be fine. However, I keep seeing this error:
Traceback (most recent call last):
File "/Users/xingweiy/Project/StarRating/TimePlot/BayesianPrediction/DiricheletMultinomialStarRating.py", line 41, in <module>
prediction = predict.predict(input,prior)
File "/Users/xingweiy/Project/StarRating/TimePlot/BayesianPrediction/predict.py", line 12, in predict
likelihood = pm.Categorical('rating',prior,value = exp_data,observed = True)
File "/Library/Python/2.7/site-packages/pymc-2.3.4-py2.7-macosx-10.9-intel.egg/pymc/distributions.py", line 3170, in __init__
verbose=verbose, **kwds)
File "/Library/Python/2.7/site-packages/pymc-2.3.4-py2.7-macosx-10.9-intel.egg/pymc/PyMCObjects.py", line 772, in __init__
if not isinstance(self.logp, float):
File "/Library/Python/2.7/site-packages/pymc-2.3.4-py2.7-macosx-10.9-intel.egg/pymc/PyMCObjects.py", line 929, in get_logp
raise ZeroProbability(self.errmsg)
pymc.Node.ZeroProbability: Stochastic rating's value is outside its support,
or it forbids its parents' current values.
Here is the code:
alpha= np.array([0.1,0.1,0.1,0.1,0.1])
prior = pm.Dirichlet('prior',alpha)
exp_data = np.array(input)
likelihood = pm.Categorical('rating',prior,value = exp_data,observed = True)
MaximumPosterior = inf.inference(prior, likelihood, exp_data)
def inference(prior,likelihood,observation):
model = Model({'likelihood':likelihood,'prior':prior})
M = MAP(model)
M.fit()
result = M.prior.value
result = np.append(result,1- np.sum(M.prior.value))
return result
I think it is a bug of pymc package. Is there any way to do MAP without changing the prior distribution?
Thanks
The answer in the link below solved my issue:
https://groups.google.com/forum/#!topic/pymc/uYQSGW4acf8
Basically, the dirichlet distribution generates probability that is close to 0.
The link below solved my issue:
https://groups.google.com/forum/#!topic/pymc/uYQSGW4acf8
I am trying to use the predict() function of the statsmodels.formula.api OLS implementation. When I pass a new data frame to the function to get predicted values for an out-of-sample dataset result.predict(newdf) returns the following error: 'DataFrame' object has no attribute 'design_info'. What does this mean and how do I fix it? The full traceback is:
p = result.predict(newdf)
File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 878, in predict
exog = dmatrix(self.model.data.orig_exog.design_info.builder,
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2088, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'design_info'
EDIT: Here is a reproducible example. The error appears to occur when I pickle and then unpickle the result object (which I need to do in my actual project):
import cPickle
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
df = pd.DataFrame({"A": [10,20,30,324,2353], "B": [20, 30, 10, 1, 2332], "C": [0, -30, 120, 11, 2]})
result = sm.ols(formula="A ~ B + C", data=df).fit()
print result.summary()
test1 = result.predict(df) #works
f_myfile = open('resultobject', "wb")
cPickle.dump(result, f_myfile, 2)
f_myfile.close()
print("Result Object Saved")
f_myfile = open('resultobject', "rb")
model = cPickle.load(f_myfile)
test2 = model.predict(df) #produces error
Pickling and unpickling of a pandas DataFrame doesn't save and restore attributes that have been attached by a user, as far as I know.
Since the formula information is currently stored together with the DataFrame of the original design matrix, this information is lost after unpickling a Results and Model instance.
If you don't use categorical variables and transformations, then the correct designmatrix can be built with patsy.dmatrix. I think the following should work
x = patsy.dmatrix("B + C", data=df) # df is data for prediction
test2 = model.predict(x, transform=False)
or constructing the design matrix for the prediction directly should also work Note we need to explicitly add a constant that the formula adds by default.
from statsmodels.api import add_constant
test2 = model.predict(add_constant(df[["B", "C"]]), transform=False)
If the formula and design matrix contain (stateful) transformation and categorical variables, then it's not possible to conveniently construct the design matrix without the original formula information. Constructing it by hand and doing all the calculations explicitly is difficult in this case, and looses all the advantages of using formulas.
The only real solution is to pickle the formula information design_info independently of the dataframe orig_exog.