PYMC MAP Fit problems - python

I use PyMC to implement a multinomial-dirichlet pair. I want to MAP the model for all the instances that we have.
The issue I face is that once MAP.fit() the prior distribution is changed. Thus, for every new instance, I need to have a new prior distribution, which should be fine. However, I keep seeing this error:
Traceback (most recent call last):
File "/Users/xingweiy/Project/StarRating/TimePlot/BayesianPrediction/DiricheletMultinomialStarRating.py", line 41, in <module>
prediction = predict.predict(input,prior)
File "/Users/xingweiy/Project/StarRating/TimePlot/BayesianPrediction/predict.py", line 12, in predict
likelihood = pm.Categorical('rating',prior,value = exp_data,observed = True)
File "/Library/Python/2.7/site-packages/pymc-2.3.4-py2.7-macosx-10.9-intel.egg/pymc/distributions.py", line 3170, in __init__
verbose=verbose, **kwds)
File "/Library/Python/2.7/site-packages/pymc-2.3.4-py2.7-macosx-10.9-intel.egg/pymc/PyMCObjects.py", line 772, in __init__
if not isinstance(self.logp, float):
File "/Library/Python/2.7/site-packages/pymc-2.3.4-py2.7-macosx-10.9-intel.egg/pymc/PyMCObjects.py", line 929, in get_logp
raise ZeroProbability(self.errmsg)
pymc.Node.ZeroProbability: Stochastic rating's value is outside its support,
or it forbids its parents' current values.
Here is the code:
alpha= np.array([0.1,0.1,0.1,0.1,0.1])
prior = pm.Dirichlet('prior',alpha)
exp_data = np.array(input)
likelihood = pm.Categorical('rating',prior,value = exp_data,observed = True)
MaximumPosterior = inf.inference(prior, likelihood, exp_data)
def inference(prior,likelihood,observation):
model = Model({'likelihood':likelihood,'prior':prior})
M = MAP(model)
M.fit()
result = M.prior.value
result = np.append(result,1- np.sum(M.prior.value))
return result
I think it is a bug of pymc package. Is there any way to do MAP without changing the prior distribution?
Thanks
The answer in the link below solved my issue:
https://groups.google.com/forum/#!topic/pymc/uYQSGW4acf8

Basically, the dirichlet distribution generates probability that is close to 0.
The link below solved my issue:
https://groups.google.com/forum/#!topic/pymc/uYQSGW4acf8

Related

Train huggingface's GPT2 from scratch : assert n_state % config.n_head == 0 error

I am trying to use a GPT2 architecture for musical applications and consequently need to train it from scratch. After a bit of googling I found that the issue #1714 from huggingface's github already had "solved" the question. When I try the to run the propose solution :
from transformers import GPT2Config, GPT2Model
NUMLAYER = 4
NUMHEAD = 4
SIZEREDUCTION = 10 #the factor by which we reduce the size of the velocity argument.
VELSIZE = int(np.floor(127/SIZEREDUCTION)) + 1
SEQLEN=40 #size of data sequences.
EMBEDSIZE = 5
config = GPT2Config(vocab_size = VELSIZE, n_positions = SEQLEN, n_embd = EMBEDSIZE, n_layer = NUMLAYER, n_ctx = SEQLEN, n_head = NUMHEAD)
model = GPT2Model(config)
I get the following error :
Traceback (most recent call last):
File "<ipython-input-7-b043a7a2425f>", line 1, in <module>
runfile('C:/Users/cnelias/Desktop/PHD/Swing project/code/script/GPT2.py', wdir='C:/Users/cnelias/Desktop/PHD/Swing project/code/script')
File "C:\Users\cnelias\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 786, in runfile
execfile(filename, namespace)
File "C:\Users\cnelias\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/cnelias/Desktop/PHD/Swing project/code/script/GPT2.py", line 191, in <module>
model = GPT2Model(config)
File "C:\Users\cnelias\Anaconda3\lib\site-packages\transformers\modeling_gpt2.py", line 355, in __init__
self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])
File "C:\Users\cnelias\Anaconda3\lib\site-packages\transformers\modeling_gpt2.py", line 355, in <listcomp>
self.h = nn.ModuleList([Block(config.n_ctx, config, scale=True) for _ in range(config.n_layer)])
File "C:\Users\cnelias\Anaconda3\lib\site-packages\transformers\modeling_gpt2.py", line 223, in __init__
self.attn = Attention(nx, n_ctx, config, scale)
File "C:\Users\cnelias\Anaconda3\lib\site-packages\transformers\modeling_gpt2.py", line 109, in __init__
assert n_state % config.n_head == 0
What does it mean and how can I solve it ?
Also more generally, is there a documentation on how to do a forward call with the GPT2 ? Can I define my own train() function or do I have to use the model's build-in function ? Am I forced to use a Dataset to do the training or can I feed it individual tensors ?
I looked for it but couldn't find answer to these on the doc, but maybe I missed something.
PS : I already read the blogpost fron huggingface.co, but it omits too much informations and details to be usefull for my application.
I think the error message is pretty clear:
assert n_state % config.n_head == 0
Tracing it back through the code, we can see
n_state = nx # in Attention: n_state=768
which indicates that n_state represents the embedding dimension (which is generally 768 by default in BERT-like models). When we then look at the GPT-2 documentation, it seems the parameter specifying this is n_embd, which you are setting to 5. As the error indicates, the embedding dimension has to be evenly divisible through the number of attention heads, which were specified as 4. So, choosing a different embedding dimension as a multiple of 4 should solve the problem. Of course, you can also change the number of heads to begin with, but it seems that odd embedding dimensions are not supported.

How to fix error "'_BaseUniformDistribution' object has no attribute 'to_internal_repr'" - strange behaviour in optuna

I am trying to use optuna lib in Python to optimise parameters for recommender systems' models. Those models are custom and look like standard fit-predict sklearn models (with methods get/set params).
What I do: simple objective function that selects two parameters from uniform int distribution, set these params to model, predicts the model (there no fit stage as it simple model that uses params only in predict stage) and calculates some metric.
What I get: the first trial runs normal, it samples params and prints results to log. But on the second and next trial I have some strange errors (look code below) that I can't solve or google. When I run study on just 1 trial everything is okay.
What I tried: to rearrange parts of objective function, put fit stage inside, try to calculate more simpler metrics - nothing helps.
Here is my objective function:
# getting train, test
# fitting model
self.model = SomeRecommender()
self.model.fit(train, some_other_params)
def objective(trial: optuna.Trial):
# save study
if path is not None:
joblib.dump(study, some_path)
# sampling params
alpha = trial.suggest_uniform('alpha', 0, 100)
beta = trial.suggest_uniform('beta', 0, 100)
# setting params to model
params = {'alpha': alpha,
'beta': beta}
self.model.set_params(**params)
# getting predict
recs = self.model.predict(some_other_params)
# metric computing
metric_result = Metrics.hit_rate_at_k(recs, test, k=k)
return metric_result
# starting study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=3, n_jobs=1)
That's what I get on three trials:
[I 2019-10-01 12:53:59,019] Finished trial#0 resulted in value: 0.1. Current best value is 0.1 with parameters: {'alpha': 59.6135986324444, 'beta': 40.714559720597585}.
[W 2019-10-01 13:39:58,140] Setting status of trial#1 as TrialState.FAIL because of the following error: AttributeError("'_BaseUniformDistribution' object has no attribute 'to_internal_repr'")
Traceback (most recent call last):
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/study.py", line 448, in _run_trial
result = func(trial)
File "/Users/roseaysina/code/project/model.py", line 100, in objective
'alpha', 0, 100)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 180, in suggest_uniform
return self._suggest(name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 453, in _suggest
self.study, trial, name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 127, in sample_independent
values, scores = _get_observation_pairs(study, param_name)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 558, in _get_observation_pairs
param_value = distribution.to_internal_repr(trial.params[param_name])
AttributeError: '_BaseUniformDistribution' object has no attribute 'to_internal_repr'
[W 2019-10-01 13:39:58,206] Setting status of trial#2 as TrialState.FAIL because of the following error: AttributeError("'_BaseUniformDistribution' object has no attribute 'to_internal_repr'")
Traceback (most recent call last):
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/study.py", line 448, in _run_trial
result = func(trial)
File "/Users/roseaysina/code/project/model.py", line 100, in objective
'alpha', 0, 100)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 180, in suggest_uniform
return self._suggest(name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/trial.py", line 453, in _suggest
self.study, trial, name, distribution)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 127, in sample_independent
values, scores = _get_observation_pairs(study, param_name)
File "/Users/roseaysina/anaconda3/envs/sauvage/lib/python3.7/site-packages/optuna/samplers/tpe/sampler.py", line 558, in _get_observation_pairs
param_value = distribution.to_internal_repr(trial.params[param_name])
AttributeError: '_BaseUniformDistribution' object has no attribute 'to_internal_repr'
I can't understand where is the problem and why the first trial is working. Please, help.
Thank you!
Your code seems to have no problems.
I ran a simplified version of your code (see below), and it worked well in my environment:
import optuna
def objective(trial: optuna.Trial):
# sampling params
alpha = trial.suggest_uniform('alpha', 0, 100)
beta = trial.suggest_uniform('beta', 0, 100)
# evaluating params
return alpha + beta
# starting study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=3, n_jobs=1)
Could you tell me about your environment in order to investigate the problem? (e.g., OS, Python version, Python interpreter (CPython, PyPy, IronPython or Jython), Optuna version)
why the first trial is working.
This error is raised by optuna/samplers/tpe/sampler.py#558, and this line is only executed when the number of completed trials in the study is greater than zero.
BTW, you might be able to avoid this problem by using RandomSampler as follows:
sampler = optuna.samplers.RandomSampler()
study = optuna.create_study(direction='maximize', sampler=sampler)
Notice that the optimization performance of RandomSampler tends to be worse than TPESampler that is the default sampler of Optuna.

mnlogit regression, singular matrix error

My regression model using statsmodels in python works with 48,065 lines of data, but while adding new data I have tracked down one line of code that produces a singular matrix error. Answers to similar questions seem to suggest missing data but I have checked and there is nothing visibibly irregular from the error prone row of code causing me major issues. Does anyone know if this is an error in my code or knows a solution to fix it as I'm out of ideas.
Data2.csv - http://www.sharecsv.com/s/8ff31545056b8864f2ad26ef2fe38a09/Data2.csv
import pandas as pd
import statsmodels.formula.api as smf
data = pd.read_csv("Data2.csv")
formula = 'is_success ~ goal_angle + goal_distance + np_distance + fp_distance + is_fast_attack + is_header + prev_tb + is_rebound + is_penalty + prev_cross + is_tb2 + is_own_goal + is_cutback + asst_dist'
model = smf.mnlogit(formula, data=data, missing='drop').fit()
CSV Line producing error: 0,0,0,0,0,0,0,1,22.94476,16.877204,13.484806,20.924627,0,0,11.765203
Error with Problematic line within the model:
runfile('C:/Users/User1/Desktop/Model Check.py', wdir='C:/Users/User1/Desktop')
Optimization terminated successfully.
Current function value: 0.264334
Iterations 20
Traceback (most recent call last):
File "<ipython-input-76-eace3b458e24>", line 1, in <module>
runfile('C:/Users/User1/Desktop/xG_xA Model Check.py', wdir='C:/Users/User1/Desktop')
File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Users\User1\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/User1/Desktop/xG_xA Model Check.py", line 6, in <module>
model = smf.mnlogit(formula, data=data, missing='drop').fit()
File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\discrete\discrete_model.py", line 587, in fit
disp=disp, callback=callback, **kwargs)
File "C:\Users\User1\Anaconda2\lib\site-packages\statsmodels\base\model.py", line 434, in fit
Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 526, in inv
ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
File "C:\Users\User1\Anaconda2\lib\site-packages\numpy\linalg\linalg.py", line 90, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
LinAlgError: Singular matrix
As far as I can see:
The problem is the variable is_own_goal because all observation where this is 1 also have the dependent variable is_success equal to 1. That means there is no variation in the outcome because is_own_goal already specifies that it is a success.
As a consequence, we cannot estimate a coefficient for is_own_goal, the coefficient is not identified by the data. The variance of the coefficient would be infinite and inverting the Hessian to get the covariance of the parameter estimates fails because the Hessian is singular.
Given floating point precision, with some computational noise the hessian might be invertible and the Singular Matrix exception would not show up. Which, I guess, is the reason that it works with some but not all observations.
BTW: If the dependent variable, endog, is binary, then Logit is more appropriate, even though MNLogit has it as a special case.
BTW: Penalized estimation would be another way to force an estimate even in singular cases, although the coefficient would still not be identified by the data and be just a consequence of the penalization.
In this example,
mod = smf.logit(formula, data=data, missing='drop').fit_regularized()
works for me. This is L1 penalization. In statsmodels 0.8, there is also elastic net penalization for GLM which has Binomial (i.e. Logit) as a family.

Error in statsmodels.api OLS predict attribute using complex formula

I am trying to use a OLS regression to predict missing (NAN) values of ustar using know data of wind speed (WS), variation of WS by month, and radiation (Rn) using known values of all the variables just mentioned. All variables within the formula do have some missing data at some point within the dataframe, but my regression formula gave me strong correlations with all my variables in the formula and an R squared valued of .80, so I know this gap-filling method of predicted regression data is feasible. Here is my code below:
regression_data = pd.DataFrame([])
regression_data['ustar'] = data['ustar']
regression_data['WS'] = data['WS']
regression_data['Rn'] = data['Rn']
regression_data['month'] = data.index.month
formula = "ustar ~ WS + (WS:C(month)) + (WS:Rn) + 1"
regression_model = sm.regression.linear_model.OLS.from_formula(formula,regression_data)
results = regression_model.fit()
predicted_values = results.predict(regression_data)
Traceback (most recent call last):
File "<ipython-input-61-073df0b2ae63>", line 1, in <module>
predicted_values = results.predict(regression_data)
File "/Users/JasonDucker/anaconda/lib/python3.5/site-packages/statsmodels/base/model.py", line 739, in predict
exog = dmatrix(self.model.data.orig_exog.design_info.builder,
File "/Users/JasonDucker/anaconda/lib/python3.5/site-packages/pandas/core/generic.py", line 2360, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'design_info'
I understand that there has been past similar issues with the same error, but I do know if the complexity of my formula is not handling well inside the "predict" attribute coding. I was wondering if anyone has a perspective of how I should approach this problem.

ZeroDivisionError when using scipy.interpolate.griddata

I'm getting a ZeroDivisionError from the following code:
#stacking the array into a complex array allows np.unique to choose
#truely unique points. We also keep a handle on the unique indices
#to allow us to index `self` in the same order.
unique_points,index = np.unique(xdata[mask]+1j*ydata[mask],
return_index=True)
#Now we break it into the data structure we need.
points = np.column_stack((unique_points.real,unique_points.imag))
xx1,xx2 = self.meta['rcm_xx1'],self.meta['rcm_xx2']
yy1 = self.meta['rcm_yy2']
gx = np.arange(xx1,xx2+dx,dx)
gy = np.arange(-yy1,yy1+dy,dy)
GX,GY = np.meshgrid(gx,gy)
xi = np.column_stack((GX.ravel(),GY.ravel()))
gdata = griddata(points,self[mask][index],xi,method='linear',
fill_value=np.nan)
Here, xdata,ydata and self are all 2D numpy.ndarrays (or subclasses thereof) with the same shape and dtype=np.float32. mask is a 2d ndarray with the same shape and dtype=bool. Here's a link for those wanting to peruse the scipy.interpolate.griddata documentation.
Originally, xdata and ydata are derived from a non-uniform cylindrical grid that has a 4 point stencil -- I thought that the error might be coming from the fact that the same point was defined multiple times, so I made the set of input points unique as suggested in this question. Unfortunately, that hasn't seemed to help. The full traceback is:
Traceback (most recent call last):
File "/xxxxxxx/rcm.py", line 428, in <module>
x[...,1].to_pz0()
File "/xxxxxxx/rcm.py", line 285, in to_pz0
fill_value=fill_value)
File "/usr/local/lib/python2.7/site-packages/scipy/interpolate/ndgriddata.py", line 183, in griddata
ip = LinearNDInterpolator(points, values, fill_value=fill_value)
File "interpnd.pyx", line 192, in scipy.interpolate.interpnd.LinearNDInterpolator.__init__ (scipy/interpolate/interpnd.c:2935)
File "qhull.pyx", line 996, in scipy.spatial.qhull.Delaunay.__init__ (scipy/spatial/qhull.c:6607)
File "qhull.pyx", line 183, in scipy.spatial.qhull._construct_delaunay (scipy/spatial/qhull.c:1919)
ZeroDivisionError: float division
For what it's worth, the code "works" (No exception) if I use the "nearest" method.

Categories