FloatingPointError from PyMC in sampling from a Dirichlet distribution

FloatingPointError from PyMC in sampling from a Dirichlet distribution - python

After being unsuccessful in using decorators to define the stochastic object of the "logarithm of an exponential random variable", I decided to manually write the code for this new distribution using pymc.stochastic_from_dist. The model that I am trying to implement is available here(the first model):
Now when I try to sample the log(alpha) using MCMC Metropolis and with a Normal distribution as proposal(as it has been stated in the following picture as the sampling method), I am getting the following error:
File "/Library/Python/2.7/site-packages/pymc/distributions.py", line 980, in rdirichlet
return (gammas[0]/gammas[0].sum())[:-1]
FloatingPointError: invalid value encountered in divide
Although the times that the sampling doesn't run into error the sampling histograms are matching with the ones in this paper. My hierarchical model is:
"""
A Hierarchical Bayesian Model for Bags of Marbles
logalpha ~ logarithm of an exponential distribution with parameter lambd
beta ~ Dirichlet([black and white ball proportions]:vector of 1's)
theta ~ Dirichlet(alpha*beta(vector))
"""
import numpy as np
import pymc
from scipy.stats import expon
lambd=1.
__all__=['alpha','beta','theta','logalpha']
#------------------------------------------------------------
# Set up pyMC model: logExponential
# 1 parameter: (alpha)
def logExp_like(x,explambda):
"""log-likelihood for logExponential"""
return -lambd*np.exp(x)+x
def rlogexp(explambda, size=None):
"""random variable from logExponential"""
sample=np.random.exponential(explambda,size)
logSample=np.log(sample)
return logSample
logExponential=pymc.stochastic_from_dist('logExponential',logp=logExp_like,
random=rlogexp,
dtype=np.float,
mv=False)
#------------------------------------------------------------
#Defining model parameteres alpha and beta.
beta=pymc.Dirichlet('beta',theta=[1,1])
logalpha=logExponential('logalpha',lambd)
#pymc.deterministic(plot=False)
def multipar(a=logalpha,b=beta):
out=np.empty(2)
out[0]=(np.exp(a)*b)
out[1]=(np.exp(a)*(1-b))
return out
theta=pymc.Dirichlet('theta',theta=multipar)
And my test sampling code is:
from pymc import Metropolis
from pymc import MCMC
from matplotlib import pyplot as plt
import HBM
import numpy as np
import pymc
import scipy
M=MCMC(HBM)
M.use_step_method(Metropolis,HBM.logalpha, proposal_sd=1.,proposal_distribution='Normal')
M.sample(iter=1000,burn=200)
When I check the values of theta passed to gamma distribution in line 978 of distributions.py I see that there are not zero but small values! So I don't know how to prevent this floating point error?

I found this nugget in their documentation:
The stochastic variable cutoff cannot be smaller than the largest element of D, otherwise D’s density would be zero. The standard Metropolis step method can handle this case without problems; it will propose illegal values occasionally, but these will be rejected.
This would lead me to believe that the dtype=np.float (which is essential has the same range as float), may not be the method you want to go about. The documentation says it needs to be a numpy dtype, but it just needs to be a dtype that converts to a numpy dtype object and in Python2 (correct me if I'm wrong) number dtypes were fixed size types meaning they're the same. Maybe utilizing the Decimal module would be an option. This way you can set the level of precision to encapsulate expected value ranges, and pass it to your extended stochastic method where it would be converted.
from decimal import Decimal, getcontext
getcontext().prec = 15
dtype=Decimal
I don't know this wouldn't still be truncated once the numpy library got a hold of it, or if it would respect the inherited level of precision. I have no accurate method of testing this, but give it a try and let me know how that works for you.
Edit: I tested the notion of precision inheritance and it would seem to hold:
>>> from decimal import Decimal, getcontext
>>> getcontext().prec = 10
>>> Decimal(1) / Decimal(7)
Decimal('0.1428571429')
>>> np.float(Decimal(1) / Decimal(7))
0.1428571429
>>> getcontext().prec = 15
>>> np.float(Decimal(1) / Decimal(7))
0.142857142857143
>>>

If you do get small numbers, it might simply be too small for a float. This is typically also what the logarithms are there for to avoid. What if you use dtype=np.float64?

As you have suggested at the end of your question, the issue is with too small numbers that are float-casted to 0. One solution could be to tweak a little the source code and replace the division with for example np.divide and in the "where" condition to add some explicit casting for to small values to a given threshold.

Related

Numerical Solver in Python is not able to find a solution

I broke my problem down as follows. I am not able to solve the following equation with Python 3.9 in a meaningful way, instead it always stops with the initial_guess for small lambda_ < 1. Is there an alternative algorithm that can handle the error function better? Or can I force fsolve to search until a solution is found?
import numpy as np
from scipy.special import erfcinv, erfc
from scipy.optimize import root, fsolve
def Q(x):
return 0.5*erfc(x/np.sqrt(2))
def Qinvers(x):
return np.sqrt(2)*erfcinv(2*x)
def epseqn(epsilon2):
lambda_ = 0.1
return Q(lambda_*Qinvers(epsilon2))
eps1 = fsolve(epseqn, 1e-2)
print(eps1)
I tried root and fsolve to get a solution. Especially for the gaussian error function I do not find a solution that converges.

root and fsolve can be used to find the roots of a function defined by f(x)=0. Since your outer function, which is basically erfc(x), has no root (it only it approaches the x-axis asymptotically from positive values) the solvers are not able to find one. Real function arguments are assumed like you did.

Before blindly starting with numerical calculations, I would recommend to think about any constraints of your function.
You will find out, that your function is only defined for values between zero and one. If you assume that there is only a single root in this interval, I would recommend to use an interval search method like brentq, see https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.brentq.html#scipy.optimize.brentq and https://en.wikipedia.org/wiki/Brent%27s_method.
However, you could instead think further and/or just plot your function, e.g. using matplotlib
import matplotlib.pyplot as plt
x = np.linspace(0, 1, 1000)
y = epseqn(x)
plt.plot(x, y)
plt.show()
There you will see that the root is at zero, which makes sense when looking at your functions, because the inverse cumulative error function is minus infinity at zero and the regular error function gives you zero at minus infinity (mathematically in the limit sense, but numerically those functions are also defined for such input values). So without any numeric calculation, you can get the root value.

scipy's fsolve gives answers that aren't roots

I am trying to find roots of a function in python using fsolve:
import math
import scipy
def f(a):
eq=-2*a**2 - 2*a**2*(math.sin(25*a**(1/4)))**2 - 2*a**2*(math.cos(25*a**(1/4)))**2 - 2*math.exp(-25*a**(1/4))*a**2*math.cos(25*a**(1/4)) - 2*math.exp(25*a**(1/4))*a**2*math.cos(25*a**(1/4))
return eq
print(f(scipy.optimize.fsolve(f,10)))
and it returns the following value:
[1234839.75468454]
That doesn't seem very close to 0 to me... Does it simply lack the computational power to calculate more decimals for the root? If so, what would be a good alternative for fsolve that could also calculate roots, just more accurately?

To better understand what happens, a first step would be to look at the infos of the run, which you can get by setting the full_output argument to True (see https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.fsolve.html for more details).
When starting with an initial point of 10 as you do, the algorithm claims to converge and do so in less evaluations than the maximum allocated, so it is not a problem of computing power.

Curve fit with an list of point

One of my python script gives me some results depending on processing duration, which I display like that:
Now I would like to trace the function's curve which approximate the best the results evolution.
After few researches, the best tool I found is the curve_fit of scipy.optimize.
There is just one problem, the function curve_fit requires at first parameter a function (if I have well understand the documentation's example) but my points on the graph are not the results of a function, so I don't know what to put here.
Can someone help me to fix this problem or proposing me another way t do that?
Thanks.

When you say "now I would like to trace the function's curve which approximate the best the results evolution", you must have some sort of curve in mind that is the ideal form for the data. So, what is that function? In curve-fitting, that function is called "the model function" -- the function that models your data.
Think of it this way: you have 50 or so measurement points. You might believe that they are each perfectly accurate and free-of-error. But since you asked about curve-fitting, this is probably not the case. That is, you probably believe there is some noise or errors in the data and that the data can be represented by an idealized function with many fewer than 50 or so parameters (I'd guess 4 or so).
That idealized function that explains your model (and would allow predicting "optimum" values at "duration" points that you did not measure) is the "model function". If you have that, curve-fitting can help: you write that function (which probably depends on a few Parameters) to model the data in python and find the best values for the Parameters so that the model matches your data. If you don't have that, what do you mean by "curve-fitting"?
You could draw a spline through the data or otherwise smooth the data, but that gives little power about predicting new values that would be different from "interpolate/extrapolate the data without worrying about the effect of noise".

It looks like an "exponential approach" type of curve like you get for charging a capacitor - see here.
So, I'd start with this formula:
y = a * ( 1 - n * np.exp(-b*x))
If I plot that with Matplotlib:
#!/usr/bin/env python3
import numpy as np
import matplotlib.pyplot as plt
# Make 100 samples along x-axis, from 0..10
x = np.linspace(0,10,100)
# Make an exponential approach type of curve
a = 17000
n = 1
b = 3
y = a * ( 1 - n * np.exp(-b*x))
# Plot it
plt.title(f'Plot for a={a}, n={n}, b={b}')
plt.plot(x,y)
plt.show()

Scipy fisher_exact test taking a very long time

I'm using scipy version 1.0.0.
import scipy as sp
x = [[5829225, 5692693], [5760959, 5760959]]
sp.stats.fisher_exact(x)
For the values above scipy does not return anything but waits.
What can be the reason for that?
How can I fix it?
However in R it returns a p-value almost immediately.
a = matrix(c(5829225,5692693,5760959,5760959), nrow=2)
fisher.test(a)

From the notes in the documentation:
The calculated odds ratio is different from the one R uses. This scipy implementation returns the (more common) “unconditional Maximum Likelihood Estimate”, while R uses the “conditional Maximum Likelihood Estimate”.
For tables with large numbers, the (inexact) chi-square test implemented in the function chi2_contingency can also be used.
(Emphasis mine)
Like DSM's comment mentioned, it's probably just very slow for your large values. And since the notes call out large values, you might try the alternative they suggest:
>>> chi2, p, dof, expected = sp.stats.chi2_contingency(x)
>>> p
6.140729432506709e-178

Fitting a distribution to data: how to penalize "bad" parameter estimates?

I'm using using scipy's least-squares optimization to fit an exponentially-modified gaussian distribution to a set of reaction time measurements. In general, it works well, but sometimes, the optimization goes off the rails and chooses a crazy value for a parameter -- the resulting plot clearly doesn't fit the data very well. In general, it looks like the problems arise from floating-point precision errors -- we head off into 0 or inf or nan-land.
I'm thinking of doing two things:
Using the parameters to simultaneously fit a CDF and PDF to the data; I have formulas for both. (I'm using a kernel density estimate to approximate the PDF from the data.)
Somehow taking into account the distance from the initial parameter estimates (generated by the method of moments approach on the wikipedia page). Those estimates are far from perfect, but are pretty good and seem to steer clear of "exploding floating point" problems.
Combining the PDF and CDF fits sounds pretty straightforward; the scales of the error will even be generally the same. But getting the initial parameter fits in there: I'm not quite sure if it's even a good idea -- but if it is:
What would I do about the difference in scale? Should I normalize the parameter "error" to a percent error?
Is there a reasonable way to decide on a relative weight between the data estimation error and parameter "error"?
Are these even the right questions to be asking? Are there generally-regarded "correct" answers, or is "try some stuff until you find something that seems to work" a good approach?
One example dataset
As requested, here's a dataset for which this process isn't working very well. I know there are only a few samples and that the data don't fit the distribution well; I'm still hoping against hope that I can get a "reasonable-looking" result from optimization.
array([ 450., 560., 692., 730., 758., 723., 486., 596., 716.,
695., 757., 522., 535., 419., 478., 666., 637., 569.,
859., 883., 551., 652., 378., 801., 718., 479., 544.])
MLE Update
I had a bunch of problems getting my MLE estimate to converge to a "reasonable" value, until I discovered this: If X contains at least one nan, np.sum(X) == nan when X is a numpy array but not when X is a pandas Series. So the sum of the log-likelihood was doing stupid things when the parameters started to go out of bounds.
Added a np.asarray() call and everything is great!

This should have been a comment but I run out of space.
I think a Maximum Likelihood fit is probably the most appropriate approach here. ML method is already implemented for many distributions in scipy.stats. For example, you can find the MLE of normal distribution by calling scipy.stats.norm.fit and find the MLE of exponential distribution in a similar way. Combining these two resulting MLE parameters should give you a pretty good starting parameter for Ex-Gaussian ML fit. In fact I would imaging most of your data is quite nicely Normally distributed. If that is the case, the ML parameter estimates for Normal distribution alone should give you a pretty good starting parameter.
Since Ex-Gaussian only has 3 parameters, I don't think a ML fit will be hard at all. If you could provide a dataset for which your current method doesn't work well, it will be easier to show a real example.
Alright, here you go:
>>> import scipy.special as sse
>>> import scipy.stats as sss
>>> import scipy.optimize as so
>>> from numpy import *
>>> def eg_pdf(p, x): #defines the PDF
m=p[0]
s=p[1]
l=p[2]
return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))
>>> xo=array([ 450., 560., 692., 730., 758., 723., 486., 596., 716.,
695., 757., 522., 535., 419., 478., 666., 637., 569.,
859., 883., 551., 652., 378., 801., 718., 479., 544.])
>>> sss.norm.fit(xo) #get the starting parameter vector form the normal MLE
(624.22222222222217, 132.23977474531389)
>>> def llh(p, f, x): #defines the negative log-likelihood function
return -sum(log(f(p,x)))
>>> so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo)) #yeah, the data is not good
Warning: Maximum number of function evaluations has been exceeded.
array([ 6.14003407e+02, 1.31843250e+02, 9.79425845e-02])
>>> przt=so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo), maxfun=1000) #so, we increase the number of function call uplimit
Optimization terminated successfully.
Current function value: 170.195924
Iterations: 376
Function evaluations: 681
>>> llh(array([624.22222222222217, 132.23977474531389, 1e-6]), eg_pdf, xo)
400.02921290185645
>>> llh(przt, eg_pdf, xo) #quite an improvement over the initial guess
170.19592431051217
>>> przt
array([ 6.14007039e+02, 1.31844654e+02, 9.78934519e-02])
The optimizer used here (fmin, or Nelder-Mead simplex algorithm) does not use any information from gradient and usually works much slower than the optimizer that does. It appears that the derivative of the negative log-likelihood function of Exponential Gaussian may be written in a close form easily. If so, optimizers that utilize gradient/derivative will be better and more efficient choice (such as fmin_bfgs).
The other thing to consider is parameter constrains. By definition, sigma and lambda has to be positive for Exponential Gaussian. You can use a constrained optimizer (such as fmin_l_bfgs_b). Alternatively, you can optimize for:
>>> def eg_pdf2(p, x): #defines the PDF
m=p[0]
s=exp(p[1])
l=exp(p[2])
return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))
Due to the functional invariance property of MLE, the MLE of this function should be the same as same as the original eg_pdf. There are other transformation that you can use, besides exp(), to project (-inf, +inf) to (0, +inf).
And you can also consider http://en.wikipedia.org/wiki/Lagrange_multiplier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.