Best way to write a Python function that integrates a gaussian?

Best way to write a Python function that integrates a gaussian? - python

In attempting to use scipy's quad method to integrate a gaussian (lets say there's a gaussian method named gauss), I was having problems passing needed parameters to gauss and leaving quad to do the integration over the correct variable. Does anyone have a good example of how to use quad w/ a multidimensional function?
But this led me to a more grand question about the best way to integrate a gaussian in general. I didn't find a gaussian integrate in scipy (to my surprise). My plan was to write a simple gaussian function and pass it to quad (or maybe now a fixed width integrator). What would you do?
Edit: Fixed-width meaning something like trapz that uses a fixed dx to calculate areas under a curve.
What I've come to so far is a method make___gauss that returns a lambda function that can then go into quad. This way I can make a normal function with the average and variance I need before integrating.
def make_gauss(N, sigma, mu):
return (lambda x: N/(sigma * (2*numpy.pi)**.5) *
numpy.e ** (-(x-mu)**2/(2 * sigma**2)))
quad(make_gauss(N=10, sigma=2, mu=0), -inf, inf)
When I tried passing a general gaussian function (that needs to be called with x, N, mu, and sigma) and filling in some of the values using quad like
quad(gen_gauss, -inf, inf, (10,2,0))
the parameters 10, 2, and 0 did NOT necessarily match N=10, sigma=2, mu=0, which prompted the more extended definition.
The erf(z) in scipy.special would require me to define exactly what t is initially, but it nice to know it is there.

Okay, you appear to be pretty confused about several things. Let's start at the beginning: you mentioned a "multidimensional function", but then go on to discuss the usual one-variable Gaussian curve. This is not a multidimensional function: when you integrate it, you only integrate one variable (x). The distinction is important to make, because there is a monster called a "multivariate Gaussian distribution" which is a true multidimensional function and, if integrated, requires integrating over two or more variables (which uses the expensive Monte Carlo technique I mentioned before). But you seem to just be talking about the regular one-variable Gaussian, which is much easier to work with, integrate, and all that.
The one-variable Gaussian distribution has two parameters, sigma and mu, and is a function of a single variable we'll denote x. You also appear to be carrying around a normalization parameter n (which is useful in a couple of applications). Normalization parameters are usually not included in calculations, since you can just tack them back on at the end (remember, integration is a linear operator: int(n*f(x), x) = n*int(f(x), x) ). But we can carry it around if you like; the notation I like for a normal distribution is then
N(x | mu, sigma, n) := (n/(sigma*sqrt(2*pi))) * exp((-(x-mu)^2)/(2*sigma^2))
(read that as "the normal distribution of x given sigma, mu, and n is given by...") So far, so good; this matches the function you've got. Notice that the only true variable here is x: the other three parameters are fixed for any particular Gaussian.
Now for a mathematical fact: it is provably true that all Gaussian curves have the same shape, they're just shifted around a little bit. So we can work with N(x|0,1,1), called the "standard normal distribution", and just translate our results back to the general Gaussian curve. So if you have the integral of N(x|0,1,1), you can trivially calculate the integral of any Gaussian. This integral appears so frequently that it has a special name: the error function erf. Because of some old conventions, it's not exactly erf; there are a couple additive and multiplicative factors also being carried around.
If Phi(z) = integral(N(x|0,1,1), -inf, z); that is, Phi(z) is the integral of the standard normal distribution from minus infinity up to z, then it's true by the definition of the error function that
Phi(z) = 0.5 + 0.5 * erf(z / sqrt(2)).
Likewise, if Phi(z | mu, sigma, n) = integral( N(x|sigma, mu, n), -inf, z); that is, Phi(z | mu, sigma, n) is the integral of the normal distribution given parameters mu, sigma, and n from minus infinity up to z, then it's true by the definition of the error function that
Phi(z | mu, sigma, n) = (n/2) * (1 + erf((x - mu) / (sigma * sqrt(2)))).
Take a look at the Wikipedia article on the normal CDF if you want more detail or a proof of this fact.
Okay, that should be enough background explanation. Back to your (edited) post. You say "The erf(z) in scipy.special would require me to define exactly what t is initially". I have no idea what you mean by this; where does t (time?) enter into this at all? Hopefully the explanation above has demystified the error function a bit and it's clearer now as to why the error function is the right function for the job.
Your Python code is OK, but I would prefer a closure over a lambda:
def make_gauss(N, sigma, mu):
k = N / (sigma * math.sqrt(2*math.pi))
s = -1.0 / (2 * sigma * sigma)
def f(x):
return k * math.exp(s * (x - mu)*(x - mu))
return f
Using a closure enables precomputation of constants k and s, so the returned function will need to do less work each time it's called (which can be important if you're integrating it, which means it'll be called many times). Also, I have avoided any use of the exponentiation operator **, which is slower than just writing the squaring out, and hoisted the divide out of the inner loop and replaced it with a multiply. I haven't looked at all at their implementation in Python, but from my last time tuning an inner loop for pure speed using raw x87 assembly, I seem to remember that adds, subtracts, or multiplies take about 4 CPU cycles each, divides about 36, and exponentiation about 200. That was a couple years ago, so take those numbers with a grain of salt; still, it illustrates their relative complexity. As well, calculating exp(x) the brute-force way is a very bad idea; there are tricks you can take when writing a good implementation of exp(x) that make it significantly faster and more accurate than a general a**b style exponentiation.
I've never used the numpy version of the constants pi and e; I've always stuck with the plain old math module's versions. I don't know why you might prefer either one.
I'm not sure what you're going for with the quad() call. quad(gen_gauss, -inf, inf, (10,2,0)) ought to integrate a renormalized Gaussian from minus infinity to plus infinity, and should always spit out 10 (your normalization factor), since the Gaussian integrates to 1 over the real line. Any answer far from 10 (I wouldn't expect exactly 10 since quad() is only an approximation, after all) means something is screwed up somewhere... hard to say what's screwed up without knowing the actual return value and possibly the inner workings of quad().
Hopefully that has demystified some of the confusion, and explained why the error function is the right answer to your problem, as well as how to do it all yourself if you're curious. If any of my explanation wasn't clear, I suggest taking a quick look at Wikipedia first; if you still have questions, don't hesitate to ask.

scipy ships with the "error function", aka Gaussian integral:
import scipy.special
help(scipy.special.erf)

The gaussian distribution is also called a normal distribution. The cdf function in the scipy norm module does what you want.
from scipy.stats import norm
print norm.cdf(0.0)
>>>0.5
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html#scipy.stats.norm

Why not just always do your integration from -infinity to +infinity, so that you always know the answer? (joking!)
My guess is that the only reason that there's not already a canned Gaussian function in SciPy is that it's a trivial function to write. Your suggestion about writing your own function and passing it to quad to integrate sounds excellent. It uses the accepted SciPy tool for doing this, it's minimal code effort for you, and it's very readable for other people even if they've never seen SciPy.
What exactly do you mean by a fixed-width integrator? Do you mean using a different algorithm than whatever QUADPACK is using?
Edit: For completeness, here's something like what I'd try for a Gaussian with the mean of 0 and standard deviation of 1 from 0 to +infinity:
from scipy.integrate import quad
from math import pi, exp
mean = 0
sd = 1
quad(lambda x: 1 / ( sd * ( 2 * pi ) ** 0.5 ) * exp( x ** 2 / (-2 * sd ** 2) ), 0, inf )
That's a little ugly because the Gaussian function is a little long, but still pretty trivial to write.

I assume you're handling multivariate Gaussians; if so, SciPy already has the function you're looking for: it's called MVNDIST ("MultiVariate Normal DISTribution). The SciPy documentation is, as ever, terrible, so I can't even find where the function is buried, but it's in there somewhere. The documentation is easily the worst part of SciPy, and has frustrated me to no end in the past.
Single-variable Gaussians just use the good old error function, of which many implementations are available.
As for attacking the problem in general, yes, as James Thompson mentions, you just want to write your own gaussian distribution function and feed it to quad(). If you can avoid the generalized integration, though, it's a good idea to do so -- specialized integration techniques for a particular function (like MVNDIST uses) are going to be much faster than a standard Monte Carlo multidimensional integration, which can be extremely slow for high accuracy.

Related

`np.linalg.solve` get solution matrix?

np.linalg.solve solves for x in a problem of the form Ax = b.
For my application, this is done to avoid calculating the inverse explicitly (i.e inverse(A)b = x)
I'd like to access what the effective inverse is that was used to solve this problem but looking at the documentation it doesn't appear to be an option... Is there a reasonable alternative approach I can follow to recover the inverse of A?
(np.linalg.inv(A) is not accurate enough for my use case)

Following the docs and source code, it seems NumPy is calling LAPACK's _gesv to compute the solution, the documentation of which reads:
The routine solves for X the system of linear equations A*X = B, where
A is an n-by-n matrix, the columns of matrix B are individual
right-hand sides, and the columns of X are the corresponding
solutions.
The LU decomposition with partial pivoting and row interchanges is
used to factor A as A = P * L * U, where P is a permutation matrix, L is
unit lower triangular, and U is upper triangular. The factored form of
A is then used to solve the system of equations A * X = B.
The NumPy implementation for solve doesn't return the inverted matrix back to the caller, and just frees the memory for the inverted matrix, so there's no hope there. SciPy provides low-level access to LAPACK so you should be able to access the result from there. You can follow the actual implementation in LAPACK's Fortran source code dgesv.f, dgetrf.f and dgetrs.f. Alternatively, you could note that NumPy's inv still calls the same underlying code, so it might be enough for your use case... You didn't specify why is it that you need the approximate inverse matrix.

Is there a way to integrate the product of two functions numerically in Python?

I have two functions which take multiple arguments:
import numpy as np
from scipy.integrate import quad
gamma_s=0.1 #eV
gamma_d=0.1 #eV
T=298 #K
homo=-5.5 #eV
Ef=-5 #eV
mu=0 #eV just displaces the function
#Fermi-Dirac distribution
k=8.617333262e-5 #eV/K
def fermi (E:float, mu:float, T:float) -> float:
return 1/(1+np.exp((E-mu)/(k*T)))
#Lorentzian density of states
gamma=gamma_d+gamma_s
def DoS (E:float, gamma:float, homo:float, Ef:float) -> float:
epsilon=homo-Ef
v=E-epsilon
u=gamma/2
return gamma/(np.pi*((v*v)+(u*u)))
I know that if I want to integrate just one of them, say fermi, then I would use
quad(fermi, -np.inf, np.inf, args=(mu,T))
But I need the integral of their product fermi*DoS with respect to their common variable E, and I can't imagine how to do it with quad, since there is no mention of it in the documentation.
I guess I could define another function integrand as their product and compute its integral, however that sounds somewhat messy and I would prefer a cleaner way of doing it.

You don't have to define a new, standalone function if something inline is more appealing to you:
quad(lambda e: fermi(e, mu, T) * DoS(e, gamma, homo, T), -np.inf, np.inf)
That is, we use partial application to turn the product of fermi and DoS into a new Python lambda.
Just to give some mathematical justification for the need to do something like this...
Mathematically speaking, one can only integrate (integrable) functions (or elements of function spaces derived from integrable functions). To integrate the product of two functions, we have to say which function is meant by their product. After a while, this may feel obvious, but here I think it's worth noting that humans defined
(fg)(x) := f(x)g(x).
In the same way one must give mathematical meaning to the product of functions, one must give meaning to the product of two Python functions. Especially because Python functions can return all sorts of things, many of which make no sense to multiply, so there couldn't be a general definition.

Limits involving the cumulative distribution function of a normal variable

I'm working through some exercises on improper integrals and I've stumbled across an issue I can't resolve. I'm attempting to use the limit() function on the following problem:
Here N(x) is the cumulative distribution function of the standard normal variable.
The limit() function so far hasn't caused any problems, including problems which require L'Hôpital's rule be applied. However, I'm struggling to get compute the correct answer for this particular problem and can't work out why. The following code yields an incorrect answer
from sympy import *
x, y = symbols('x y')
init_printing(use_unicode=False) #Print the answers in unicode characters
cum_distribution = (1/sqrt(2*pi)*(integrate(exp(-y**2/2), (y, -oo, x))))
func = (cum_distribution -(1/2)-(x/sqrt(2*pi)))/(x**3)
limit(func, x, 0)
If I apply L'Hôpital's rule, i get the correct
l_hopital = diff((cum_distribution -(1/2)-(x/sqrt(2*pi))), x)/diff(x**3, x)
limit(l_hopital, x, 0)
I looked through the limit() function source code and my understanding is that L'Hôpital's rule isn't applied? In this case, can this problem be solved using the limit() function without applying this rule?

At present, a limit involving the function erf (known as the error function, related to normal CDF) can only be evaluated when the argument of erf tends to positive infinity. Limits at other places are either not evaluated, or evaluated incorrectly. (Related PR). This includes the limit
limit(-(sqrt(2)*x - sqrt(pi)*erf(sqrt(2)*x/2))/(2*sqrt(pi)*x**3), x, 0)
which returns unevaluated (though I would not call this incorrect). As a workaround, you can compute the Taylor series of this function with one term (the constant term), which gives the correct value of the limit:
series(func, x, 0, 1).removeO()
returns -sqrt(2)/(12*sqrt(pi)).
As in calculus practice, L'Hopital's rule is inferior to power series techniques when it comes to algorithmic computations, and SymPy relies primarily on the latter. The algorithm it uses is devised and explained in On Computing Limits in a Symbolic Manipulation System by Dominik Gruntz.

Fitting a distribution to data: how to penalize "bad" parameter estimates?

I'm using using scipy's least-squares optimization to fit an exponentially-modified gaussian distribution to a set of reaction time measurements. In general, it works well, but sometimes, the optimization goes off the rails and chooses a crazy value for a parameter -- the resulting plot clearly doesn't fit the data very well. In general, it looks like the problems arise from floating-point precision errors -- we head off into 0 or inf or nan-land.
I'm thinking of doing two things:
Using the parameters to simultaneously fit a CDF and PDF to the data; I have formulas for both. (I'm using a kernel density estimate to approximate the PDF from the data.)
Somehow taking into account the distance from the initial parameter estimates (generated by the method of moments approach on the wikipedia page). Those estimates are far from perfect, but are pretty good and seem to steer clear of "exploding floating point" problems.
Combining the PDF and CDF fits sounds pretty straightforward; the scales of the error will even be generally the same. But getting the initial parameter fits in there: I'm not quite sure if it's even a good idea -- but if it is:
What would I do about the difference in scale? Should I normalize the parameter "error" to a percent error?
Is there a reasonable way to decide on a relative weight between the data estimation error and parameter "error"?
Are these even the right questions to be asking? Are there generally-regarded "correct" answers, or is "try some stuff until you find something that seems to work" a good approach?
One example dataset
As requested, here's a dataset for which this process isn't working very well. I know there are only a few samples and that the data don't fit the distribution well; I'm still hoping against hope that I can get a "reasonable-looking" result from optimization.
array([ 450., 560., 692., 730., 758., 723., 486., 596., 716.,
695., 757., 522., 535., 419., 478., 666., 637., 569.,
859., 883., 551., 652., 378., 801., 718., 479., 544.])
MLE Update
I had a bunch of problems getting my MLE estimate to converge to a "reasonable" value, until I discovered this: If X contains at least one nan, np.sum(X) == nan when X is a numpy array but not when X is a pandas Series. So the sum of the log-likelihood was doing stupid things when the parameters started to go out of bounds.
Added a np.asarray() call and everything is great!

This should have been a comment but I run out of space.
I think a Maximum Likelihood fit is probably the most appropriate approach here. ML method is already implemented for many distributions in scipy.stats. For example, you can find the MLE of normal distribution by calling scipy.stats.norm.fit and find the MLE of exponential distribution in a similar way. Combining these two resulting MLE parameters should give you a pretty good starting parameter for Ex-Gaussian ML fit. In fact I would imaging most of your data is quite nicely Normally distributed. If that is the case, the ML parameter estimates for Normal distribution alone should give you a pretty good starting parameter.
Since Ex-Gaussian only has 3 parameters, I don't think a ML fit will be hard at all. If you could provide a dataset for which your current method doesn't work well, it will be easier to show a real example.
Alright, here you go:
>>> import scipy.special as sse
>>> import scipy.stats as sss
>>> import scipy.optimize as so
>>> from numpy import *
>>> def eg_pdf(p, x): #defines the PDF
m=p[0]
s=p[1]
l=p[2]
return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))
>>> xo=array([ 450., 560., 692., 730., 758., 723., 486., 596., 716.,
695., 757., 522., 535., 419., 478., 666., 637., 569.,
859., 883., 551., 652., 378., 801., 718., 479., 544.])
>>> sss.norm.fit(xo) #get the starting parameter vector form the normal MLE
(624.22222222222217, 132.23977474531389)
>>> def llh(p, f, x): #defines the negative log-likelihood function
return -sum(log(f(p,x)))
>>> so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo)) #yeah, the data is not good
Warning: Maximum number of function evaluations has been exceeded.
array([ 6.14003407e+02, 1.31843250e+02, 9.79425845e-02])
>>> przt=so.fmin(llh, array([624.22222222222217, 132.23977474531389, 1e-6]), (eg_pdf, xo), maxfun=1000) #so, we increase the number of function call uplimit
Optimization terminated successfully.
Current function value: 170.195924
Iterations: 376
Function evaluations: 681
>>> llh(array([624.22222222222217, 132.23977474531389, 1e-6]), eg_pdf, xo)
400.02921290185645
>>> llh(przt, eg_pdf, xo) #quite an improvement over the initial guess
170.19592431051217
>>> przt
array([ 6.14007039e+02, 1.31844654e+02, 9.78934519e-02])
The optimizer used here (fmin, or Nelder-Mead simplex algorithm) does not use any information from gradient and usually works much slower than the optimizer that does. It appears that the derivative of the negative log-likelihood function of Exponential Gaussian may be written in a close form easily. If so, optimizers that utilize gradient/derivative will be better and more efficient choice (such as fmin_bfgs).
The other thing to consider is parameter constrains. By definition, sigma and lambda has to be positive for Exponential Gaussian. You can use a constrained optimizer (such as fmin_l_bfgs_b). Alternatively, you can optimize for:
>>> def eg_pdf2(p, x): #defines the PDF
m=p[0]
s=exp(p[1])
l=exp(p[2])
return 0.5*l*exp(0.5*l*(2*m+l*s*s-2*x))*sse.erfc((m+l*s*s-x)/(sqrt(2)*s))
Due to the functional invariance property of MLE, the MLE of this function should be the same as same as the original eg_pdf. There are other transformation that you can use, besides exp(), to project (-inf, +inf) to (0, +inf).
And you can also consider http://en.wikipedia.org/wiki/Lagrange_multiplier.

complex ODE systems in scipy

I am having trouble sovling the optical bloch equation, which is a first order ODE system with complex values. I have found scipy may solve such system, but their webpage offers too little information and I can hardly understand it.
I have 8 coupled first order ODEs, and I should generate a function like:
def derv(y):
compute the time dervative of elements in y
return answers as an array
then do complex_ode(derv）
My questions are:
my y is not a list but a matrix, how can i give a corrent output
fits into complex_ode()?
complex_ode() needs a jacobian, I have no idea how to start constructing one
and what type it should be?
Where should I put the initial conditions like in the normal ode and
time linspace?
this is scipy's complex_ode link:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.complex_ode.html
Could anyone provide me with more infomation so that I can learn a bit more.

I think we can at least point you in the right direction. The optical
bloch equation is a problem which is well understood in the scientific
community, although not by me :-), so there are already solutions on the internet
to this particular problem.
http://massey.dur.ac.uk/jdp/code.html
However, to address your needs, you spoke of using complex_ode, which I suppose
is fine, but I think just plain scipy.integrate.ode will work just fine as well
according to their documentation:
from scipy import eye
from scipy.integrate import ode
y0, t0 = [1.0j, 2.0], 0
def f(t, y, arg1):
return [1j*arg1*y[0] + y[1], -arg1*y[1]**2]
def jac(t, y, arg1):
return [[1j*arg1, 1], [0, -arg1*2*y[1]]]
r = ode(f, jac).set_integrator('zvode', method='bdf', with_jacobian=True)
r.set_initial_value(y0, t0).set_f_params(2.0).set_jac_params(2.0)
t1 = 10
dt = 1
while r.successful() and r.t < t1:
r.integrate(r.t+dt)
print r.t, r.y
You also have the added benefit of an older more established and better
documented function. I am surprised you have 8 and not 9 coupled ODE's, but I'm
sure you understand this better than I. Yes, you are correct, your function
should be of the form ydot = f(t,y), which you call def derv() but you're
going to need to make sure your function takes at least two parameters
like derv(t,y). If your y is in matrix, no problem! Just "reshape" it in
the derv(t,y) function like so:
Y = numpy.reshape(y,(num_rows,num_cols));
As long as num_rows*num_cols = 8, your number of ODE's you should be fine. Then
use the matrix in your computations. When you're all done, just be sure to return
a vector and not a matrix like:
out = numpy.reshape(Y,(8,1));
The Jacobian is not required, but it will likely allow the computation to proceed
much more quickly. If you do not know how to compute this you may want to consult
wikipedia or a calculus text book. It's pretty simple, but can be time consuming.
As far as initial conditions, you should probably already know what those should
be, whether it's complex or real valued. As long as you select values that are
within reason, it shouldn't matter much.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.