Most pythonic way to fit multiple gaussians using scipy.optimize

Most pythonic way to fit multiple gaussians using scipy.optimize - python

from scipy.optimize import curve_fit
def func(x, a, b):
return a * np.exp(-b * x)
xdata = np.linspace(0, 4, 50)
ydata = np.linspace(0, 4, 50)
popt, pcov = curve_fit(func, xdata, ydata)
It is quite easy to fit an arbitrary Gaussian in python with something like the above method. However, I would like to prepare a function that always the user to select an arbitrary number of Gaussians and still attempt to find a best fit.
I'm trying to figure out how to modify the function func so that I can pass it an additional parameter n=2 for instance and it would return a function that would try and fit 2 Gaussians, akin to:
from scipy.optimize import curve_fit
def func2(x, a, b, d, e):
return (a * np.exp(-b * x) + c * np.exp(-d * x))
xdata = np.linspace(0, 4, 50)
ydata = np.linspace(0, 4, 50)
popt, pcov = curve_fit(func2, xdata, ydata)
Without having to hard code this additional cases, so that we could instead pass a single function like func(...,n=2) and get the same result as above. I'm having trouble finding an elegant solution to this. My guess is that the best solution will be something using a lambda function.

You can define the function to take a variable number of arguments using def func(x, *args). The *args variable then contains something like (a1, b1, a2, b2, a3, ...). It would be possible to loop over those and sum the Gaussians but I'm showing a vectorized solution instead.
Since curve_fit can no longer determine the number of parameters from this function, you can provide an initial guess that determines the amount of Gaussians you want to fit. Each Gaussian requires two parameters, so [1, 1]*n produces a parameter vector of the correct length.
from scipy.optimize import curve_fit
import numpy as np
def func(x, *args):
x = x.reshape(-1, 1)
a = np.array(args[0::2]).reshape(1, -1)
b = np.array(args[1::2]).reshape(1, -1)
return np.sum(a * np.exp(-b * x), axis=1)
n = 3
xdata = np.linspace(0, 4, 50)
ydata = np.linspace(0, 4, 50)
popt, pcov = curve_fit(func, xdata, ydata, p0=[1, 1] * n)

Related

How to determine unknown parameters of a differential equation based on the best fit to a data set in Python?

I am trying to fit different differential equations to a given data set with python. For this reason, I use the scipy package, respectively the solve_ivp function.
This works fine for me, as long as I have a rough estimate of the parameters (b= 0.005) included in the differential equations, e.g:
import matplotlib.pyplot as plt
from scipy.integrate import solve_ivp
import numpy as np
def f(x, y, b):
dydx= [-b[0] * y[0]]
return dydx
xspan= np.linspace(1, 500, 25)
yinit= [5]
b= [0.005]
sol= solve_ivp(lambda x, y: f(x, y, b),
[xspan[0], xspan[-1]], yinit, t_eval= xspan)
print(sol)
print("\n")
print(sol.t)
print(sol.y)
plt.plot(sol.t, sol.y[0], "b--")
However, what I like to achieve is, that the parameter b (or more parameters) is/are determined "automatically" based on the best fit of the solved differential equation to a given data set (x and y). Is there a way this can be done, for example by combining this example with the curve_fit function of scipy and how would this look?
Thank you in advance!

Yes, what you think about should work, it should be easy to plug together. You want to call
popt, pcov = scipy.optimize.curve_fit(curve, xdata, ydata, p0=[b0])
b = popt[0]
where you now have to define a function curve(x,*p) that transforms any list of point into a list of values according to the only parameter b.
def curve(x,b):
res = solve_ivp(odefun, [1,500], [5], t_eval=x, args = [b])
return res.y[0]
Add optional arguments for error tolerances as necessary.
To make this more realistic, make also the initial point a parameter. Then it also becomes more obvious where a list is expected and where single arguments. To get a proper fitting task add some random noise to the test data. Also make the fall to zero not so fast, so that the final plot still looks somewhat interesting.
from scipy.integrate import solve_ivp
from scipy.optimize import curve_fit
xmin,xmax = 1,500
def f(t, y, b):
dydt= -b * y
return dydt
def curve(t, b, y0):
sol= solve_ivp(lambda t, y: f(t, y, b),
[xmin, xmax], [y0], t_eval= t)
return sol.y[0]
xdata = np.linspace(xmin, xmax, 25)
ydata = np.exp(-0.02*xdata)+0.02*np.random.randn(*xdata.shape)
y0 = 5
b= 0.005
p0 = [b,y0]
popt, pcov = curve_fit(curve, xdata, ydata, p0=p0)
b, y0 = popt
print(f"b={b}, y0 = {y0}")
This returns
b=0.019975693539459473, y0 = 0.9757709108115179
Now plot the test data against the fitted curve

How to fit the following function using curve_fit

I have the following function I need to solve:
np.exp((1-Y)/Y) = np.exp(c) -b*x
I defined the function as:
def function(x, b, c):
np.exp((1-Y)/Y) = np.exp(c) -b*x
return y
def function_solve(y, b, c):
x = (np.exp(c)-np.exp((1-Y)/Y))/b
return x
then I used:
x_data = [4, 6, 8, 10]
y_data = [0.86, 0.73, 0.53, 0.3]
popt, pcov = curve_fit(function, x_data, y_data,(28.14,-0.25))
answer = function_solve(0.5, popt[0], popt[1])
I tried running the code and the error was:
can't assign to function call
The function I'm trying to solve is y = 1/ c*exp(-b*x) in the linear form. I have bunch of y_data and x_data, I want to get optimal values for c and b.

There are two problems that jump at me:
ln((1-Y)/Y) = ln(c) -b*x this is not valid Python code. On the left side you must have a name, whereas here you have a function call ln(..), hence the error.
ln() is not a Python function in the standard library. There is a math.log() function. Unless you defined ln() somewhere else, it will not work.

Some problems with your code have already been pointed out. Here is a solution:
First, you need to get the correct logarithmic expression of your original function:
y = 1 / (c * exp(-b * x))
y = exp(b * x) / c
ln(y) = b * x + ln(1/c)
ln(y) = b * x - ln(c)
If you want to use that in curve_fit, you need to define your function as follows:
def f_log(x, b, c_ln):
return b * x - c_ln
I now show you the outcome for some randomly generated data (using b = 0.08 and c = 100.5) using the original function and then also the output for the data you provided:
[ 8.17260899e-02 1.17566291e+02]
As you can see the fitted values are close to the original ones and the fit describes the data very well.
For your data it looks as follows:
[-0.094 -1.263]
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def f(x, b, c):
return 1. / (c * np.exp(-b * x))
def f_log(x, b, c_ln):
return b * x - c_ln
# some random data
b_org = 0.08
c_org = 100.5
x_data = np.linspace(0.01, 100., 50)
y_data = f(x_data, b_org, c_org) + np.random.normal(0, 0.5, len(x_data))
# fit the data
popt, pcov = curve_fit(f, x_data, y_data, p0=(0.1, 50))
print popt
# plot the data
xnew = np.linspace(0.01, 100., 5000)
plt.plot(x_data, y_data, 'bo')
plt.plot(xnew, f(xnew, *popt), 'r')
plt.show()
# your data
x_data = np.array([4, 6, 8, 10])
y_data = np.array([0.86, 0.73, 0.53, 0.3])
# fit the data
popt_log, pcov_log = curve_fit(f_log, x_data, y_data)
print popt_log
# plot the data
xnew = np.linspace(4, 10., 500)
plt.plot(x_data, y_data, 'bo')
plt.plot(xnew, f_log(xnew, *popt_log), 'r')
plt.show()

Your problem is in defining function():
def function(x, b, c):
ln((1-Y)/Y) = ln(c) -b*x
return y
You're trying to assign
ln(c) - b*x
to the call of another function, ln(), rather than a variable. Instead, solve the function for a variable (of the function) so it can be stored in a python variable.

Using scipy.optimize.curve_fit with weights

According to the documentation, the argument sigma can be used to set the weights of the data points in the fit. These "describe" 1-sigma errors when the argument absolute_sigma=True.
I have some data with artificial normally-distributed noise which varies:
n = 200
x = np.linspace(1, 20, n)
x0, A, alpha = 12, 3, 3
def f(x, x0, A, alpha):
return A * np.exp(-((x-x0)/alpha)**2)
noise_sigma = x/20
noise = np.random.randn(n) * noise_sigma
yexact = f(x, x0, A, alpha)
y = yexact + noise
If I want to fit the noisy y to f using curve_fit to what should I set sigma? The documentation isn't very specific here, but I would usually use 1/noise_sigma**2 as the weight:
p0 = 10, 4, 2
popt, pcov = curve_fit(f, x, y, p0)
popt2, pcov2 = curve_fit(f, x, y, p0, sigma=1/noise_sigma**2, absolute_sigma=True)
It doesn't seem to improve the fit much, though.
Is this option only used to better interpret the fit uncertainties through the covariance matrix? What is the difference between these two telling me?
In [249]: pcov
Out[249]:
array([[ 1.10205238e-02, -3.91494024e-08, 8.81822412e-08],
[ -3.91494024e-08, 1.52660426e-02, -1.05907265e-02],
[ 8.81822412e-08, -1.05907265e-02, 2.20414887e-02]])
In [250]: pcov2
Out[250]:
array([[ 0.26584674, -0.01836064, -0.17867193],
[-0.01836064, 0.27833 , -0.1459469 ],
[-0.17867193, -0.1459469 , 0.38659059]])

At least with scipy version 1.1.0 the parameter sigma should be equal to the error on each parameter. Specifically the documentation says:
A 1-d sigma should contain values of standard deviations of errors in
ydata. In this case, the optimized function is chisq = sum((r / sigma)
** 2).
In your case that would be:
curve_fit(f, x, y, p0, sigma=noise_sigma, absolute_sigma=True)
I looked through the source code and verified that when you specify sigma this way it minimizes ((f-data)/sigma)**2.
As a side note, this is in general what you want to be minimizing when you know the errors. The likelihood of observing points data given a model f is given by:
L(data|x0,A,alpha) = product over i Gaus(data_i, mean=f(x_i,x0,A,alpha), sigma=sigma_i)
which if you take the negative log becomes (up to constant factors that don't depend on the parameters):
-log(L) = sum over i (f(x_i,x0,A,alpha)-data_i)**2/(sigma_i**2)
which is just the chisquare.
I wrote a test program to verify that curve_fit was indeed returning the correct values with the sigma specified correctly:
from __future__ import print_function
import numpy as np
from scipy.optimize import curve_fit, fmin
np.random.seed(0)
def make_chi2(x, data, sigma):
def chi2(args):
x0, A, alpha = args
return np.sum(((f(x,x0,A,alpha)-data)/sigma)**2)
return chi2
n = 200
x = np.linspace(1, 20, n)
x0, A, alpha = 12, 3, 3
def f(x, x0, A, alpha):
return A * np.exp(-((x-x0)/alpha)**2)
noise_sigma = x/20
noise = np.random.randn(n) * noise_sigma
yexact = f(x, x0, A, alpha)
y = yexact + noise
p0 = 10, 4, 2
# curve_fit without parameters (sigma is implicitly equal to one)
popt, pcov = curve_fit(f, x, y, p0)
# curve_fit with wrong sigma specified
popt2, pcov2 = curve_fit(f, x, y, p0, sigma=1/noise_sigma**2, absolute_sigma=True)
# curve_fit with correct sigma
popt3, pcov3 = curve_fit(f, x, y, p0, sigma=noise_sigma, absolute_sigma=True)
chi2 = make_chi2(x,y,noise_sigma)
# double checking that we get the correct answer
xopt = fmin(chi2,p0,xtol=1e-10,ftol=1e-10)
print("popt = %s, chi2 = %.2f" % (popt,chi2(popt)))
print("popt2 = %s, chi2 = %.2f" % (popt2, chi2(popt2)))
print("popt3 = %s, chi2 = %.2f" % (popt3, chi2(popt3)))
print("xopt = %s, chi2 = %.2f" % (xopt, chi2(xopt)))
which outputs:
popt = [ 11.93617403 3.30528488 2.86314641], chi2 = 200.66
popt2 = [ 11.94169083 3.30372955 2.86207253], chi2 = 200.64
popt3 = [ 11.93128545 3.333727 2.81403324], chi2 = 200.44
xopt = [ 11.93128603 3.33373094 2.81402741], chi2 = 200.44
As you can see the chi2 is indeed minimized correctly when you specify sigma=sigma as an argument to curve_fit.
As to why the improvement isn't "better", I'm not really sure. My only guess is that without specifying a sigma value you implicitly assume they are equal and over the part of the data where the fit matters (the peak), the errors are "approximately" equal.
To answer your second question, no the sigma option is not only used to change the output of the covariance matrix, it actually changes what is being minimized.

Pass tuple as input argument for scipy.optimize.curve_fit

I have the following code:
import numpy as np
from scipy.optimize import curve_fit
def func(x, p): return p[0] + p[1] + x
popt, pcov = curve_fit(func, np.arange(10), np.arange(10), p0=(0, 0))
It will raise TypeError: func() takes exactly 2 arguments (3 given). Well, that sounds fair - curve_fit unpact the (0, 0) to be two scalar inputs. So I tried this:
popt, pcov = curve_fit(func, np.arange(10), np.arange(10), p0=((0, 0),))
Again, it said: ValueError: object too deep for desired array
If I left it as default (not specifying p0):
popt, pcov = curve_fit(func, np.arange(10), np.arange(10))
It will raise IndexError: invalid index to scalar variable. Obviously, it only gave the function a scalar for p.
I can make def func(x, p1, p2): return p1 + p2 + x to get it working, but with more complicated situations the code is going to look verbose and messy. I'd really love it if there's a cleaner solution to this problem.
Thanks!

Not sure if this is cleaner, but at least it is easier now to add more parameters to the fitting function. Maybe one could even make an even better solution out of this.
import numpy as np
from scipy.optimize import curve_fit
def func(x, p): return p[0] + p[1] * x
def func2(*args):
return func(args[0],args[1:])
popt, pcov = curve_fit(func2, np.arange(10), np.arange(10), p0=(0, 0))
print popt,pcov
EDIT: This works for me
import numpy as np
from scipy.optimize import curve_fit
def func(x, *p): return p[0] + p[1] * x
popt, pcov = curve_fit(func, np.arange(10), np.arange(10), p0=(0, 0))
print popt,pcov

Problem
When using curve_fit you must explicitly say the number of fit parameters. Doing something like:
def f(x, *p):
return sum( [p[i]*x**i for i in range(len(p))] )
would be great, since it would be a general nth-order polynomial fitting function, but unfortunately, in my SciPy 0.12.0, it raises:
ValueError: Unable to determine number of fit parameters.
Solution
So you should do:
def f_1(x, p0, p1):
return p0 + p1*x
def f_2(x, p0, p1, p2):
return p0 + p1*x + p2*x**2
and so forth...
Then you can call using the p0 argument:
curve_fit(f_1, xdata, ydata, p0=(0,0))

scipy.optimize.curve_fit
scipy.optimize.curve_fit(f, xdata, ydata, p0=None, sigma=None, **kw)
Use non-linear least squares to fit a function, f, to data.
Assumes ydata = f(xdata, *params) + eps
Explaining the idea
The function to be fitted should take only scalars (not: *p0).
Remember that the result of the fit depends on the initialization parameters.
Working example
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
def func(x, a0, a1):
return a0 + a1 * x
x, y = np.arange(10), np.arange(10) + np.random.randn(10)/10
popt, pcov = curve_fit(func, x, y, p0=(1, 1))
# Plot the results
plt.title('Fit parameters:\n a0=%.2e a1=%.2e' % (popt[0], popt[1]))
# Data
plt.plot(x, y, 'rx')
# Fitted function
x_fine = np.linspace(x[0], x[-1], 100)
plt.plot(x_fine, func(x_fine, popt[0], popt[1]), 'b-')
plt.savefig('Linear_fit.png')
plt.show()

You can define functions that return other functions (see Passing additional arguments using scipy.optimize.curve_fit? )
Working example :
import numpy as np
import random
from scipy.optimize import curve_fit
from matplotlib import pyplot as plt
import math
def funToFit(x):
return 0.5+2*x-3*x*x+0.2*x*x*x+0.1*x*x*x*x
xx=[random.uniform(1,5) for i in range(30)]
yy=[funToFit(xx[i])+random.uniform(-1,1) for i in range(len(xx))]
a=np.zeros(5)
def make_func(numarg):
def func(x,*a):
ng=numarg
v=0
for i in range(ng):
v+=a[i]*np.power(x,i)
return v
return func
leastsq, covar = curve_fit(make_func(len(a)),xx,yy,tuple(a))
print leastsq
def fFited(x):
v=0
for i in range(len(leastsq)):
v+=leastsq[i]*np.power(x,i)
return v
xfine=np.linspace(1,5,200)
plt.plot(xx,yy,".")
plt.plot(xfine,fFited(xfine))
plt.show()

This is an old thread now, but I also just ran into this issue. Building on Emile Maras' solution, but expanding the function to to return either the nth order polynomial fitting function for curve_fit, or the y values based on fit results. This facilitates plotting and residual calculations. Here is an example that fits data to progressively higher order polynomials and plots the results and residuals.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def funToFit(x):
return 0.5 + 2*x -3*x**2 + 0.2*x**3 + 0.1*x**4
x=np.arange(30)
y=funToFit(x)+np.random.normal(0,5000,x.size)
def polyfun(order=0,x=np.arange(0),P=np.arange(0)):
if P.size>0 and x.size>0:
y=0
for i in range(P.size):
y+=P[i]*np.power(x,i)
return y
elif order>0:
def fitfun(x,*a):
y=0
for i in range(order+1):
y+=a[i]*np.power(x,i)
return y
return fitfun
else:
raise Exception("Either order or x and P must be provided")
plt.figure("fits")
plt.plot(x,y,color="black")
for i in range(4):
order = i+1
[fit,covar] = curve_fit(polyfun(order=order),x,y,p0=(1,)*(order+1))
yfit = polyfun(x=x,P=fit)
res = yfit-y
plt.figure("fits")
plt.plot(x,yfit)
plt.figure("res")
plt.plot(x,res)

Python - calculating trendlines with errors

So I've got some data stored as two lists, and plotted them using
plot(datasetx, datasety)
Then I set a trendline
trend = polyfit(datasetx, datasety)
trendx = []
trendy = []
for a in range(datasetx[0], (datasetx[-1]+1)):
trendx.append(a)
trendy.append(trend[0]*a**2 + trend[1]*a + trend[2])
plot(trendx, trendy)
But I have a third list of data, which is the error in the original datasety. I'm fine with plotting the errorbars, but what I don't know is using this, how to find the error in the coefficients of the polynomial trendline.
So say my trendline came out to be 5x^2 + 3x + 4 = y, there needs to be some sort of error on the 5, 3 and 4 values.
Is there a tool using NumPy that will calculate this for me?

I think you can use the function curve_fit of scipy.optimize (documentation). A basic example of the usage:
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0,4,50)
y = func(x, 5, 3, 4)
yn = y + 0.2*np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn)
Following the documentation, pcov gives:
The estimated covariance of popt. The diagonals provide the variance
of the parameter estimate.
So in this way you can calculate an error estimate on the coefficients. To have the standard deviation you can take the square root of the variance.
Now you have an error on the coefficients, but it is only based on the deviation between the ydata and the fit. In case you also want to account for an error on the ydata itself, the curve_fit function provides the sigma argument:
sigma : None or N-length sequence
If not None, it represents the standard-deviation of ydata. This
vector, if given, will be used as weights in the least-squares
problem.
A complete example:
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0,4,20)
y = func(x, 5, 3, 4)
# generate noisy ydata
yn = y + 0.2 * y * np.random.normal(size=len(x))
# generate error on ydata
y_sigma = 0.2 * y * np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn, sigma = y_sigma)
# plot
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(x, yn, yerr = y_sigma, fmt = 'o')
ax.plot(x, np.polyval(popt, x), '-')
ax.text(0.5, 100, r"a = {0:.3f} +/- {1:.3f}".format(popt[0], pcov[0,0]**0.5))
ax.text(0.5, 90, r"b = {0:.3f} +/- {1:.3f}".format(popt[1], pcov[1,1]**0.5))
ax.text(0.5, 80, r"c = {0:.3f} +/- {1:.3f}".format(popt[2], pcov[2,2]**0.5))
ax.grid()
plt.show()
Then something else, about using numpy arrays. One of the main advantages of using numpy is that you can avoid for loops because operations on arrays apply elementwise. So the for-loop in your example can also be done as following:
trendx = arange(datasetx[0], (datasetx[-1]+1))
trendy = trend[0]*trendx**2 + trend[1]*trendx + trend[2]
Where I use arange instead of range as it returns a numpy array instead of a list.
In this case you can also use the numpy function polyval:
trendy = polyval(trend, trendx)

I have not been able to find any way of getting the errors in the coefficients that is built in to numpy or python. I have a simple tool that I wrote based on Section 8.5 and 8.6 of John Taylor's An Introduction to Error Analysis. Maybe this will be sufficient for your task (note the default return is the variance, not the standard deviation). You can get large errors (as in the provided example) because of significant covariance.
def leastSquares(xMat, yMat):
'''
Purpose
-------
Perform least squares using the procedure outlined in 8.5 and 8.6 of Taylor, solving
matrix equation X a = Y
Examples
--------
>>> from scipy import matrix
>>> xMat = matrix([[ 1, 5, 25],
[ 1, 7, 49],
[ 1, 9, 81],
[ 1, 11, 121]])
>>> # matrix has rows of format [constant, x, x^2]
>>> yMat = matrix([[142],
[168],
[211],
[251]])
>>> a, varCoef, yRes = leastSquares(xMat, yMat)
>>> # a is a column matrix, holding the three coefficients a, b, c, corresponding to
>>> # the equation a + b*x + c*x^2
Returns
-------
a: matrix
best fit coefficients
varCoef: matrix
variance of derived coefficents
yRes: matrix
y-residuals of fit
'''
xMatSize = xMat.shape
numMeas = xMatSize[0]
numVars = xMatSize[1]
xxMat = xMat.T * xMat
xyMat = xMat.T * yMat
xxMatI = xxMat.I
aMat = xxMatI * xyMat
yAvgMat = xMat * aMat
yRes = yMat - yAvgMat
var = (yRes.T * yRes) / (numMeas - numVars)
varCoef = xxMatI.diagonal() * var[0, 0]
return aMat, varCoef, yRes

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Most pythonic way to fit multiple gaussians using scipy.optimize - python

Related

How to determine unknown parameters of a differential equation based on the best fit to a data set in Python?

How to fit the following function using curve_fit

Using scipy.optimize.curve_fit with weights

Pass tuple as input argument for scipy.optimize.curve_fit

Python - calculating trendlines with errors

Categories

Resources