According to the documentation, the argument sigma can be used to set the weights of the data points in the fit. These "describe" 1-sigma errors when the argument absolute_sigma=True.
I have some data with artificial normally-distributed noise which varies:
n = 200
x = np.linspace(1, 20, n)
x0, A, alpha = 12, 3, 3
def f(x, x0, A, alpha):
return A * np.exp(-((x-x0)/alpha)**2)
noise_sigma = x/20
noise = np.random.randn(n) * noise_sigma
yexact = f(x, x0, A, alpha)
y = yexact + noise
If I want to fit the noisy y to f using curve_fit to what should I set sigma? The documentation isn't very specific here, but I would usually use 1/noise_sigma**2 as the weight:
p0 = 10, 4, 2
popt, pcov = curve_fit(f, x, y, p0)
popt2, pcov2 = curve_fit(f, x, y, p0, sigma=1/noise_sigma**2, absolute_sigma=True)
It doesn't seem to improve the fit much, though.
Is this option only used to better interpret the fit uncertainties through the covariance matrix? What is the difference between these two telling me?
In [249]: pcov
Out[249]:
array([[ 1.10205238e-02, -3.91494024e-08, 8.81822412e-08],
[ -3.91494024e-08, 1.52660426e-02, -1.05907265e-02],
[ 8.81822412e-08, -1.05907265e-02, 2.20414887e-02]])
In [250]: pcov2
Out[250]:
array([[ 0.26584674, -0.01836064, -0.17867193],
[-0.01836064, 0.27833 , -0.1459469 ],
[-0.17867193, -0.1459469 , 0.38659059]])
At least with scipy version 1.1.0 the parameter sigma should be equal to the error on each parameter. Specifically the documentation says:
A 1-d sigma should contain values of standard deviations of errors in
ydata. In this case, the optimized function is chisq = sum((r / sigma)
** 2).
In your case that would be:
curve_fit(f, x, y, p0, sigma=noise_sigma, absolute_sigma=True)
I looked through the source code and verified that when you specify sigma this way it minimizes ((f-data)/sigma)**2.
As a side note, this is in general what you want to be minimizing when you know the errors. The likelihood of observing points data given a model f is given by:
L(data|x0,A,alpha) = product over i Gaus(data_i, mean=f(x_i,x0,A,alpha), sigma=sigma_i)
which if you take the negative log becomes (up to constant factors that don't depend on the parameters):
-log(L) = sum over i (f(x_i,x0,A,alpha)-data_i)**2/(sigma_i**2)
which is just the chisquare.
I wrote a test program to verify that curve_fit was indeed returning the correct values with the sigma specified correctly:
from __future__ import print_function
import numpy as np
from scipy.optimize import curve_fit, fmin
np.random.seed(0)
def make_chi2(x, data, sigma):
def chi2(args):
x0, A, alpha = args
return np.sum(((f(x,x0,A,alpha)-data)/sigma)**2)
return chi2
n = 200
x = np.linspace(1, 20, n)
x0, A, alpha = 12, 3, 3
def f(x, x0, A, alpha):
return A * np.exp(-((x-x0)/alpha)**2)
noise_sigma = x/20
noise = np.random.randn(n) * noise_sigma
yexact = f(x, x0, A, alpha)
y = yexact + noise
p0 = 10, 4, 2
# curve_fit without parameters (sigma is implicitly equal to one)
popt, pcov = curve_fit(f, x, y, p0)
# curve_fit with wrong sigma specified
popt2, pcov2 = curve_fit(f, x, y, p0, sigma=1/noise_sigma**2, absolute_sigma=True)
# curve_fit with correct sigma
popt3, pcov3 = curve_fit(f, x, y, p0, sigma=noise_sigma, absolute_sigma=True)
chi2 = make_chi2(x,y,noise_sigma)
# double checking that we get the correct answer
xopt = fmin(chi2,p0,xtol=1e-10,ftol=1e-10)
print("popt = %s, chi2 = %.2f" % (popt,chi2(popt)))
print("popt2 = %s, chi2 = %.2f" % (popt2, chi2(popt2)))
print("popt3 = %s, chi2 = %.2f" % (popt3, chi2(popt3)))
print("xopt = %s, chi2 = %.2f" % (xopt, chi2(xopt)))
which outputs:
popt = [ 11.93617403 3.30528488 2.86314641], chi2 = 200.66
popt2 = [ 11.94169083 3.30372955 2.86207253], chi2 = 200.64
popt3 = [ 11.93128545 3.333727 2.81403324], chi2 = 200.44
xopt = [ 11.93128603 3.33373094 2.81402741], chi2 = 200.44
As you can see the chi2 is indeed minimized correctly when you specify sigma=sigma as an argument to curve_fit.
As to why the improvement isn't "better", I'm not really sure. My only guess is that without specifying a sigma value you implicitly assume they are equal and over the part of the data where the fit matters (the peak), the errors are "approximately" equal.
To answer your second question, no the sigma option is not only used to change the output of the covariance matrix, it actually changes what is being minimized.
Related
how can I fit the differential function of the followint scipy tutorial
Scipy Differential Equation Tutorial?
In the end I want to fit some datapoints that follow a set of two differential equations with six parameters in total but I'd like to start with an easy example. So far I tried the functions scipy.optimize.curve_fit and scipy.optimize.leastsq but I did not get anywhere.
So this is how far I came:
import numpy as np
import scipy.optimize as scopt
import scipy.integrate as scint
import scipy.optimize as scopt
def pend(y, t, b, c):
theta, omega = y
dydt = [omega, -b*omega - c*np.sin(theta)]
return dydt
def test_pend(y, t, b, c):
theta, omega = y
dydt = [omega, -b*omega - c*np.sin(theta)]
return dydt
b = 0.25
c = 5.0
y0 = [np.pi - 0.1, 0.0]
guess = [0.5, 4]
t = np.linspace(0, 1, 11)
sol = scint.odeint(pend, y0, t, args=(b, c))
popt, pcov = scopt.curve_fit(test_pend, guess, t, sol)
with the following error message:
ValueError: too many values to unpack (expected 2)
And I'm sorry as this is assumingly a pretty simple question but I don't get it to work. So thanks in advance.
You need to provide a function f(t,b,c) that given an argument or a list of arguments in t returns the value of the function at the argument(s). This requires some work, either by determining the type of t or by using a construct that works either way:
def f(t,b,c):
tspan = np.hstack([[0],np.hstack([t])])
return scint.odeint(pend, y0, tspan, args=(b,c))[1:,0]
popt, pcov = scopt.curve_fit(f, t, sol[:,0], p0=guess)
which returns popt = array([ 0.25, 5. ]).
This can be extended to fit even more parameters,
def f(t, a0,a1, b,c):
tspan = np.hstack([[0],np.hstack([t])])
return scint.odeint(pend, [a0,a1], tspan, args=(b,c))[1:,0]
popt, pcov = scopt.curve_fit(f, t, sol[:,0], p0=guess)
which results in popt = [ 3.04159267e+00, -2.38543640e-07, 2.49993362e-01, 4.99998795e+00].
Another possibility is to explicitly compute the square norm of the differences to the target solution and apply minimization to the so-defined scalar function.
def f(param):
b,c = param
t_sol = scint.odeint(pend, y0, t, args=(b,c))
return np.linalg.norm(t_sol[:,0]-sol[:,0]);
res = scopt.minimize(f, np.array(guess))
which returns in res
fun: 1.572327981969186e-08
hess_inv: array([[ 0.00031325, 0.00033478],
[ 0.00033478, 0.00035841]])
jac: array([ 0.06129361, -0.04859557])
message: 'Desired error not necessarily achieved due to precision loss.'
nfev: 518
nit: 27
njev: 127
status: 2
success: False
x: array([ 0.24999905, 4.99999884])
As an example, here's my code for fitting multiexponential decays with a monoexponential decay (this doesn't produce a good fit, but it still works as an example):
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize, curve_fit
lifetimes = [1e-9, 2e-9, 4e-9, 8e-9]
amplitudes = [1000, 1000, 1000, 1000]
background = 10
t = np.arange(1024)*1e-10
y = np.zeros(len(t))
for i in range(len(lifetimes)):
y += amplitudes[i] * np.exp(-t/lifetimes[i])
y += np.random.poisson(background, len(y))
def fit_fun(t, amplitude, lifetime, background):
return amplitude * np.exp(-t/lifetime) + background
def loss_fun(params, x, y, fit_fun, c=5):
fit_y = fit_fun(x, *params)
residuals = y - fit_y
loss = np.sum(residuals**2)
return loss
p0 = [1000, 6e-9, 10]
result = minimize(loss_fun, p0, args=(t, y, fit_fun))
params_minimize = result.x
minimize_y = fit_fun(t, *params_minimize)
params_fit, _ = curve_fit(fit_fun, t, y, p0)
fit_y = fit_fun(t, *params_fit)
plt.semilogy(t, y)
plt.semilogy(t, minimize_y)
plt.semilogy(t, fit_y)
plt.ylim([1, plt.ylim()[1]])
plt.show()
Here are the resulting fits (Green was fitted with curve_fit and orange with minimize).
So, why doesn't using minimize work properly?
Also, the reason I'm doing this is that I want to implement a loss function other than least squares. If it isn't possible this way, how could I do that?
I use this code for smoothing data by fitting exponent with scipy.optimize.curve_fit:
def smooth_data_v1(x_arr,y_arr):
def func(x, a, b, c):
return a*np.exp(-b*x)+c
#Scale data
y = y_orig / 10000.0
x = 500.0 * x_orig
popt, pcov = curve_fit(func, x, y, p0=(1, 0.01, 1))
y_smooth = func(x, *popt) # Calcaulate smoothed values for same points
#Undo scaling
y_final = y_smooth * 10000.0
return y_final
Howewer I want estimated exponent curve to go through 1st point.
Bad case:
Good case:
I have tried to remove last parameter using first point x0,y0:
def smooth_data_v2(x_orig,y_orig):
x0 = x_orig[0]
y0 = y_orig[0]
def func(x, a, b):
return a*np.exp(-b*x)+y0-a*np.exp(-b*x0)
#Scale data
y = y_orig / 10000.0
x = 500.0 * x_orig
popt, pcov = curve_fit(func, x, y, p0=(1, 0.01))
y_smooth = func(x, *popt) # Calcaulate smoothed values for same points
#Undo scaling
y_final = y_smooth * 10000.0
return y_final
Howewer something go wrong and I get:
a param is really large
popt [ 4.45028144e+05 2.74698863e+01]
Any ideas?
Update:
Example of data
x_orig [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.]
y_orig [ 445057. 447635. 450213. 425089. 391746. 350725. 285433. 269027.
243835. 230587. 216757. 202927. 189097. 175267. 161437.]
Scipy curve_fit allows for passing the parameter sigma, which is designed to be the standard deviation for weighting the fit. But this array can be filled with arbitrary data:
from scipy.optimize import curve_fit
def smooth_data_v1(x_arr,y_arr):
def func(x, a, b, c):
return a*np.exp(-b*x)+c
#create the weighting array
y_weight = np.empty(len(y_arr))
#high pseudo-sd values, meaning less weighting in the fit
y_weight.fill(10)
#low values for point 0 and the last points, meaning more weighting during the fit procedure
y_weight[0] = y_weight[-5:-1] = 0.1
popt, pcov = curve_fit(func, x_arr, y_arr, p0=(y_arr[0], 1, 1), sigma = y_weight, absolute_sigma = True)
print("a, b, c:", *popt)
y_smooth = func(x_arr, *popt)
return y_smooth
x_orig = np.asarray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
y_orig = np.asarray([ 445057, 447635, 450213, 425089, 391746, 350725, 285433, 269027,
243835, 230587, 216757, 202927, 189097, 175267, 161437])
print(smooth_data_v1(x_orig, y_orig))
As you can see, now the first and last point are close to the original values, but this "clamping of values" comes sometimes at a price for the rest of the data points.
You have probably also noticed, that I removed your rescaling part. Imho, one shouldn't do this before curve fitting procedures. It is usually better to use raw data. Additionally, your data are not really well represented by an exponential function, hence the tiny b value.
How about changing the definition of the function you try to fit?
def func(x, a, b, c):
return (y_arr[0] - c)*np.exp(-b*(x - x_arr[0]))+c
This function always fits the first point perfectly by definition, but otherwise can do anything your current function does.
I need to interpolate data coming from an instrument using a gaussian fit. To this end I thought about using the curve_fit function from scipy.
Since I'd like to test this functionality on fake data before trying it on the instrument I wrote the following code to generate noisy gaussian data and to fit it:
from scipy.optimize import curve_fit
import numpy
import pylab
# Create a gaussian function
def gaussian(x, a, b, c):
val = a * numpy.exp(-(x - b)**2 / (2*c**2))
return val
# Generate fake data.
zMinEntry = 80.0*1E-06
zMaxEntry = 180.0*1E-06
zStepEntry = 0.2*1E-06
x = numpy.arange(zMinEntry,
zMaxEntry,
zStepEntry,
dtype = numpy.float64)
n = len(x)
meanY = zMinEntry + (zMaxEntry - zMinEntry)/2
sigmaY = 10.0E-06
a = 1.0/(sigmaY*numpy.sqrt(2*numpy.pi))
y = gaussian(x, a, meanY, sigmaY) + a*0.1*numpy.random.normal(0, 1, size=len(x))
# Fit
popt, pcov = curve_fit(gaussian, x, y)
# Print results
print("Scale = %.3f +/- %.3f" % (popt[0], numpy.sqrt(pcov[0, 0])))
print("Offset = %.3f +/- %.3f" % (popt[1], numpy.sqrt(pcov[1, 1])))
print("Sigma = %.3f +/- %.3f" % (popt[2], numpy.sqrt(pcov[2, 2])))
pylab.plot(x, y, 'ro')
pylab.plot(x, gaussian(x, popt[0], popt[1], popt[2]))
pylab.grid(True)
pylab.show()
Unfortunately this does not work properly, the output of the code is the following:
Scale = 6174.816 +/- 7114424813.672
Offset = 429.319 +/- 3919751917.830
Sigma = 1602.869 +/- 17923909301.176
And the plotted result is (blue is the fit function, red dots is the noisy input data):
I also tried to look at this answer, but couldn't figure out where my problem is.
Am I missing something here? Or am I using the curve_fit function in the wrong way? Thanks in advance!
I agree with Olaf in so far as it is a question of scale. The optimal parameters differ by many orders of magnitude. However, scaling the parameters with which you generated your toy data does not seem to solve the problem for your actual application. curve_fit uses lestsq, which numerically approximates the Jacobian, where numerical problems arise because of the differences in scale (try to use the full_output keyword in curve_fit).
In my experience it is often best to use fmin which does not rely on approximated derivatives but uses only function values. You now have to write your own least-squares function that is to be optimized.
Starting values are still important. In your case you can make sufficiently good guesses by taking the maximum amplitude for a and the corresponding x-values for band c.
In code, it looks like this:
from scipy.optimize import curve_fit,fmin
import numpy
import pylab
# Create a gaussian function
def gaussian(x, a, b, c):
val = a * numpy.exp(-(x - b)**2 / (2*c**2))
return val
# Generate fake data.
zMinEntry = 80.0*1E-06
zMaxEntry = 180.0*1E-06
zStepEntry = 0.2*1E-06
x = numpy.arange(zMinEntry,
zMaxEntry,
zStepEntry,
dtype = numpy.float64)
n = len(x)
meanY = zMinEntry + (zMaxEntry - zMinEntry)/2
sigmaY = 10.0E-06
a = 1.0/(sigmaY*numpy.sqrt(2*numpy.pi))
y = gaussian(x, a, meanY, sigmaY) + a*0.1*numpy.random.normal(0, 1, size=len(x))
print a, meanY, sigmaY
# estimate starting values from the data
a = y.max()
b = x[numpy.argmax(a)]
c = b
# define a least squares function to optimize
def minfunc(params):
return sum((y-gaussian(x,params[0],params[1],params[2]))**2)
# fit
popt = fmin(minfunc,[a,b,c])
# Print results
print("Scale = %.3f" % (popt[0]))
print("Offset = %.3f" % (popt[1]))
print("Sigma = %.3f" % (popt[2]))
pylab.plot(x, y, 'ro')
pylab.plot(x, gaussian(x, popt[0], popt[1], popt[2]),lw = 2)
pylab.xlim(x.min(),x.max())
pylab.grid(True)
pylab.show()
Looks like some numerical instabilities are creeping into the optimizer. Try scaling the data. With the following data:
zMinEntry = 80.0*1E-06 * 1000
zMaxEntry = 180.0*1E-06 * 1000
zStepEntry = 0.2*1E-06 * 1000
sigmaY = 10.0E-06 * 1000
I get estimates of
Scale = 39.697 +/- 0.526
Offset = 0.130 +/- 0.000
Sigma = -0.010 +/- 0.000
Compare that to the true values:
Scale = 39.894228
Offset = 0.13
Sigma = 0.01
The minus sign of sigma can of course be ignored.
This gives the following plot
As I said in a comment, if you provide a reasonable initial guess, the fit succeeds, i.e. call curve_fit like that:
popt, pcov = curve_fit(gaussian, x, y, [50000,0.00012,0.00002])
So I've got some data stored as two lists, and plotted them using
plot(datasetx, datasety)
Then I set a trendline
trend = polyfit(datasetx, datasety)
trendx = []
trendy = []
for a in range(datasetx[0], (datasetx[-1]+1)):
trendx.append(a)
trendy.append(trend[0]*a**2 + trend[1]*a + trend[2])
plot(trendx, trendy)
But I have a third list of data, which is the error in the original datasety. I'm fine with plotting the errorbars, but what I don't know is using this, how to find the error in the coefficients of the polynomial trendline.
So say my trendline came out to be 5x^2 + 3x + 4 = y, there needs to be some sort of error on the 5, 3 and 4 values.
Is there a tool using NumPy that will calculate this for me?
I think you can use the function curve_fit of scipy.optimize (documentation). A basic example of the usage:
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0,4,50)
y = func(x, 5, 3, 4)
yn = y + 0.2*np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn)
Following the documentation, pcov gives:
The estimated covariance of popt. The diagonals provide the variance
of the parameter estimate.
So in this way you can calculate an error estimate on the coefficients. To have the standard deviation you can take the square root of the variance.
Now you have an error on the coefficients, but it is only based on the deviation between the ydata and the fit. In case you also want to account for an error on the ydata itself, the curve_fit function provides the sigma argument:
sigma : None or N-length sequence
If not None, it represents the standard-deviation of ydata. This
vector, if given, will be used as weights in the least-squares
problem.
A complete example:
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0,4,20)
y = func(x, 5, 3, 4)
# generate noisy ydata
yn = y + 0.2 * y * np.random.normal(size=len(x))
# generate error on ydata
y_sigma = 0.2 * y * np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn, sigma = y_sigma)
# plot
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(x, yn, yerr = y_sigma, fmt = 'o')
ax.plot(x, np.polyval(popt, x), '-')
ax.text(0.5, 100, r"a = {0:.3f} +/- {1:.3f}".format(popt[0], pcov[0,0]**0.5))
ax.text(0.5, 90, r"b = {0:.3f} +/- {1:.3f}".format(popt[1], pcov[1,1]**0.5))
ax.text(0.5, 80, r"c = {0:.3f} +/- {1:.3f}".format(popt[2], pcov[2,2]**0.5))
ax.grid()
plt.show()
Then something else, about using numpy arrays. One of the main advantages of using numpy is that you can avoid for loops because operations on arrays apply elementwise. So the for-loop in your example can also be done as following:
trendx = arange(datasetx[0], (datasetx[-1]+1))
trendy = trend[0]*trendx**2 + trend[1]*trendx + trend[2]
Where I use arange instead of range as it returns a numpy array instead of a list.
In this case you can also use the numpy function polyval:
trendy = polyval(trend, trendx)
I have not been able to find any way of getting the errors in the coefficients that is built in to numpy or python. I have a simple tool that I wrote based on Section 8.5 and 8.6 of John Taylor's An Introduction to Error Analysis. Maybe this will be sufficient for your task (note the default return is the variance, not the standard deviation). You can get large errors (as in the provided example) because of significant covariance.
def leastSquares(xMat, yMat):
'''
Purpose
-------
Perform least squares using the procedure outlined in 8.5 and 8.6 of Taylor, solving
matrix equation X a = Y
Examples
--------
>>> from scipy import matrix
>>> xMat = matrix([[ 1, 5, 25],
[ 1, 7, 49],
[ 1, 9, 81],
[ 1, 11, 121]])
>>> # matrix has rows of format [constant, x, x^2]
>>> yMat = matrix([[142],
[168],
[211],
[251]])
>>> a, varCoef, yRes = leastSquares(xMat, yMat)
>>> # a is a column matrix, holding the three coefficients a, b, c, corresponding to
>>> # the equation a + b*x + c*x^2
Returns
-------
a: matrix
best fit coefficients
varCoef: matrix
variance of derived coefficents
yRes: matrix
y-residuals of fit
'''
xMatSize = xMat.shape
numMeas = xMatSize[0]
numVars = xMatSize[1]
xxMat = xMat.T * xMat
xyMat = xMat.T * yMat
xxMatI = xxMat.I
aMat = xxMatI * xyMat
yAvgMat = xMat * aMat
yRes = yMat - yAvgMat
var = (yRes.T * yRes) / (numMeas - numVars)
varCoef = xxMatI.diagonal() * var[0, 0]
return aMat, varCoef, yRes