Why don't these curve fit results match? - python

I'm attempting to estimate a decay rate using an exponential fit, but I'm puzzled by why the two methods don't give the same result.
In the first case, taking the log of the data to linearize the problem matches Excel's exponential trendline fit. I had expected that fitting the exponential directly would be the same.
import numpy as np
from scipy.optimize import curve_fit
def exp_func(x, a, b):
return a * np.exp(-b * x)
def lin_func(x, m, b):
return m*x + b
xdata = [1065.0, 1080.0, 1095.0, 1110.0, 1125.0, 1140.0, 1155.0, 1170.0, 1185.0, 1200.0, 1215.0, 1230.0, 1245.0, 1260.0, 1275.0, 1290.0, 1305.0, 1320.0, 1335.0, 1350.0, 1365.0, 1380.0, 1395.0, 1410.0, 1425.0, 1440.0, 1455.0, 1470.0, 1485.0, 1500.0]
ydata = [21.3934, 17.14985, 11.2703, 13.284, 12.28465, 12.46925, 12.6315, 12.1292, 10.32762, 8.509195, 14.5393, 12.02665, 10.9383, 11.23325, 6.03988, 9.34904, 8.08941, 6.847, 5.938535, 6.792715, 5.520765, 6.16601, 5.71889, 4.949725, 7.62808, 5.5079, 3.049625, 4.8566, 3.26551, 3.50161]
xdata = np.array(xdata)
xdata = xdata - xdata.min() + 1
ydata = np.array(ydata)
lydata = np.log(ydata)
lopt, lcov = curve_fit(lin_func, xdata, lydata)
elopt = [np.exp(lopt[1]),-lopt[0]]
eopt, ecov = curve_fit(exp_func, xdata, ydata, p0=elopt)
print 'elopt: {},{}'.format(*elopt)
print 'eopt: {},{}'.format(*eopt)
elopt: 17.2526204283,0.00343624199064
eopt: 17.1516384575,0.00330590568338

You're solving two different optimization problems. The curve_fit() assumes that the noise eps_i is additive (and somewhat Gaussian). Else it wont deliver optimal results.
Assuming that you want to minimize Sum (y_i - f(x_i))**2 with:
f(x) = a * Exp(-b * x) + eps_i
where eps_i the unknown error for the i-th data item you want to eliminate. Taking the logarithm results in
Log(f(x)) = Log(a*Exp(-b*x) + eps_i) != Log(Exp(Log(a) - b*x)) + eps_i
You can interpret the exponential equation as having additive noise. Your linear version has multiplicative noise mu_i, because:
g(x) = a * mu_i * Exp(-b*x)
results in
Log(g(x) = Log(a) - b * x + Log(mu_i)
In conclusion, you will only get identical results when the magnitude of the errors eps_i is very small.


How to fit this data in python and scipy?

I have some function which behaves as shown below i.e. some tapered/decaying oscillations
I want to fit the data using scipy's curve_fit. I have previously asked a question related to fitting functions with scipy, which was well answered here, and highlighted the importance of the initial guess for the values of the fitting parameters.
However, I am struggling to fit this data in a way which captures both the oscillations and the decay. My approach is as follows:
from scipy.optimize import curve_fit
import numpy as np
def Fit(x,y):
#Define the function fit
func = ansatz
#Define the initial guess of parameters
mag = (y.max() + y.min()) / 2
y_shifted = y - mag
omega_guess = np.pi * np.sum(y_shifted[:-1] * y_shifted[1:] < 0) / (x.max() - x.min())
lam = np.log(2) / 1e7 #Rough guess based on approximate half life
p0 = (mag,mag, omega_guess,mag,lam)
#Do the fit
popt, pcov = curve_fit(func, x,y,p0=p0)
# return
return func(x, *popt)
def ansatz(x,A,B,omega,offset,lam):
osc = A*np.sin(omega*x) + B*np.cos(omega*x)
linear = offset
decay = np.exp(-x*lam)
return decay*osc + linear
data = np.load('example.npy')
x = data[:,0]
y = data[:,1]
yFit = Fit(x,y)
This approach captures the decay, but not the oscillations. What is erroneous with my approach? Guesses for fit parameters? Function ansatz? Code implementation?

curve_fit fails when using endpoint=False in independent variable

I'm fitting some data (I have hard-coded it here) using a technique described in this question and it seemed to work fine. However, I realized my xdata was not quite what I wanted it to be so I used 'endpoint=False' so that my xdata increased from 17 to 27.5 in steps of 0.5. Upon doing this, scipy warned me that:
minpack.py:794: OptimizeWarning: Covariance of the parameters could not be estimated category=OptimizeWarning)
Perhaps this is working as intended and I'm missing some part of how curve_fit, or the Fourier function works, but I would really like to be able to fit this with the correct (albeit only slightly different) x values. My y values do have an offset that the fit removes when it runs successfully, which is fine by me.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
ydata = [48.97266579, 54.97148132, 65.33787537, 69.55269623, 56.5559082, 41.52973366,
28.06554699, 19.01652718, 16.74026489, 19.38094521, 25.63856506, 24.39780998,
18.99308014, 30.67970657, 31.52746582, 45.38796043, 45.3911972, 42.38343811,
41.90969849, 38.00998878, 49.11366463, 70.14483643]
xdata = np.linspace(17, 28, 22, endpoint=False) #, endpoint=False
def make_fourier(na, nb):
def fourier(x, *a):
ret = 0.0
for deg in range(0, na):
ret += a[deg] * np.cos((deg+1) * 2 * np.pi * x)
for deg in range(na, na+nb):
ret += a[deg] * np.sin((deg+1) * 2 * np.pi * x)
return ret
return fourier
def obtain_fourier_coef(ydata, harms):
popt, pcov = curve_fit(make_fourier(harms, harms), xdata, ydata, [0.0]*harms*2)
plt.plot(xdata, (make_fourier(harms,harms))(xdata, *popt))
plt.plot(xdata, ydata)
obtain_fourier_coef(ydata, 10)
With endpoint=False:
curve fit results plot
Without endpoint=False:
curve fit results plot
The Problem is caused by a combination of
[...] xdata increased from 17 to 27.5 in steps of 0.5.
np.cos((deg+1) * 2 * np.pi * x)
If x contains values in steps of 0.5, the values passed to the trigonometric functions are multiples of pi. This makes sin always return 0 and cos return either +1 or -1. Because of this degeneracy the resulting function cannot be fitted.

Python gaussian fit on simulated gaussian noisy data

I need to interpolate data coming from an instrument using a gaussian fit. To this end I thought about using the curve_fit function from scipy.
Since I'd like to test this functionality on fake data before trying it on the instrument I wrote the following code to generate noisy gaussian data and to fit it:
from scipy.optimize import curve_fit
import numpy
import pylab
# Create a gaussian function
def gaussian(x, a, b, c):
val = a * numpy.exp(-(x - b)**2 / (2*c**2))
return val
# Generate fake data.
zMinEntry = 80.0*1E-06
zMaxEntry = 180.0*1E-06
zStepEntry = 0.2*1E-06
x = numpy.arange(zMinEntry,
dtype = numpy.float64)
n = len(x)
meanY = zMinEntry + (zMaxEntry - zMinEntry)/2
sigmaY = 10.0E-06
a = 1.0/(sigmaY*numpy.sqrt(2*numpy.pi))
y = gaussian(x, a, meanY, sigmaY) + a*0.1*numpy.random.normal(0, 1, size=len(x))
# Fit
popt, pcov = curve_fit(gaussian, x, y)
# Print results
print("Scale = %.3f +/- %.3f" % (popt[0], numpy.sqrt(pcov[0, 0])))
print("Offset = %.3f +/- %.3f" % (popt[1], numpy.sqrt(pcov[1, 1])))
print("Sigma = %.3f +/- %.3f" % (popt[2], numpy.sqrt(pcov[2, 2])))
pylab.plot(x, y, 'ro')
pylab.plot(x, gaussian(x, popt[0], popt[1], popt[2]))
Unfortunately this does not work properly, the output of the code is the following:
Scale = 6174.816 +/- 7114424813.672
Offset = 429.319 +/- 3919751917.830
Sigma = 1602.869 +/- 17923909301.176
And the plotted result is (blue is the fit function, red dots is the noisy input data):
I also tried to look at this answer, but couldn't figure out where my problem is.
Am I missing something here? Or am I using the curve_fit function in the wrong way? Thanks in advance!
I agree with Olaf in so far as it is a question of scale. The optimal parameters differ by many orders of magnitude. However, scaling the parameters with which you generated your toy data does not seem to solve the problem for your actual application. curve_fit uses lestsq, which numerically approximates the Jacobian, where numerical problems arise because of the differences in scale (try to use the full_output keyword in curve_fit).
In my experience it is often best to use fmin which does not rely on approximated derivatives but uses only function values. You now have to write your own least-squares function that is to be optimized.
Starting values are still important. In your case you can make sufficiently good guesses by taking the maximum amplitude for a and the corresponding x-values for band c.
In code, it looks like this:
from scipy.optimize import curve_fit,fmin
import numpy
import pylab
# Create a gaussian function
def gaussian(x, a, b, c):
val = a * numpy.exp(-(x - b)**2 / (2*c**2))
return val
# Generate fake data.
zMinEntry = 80.0*1E-06
zMaxEntry = 180.0*1E-06
zStepEntry = 0.2*1E-06
x = numpy.arange(zMinEntry,
dtype = numpy.float64)
n = len(x)
meanY = zMinEntry + (zMaxEntry - zMinEntry)/2
sigmaY = 10.0E-06
a = 1.0/(sigmaY*numpy.sqrt(2*numpy.pi))
y = gaussian(x, a, meanY, sigmaY) + a*0.1*numpy.random.normal(0, 1, size=len(x))
print a, meanY, sigmaY
# estimate starting values from the data
a = y.max()
b = x[numpy.argmax(a)]
c = b
# define a least squares function to optimize
def minfunc(params):
return sum((y-gaussian(x,params[0],params[1],params[2]))**2)
# fit
popt = fmin(minfunc,[a,b,c])
# Print results
print("Scale = %.3f" % (popt[0]))
print("Offset = %.3f" % (popt[1]))
print("Sigma = %.3f" % (popt[2]))
pylab.plot(x, y, 'ro')
pylab.plot(x, gaussian(x, popt[0], popt[1], popt[2]),lw = 2)
Looks like some numerical instabilities are creeping into the optimizer. Try scaling the data. With the following data:
zMinEntry = 80.0*1E-06 * 1000
zMaxEntry = 180.0*1E-06 * 1000
zStepEntry = 0.2*1E-06 * 1000
sigmaY = 10.0E-06 * 1000
I get estimates of
Scale = 39.697 +/- 0.526
Offset = 0.130 +/- 0.000
Sigma = -0.010 +/- 0.000
Compare that to the true values:
Scale = 39.894228
Offset = 0.13
Sigma = 0.01
The minus sign of sigma can of course be ignored.
This gives the following plot
As I said in a comment, if you provide a reasonable initial guess, the fit succeeds, i.e. call curve_fit like that:
popt, pcov = curve_fit(gaussian, x, y, [50000,0.00012,0.00002])

numpy polyfit passing through 0

Suppose I have x and y vectors with a weight vector wgt. I can fit a cubic curve (y = a x^3 + b x^2 + c x + d) by using np.polyfit as follows:
y_fit = np.polyfit(x, y, deg=3, w=wgt)
Now, suppose I want to do another fit, but this time, I want the fit to pass through 0 (i.e. y = a x^3 + b x^2 + c x, d = 0), how can I specify a particular coefficient (i.e. d in this case) to be zero?
You can try something like the following:
Import curve_fit from scipy, i.e.
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
Define the curve fitting function. In your case,
def fit_func(x, a, b, c):
# Curve fitting function
return a * x**3 + b * x**2 + c * x # d=0 is implied
Perform the curve fitting,
# Curve fitting
params = curve_fit(fit_func, x, y)
[a, b, c] = params[0]
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = a * x_fit**3 + b * x_fit**2 + c * x_fit
Plot the results if you please,
plt.plot(x, y, '.r') # Data
plt.plot(x_fit, y_fit, 'k') # Fitted curve
It does not answer the question in the sense that it uses numpy's polyfit function to pass through the origin, but it solves the problem.
Hope someone finds it useful :)
You can use np.linalg.lstsq and construct your coefficient matrix manually. To start, I'll create the example data x and y, and the "exact fit" y0:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(100)
y0 = 0.07 * x ** 3 + 0.3 * x ** 2 + 1.1 * x
y = y0 + 1000 * np.random.randn(x.shape[0])
Now I'll create a full cubic polynomial 'training' or 'independent variable' matrix that includes the constant d column.
XX = np.vstack((x ** 3, x ** 2, x, np.ones_like(x))).T
Let's see what I get if I compute the fit with this dataset and compare it to polyfit:
p_all = np.linalg.lstsq(X_, y)[0]
pp = np.polyfit(x, y, 3)
print np.isclose(pp, p_all).all()
# Returns True
Where I've used np.isclose because the two algorithms do produce very small differences.
You're probably thinking 'that's nice, but I still haven't answered the question'. From here, forcing the fit to have a zero offset is the same as dropping the np.ones column from the array:
p_no_offset = np.linalg.lstsq(XX[:, :-1], y)[0] # use [0] to just grab the coefs
Ok, let's see what this fit looks like compared to our data:
y_fit = np.dot(p_no_offset, XX[:, :-1].T)
plt.plot(x, y0, 'k-', linewidth=3)
plt.plot(x, y_fit, 'y--', linewidth=2)
plt.plot(x, y, 'r.', ms=5)
This gives this figure,
WARNING: When using this method on data that does not actually pass through (x,y)=(0,0) you will bias your estimates of your output solution coefficients (p) because lstsq will be trying to compensate for that fact that there is an offset in your data. Sort of a 'square peg round hole' problem.
Furthermore, you could also fit your data to a cubic only by doing:
p_ = np.linalg.lstsq(X_[:1, :], y)[0]
Here again the warning above applies. If your data contains quadratic, linear or constant terms the estimate of the cubic coefficient will be biased. There can be times when - for numerical algorithms - this sort of thing is useful, but for statistical purposes my understanding is that it is important to include all of the lower terms. If tests turn out to show that the lower terms are not statistically different from zero that's fine, but for safety's sake you should probably leave them in when you estimate your cubic.
Best of luck!

trying to get reasonable values from scipy powerlaw fit

I'm trying to fit some data from a simulation code I've been running in order to figure out a power law dependence. When I plot a linear fit, the data does not fit very well.
Here's the python script I'm using to fit the data:
#!/usr/bin/env python
from scipy import optimize
import numpy
xdata=[ 0.00010851, 0.00021701, 0.00043403, 0.00086806, 0.00173611, 0.00347222]
ydata=[ 29.56241016, 29.82245508, 25.33930469, 19.97075977, 12.61276074, 7.12695312]
fitfunc = lambda p, x: p[0] + p[1] * x ** (p[2])
errfunc = lambda p, x, y: (y - fitfunc(p, x))
out,success = optimize.leastsq(errfunc, [1,-1,-0.5],args=(xdata, ydata),maxfev=3000)
print "%g + %g*x^%g"%(out[0],out[1],out[2])
the output I get is:
-71205.3 + 71174.5*x^-9.79038e-05
While on the plot the fit looks about as good as you'd expect from a leastsquares fit, the form of the output bothers me. I was hoping the constant would be close to where you'd expect the zero to be (around 30). And I was expecting to find a power dependence of a larger fraction than 10^-5.
I've tried rescaling my data and playing with the parameters to optimize.leastsq with no luck. Is what I'm trying to accomplish possible or does my data just not allow it? The calculation is expensive, so getting more data points is non-trivial.
It is much better to first take the logarithm, then use leastsquare to fit to this linear equation, which will give you a much better fit. There is a great example in the scipy cookbook, which I've adapted below to fit your code.
The best fits like this are: amplitude = 0.8955, and index = -0.40943265484
As we can see from the graph (and your data), if its a power law fit we would not expect the amplitude value to be near 30. As in the power law equation f(x) == Amp * x ** index, so with a negative index: f(1) == Amp and f(0) == infinity.
from pylab import *
from scipy import *
from scipy import optimize
xdata=[ 0.00010851, 0.00021701, 0.00043403, 0.00086806, 0.00173611, 0.00347222]
ydata=[ 29.56241016, 29.82245508, 25.33930469, 19.97075977, 12.61276074, 7.12695312]
logx = log10(xdata)
logy = log10(ydata)
# define our (line) fitting function
fitfunc = lambda p, x: p[0] + p[1] * x
errfunc = lambda p, x, y: (y - fitfunc(p, x))
pinit = [1.0, -1.0]
out = optimize.leastsq(errfunc, pinit,
args=(logx, logy), full_output=1)
pfinal = out[0]
covar = out[1]
index = pfinal[1]
amp = 10.0**pfinal[0]
print 'amp:',amp, 'index', index
powerlaw = lambda x, amp, index: amp * (x**index)
# Plotting data
subplot(2, 1, 1)
plot(xdata, powerlaw(xdata, amp, index)) # Fit
plot(xdata, ydata)#, yerr=yerr, fmt='k.') # Data
text(0.0020, 30, 'Ampli = %5.2f' % amp)
text(0.0020, 25, 'Index = %5.2f' % index)
subplot(2, 1, 2)
loglog(xdata, powerlaw(xdata, amp, index))
plot(xdata, ydata)#, yerr=yerr, fmt='k.') # Data
xlabel('X (log scale)')
ylabel('Y (log scale)')
It helps to rescale xdata so the numbers are not all so small.
You could work in a new variable xprime = 1000*x.
Then fit xprime versus y.
Least squares will find parameters q fitting
y = q[0] + q[1] * (xprime ** q[2])
= q[0] + q[1] * ((1000*x) ** q[2])
So let
p[0] = q[0]
p[1] = q[1] * (1000**q[2])
p[2] = q[2]
Then y = p[0] + p[1] * (x ** p[2])
It also helps to change the initial guess to something closer to your desired result, such as
[max(ydata), -1, -0.5].
from scipy import optimize
import numpy as np
def fitfunc(p, x):
return p[0] + p[1] * (x ** p[2])
def errfunc(p, x, y):
return y - fitfunc(p, x)
xdata=np.array([ 0.00010851, 0.00021701, 0.00043403, 0.00086806,
0.00173611, 0.00347222])
ydata=np.array([ 29.56241016, 29.82245508, 25.33930469, 19.97075977,
12.61276074, 7.12695312])
N = 5000
xprime = xdata * N
qout,success = optimize.leastsq(errfunc, [max(ydata),-1,-0.5],
args=(xprime, ydata),maxfev=3000)
out = qout[:]
out[0] = qout[0]
out[1] = qout[1] * (N**qout[2])
out[2] = qout[2]
print "%g + %g*x^%g"%(out[0],out[1],out[2])
40.1253 + -282.949*x^0.375555
The standard way to use linear least squares to obtain an exponential fit is to do what fraxel suggests in his/her answer: fit a straight line to log(y_i).
However, this method has known numerical disadvantages, particularly sensitivity (a small change in the data yields a large change in the estimate). The preferred alternative is to use a nonlinear least squares approach -- it is less sensitive. But if you are satisfied with the linear LS method for non-critical purposes, just use that.
