curve_fit fails when using endpoint=False in independent variable - python

I'm fitting some data (I have hard-coded it here) using a technique described in this question and it seemed to work fine. However, I realized my xdata was not quite what I wanted it to be so I used 'endpoint=False' so that my xdata increased from 17 to 27.5 in steps of 0.5. Upon doing this, scipy warned me that:
minpack.py:794: OptimizeWarning: Covariance of the parameters could not be estimated category=OptimizeWarning)
Perhaps this is working as intended and I'm missing some part of how curve_fit, or the Fourier function works, but I would really like to be able to fit this with the correct (albeit only slightly different) x values. My y values do have an offset that the fit removes when it runs successfully, which is fine by me.
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
ydata = [48.97266579, 54.97148132, 65.33787537, 69.55269623, 56.5559082, 41.52973366,
28.06554699, 19.01652718, 16.74026489, 19.38094521, 25.63856506, 24.39780998,
18.99308014, 30.67970657, 31.52746582, 45.38796043, 45.3911972, 42.38343811,
41.90969849, 38.00998878, 49.11366463, 70.14483643]
xdata = np.linspace(17, 28, 22, endpoint=False) #, endpoint=False
def make_fourier(na, nb):
def fourier(x, *a):
ret = 0.0
for deg in range(0, na):
ret += a[deg] * np.cos((deg+1) * 2 * np.pi * x)
for deg in range(na, na+nb):
ret += a[deg] * np.sin((deg+1) * 2 * np.pi * x)
return ret
return fourier
def obtain_fourier_coef(ydata, harms):
popt, pcov = curve_fit(make_fourier(harms, harms), xdata, ydata, [0.0]*harms*2)
plt.plot(xdata, (make_fourier(harms,harms))(xdata, *popt))
plt.show()
plt.plot(xdata, ydata)
obtain_fourier_coef(ydata, 10)
With endpoint=False:
curve fit results plot
Without endpoint=False:
curve fit results plot

The Problem is caused by a combination of
[...] xdata increased from 17 to 27.5 in steps of 0.5.
and
np.cos((deg+1) * 2 * np.pi * x)
If x contains values in steps of 0.5, the values passed to the trigonometric functions are multiples of pi. This makes sin always return 0 and cos return either +1 or -1. Because of this degeneracy the resulting function cannot be fitted.

Related

How to write a function to fit data to a sum of N Gaussian-like peaks without explicitly defining the expression for every possible N?

I am trying to fit a progression of Gaussian peaks to a spectral lineshape.
The progression is a summation of N evenly spaced Gaussian peaks. When coded as a function, the formula for N=1 looks like this:
A * ((e0-i*hf)/e0)**3 * ((S**i)/np.math.factorial(i)) * np.exp(-4*np.log(2)*((x-e0+i*hf)/fwhm)**2)
where A, e0, hf, S and fwhm are to be determined from the fit with some good initial guesses.
Importantly, the parameter i starts at 0 and is incremented by 1 for every additional component.
So, for N = 3 the expression would take the form:
A * ((e0-0*hf)/e0)**3 * ((S**0)/np.math.factorial(0)) * np.exp(-4*np.log(2)*((x-e0+0*hf)/fwhm)**2) +
A * ((e0-1*hf)/e0)**3 * ((S**1)/np.math.factorial(1)) * np.exp(-4*np.log(2)*((x-e0+1*hf)/fwhm)**2) +
A * ((e0-2*hf)/e0)**3 * ((S**2)/np.math.factorial(2)) * np.exp(-4*np.log(2)*((x-e0+2*hf)/fwhm)**2)
All the parameters except i are constant for every component in the summation, and this is intended. i is changing in a controlled way depending on the number of parameters.
I am using curve_fit. One way to code the fitting routine would be to explicitly define the expression for any reasonable N and just use an appropriate one. Like, here it'would be 5 or 6, depending on the spacing, which is determined by hf. I could just define a long function with N components, writing an appropriate i value into each component. I understand how to do that (and did). But I would like to code this more intelligently. My goal is to write a function that will accept any value of N, add the appropriate amount of components as described above, compute the expression while incrementing the i properly and return the result.
I have attempted a variety of things. My main hurdle is that I don't know how to tell the program to use a particular N and the corresponding values of i. Finally, after some searching I thought I found a good way to code it with a lambda function.
from scipy.optimize import curve_fit
import numpy as np
def fullfunc(x,p,n):
def func(x,A,e0,hf,S,fwhm,i):
return A * ((e0-i*hf)/e0)**3 * ((S**i)/np.math.factorial(i)) * np.exp(-4*np.log(2)*((x-e0+i*hf)/fwhm)**2)
y_fit = np.zeros_like(x)
for i in range(n):
y_fit += func(x,p[0],p[1],p[2],p[3],p[4],i)
return y_fit
p = [1,26000,1400,1,1000]
x = [27027,25062,23364,21881,20576,19417,18382,17452,16611,15847,15151]
y = [0.01,0.42,0.93,0.97,0.65,0.33,0.14,0.06,0.02,0.01,0.004]
n = 7
fittedParameters, pcov = curve_fit(lambda x,p: fullfunc(x,p,n), x, y, p)
A,e0,hf,S,fwhm = fittedParameters
This gives:
TypeError: <lambda>() takes 2 positional arguments but 7 were given
and I don't understand why. I have a feeling the lambda function can't deal with a list of initial parameters.
I would greatly appreciate any advice on how to make this work without explicitly writing all the equations out, as I find that a bit too rigid.
The x and y ranges provided are samples of real data which give a general idea of what the shape is.
Since you only use summation over a range i=0, 1, ..., n-1, there is no need to refer to complicated lambda constructs that may or may not work in the context of curve fit. Just define your fit function as the summation of n components:
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
def func(x, A, e0, hf, S, fwhm):
return sum((A * ((e0-i*hf)/e0)**3 * ((S**i)/np.math.factorial(i)) * np.exp(-4*np.log(2)*((x-e0+i*hf)/fwhm)**2)) for i in range(n))
p = [1,26000,1400,1,1000]
x = [27027,25062,23364,21881,20576,19417,18382,17452,16611,15847,15151]
y = [0.01,0.42,0.93,0.97,0.65,0.33,0.14,0.06,0.02,0.01,0.004]
n = 7
fittedParameters, pcov = curve_fit(func, x, y, p0=p)
#A,e0,hf,S,fwhm = fittedParameters
print(fittedParameters)
plt.plot(x, y, "ro", label="data")
x_fit = np.linspace(min(x), max(x), 100)
y_fit = func(x_fit, *fittedParameters)
plt.plot(x_fit, y_fit, label="fit")
plt.legend()
plt.show()
Sample output:
P.S.: By the look of it, these data points are already well fitted with n=1.

Fitting gaussian to absorbtion line in python

I am trying to fit a gaussian to my data which is taken in a pretty narrow spectral window. We got about 2 points of continuum and then about 10-11 that are part of the line. It should still be possible to fit it I think, but the curve fit is failing each time, and I am not sure why.
When running I get RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 800.
Code and data:
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
x = np.arange(13)
xx = np.arange(130)/13.
y = np.array([19699.959 , 21679.445 , 21143.195 , 20602.875 , 16246.769 ,
11635.25 , 8602.465 , 7035.493 , 6697.0337, 6510.092 ,
7717.772 , 12270.446 , 16807.81 ])
# weighted arithmetic mean (corrected - check the section below)
mean = sum(x * y) / sum(y)
sigma = np.sqrt(sum(y * (x - mean)**2) / sum(y))
def Gauss(x, a, x0, sigma):
return a * np.exp(-(x - x0)**2 / (2 * sigma**2))
popt,pcov = curve_fit(Gauss, x, y, p0=[max(y), mean, sigma])
plt.plot(x, y, 'b+:', label='data')
plt.plot(xx, Gauss(xx, *popt), 'r-', label='fit')
plt.legend()
plt.show()
As the error says, the procedure to find the optimal values doesn't converge. If you really think that what you have can be fitted with a Gaussian curve, then this in general means you have a bad starting point.
How you're giving the starting point could have been an issue, particularly with how you provide sigma given that at positions 11, 12 and 13 you have what could be the onset of another signal. Anyhow, that's not the biggest issue this time, but the fact that you forgot to add an offset to the gaussian function
# ----> new parameter in signature
# |
def Gauss(x, y0, a, x0, sigma):
return y0 + a * np.exp(-(x - x0)**2 / (2 * sigma**2))
# |
# -------> adding and offset
Then, you can decide how to provide a starting point for the offset, but by eye, I did set 5000
popt, pcov = curve_fit(Gauss, x, y, p0=[5000, max(y), mean, sigma])
Doing that, I get a fit. But, due to the last three data points, it's not a very nice one.
If you avoid those values, the fit improves significantly.
Edit:
As indicated in the comments, the Gaussian is centered at about 8 looking downwards (silly me, it was an absorption line).
In such a case, the offset should be located at about the maximum ~22000 and then the parameter for the amplitude should be negative ~ -(max(y)-min(y)) ~ -16000.
And as an addition, better change xx to be as follows
xx = np.linspace(0, 13, 100)
or
xx = np.arange(0, 13, 0.05)
Which will give
and checking popt you get basically the values I mentioned/estimated by just looking at the plot ~(2180, -16000, 8) with a sigma of 2.7 which was the only one I don't have an immediate feeling on how to estimate.
My guess is that you should actually be fitting a mixture of Gauss and a Cauchy/Lorentz lineshape or even better a Voigt lineshape, to account for experimental broadening.

SciPy Curve Fit Fails Power Law

So, I'm trying to fit a set of data with a power law of the following kind:
def f(x,N,a): # Power law fit
if a >0:
return N*x**(-a)
else:
return 10.**300
par,cov = scipy.optimize.curve_fit(f,data,time,array([10**(-7),1.2]))
where the else condition is just to force a to be positive. Using scipy.optimize.curve_fit yields an awful fit (green line), returning values of 1.2e+04 and 1.9e0-7 for N and a, respectively, with absolutely no intersection with the data. From fits I've put in manually, the values should land around 1e-07 and 1.2 for N and a, respectively, though putting those into curve_fit as initial parameters doesn't change the result. Removing the condition for a to be positive results in a worse fit, as it chooses a negative, which leads to a fit with the wrong sign slope.
I can't figure out how to get a believable, let alone reliable, fit out of this routine, but I can't find any other good Python curve fitting routines. Do I need to write my own least-squares algorithm or is there something I'm doing wrong here?
UPDATE
In the original post, I showed a solution that uses lmfit which allows to assign bounds to your parameters. Starting with version 0.17, scipy also allows to assign bounds to your parameters directly (see documentation). Please find this solution below after the EDIT which can hopefully serve as a minimal example on how to use scipy's curve_fit with parameter bounds.
Original post
As suggested by #Warren Weckesser, you could use lmfit to get this task done, which allows you to assign bounds to your parameters and avoids this 'ugly' if-clause.
Since you do not provide any data, I created some which are shown here:
They follow the law f(x) = 10.5 * x ** (-0.08)
I fit them - as suggested by #roadrunner66 - by transforming the power law in a linear function:
y = N * x ** a
ln(y) = ln(N * x ** a)
ln(y) = a * ln(x) + ln(N)
So I first use np.log on the original data and then do the fit. When I now use lmfit, I get the following output:
[[Variables]]
lN: 2.35450302 +/- 0.019531 (0.83%) (init= 1.704748)
a: -0.08035342 +/- 0.005158 (6.42%) (init=-0.5)
So a is pretty close to the original value and np.exp(2.35450302) gives 10.53 which is also very close to the original value.
The plot then looks as follows; as you can see the fit describes the data very well:
Here is the entire code with a couple of inline comments:
import numpy as np
import matplotlib.pyplot as plt
from lmfit import minimize, Parameters, Parameter, report_fit
# generate some data with noise
xData = np.linspace(0.01, 100., 50.)
aOrg = 0.08
Norg = 10.5
yData = Norg * xData ** (-aOrg) + np.random.normal(0, 0.5, len(xData))
plt.plot(xData, yData, 'bo')
plt.show()
# transform data so that we can use a linear fit
lx = np.log(xData)
ly = np.log(yData)
plt.plot(lx, ly, 'bo')
plt.show()
def decay(params, x, data):
lN = params['lN'].value
a = params['a'].value
# our linear model
model = a * x + lN
return model - data # that's what you want to minimize
# create a set of Parameters
params = Parameters()
params.add('lN', value=np.log(5.5), min=0.01, max=100) # value is the initial value
params.add('a', value=-0.5, min=-1, max=-0.001) # min, max define parameter bounds
# do fit, here with leastsq model
result = minimize(decay, params, args=(lx, ly))
# write error report
report_fit(params)
# plot data
xnew = np.linspace(0., 100., 5000.)
# plot the data
plt.plot(xData, yData, 'bo')
plt.plot(xnew, np.exp(result.values['lN']) * xnew ** (result.values['a']), 'r')
plt.show()
EDIT
Assuming that you have scipy 0.17 installed, you can also do the following using curve_fit. I show it for your original definition of the power law (red line in the plot below) as well as for the logarithmic data (black line in the plot below). The data is generated in the same way as above. The plot the looks as follows:
As you can see, the data is described very well. If you print popt and popt_log, you obtain array([ 10.47463426, 0.07914812]) and array([ 2.35158653, -0.08045776]), respectively (note: for the letter one you will have to take the exponantial of the first argument - np.exp(popt_log[0]) = 10.502 which is close to the original data).
Here is the entire code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# generate some data with noise
xData = np.linspace(0.01, 100., 50)
aOrg = 0.08
Norg = 10.5
yData = Norg * xData ** (-aOrg) + np.random.normal(0, 0.5, len(xData))
# get logarithmic data
lx = np.log(xData)
ly = np.log(yData)
def f(x, N, a):
return N * x ** (-a)
def f_log(x, lN, a):
return a * x + lN
# optimize using the appropriate bounds
popt, pcov = curve_fit(f, xData, yData, bounds=(0, [30., 20.]))
popt_log, pcov_log = curve_fit(f_log, lx, ly, bounds=([0, -10], [30., 20.]))
xnew = np.linspace(0.01, 100., 5000)
# plot the data
plt.plot(xData, yData, 'bo')
plt.plot(xnew, f(xnew, *popt), 'r')
plt.plot(xnew, f(xnew, np.exp(popt_log[0]), -popt_log[1]), 'k')
plt.show()

python - scipy.integrate.odeint returning wrong results

I was trying to integrate a square wave using python 3.5 and the scipy.integrate.odeint function but the results don't make any sense and vary wildly with the array of time points selected.
The square wave has a period of 10sec and the simulation runs for 100sec. Since the array of time points has size 500, there will be 50 time points on each period of the square wave, but that doesn't seem to be happening.
Using the optional parameter hmax=0.02 fixes it, but shouldn't it be inferred automatically?
Here's the code:
import numpy as np
import matplotlib.pyplot as plt
import scipy.integrate as integrate
# dx/dt = f(t), where f(t) is a square wave
def f(x, t):
return float(t % 10.0 < 5.0) * 0.3
T = 100
tt = np.linspace(0, T, 500)
xx = integrate.odeint(f, 0, tt, hmax=0.2)
plt.figure()
plt.subplot(2,1,1)
plt.plot(tt, xx)
plt.axis([0,T,0,16])
plt.subplot(2,1,2)
plt.plot(tt, [f(None,t) for t in tt])
plt.axis([0, T, 0, 1])
plt.show()
I'm hoping someone can put some light into what is happening here.
Try changing T between 80 and 100 (simulation time).
I think your problem is that the odeint function takes continuous Ordinary Differential Equations which a square wave is not.
i'd start by redefining your square-wave function to:
def g(t):
return float(t % 10.0 < 5.0) * 0.3
then define a function to calculate the integral step-by-step:
def get_integral(tt):
intarray = np.zeros_like(tt)
step_size = tt[1] -tt[0]
for i,t in enumerate(tt):
intarray[i] = intarray[i-1] + g(t)*step_size
return intarray
Then:
xx = get_integral(tt)
should give you the result you're looking for.

Why don't these curve fit results match?

I'm attempting to estimate a decay rate using an exponential fit, but I'm puzzled by why the two methods don't give the same result.
In the first case, taking the log of the data to linearize the problem matches Excel's exponential trendline fit. I had expected that fitting the exponential directly would be the same.
import numpy as np
from scipy.optimize import curve_fit
def exp_func(x, a, b):
return a * np.exp(-b * x)
def lin_func(x, m, b):
return m*x + b
xdata = [1065.0, 1080.0, 1095.0, 1110.0, 1125.0, 1140.0, 1155.0, 1170.0, 1185.0, 1200.0, 1215.0, 1230.0, 1245.0, 1260.0, 1275.0, 1290.0, 1305.0, 1320.0, 1335.0, 1350.0, 1365.0, 1380.0, 1395.0, 1410.0, 1425.0, 1440.0, 1455.0, 1470.0, 1485.0, 1500.0]
ydata = [21.3934, 17.14985, 11.2703, 13.284, 12.28465, 12.46925, 12.6315, 12.1292, 10.32762, 8.509195, 14.5393, 12.02665, 10.9383, 11.23325, 6.03988, 9.34904, 8.08941, 6.847, 5.938535, 6.792715, 5.520765, 6.16601, 5.71889, 4.949725, 7.62808, 5.5079, 3.049625, 4.8566, 3.26551, 3.50161]
xdata = np.array(xdata)
xdata = xdata - xdata.min() + 1
ydata = np.array(ydata)
lydata = np.log(ydata)
lopt, lcov = curve_fit(lin_func, xdata, lydata)
elopt = [np.exp(lopt[1]),-lopt[0]]
eopt, ecov = curve_fit(exp_func, xdata, ydata, p0=elopt)
print 'elopt: {},{}'.format(*elopt)
print 'eopt: {},{}'.format(*eopt)
results:
elopt: 17.2526204283,0.00343624199064
eopt: 17.1516384575,0.00330590568338
You're solving two different optimization problems. The curve_fit() assumes that the noise eps_i is additive (and somewhat Gaussian). Else it wont deliver optimal results.
Assuming that you want to minimize Sum (y_i - f(x_i))**2 with:
f(x) = a * Exp(-b * x) + eps_i
where eps_i the unknown error for the i-th data item you want to eliminate. Taking the logarithm results in
Log(f(x)) = Log(a*Exp(-b*x) + eps_i) != Log(Exp(Log(a) - b*x)) + eps_i
You can interpret the exponential equation as having additive noise. Your linear version has multiplicative noise mu_i, because:
g(x) = a * mu_i * Exp(-b*x)
results in
Log(g(x) = Log(a) - b * x + Log(mu_i)
In conclusion, you will only get identical results when the magnitude of the errors eps_i is very small.

Categories