How to fit a model using least squares minimisation in matplotlib (python) - python

I am trying to fit a model to a spectrum (see below) using the least squares minimisation in matplotlib. The spectrum contains features characterised by the following curves:
Since the spectrum contains features of an exponential and Gaussian function, I combined the first two characteristics (the Gaussian and exponential functions) and fitted the model using the least squares minimisation method in python. See the code below.
#The model to fit to the data
def f(a, x):
return a[0]*np.exp(-((x - a[1])**2)/(2*(a[2])**2)) + a[0]*np.exp(-a[1]*(x - a[2])) +a[0]
#Residuals
def err(a, x, y):
return f(a, x) - y
xdata, ydata = np.loadtxt('data2.csv', delimiter = ',', unpack = True)
#Performing the fit
a0 = [0.1, 1, 1] #initial guess
a, success = leastsq(err, a0, args = (xdata, ydata))
x = np.linspace(xdata.min(), xdata.max(), 1000)
plt.plot(x, f(a, x), 'r-')
plt.step(xdata,ydata)
#plt.title('Spectrum')
Output:
The curve I am trying to fit on the spectrum is shown in red but it's not oscillating around the peak. I tried increasing or decreasing the initial guesses but that does not help. Is my model wrong perhaps? Or am I missing something in my code? Please help, your help will be much appreciated. Thank you so much in advance.
Link to the data

Related

Issues fitting gaussian to scatter plot

I'm having a lot of trouble fitting this data, particularly getting the fit parameters to match the expected parameters.
from scipy.optimize import curve_fit
import numpy as np
def gaussian_model(x, a, b, c, d): # add constant d
return a*np.exp(-(x-b)**2/(2*c**2))+d
x = np.linspace(0, 20, 100)
mu, cov = curve_fit(gaussian_model, xdata, ydata)
fit_A = mu[0]
fit_B = mu[1]
fit_C = mu[2]
fit_D = mu[3]
fit_y = gaussian_model(xdata, fit_A, fit_B, fit_C, fit_D)
print(mu)
plt.plot(x, fit_y)
plt.scatter(xdata, ydata)
plt.show()
Here's the plot
When I printed the parameters, I got values of -17 for amplitude, 2.6 for mean, -2.5 for standard deviation, and 110 for the base. This is very far off from what I would expect from the scatter plot. Any ideas why?
Also, I'm pretty new to coding, so any advice is helpful! Thanks everyone :)
Edit: figured out what was wrong! Just needed to add some guesses.
This is not an answer as expected.
This is an alternative method of fitting gaussian.
The process is not iteratif and doesn't requier initial "guessed" values of the parameters to start as in the usual methods.
The result is :
The method of calculus is shown below :
The general principle is explained with examples in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales . This is a linear regression wrt an integral equation which solution is the gaussian function.
If one want more accurate and/or more specific result according to some specified criteria of fitting, one have to use a software with non-linear regression process. Then one can use the above result as initial values of parameters for a more robust iterative process.

Python fitting model to curve

I am experimenting with Python to fit curves to a series of data-points, summary below:
From the below, it would seem that polynomials of order greater than 2 are the best fit, followed by linear and finally exponential which has the overall worst outcome.
While I appreciate this might not be exponential growth, I just wanted to know whether you would expect the exponential function perform so badly (basically the coefficient of x,b, has been set to 0 and an arbitrary point has been picked on the curve to intersect) or if I have somehow done something wrong in my code to fit.
The code I'm using to fit is as follows:
# Fitting
def exponenial_func(x,a,b,c):
return a*np.exp(-b*x)+c
def linear(x,m,c):
return m*x+c
def quadratic(x,a,b,c):
return a*x**2 + b*x+c
def cubic(x,a,b,c,d):
return a*x**3 + b*x**2 + c*x + d
x = np.array(x)
yZero = np.array(cancerSizeMean['levelZero'].values)[start:]
print len(x)
print len(yZero)
popt, pcov = curve_fit(exponenial_func,x, yZero, p0=(1,1,1))
expZeroFit = exponenial_func(x, *popt)
plt.plot(x, expZeroFit, label='Control, Exponential Fit')
popt, pcov = curve_fit(linear, x, yZero, p0=(1,1))
linearZeroFit = linear(x, *popt)
plt.plot(x, linearZeroFit, label = 'Control, Linear')
popt, pcov = curve_fit(quadratic, x, yZero, p0=(1,1,1))
quadraticZeroFit = quadratic(x, *popt)
plt.plot(x, quadraticZeroFit, label = 'Control, Quadratic')
popt, pcov = curve_fit(cubic, x, yZero, p0=(1,1,1,1))
cubicZeroFit = cubic(x, *popt)
plt.plot(x, cubicZeroFit, label = 'Control, Cubic')
*Edit: curve_fit is imported from the scipy.optimize package
from scipy.optimize import curve_fit
curve_fit tends to perform poorly if you give it a poor initial guess with functions like the exponential that could end up with very large numbers. You could try altering the maxfev input so that it runs more iterations. otherwise, I would suggest trying with with something like:
p0=(1000,-.005,0)
-.01, since it ~doubles from 300 to 500 and you have -b in your eqn, 100 0 since it is ~3000 at 300 (1.5 doublings from 0). See how that turns out
As for why the initial exponential doesn't work at all, your initial guess is b=1, and x is in range of (300,1000) or range. This means python is calculating exp(-300) which either throws an exception or is set to 0. At this point, whether b is increased or decreased, the exponential is going to still be set to 0 for any value in the general vicinity of the initial estimate.
Basically, python uses a numerical method with limited precision, and the exponential estimate went outside of the range of values it can handle
I'm not sure how you're fitting the curves -- are you using polynomial least squares? In that case, you'd expect the fit to improve with each additional degree of flexibility, and you choose the power based on diminishing marginal improvement / outside theory.
The improving fit should look something like this.
I actually wrote some code to do Polynomial Least Squares in python for a class a while back, which you can find here on Github. It's a bit hacky though and loosely commented since I was just using it to solve exercises. Hope it's helpful.

using undetermined number of parameters in scipy function curve_fit

First question:
I'm trying to fit experimental datas with function of the following form:
f(x) = m_o*(1-exp(-t_o*x)) + ... + m_j*(1-exp(-t_j*x))
Currently, I don't find a way to have an undetermined number of parameters m_j, t_j, I'm forced to do something like this:
def fitting_function(x, m_1, t_1, m_2, t_2):
return m_1*(1.-numpy.exp(-t_1*x)) + m_2*(1.-numpy.exp(-t_2*x))
parameters, covariance = curve_fit(fitting_function, xExp, yExp, maxfev = 100000)
(xExp and yExp are my experimental points)
Is there a way to write my fitting function like this:
def fitting_function(x, li):
res = 0.
for el in range(len(li) / 2):
res += li[2*idx]*(1-numpy.exp(-li[2*idx+1]*x))
return res
where li is the list of fitting parameters and then do a curve_fitting? I don't know how to tell to curve_fitting what is the number of fitting parameters.
When I try this kind of form for fitting_function, I have errors like "ValueError: Unable to determine number of fit parameters."
Second question:
Is there any way to force my fitting parameters to be positive?
Any help appreciated :)
See my question and answer here. I've also made a minimal working example demonstrating how it could be done for your application. I make no claims that this is the best way - I am muddling through all this myself, so any critiques or simplifications are appreciated.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as pl
def wrapper(x, *args): #take a list of arguments and break it down into two lists for the fit function to understand
N = len(args)/2
amplitudes = list(args[0:N])
timeconstants = list(args[N:2*N])
return fit_func(x, amplitudes, timeconstants)
def fit_func(x, amplitudes, timeconstants): #the actual fit function
fit = np.zeros(len(x))
for m,t in zip(amplitudes, timeconstants):
fit += m*(1.0-np.exp(-t*x))
return fit
def gen_data(x, amplitudes, timeconstants, noise=0.1): #generate some fake data
y = np.zeros(len(x))
for m,t in zip(amplitudes, timeconstants):
y += m*(1.0-np.exp(-t*x))
if noise:
y += np.random.normal(0, noise, size=len(x))
return y
def main():
x = np.arange(0,100)
amplitudes = [1, 2, 3]
timeconstants = [0.5, 0.2, 0.1]
y = gen_data(x, amplitudes, timeconstants, noise=0.01)
p0 = [1, 2, 3, 0.5, 0.2, 0.1]
popt, pcov = curve_fit(lambda x, *p0: wrapper(x, *p0), x, y, p0=p0) #call with lambda function
yfit = gen_data(x, popt[0:3], popt[3:6], noise=0)
pl.plot(x,y,x,yfit)
pl.show()
print popt
print pcov
if __name__=="__main__":
main()
A word of warning, though. A linear sum of exponentials is going to make the fit EXTREMELY sensitive to any noise, particularly for a large number of parameters. You can test that by adding even a small amount of noise to the data generated in the script - even small deviations cause it to get the wrong answer entirely while the fit still looks perfectly valid by eye (test with noise=0, 0.01, and 0.1). Be very careful interpreting your results even if the fit looks good. It's also a form that allows for variable swapping: the best fit solution is the same even if you swap any pairs of (m_i, t_i) with (m_j, t_j), meaning your chi-square has multiple identical local minima that might mean your variables get swapped around during fitting, depending on your initial conditions. This is unlikely to be a numeriaclly robust way to extract these parameters.
To your second question, yes, you can, by defining your exponentials like so:
m_0**2*(1.0-np.exp(-t_0**2*x)+...
Basically, square them all in your actual fit function, fit them, and then square the results (which could be negative or positive) to get your actual parameters. You can also define variables to be between a certain range by using different proxy forms.

Creating Non Linear Regression with Python

I have a simple data;
x = numpy.array([1,2,3,
4,5,6,
7,8,9,
10,11,12,
13,14,15,
16,17,18,
19,20,21,
22,23,24])
y = numpy.array([2149,2731,3397,
3088,2928,2108,
1200,659,289,
1141,1726,2910,
4410,5213,5851,
5817,5307,4314,
3656,3081,3103,
3535,4512,5584])
I can create linear regression and make guess with this code:
z = numpy.polyfit(x, y, 1)
p = numpy.poly1d(z)
But I want to create non linear regression of this data and draw graph with code like this:
import matplotlib.pyplot as plt
xp1 = numpy.linspace(1,24,100)
plt.plot(x, y, 'r--', xp1, p(xp1))
plt.show()
I saw a code like this but that couldn't help me:
def func(x, a, b, c):
return a*np.exp(-b*x) + c
...
popt, pcov = curve_fit(func, x, y)
...
So what's the code for making non linear regression and what can i make some guesses with non linear equation?
What you are referring to is the scipy module. You are right in that this is probably the module you want to be using.
Then, what you are interested in knowing is how curve_fit(func, x, y) works. The idea is that you want to minimize the difference between some function model (like y = m*x + b for a line) and the points on your model. The func argument represents this model: you are making a function that takes in as its first argument the dependent variable of the model (x in my example) and for all subsequent arguments the parameters of the model (those would be m and b in the case of the linear model). The x and y you have already figured out.
The real problem though, and yes I realize I'm not answering your question, is that you need to figure out manually some sort of model for your data (at least the type of model: exponential, linear, polynomial, etc.). There is no easy way out of that. Judging from your data, though I would go for a model of the form
y = a*sin(b*x + c) + d*x + e
or a 5 degree polynomial.

I know scipy curve_fit can do better

I'm using python/numpy/scipy to implement this algorithm for aligning two digital elevation models (DEMs) based on terrain aspect and slope:
"Co-registration and bias corrections of satellite elevation data sets for quantifying glacier thickness change", C. Nuth and A. Kääb, doi:10.5194/tc-5-271-2011
I have things a framework set up, but the quality of the fit provided by scipy.optimize.curve_fit is poor.
def f(x, a, b, c):
y = a * numpy.cos(numpy.deg2rad(b-x)) + c
return y
def compute_offset(dh, slope, aspect):
import scipy.optimize as optimization
idx = random.sample(range(dh.compressed().size), 10000)
xdata = numpy.array(aspect.compressed()[idx], float)
ydata = numpy.array((dh/numpy.tan(numpy.deg2rad(slope))).compressed()[idx], float)
#Generate synthetic data to test curve_fit
#xdata = numpy.arange(0,360,0.01)
#ydata = f(xdata, 20.0, 130.0, -3.0) + 20*numpy.random.normal(size=len(xdata))
print xdata
print ydata
x0 = numpy.array([0.0, 0.0, 0.0])
fit = optimization.curve_fit(f, xdata, ydata, x0)[0]
#optimization.leastsq(f, x0[:], args=(xdata, ydata))
genplot(xdata, ydata, fit)
return fit
def genplot(x, y, fit):
a = (numpy.arange(0,360))
f_a = f(a, fit[0], fit[1], fit[2])
idx = random.sample(range(x.size), 10000)
plt.figure()
plt.xlabel('Aspect (deg)')
plt.ylabel('dh/tan(slope) (m)')
plt.plot(x[idx], y[idx], 'r.')
plt.axhline(color='k')
plt.plot(a, f_a, 'b')
plt.ylim(-80,80)
plt.show()
#Input DEMs
dem1_fn = sys.argv[1]
dem2_fn = sys.argv[2]
dem1_ds = gdal.Open(dem1_fn, gdal.GA_ReadOnly)
dem2_ds = gdal.Open(dem2_fn, gdal.GA_ReadOnly)
#Extract band 1 from each dataset as masked array using internal nodata value
dem1 = getperc_new.gdal_getma(dem1_ds, 1)
dem2 = getperc_new.gdal_getma(dem2_ds, 1)
#Produce slope and aspect maps using gdaldem and load into masked arrays
dem1_slope = gdaldem_slope(dem1_fn)
dem1_aspect = gdaldem_aspect(dem1_fn)
#Compute common mask and apply to all products
common_mask = dem1.mask + dem2.mask + dem1_slope.mask + dem1_aspect.mask
diff_euler = numpy.ma.array(dem2-dem1, mask=common_mask)
dem1_slope.__setmask__(common_mask)
dem1_aspect.__setmask__(common_mask)
#Compute relationship between elevation difference, slope and aspect
fit = compute_offset(diff_euler, dem1_slope, dem1_aspect)
print fit
Here is the fit for my data, which initially consists of ~2 million points, but I've randomly sampled for testing/plotting purposes:
[ -14.9639559 216.01093596 -41.96806735]
There is plenty of data there for a good fit, but the result from curve_fit is poor. When I run with synthetic data, I get a nice fit:
original input parameters [20.0, 130.0, -3.0]
result from curve_fit [-19.66719631 -49.6673076 -3.12198723]
Not sure if this has something to do with using masked arrays, a limitation of curve_fit, or if I'm just overlooking something simple. Thanks for any suggestions.
==========================
Edit 9/4/13 16:30 PDT
As suggested by #Evert and others, the problem was definitely related to outliers. I was able to obtain a much better fit after removing outliers. Looking at my old code, it seems I computed the median absolute deviation for each aspect, then removed anything outside of 2*mad before fitting.
I generated a few additional plots back in November 2012:
But looking at these again, I'm almost positive they were generated for different input data. It's all that I can find right now, so I'm including them here as an example of a case with biased sampling. This method for DEM alignment is bound to fail for cases like these - and it has nothing to do with scipy's curve fitting abilities.
I ended up developing a different approach for alignment involving normalized cross-correlation, sub-pixel refinement, and vertical offset removal for two masked 2D numpy arrays. It is faster and consistently provides better results. Although even that approach has been superseded by an Iterative Closest Point (ICP) tool (pc_align) developed by Oleg Alexandrov as part of the NASA Ames Stereo Pipeline.
Thanks for all of your responses and I apologize for abandoning this question.
If you're just trying to get a sine wave with phase offset, you don't need a non-linear fit.
You can replace that a * sin(x - b) + c by a * sin(x) + b * cos(x) + c, because any sine with an offset can be written as an appropriate combination of a sine and a cosine("Phasor addition", like in a fourier transform).
If that gives the same result then it is not the "non-linear" fit that is the problem.

Categories