Python fitting model to curve - python

I am experimenting with Python to fit curves to a series of data-points, summary below:
From the below, it would seem that polynomials of order greater than 2 are the best fit, followed by linear and finally exponential which has the overall worst outcome.
While I appreciate this might not be exponential growth, I just wanted to know whether you would expect the exponential function perform so badly (basically the coefficient of x,b, has been set to 0 and an arbitrary point has been picked on the curve to intersect) or if I have somehow done something wrong in my code to fit.
The code I'm using to fit is as follows:
# Fitting
def exponenial_func(x,a,b,c):
return a*np.exp(-b*x)+c
def linear(x,m,c):
return m*x+c
def quadratic(x,a,b,c):
return a*x**2 + b*x+c
def cubic(x,a,b,c,d):
return a*x**3 + b*x**2 + c*x + d
x = np.array(x)
yZero = np.array(cancerSizeMean['levelZero'].values)[start:]
print len(x)
print len(yZero)
popt, pcov = curve_fit(exponenial_func,x, yZero, p0=(1,1,1))
expZeroFit = exponenial_func(x, *popt)
plt.plot(x, expZeroFit, label='Control, Exponential Fit')
popt, pcov = curve_fit(linear, x, yZero, p0=(1,1))
linearZeroFit = linear(x, *popt)
plt.plot(x, linearZeroFit, label = 'Control, Linear')
popt, pcov = curve_fit(quadratic, x, yZero, p0=(1,1,1))
quadraticZeroFit = quadratic(x, *popt)
plt.plot(x, quadraticZeroFit, label = 'Control, Quadratic')
popt, pcov = curve_fit(cubic, x, yZero, p0=(1,1,1,1))
cubicZeroFit = cubic(x, *popt)
plt.plot(x, cubicZeroFit, label = 'Control, Cubic')
*Edit: curve_fit is imported from the scipy.optimize package
from scipy.optimize import curve_fit

curve_fit tends to perform poorly if you give it a poor initial guess with functions like the exponential that could end up with very large numbers. You could try altering the maxfev input so that it runs more iterations. otherwise, I would suggest trying with with something like:
p0=(1000,-.005,0)
-.01, since it ~doubles from 300 to 500 and you have -b in your eqn, 100 0 since it is ~3000 at 300 (1.5 doublings from 0). See how that turns out
As for why the initial exponential doesn't work at all, your initial guess is b=1, and x is in range of (300,1000) or range. This means python is calculating exp(-300) which either throws an exception or is set to 0. At this point, whether b is increased or decreased, the exponential is going to still be set to 0 for any value in the general vicinity of the initial estimate.
Basically, python uses a numerical method with limited precision, and the exponential estimate went outside of the range of values it can handle

I'm not sure how you're fitting the curves -- are you using polynomial least squares? In that case, you'd expect the fit to improve with each additional degree of flexibility, and you choose the power based on diminishing marginal improvement / outside theory.
The improving fit should look something like this.
I actually wrote some code to do Polynomial Least Squares in python for a class a while back, which you can find here on Github. It's a bit hacky though and loosely commented since I was just using it to solve exercises. Hope it's helpful.

Related

Exponential curve fit in python--parameters do not make sense

I'm doing a curve fit in python using scipy.curve_fit, and the fit itself looks great, however the parameters that are generated don't make sense.
The equation is (ax)^b + cx, but with the params python finds a = -c and b = 1, so the whole equation just equals 0 for every value of x.
here is the plot and my code.
(https://i.stack.imgur.com/fBfg7.png)](https://i.stack.imgur.com/fBfg7.png)
# experimental data
xdata = cfu_u
ydata = OD_u
# x-values to plot for curve fit
min_cfu = 0.1
max_cfu = 9.1
x_vec = pow(10,np.arange(min_cfu,max_cfu,0.1))
# exponential function
def func(x,a, b, c):
return (a*x)**b + c*x
# curve fit
popt, pcov = curve_fit(func, xdata, ydata)
# plot experimental data and fitted curve
plt.plot(x_vec, func(x_vec, *popt), label = 'curve fit',color='slateblue',linewidth = 2.2)
plt.plot(cfu_u,OD_u,'-',label = 'experimental data',marker='.',markersize=8,color='deepskyblue',linewidth = 1.4)
plt.legend(loc='upper left',fontsize=12)
plt.ylabel("Y",fontsize=12)
plt.xlabel("X",fontsize=12)
plt.xscale("log")
plt.gcf().set_size_inches(7, 5)
plt.show()
print(popt)
[ 1.44930871e+03 1.00000000e+00 -1.44930871e+03]
How can I find the actual parameters?
edit: here is the actual experimental raw data I used: https://pastebin.com/CR2BCJji
The chosen function model is :
y(x)=(ax)^b+cx
In order to understand the difficulty encountered one have first to compare the behaviour of the function to the data on the range of the lowest values of X.
We see that y(x)=0 is an acceptable fitting for the points on a large range (at least 6 decades ) considering the scatter. They are the majority of the experimental points (18 points among 27). The function y(x)=0 is obtained from the function model only if b=1 leading to y(x)=(a+c)x and with a+c=0. At first sight python seems to give : b=1 and c=-a. But we have to look more carefully.
Of course the fonction y(x)=0 is not convenient for the 9 points at larger X.
This draw to think that the fitting of the whole set of points is an extension of the above fitting with values of the parameters different from b=1 and a+c=0 but not far in order to continue to have a good fitting on the above 18 points.
Conclusion : The actual values of the parameters found by python are certainly very close to b=1 and a close to 1.44930871e+03 and b close to -1.44930871e+03
The calculus inside python is certainly carried out with 16 or 18 digits. But the display is with 9 digits only. This is not sufficient to see that b might be different from 1 and that c might be different from -a. This suggests that the clue might be only a matter of display with enough digits.
Yes, the fitting by python looks great. This is a fine performance on the mathematical viewpoint. But the physical signifiance is doubtful with so many digits essential to the fitting on the whole range.

Issues fitting gaussian to scatter plot

I'm having a lot of trouble fitting this data, particularly getting the fit parameters to match the expected parameters.
from scipy.optimize import curve_fit
import numpy as np
def gaussian_model(x, a, b, c, d): # add constant d
return a*np.exp(-(x-b)**2/(2*c**2))+d
x = np.linspace(0, 20, 100)
mu, cov = curve_fit(gaussian_model, xdata, ydata)
fit_A = mu[0]
fit_B = mu[1]
fit_C = mu[2]
fit_D = mu[3]
fit_y = gaussian_model(xdata, fit_A, fit_B, fit_C, fit_D)
print(mu)
plt.plot(x, fit_y)
plt.scatter(xdata, ydata)
plt.show()
Here's the plot
When I printed the parameters, I got values of -17 for amplitude, 2.6 for mean, -2.5 for standard deviation, and 110 for the base. This is very far off from what I would expect from the scatter plot. Any ideas why?
Also, I'm pretty new to coding, so any advice is helpful! Thanks everyone :)
Edit: figured out what was wrong! Just needed to add some guesses.
This is not an answer as expected.
This is an alternative method of fitting gaussian.
The process is not iteratif and doesn't requier initial "guessed" values of the parameters to start as in the usual methods.
The result is :
The method of calculus is shown below :
The general principle is explained with examples in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales . This is a linear regression wrt an integral equation which solution is the gaussian function.
If one want more accurate and/or more specific result according to some specified criteria of fitting, one have to use a software with non-linear regression process. Then one can use the above result as initial values of parameters for a more robust iterative process.

How to best implement curve fitting using the curve_fit() function in python and properly compare different curve equations?

I am trying to understand whether I am carrying out curve fitting correctly and appropriately using the "curve_fit()" module from scipy within python. I have a script that iterates through each column of my dataframe, plots the data, and then fits a curve to it, and then judges the fit of the equation across the entire data set by calculating the average root mean square error across all of the rows.
First, here is my dataframe:
I would have tried to actually provide the data here but I am not sure how. As an example description of what this data means, I am looking at how pollution levels decay as distance away from a factory site increases, with a distance column and each subsequent column representing a different factory (factory "0", factory "1", factory "2", factory "3", etc.). And so we see pollution levels recorded at each distance moving away from each factory in 30-meter increments, with 0-meters representing the location at the factory site itself. And so I am going by the hypothesis that as move away from a factory site, the pollution levels will start to decay. The research question I am trying to discover here though is, from my specific data set, overall, across all of my studied factories, how does pollution decay with increasing distance away? Is this linear decay? Exponential decay? Logarithmic decay? polynomial decay? Inverse-logistic decay? That is what I am trying to find out.
And so I am trying to fit different curve equations to my data to see which type of curve best represents the decay I am looking at. Here is my code:
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
#Un-comment each of these individual curve functions to try them out individually
# def func(x, a, b):
# return a * x + b
# def func(x, a, b):
# return -a *np.log(x) + b
#def func(x, a, b, c):
# return a * np.exp(-b * x) + c
# def func(x, a, b, c):
# return a / (1 + b*c**(-x))
# def func(x, a, b, c):
# return a*x**2+b*x+c
RMSE_list = []
xdata = data['Distance'].to_numpy()
c = range(8)
for i in c:
# plt.figure(figsize=(3, 3))
greenspace = str(i)
ydata = data[greenspace].to_numpy()
plt.plot(xdata, ydata, 'b-', label='data')
popt, pcov = curve_fit(func, xdata, ydata)
plt.plot(xdata, func(xdata, *popt), 'g--')
#Calculate RMSE
params, _ = curve_fit(func, xdata, ydata)
a, b = params[0], params[1]
yfit1 = a * xdata + b
rmse = np.sqrt(np.mean((yfit1 - ydata) ** 2))
RMSE_list.append(rmse)
RMSE_Average = np.mean(RMSE_list)
print("RMSE Average:")
print(RMSE_Average)
And so here I am trying out curves of linear decay, logarithmic decay, exponential decay, inverse-logistic decay, and second-degree polynomial decay. And so I try out each of these curve fitting equations individually as a unique function and fit this to the data from each factory. I then find the RMSE of each of these fits for each factory, and then average them to get an average RMSE to describe all factories for each curve type. That's all my code does. What I am trying to seek some help here with is to ask whether I am actually doing this right or if it even sounds like my approach here is on the right track. This whole field of curve fitting is very new to me, and I am really just going as far as I know how to at my beginner level. The code works I suppose, but I get some strange outcomes.
When I try this all out I get the following results for each curve fitting attempt:
Overall the idea makes sense to me here, but there are specific issues I encountered that I do not understand. I specifically had issues with the logarithmic, exponential and inverse-logistic curve fittings. For the logistic curve attempt I received the following error for each factory:
C:\Users\MyName\AppData\Local\Temp/ipykernel_7788/37186127.py:8: RuntimeWarning: divide by zero encountered in log return -a *np.log(x) + b
Also, just looking at the curve, the fitted logarithmic curve is fitted way below the actual data, it doesn't even touch!
And for the exponential curve attempt I received the following error for each factory:
C:\Users\MyName\AppData\Roaming\Python\Python39\site-packages\scipy\optimize\minpack.py:833: OptimizeWarning: Covariance of the parameters could not be estimated
I am thinking that perhaps a lot of this might be resolved by setting bounds. When I set the parameter bounds for the exponential attempt to "bounds=(0, [3., 1., 0.5])", based on this curve_fit() documention from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
I get the same error message for each factory, but I receive this exponential curve fitting result:
So the visual fitting looks better, but the average RMSE looks the same, which is confusing me. And for the inverse-logistic fitting, it visually looks like the fitting is good, but the average RMSE is so high, which tells me the fit is really bad.
And so my question here is: how can I improve my approach and code here to most accurately and effectively fit these curves to my data and compare them using "curve_fit()" in python? I am thinking maybe this is a matter of methodically setting the bounds for each curve fitting, but I am not sure.

Scipy.optimize.curve_fit won't fit cosine power law

For several hours now, I have been trying to fit a model to a (generated) dataset as a casus for a problem I've been struggling with. I generated datapoints for the function f(x) = A*cos^n(x)+b, and added some noise. When I try to fit the dataset with this function and curve_fit, I get the error
./tester.py:10: RuntimeWarning: invalid value encountered in power
return Amp*(np.cos(x))**n + b
/usr/lib/python2.7/dist-packages/scipy/optimize/minpack.py:690: OptimizeWarning: Covariance of the parameters could not be estimated category=OptimizeWarning)
The code I'm using to generate the datapoints and fit the model is the following:
#!/usr/bin/env python
from __future__ import print_function
import numpy as np
from scipy.optimize import curve_fit
from matplotlib.pyplot import figure, show, rc, plot
def f(x, Amp, n, b):
return np.real(Amp*(np.cos(x))**n + b)
x = np.arange(0, 6.28, 0.01)
randomPart = np.random.rand(len(x))-0.5
fig = figure()
sample = f(x, 5, 2, 5)+randomPart
frame = fig.add_subplot(1,1,1)
frame.plot(x, sample, label="Sample measurements")
popt, pcov = curve_fit(f, x, sample, p0=(1,1,1))
modeldata = f(x, popt[0], popt[1], popt[2])
print(modeldata)
frame.plot(x, modeldata, label="Best fit")
frame.legend()
frame.set_xlabel("x")
frame.set_ylabel("y")
show()
The noisy data is shown - see the image below.
Does any of you have a clue of what's going on? I suspect it has something to do with the power law going into the complex domain, as the real part of the function is nowhere divergent. I have tried returning only the real part of the function, setting realistic bounds in curve_fit and using a numpy array instead of a python list for p0 already as well. I'm running the latest version of scipy available to me, scipy 0.17.0-1.
The problem is the following:
>>> (-2)**1.1
(-2.0386342710747223-0.6623924280875919j)
>>> np.array(-2)**1.1
__main__:1: RuntimeWarning: invalid value encountered in power
nan
Unlike native python floats, numpy doubles usually refuse to take part in operations leading to complex results:
>>> np.sqrt(-1)
__main__:1: RuntimeWarning: invalid value encountered in sqrt
nan
As a quick workaround I suggest adding an np.abs call to your function, and using appropriate bounds for fitting to make sure this doesn't give a spurious fit. If your model is near the truth and your sample (I mean the cosine in your sample) is positive, then adding an absolute value around it should be a no-op (update: I realize this is never the case, see the proper approach below).
def f(x, Amp, n, b):
return Amp*(np.abs(np.cos(x)))**n + b # only change here
With this small change I get this:
For reference, the parameters from the fit are (4.96482314, 2.03690954, 5.03709923]) comparing to the generation with (5,2,5).
After giving it a bit more thought I realized that the cosine will always be negative for half your domain (duh). So the workaround I suggested might be a bit problematic, or at least its correctness is non-trivial. On the other hand, thinking of your original formula containing cos(x)^n, with negative values for cos(x) this only makes sense as a model if n is an integer, otherwise you get a complex result. Since we can't solve Diophantine fitting problems, we need to handle this properly.
The most proper way (by which I mean the way that is least likely to bias your data) is this: first do the fitting with a model that converts your data to complex numbers then takes the complex magnitude on output:
def f(x, Amp, n, b):
return Amp*np.abs(np.cos(x.astype(np.complex128))**n) + b
This is obviously much less efficient than my workaround, since in each fitting step we create a new mesh, and do some extra work both in the form of complex arithmetic and an extra magnitude calculation. This gives me the following fit even with no bounds set:
The parameters are (5.02849409, 1.97655728, 4.96529108). These are close too. However, if we put these values back into the actual model (without np.abs), we get imaginary parts as large as -0.37, which is not overwhelming but significant.
So the second step should be redoing the fit with a proper model---one that has an integer exponent. Take the exponent 2 which is obvious from your fit, and do a new fit with this model. I don't believe any other approach gives you a mathematically sound result. You can also start from the original popt, hoping that it's indeed close to the truth. Of course we could use the original function with some currying, but it's much faster to use a dedicated double-specific version of your model.
from __future__ import print_function
import numpy as np
from scipy.optimize import curve_fit
from matplotlib.pyplot import subplots, show
def f_aux(x, Amp, n, b):
return Amp*np.abs(np.cos(x.astype(np.complex128))**n) + b
def f_real(x, Amp, n, b):
return Amp*np.cos(x)**n + b
x = np.arange(0, 2*np.pi, 0.01) # pi
randomPart = np.random.rand(len(x)) - 0.5
sample = f(x, 5, 2, 5) + randomPart
fig,(frame_aux,frame) = subplots(ncols=2)
for fr in frame_aux,frame:
fr.plot(x, sample, label="Sample measurements")
fr.legend()
fr.set_xlabel("x")
fr.set_ylabel("y")
# auxiliary fit for n value
popt_aux, pcov_aux = curve_fit(f_aux, x, sample, p0=(1,1,1))
modeldata = f(x, *popt_aux)
#print(modeldata)
print('Auxiliary fit parameters: {}'.format(popt_aux))
frame_aux.plot(x, modeldata, label="Auxiliary fit")
# check visually, test if it's close to an integer, but otherwise
n = np.round(popt_aux[1])
# actual fit with integral exponent
popt, pcov = curve_fit(lambda x,Amp,b,n=n: f_real(x,Amp,n,b), x, sample, p0=(popt_aux[0],popt_aux[2]))
modeldata = f(x, popt[0], n, popt[1])
#print(modeldata)
print('Final fit parameters: {}'.format([popt[0],n,popt[1]]))
frame.plot(x, modeldata, label="Best fit")
frame_aux.legend()
frame.legend()
show()
Note that I changed a few things in your code which doesn't really affect my point. The figure from the above, so the one that shows both the auxiliary fit and the proper one:
The output:
Auxiliary fit parameters: [ 5.02628994 2.00886409 5.00652371]
Final fit parameters: [5.0288141074549699, 2.0, 5.0009730316739462]
Just to reiterate: while there might be no visual difference between the auxiliary fit and the proper one, only the latter gives a meaningful answer to your problem.

Fitting gaussian to a curve in Python II

I have two lists .
import numpy
x = numpy.array([7250, ... list of 600 ints ... ,7849])
y = numpy.array([2.4*10**-16, ... list of 600 floats ... , 4.3*10**-16])
They make a U shaped curve.
Now I want to fit a gaussian to that curve.
from scipy.optimize import curve_fit
n = len(x)
mean = sum(y)/n
sigma = sum(y - mean)**2/n
def gaus(x,a,x0,sigma,c):
return a*numpy.exp(-(x-x0)**2/(2*sigma**2))+c
popt, pcov = curve_fit(gaus,x,y,p0=[-1,mean,sigma,-5])
pylab.plot(x,y,'r-')
pylab.plot(x,gaus(x,*popt),'k-')
pylab.show()
I just end up with the noisy original U-shaped curve and a straight horizontal line running through the curve.
I am not sure what the -1 and the -5 represent in the above code but I am sure that I need to adjust them or something else to get the gaussian curve. I have been playing around with possible values but to no avail.
Any ideas?
First of all, your variable sigma is actually variance, i.e. sigma squared --- http://en.wikipedia.org/wiki/Variance#Definition.
This confuses the curve_fit by giving it a suboptimal starting estimate.
Then, your fitting ansatz, gaus, includes an amplitude a and an offset, is this what you actually need? And the starting values are a=-1 (negated bell shape) and offset c=-5. Where do they come from?
Here's what I'd do:
fix your fitting model. Do you want just a gaussian, does it need to be normalized. If it does, then the amplitude a is fixed by sigma etc.
Have a look at the actual data. What's the tail (offset), what's the sign (amplitude sign).
If you're actually want just a gaussian without any bells and whistles, you might not actually need curve_fit: a gaussian is fully defined by two first moments, mean and sigma. Calculate them as you do, plot them over the data and see if you're not all set.
p0 in your call to curve_fit gives the initial guesses for the additional parameters of you function in addition to x. In the above code you are saying that I want the curve_fit function to use -1 as the initial guess for a, -5 as the initial guess for c, mean as the initial guess for x0, and sigma as the guess for sigma. The curve_fit function will then adjust these parameters to try and get a better fit. The problem is your initial guesses at your function parameters are really bad given the order of (x,y)s.
Think a little bit about the order of magnitude of your different parameters for the Gaussian. a should be around the size of your y values (10**-16) as at the peak of the Gaussian the exponential part will never be larger than 1. x0 will give the position within your x values at which the exponential part of your Gaussian will be 1, so x0 should be around 7500, probably somewhere in the centre of your data. Sigma indicates the width, or spread of your Gaussian, so perhaps something in the 100's just a guess. Finally c is just an offset to shift the whole Gaussian up and down.
What I would recommend doing, is before fitting the curve, pick some values for a, x0, sigma, and c that seem reasonable and just plot the data with the Gaussian, and play with a, x0, sigma, and c until you get something that looks at least some what the way you want the Gaussian to fit, then use those as the starting points for curve_fit p0 values. The values I gave should get you started, but may not do exactly what you want. For instance a probably needs to be negative if you want to flip the Gaussian to get a "U" shape.
Also printing out the values that curve_fit thinks are good for your a,x0,sigma, and c might help you see what it is doing and if that function is on the right track to minimizing the residual of the fit.
I have had similar problems doing curve fitting with gnuplot, if the initial values are too far from what you want to fit it goes in completely the wrong direction with the parameters to minimize the residuals, and you could probably do better by eye. Think of these functions as a way to fine tune your by eye estimates of these parameters.
hope that helps
I don't think you are estimating your initial guesses for mean and sigma correctly.
Take a look at the SciPy Cookbook here
I think it should look like this.
x = numpy.array([7250, ... list of 600 ints ... ,7849])
y = numpy.array([2.4*10**-16, ... list of 600 floats ... , 4.3*10**-16])
n = len(x)
mean = sum(x*y)/sum(y)
sigma = sqrt(abs(sum((x-mean)**2*y)/sum(y)))
def gaus(x,a,x0,sigma,c):
return a*numpy.exp(-(x-x0)**2/(2*sigma**2))+c
popy, pcov = curve_fit(gaus,x,y,p0=[-max(y),mean,sigma,min(x)+((max(x)-min(x)))/2])
pylab.plot(x,gaus(x,*popt))
If anyone has a link to a simple explanation why these are the correct moments I would appreciate it. I am going on faith that SciPy Cookbook got it right.
Here is the solution thanks to everyone .
x = numpy.array([7250, ... list of 600 ints ... ,7849])
y = numpy.array([2.4*10**-16, ... list of 600 floats ... , 4.3*10**-16])
n = len(x)
mean = sum(x)/n
sigma = math.sqrt(sum((x-mean)**2)/n)
def gaus(x,a,x0,sigma,c):
return a*numpy.exp(-(x-x0)**2/(2*sigma**2))+c
popy, pcov = curve_fit(gaus,x,y,p0=[-max(y),mean,sigma,min(x)+((max(x)-min(x)))/2])
pylab.plot(x,gaus(x,*popt))
Maybe it is because I use matlab and fminsearch or my fits have to work on much fewer datapoints (~ 5-10), I have much better results with the following starter values (as simple as they are):
a = max(y)-min(y);
imax= find(y==max(y),1);
mean = x(imax);
avg = sum(x.*y)./sum(y);
sigma = sqrt(abs(sum((x-avg).^2.*y) ./ sum(y)));
c = min(y);
The sigma works fine.

Categories