How to improve 4-parameter logistic regression curve_fit?

How to improve 4-parameter logistic regression curve_fit? - python

I am trying to fit a 4 parameter logistic regression to a set of data points in python with scipy.curve_fit. However, the fit is quite bad, see below:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.optimize import curve_fit
def fit_4pl(x, a, b, c, d):
return a + (d-a)/(1+np.exp(-b*(x-c)))
x=np.array([2000. , 1000. , 500. , 250. , 125. , 62.5, 2000. , 1000. ,
500. , 250. , 125. , 62.5])
y=np.array([1.2935, 0.9735, 0.7274, 0.3613, 0.1906, 0.104 , 1.3964, 0.9751,
0.6589, 0.353 , 0.1568, 0.0909])
#initial guess
p0 = [y.min(), 0.5, np.median(x), y.max()]
#fit 4pl
p_opt, cov_p = curve_fit(fit_4pl, x, y, p0=p0, method='dogbox')
#get optimized model
a_opt, b_opt, c_opt, d_opt = p_opt
x_model = np.linspace(min(x), max(x), len(x))
y_model = fit_4pl(x_model, a_opt, b_opt, c_opt, d_opt)
#plot
plt.figure()
sns.scatterplot(x=x, y=y, label="measured data")
plt.title(f"Calibration curve of {i}")
sns.lineplot(x=x_model, y=y_model, label='fit')
plt.legend(loc='best')
This gives me the following parameters and graph:
[2.46783333e-01 5.00000000e-01 3.75000000e+02 1.14953333e+00]
Graphh of 4PL fit overlaid with measured data
This fit is clearly terrible, but I do not know how to improve this. Please help.
I've tried different initial guesses. This has resulted in either the fit shown above or no fit at all (either warnings.warn('Covariance of the parameters could not be estimated' error or Optimal parameters not found: Number of calls to function has reached maxfev = 1000)
I looked at this similar question and the graph produced in the accepted solution is what I'm aiming for. I have attempted to manually assign some bounds but I do not know what I'm doing and there was no discernible improvement.

In the usual nonlinear regresion softwares an iterative calculus is involved which requires initial values of the parameters to start. The initial values have to be not too far from the unknown correct values.
Possibly the trouble comes the "guessing" process of the initial values which might be not good enough.
In order to test this hypothesis one will try an unusal method of regression which is not iterative thus which doesn't require initial values. This is shown below.
The calculus is straightforward (with MathCad) and the results are shown below.
One have to take care about the notations which are not the same in your equation than in the above equation. In order to avoid confusion capital letters will be used in your equation while lowcase letters are used above. The relationship between them is :
Note that the sign in front of the exponential is changed to be compatible with negative c found above.
Try your nonlinear regression in starting with the above values A, B, C, D.
You should find values of parameters close to the above values but not exactly the same. This is due to the criteria of fitting (LMSE or LMSRE or LMSAE or other) implemented in your software which is different from the criteria of fitting used in the above method.
For general explanation about the method used above : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrale

Related

Exponential decay curve fitting with scipy.optimize

I am trying to fit a curve with the curve_fit function in SciPy. By changing the inital values of the model the quality of the fit is changing but I am not able to find the best fit through my data. Here is how my fit looks like
My question is how can I improve this fit and what is the best way of selecting the initial values of the model.
I have attached the raw data which I want to fit an exponential curve to it.
This is the data which I am using
y = [ 338.52656636 337.43934446 348.25434126 308.42768639 279.24436171
269.85992004 279.24436171 249.25992615 239.53215125 219.96215705
220.41993469 220.30549028 220.30549028 195.07049776 180.364391
171.20883816 180.24994659 180.13550218 180.47883541 209.89104892
220.19104587 180.02105777 595.45426801 324.50712607 150.60884426
170.97994934 171.20883816 170.75106052 170.75106052 159.76439711
140.88106937 150.37995544 140.88106937 1620.70451979 140.42329173
150.37995544 140.53773614 284.68047121 1146.84743797 170.97994934
150.60884426 145.74495682 141.10995819 121.53996399 121.19663076
131.38218329 170.40772729 140.42329173 140.82384716 145.5732902
140.30884732 121.53996399 700.39979247 2783.74584185 131.26773888
140.76662496 140.53773614 121.76885281 126.23218482 130.69551683]
and here is my code:
from numpy import arange
from pandas import read_csv
from scipy.optimize import curve_fit
from matplotlib import pyplot
def expDecay(t, Amax, tau):
return Amax/tau*np.exp(-t/tau)
Amax = []
Tau = []
ydata = y
x = array(range(len(y)))
xdata = x
popt, pcov = curve_fit(expDecay, x, y,
p0=(10000, 5),
bounds=([0., 2.], [10000., 30]),)
Amax.append(popt[0])
Tau.append(popt[1])
plt.plot(xdata, expDecay(xdata, *popt), 'k-', label='Pred.');
plt.plot(ydata);
plt.ylim([0, 500])
plt.show()

The deviation is due to the outliers. After eliminating them :
Note about eliminating the outliers.
Since the definition of outlier is subjective a software able to do this will probably be more or less interactive. I built my own very rudimentary software. The principle is :
A first nonlinear regression is done with all the points. With the function and parameters obtained the values of y are computed for each point. The absolute difference between the "y computed" and the "y values" from the given data file are compared. This allows to eliminate the point the further away.
Another nonlinear regression is done with the remaining points. The same procedure eliminates a second point.
And so on until a specified criteria be reached to stop. That is the subjective part.
With your data (60 points) the point n.54 was eliminated first. Then the point n.34, then n.39 and so on. The process stops after eliminating 6 points. Eliminating more points doesn't improve much the LMSE.
The curve above is the result of the last nonlinear regression with the 54 remaining points.

Issues fitting gaussian to scatter plot

I'm having a lot of trouble fitting this data, particularly getting the fit parameters to match the expected parameters.
from scipy.optimize import curve_fit
import numpy as np
def gaussian_model(x, a, b, c, d): # add constant d
return a*np.exp(-(x-b)**2/(2*c**2))+d
x = np.linspace(0, 20, 100)
mu, cov = curve_fit(gaussian_model, xdata, ydata)
fit_A = mu[0]
fit_B = mu[1]
fit_C = mu[2]
fit_D = mu[3]
fit_y = gaussian_model(xdata, fit_A, fit_B, fit_C, fit_D)
print(mu)
plt.plot(x, fit_y)
plt.scatter(xdata, ydata)
plt.show()
Here's the plot
When I printed the parameters, I got values of -17 for amplitude, 2.6 for mean, -2.5 for standard deviation, and 110 for the base. This is very far off from what I would expect from the scatter plot. Any ideas why?
Also, I'm pretty new to coding, so any advice is helpful! Thanks everyone :)
Edit: figured out what was wrong! Just needed to add some guesses.

This is not an answer as expected.
This is an alternative method of fitting gaussian.
The process is not iteratif and doesn't requier initial "guessed" values of the parameters to start as in the usual methods.
The result is :
The method of calculus is shown below :
The general principle is explained with examples in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales . This is a linear regression wrt an integral equation which solution is the gaussian function.
If one want more accurate and/or more specific result according to some specified criteria of fitting, one have to use a software with non-linear regression process. Then one can use the above result as initial values of parameters for a more robust iterative process.

how to optimise fitting of gauss-hermite function in python?

What I have done so far:
I am trying to fit noisy data (which I generated myself by adding random noise to my function) to Gauss-Hermite function that I have defined. It works well in some cases for lower values of h3 and h4 but every once in a while it will produce a really bad fit even for lower h3, h4 values, and for higher h3, h4 values, it always gives a bad fit.
My code:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as mpl
import matplotlib.pyplot as plt
# Let's define the Gauss-Hermite function
#a=amplitude, x0=location of peak, sig=std dev, h3, h4
def gh_func(x, a, x0, sig, h3, h4):
return a*np.exp(-.5*((x-x0)/sig)**2)*(1+h3*((-np.sqrt(3))*((x-x0)/sig)+(2/np.sqrt(3))*((x-x0)/sig)**3)+ h4*((np.sqrt(6)/4)+(-np.sqrt(6))*((x-x0)/sig)**2+(np.sqrt(6)/3)*(((x-x0)/sig)**4)))
#generate clean data
x = np.linspace(-10, 20, 100)
y = gh_func(x, 10, 5, np.sqrt(3), -0.10,-0.03) #it gives okay fit for h3=-0.10, h4=-0.03 but bad fits for higher values like h3=-0.4 and h4=-0.3.
#add noise to data
noise=np.random.normal(0,np.sqrt(0.5),size=len(x))
yn = y + noise
fig = mpl.figure(1)
ax = fig.add_subplot(111)
ax.plot(x, y, c='k', label='analytic function')
ax.scatter(x, yn, s=5, label='fake noisy data')
fig.savefig('model_and_noise_h3h4.png')
# Executing curve_fit on noisy data
popt, pcov = curve_fit(gh_func, x, yn)
#popt returns the best fit values for parameters of the given model (func)
print('Fitted Parameters (Gaus_Hermite):\na = %.10f , x0 = %.10f , sig = %.10f\nh3 = %.10f , h4 = %.10f' \
%(popt[0],popt[1],popt[2],popt[3],popt[4]))
ym = gh_func(x, popt[0], popt[1], popt[2], popt[3], popt[4])
ax.plot(x, ym, c='r', label='Best fit')
ax.legend()
fig.savefig('model_fit_h3h4.png')
plt.legend(loc='upper left')
plt.xlabel("v")
plt.ylabel("f(v)")
What I want to do:
I want to find a better fitting method than just curve_fit from scipy.optimize but I am not sure what I can use. Even if we end up using curve_fit, I need a way to produce better fits by providing initial guesses for the parameters which are generated automatically, e.g. one approach for only single peak gaussian is described in the accepted answer of this post (Jean Jacquelin's method):gaussian fitting inaccurate for lower peak width using Python. But this is just for mu,sigma and amplitude not h3,h4.
Besides curve_fit from scipy.optimize, I think there's one called lmfit: https://lmfit.github.io/lmfit-py/ but I am not sure how I will implement it in my code. I do not want to use manual initial guesses for the parameters. I want to be able to find the fitting automatically.
Thanks a lot!

Using lmfit for this fitting problem would be straightforward, by creating a lmfit.Model from your gh_func, with something like
from lmfit import Model
gh_model = Model(gh_func)
params = gh_model.make_params(x0=?, a=?, sig=?, h3=?, h4=?)
where those question marks would have to be initial values for the parameters in question.
Your use of scipy.optimize.curve_fit does not provide initial values for the variables. Unfortunately, this does not raise an error or give an indication of a problem because the authors of scipy.optimize.curve_fit have lied to you by making initial values optional. Initial values for all parameters are always required for all non-linear least-squares analyses. It is not a choice of the implementation, it is a feature of the mathematical algorithm. What curve_fit hides from you is that leaving p0=None makes all initial values 1.0. Whether that is remotely appropriate depends on the model and data being fit - it cannot be always reasonable. As an example, if the x values for your data extended from 9000 to 9500, and the peak function was centered around 9200, starting with x0 of 1 would almost certainly not find a suitable fit. It is always deceptive to imply in any way that initial values are optional. They just are not. Not sometimes, not for well-defined problems. Initial values are never, ever optional.
Initial values don't have to be perfect, but they need to be of the right scale. Many people will warn you about "false minima" - getting stuck at a solution that is not bad but not "best". My experience is that the more common problem people run into is initial values that are so far off that the model is just not sensitive to small changes in their values, and so can never be optimized (with x0=1,sig=1 for data with x on [9000, 9500] and centered at 9200 being exactly in that category).
The answer to your question for your essential question of "providing initial guesses for the parameters which are generated automatically" is hard to answer without knowing the properties of your model function. You might be able to use some heuristics to guess values from a data set. Lmfit has such heuristic "guess parameter values from data" functions for many peak-like functions. You probably have some sense for the physical (or other domain) for h3 and h4 and know what kinds of values are reasonable, and can probably give better initial values than h3=h4=1. It might be that you want to start with guessing parameters as if the data was simple Gaussian (say, using lmfit.Model.guess_from_peak()) and then use the difference of that and your data to get the scale for h3 and h4.

Problem while fitting a set of data points to arbitrary curve with Scipy

I have a set of data points which, according to the model I want to implement, could be modelled with a certain curve (in this case, a product between an exponential and a complementary error function).
For fitting these data into such a curve, I tried:
import numpy as np
from scipy.optimize import curve_fit
from scipy import special
x_fit = np.linspace(0,1,1000)
def fitted_function(x_fit, c, d, S):
return c*np.exp(((S*d/2)**2)-x_fit*d)*special.erfc(S*d/2-x_fit/S)
FitParameters, FitCovariance = curve_fit(fitted_function, x_data, y_data, maxfev = 100000)
It does not give me any particular error, but the result of the fitting is evidently wrong. I strongly suspect that it has to do with the the part x_fit/S, where the fitting parameter S appears as a denominator.
For example, I encounter the same problem while fitting a simple exponential: if I define the fitting curve with
return a*np.exp(-x_fit/b)
with a, b fitting parameters; since the fitting parameter b appears as a denominator, I find the same problem (i.e. the resulting fitted curve is a horizontal line for some reason).
For the case of a simple exponential I can simple bypass this by doing
return a*np.exp(-b*x_fit)
so that b is not a denominator anymore and the fitted curve is really an exponential curve. For my current case, instead, I cannot do this since S appears ad a numerator and a denominator in different part of the expression.
Any ideas? Thank you in advance!

Gaussian fit in Python - parameters estimation

I want to fit an array of data (in the program called "data", of size "n") with a Gaussian function and I want to get the estimations for the parameters of the curve, namely the mean and the sigma. Is the following code, which I found on the Web, a fast way to do that? If so, how can I actually get the estimated values of the parameters?
import pylab as plb
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
x = ar(range(n))
y = data
n = len(x) #the number of data
mean = sum(x*y)/n #note this correction
sigma = sum(y*(x-mean)**2)/n #note this correction
def gaus(x,a,x0,sigma,c):
return a*exp(-(x-x0)**2/(sigma**2))+c
popt,pcov = curve_fit(gaus,x,y,p0=[1,mean,sigma,0.0])
print popt
print pcov
plt.plot(x,y,'b+:',label='data')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.legend()
plt.title('Fig. 3 - Fit')
plt.xlabel('q')
plt.ylabel('data')
plt.show()

To answer your first question, "Is the following code, which I found on the Web, a fast way to do that?"
The code that you have is in fact the right way to proceed with fitting your data, when you believe is Gaussian and know the fitting function (except change the return function to
a*exp(-(x-x0)**2/(sigma**2)).
I believe for a Gaussian function you don't need the constant c parameter.
A common use of least-squares minimization is curve fitting, where one has a parametrized model function meant to explain some phenomena and wants to adjust the numerical values for the model to most closely match some data. With scipy, such problems are commonly solved with scipy.optimize.curve_fit.
To answer your second question, "If so, how can I actually get the estimated values of the parameters?"
You can go to the link provided for scipy.optimize.curve_fit and find that the best fit parameters reside in your popt variable. In your example, popt will contain the mean and sigma of your data. In addition to the best fit parameters, pcov will contain the covariance matrix, which will have the errors of your mean and sigma. To obtain 1sigma standard deviations, you can simply use np.sqrt(pcov) and obtain the same.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.