how to optimise fitting of gauss-hermite function in python? - python

What I have done so far:
I am trying to fit noisy data (which I generated myself by adding random noise to my function) to Gauss-Hermite function that I have defined. It works well in some cases for lower values of h3 and h4 but every once in a while it will produce a really bad fit even for lower h3, h4 values, and for higher h3, h4 values, it always gives a bad fit.
My code:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as mpl
import matplotlib.pyplot as plt
# Let's define the Gauss-Hermite function
#a=amplitude, x0=location of peak, sig=std dev, h3, h4
def gh_func(x, a, x0, sig, h3, h4):
return a*np.exp(-.5*((x-x0)/sig)**2)*(1+h3*((-np.sqrt(3))*((x-x0)/sig)+(2/np.sqrt(3))*((x-x0)/sig)**3)+ h4*((np.sqrt(6)/4)+(-np.sqrt(6))*((x-x0)/sig)**2+(np.sqrt(6)/3)*(((x-x0)/sig)**4)))
#generate clean data
x = np.linspace(-10, 20, 100)
y = gh_func(x, 10, 5, np.sqrt(3), -0.10,-0.03) #it gives okay fit for h3=-0.10, h4=-0.03 but bad fits for higher values like h3=-0.4 and h4=-0.3.
#add noise to data
yn = y + noise
fig = mpl.figure(1)
ax = fig.add_subplot(111)
ax.plot(x, y, c='k', label='analytic function')
ax.scatter(x, yn, s=5, label='fake noisy data')
# Executing curve_fit on noisy data
popt, pcov = curve_fit(gh_func, x, yn)
#popt returns the best fit values for parameters of the given model (func)
print('Fitted Parameters (Gaus_Hermite):\na = %.10f , x0 = %.10f , sig = %.10f\nh3 = %.10f , h4 = %.10f' \
ym = gh_func(x, popt[0], popt[1], popt[2], popt[3], popt[4])
ax.plot(x, ym, c='r', label='Best fit')
plt.legend(loc='upper left')
What I want to do:
I want to find a better fitting method than just curve_fit from scipy.optimize but I am not sure what I can use. Even if we end up using curve_fit, I need a way to produce better fits by providing initial guesses for the parameters which are generated automatically, e.g. one approach for only single peak gaussian is described in the accepted answer of this post (Jean Jacquelin's method):gaussian fitting inaccurate for lower peak width using Python. But this is just for mu,sigma and amplitude not h3,h4.
Besides curve_fit from scipy.optimize, I think there's one called lmfit: but I am not sure how I will implement it in my code. I do not want to use manual initial guesses for the parameters. I want to be able to find the fitting automatically.
Thanks a lot!

Using lmfit for this fitting problem would be straightforward, by creating a lmfit.Model from your gh_func, with something like
from lmfit import Model
gh_model = Model(gh_func)
params = gh_model.make_params(x0=?, a=?, sig=?, h3=?, h4=?)
where those question marks would have to be initial values for the parameters in question.
Your use of scipy.optimize.curve_fit does not provide initial values for the variables. Unfortunately, this does not raise an error or give an indication of a problem because the authors of scipy.optimize.curve_fit have lied to you by making initial values optional. Initial values for all parameters are always required for all non-linear least-squares analyses. It is not a choice of the implementation, it is a feature of the mathematical algorithm. What curve_fit hides from you is that leaving p0=None makes all initial values 1.0. Whether that is remotely appropriate depends on the model and data being fit - it cannot be always reasonable. As an example, if the x values for your data extended from 9000 to 9500, and the peak function was centered around 9200, starting with x0 of 1 would almost certainly not find a suitable fit. It is always deceptive to imply in any way that initial values are optional. They just are not. Not sometimes, not for well-defined problems. Initial values are never, ever optional.
Initial values don't have to be perfect, but they need to be of the right scale. Many people will warn you about "false minima" - getting stuck at a solution that is not bad but not "best". My experience is that the more common problem people run into is initial values that are so far off that the model is just not sensitive to small changes in their values, and so can never be optimized (with x0=1,sig=1 for data with x on [9000, 9500] and centered at 9200 being exactly in that category).
The answer to your question for your essential question of "providing initial guesses for the parameters which are generated automatically" is hard to answer without knowing the properties of your model function. You might be able to use some heuristics to guess values from a data set. Lmfit has such heuristic "guess parameter values from data" functions for many peak-like functions. You probably have some sense for the physical (or other domain) for h3 and h4 and know what kinds of values are reasonable, and can probably give better initial values than h3=h4=1. It might be that you want to start with guessing parameters as if the data was simple Gaussian (say, using lmfit.Model.guess_from_peak()) and then use the difference of that and your data to get the scale for h3 and h4.


How to improve 4-parameter logistic regression curve_fit?

I am trying to fit a 4 parameter logistic regression to a set of data points in python with scipy.curve_fit. However, the fit is quite bad, see below:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.optimize import curve_fit
def fit_4pl(x, a, b, c, d):
return a + (d-a)/(1+np.exp(-b*(x-c)))
x=np.array([2000. , 1000. , 500. , 250. , 125. , 62.5, 2000. , 1000. ,
500. , 250. , 125. , 62.5])
y=np.array([1.2935, 0.9735, 0.7274, 0.3613, 0.1906, 0.104 , 1.3964, 0.9751,
0.6589, 0.353 , 0.1568, 0.0909])
#initial guess
p0 = [y.min(), 0.5, np.median(x), y.max()]
#fit 4pl
p_opt, cov_p = curve_fit(fit_4pl, x, y, p0=p0, method='dogbox')
#get optimized model
a_opt, b_opt, c_opt, d_opt = p_opt
x_model = np.linspace(min(x), max(x), len(x))
y_model = fit_4pl(x_model, a_opt, b_opt, c_opt, d_opt)
sns.scatterplot(x=x, y=y, label="measured data")
plt.title(f"Calibration curve of {i}")
sns.lineplot(x=x_model, y=y_model, label='fit')
This gives me the following parameters and graph:
[2.46783333e-01 5.00000000e-01 3.75000000e+02 1.14953333e+00]
Graphh of 4PL fit overlaid with measured data
This fit is clearly terrible, but I do not know how to improve this. Please help.
I've tried different initial guesses. This has resulted in either the fit shown above or no fit at all (either warnings.warn('Covariance of the parameters could not be estimated' error or Optimal parameters not found: Number of calls to function has reached maxfev = 1000)
I looked at this similar question and the graph produced in the accepted solution is what I'm aiming for. I have attempted to manually assign some bounds but I do not know what I'm doing and there was no discernible improvement.
In the usual nonlinear regresion softwares an iterative calculus is involved which requires initial values of the parameters to start. The initial values have to be not too far from the unknown correct values.
Possibly the trouble comes the "guessing" process of the initial values which might be not good enough.
In order to test this hypothesis one will try an unusal method of regression which is not iterative thus which doesn't require initial values. This is shown below.
The calculus is straightforward (with MathCad) and the results are shown below.
One have to take care about the notations which are not the same in your equation than in the above equation. In order to avoid confusion capital letters will be used in your equation while lowcase letters are used above. The relationship between them is :
Note that the sign in front of the exponential is changed to be compatible with negative c found above.
Try your nonlinear regression in starting with the above values A, B, C, D.
You should find values of parameters close to the above values but not exactly the same. This is due to the criteria of fitting (LMSE or LMSRE or LMSAE or other) implemented in your software which is different from the criteria of fitting used in the above method.
For general explanation about the method used above :

Issues fitting gaussian to scatter plot

I'm having a lot of trouble fitting this data, particularly getting the fit parameters to match the expected parameters.
from scipy.optimize import curve_fit
import numpy as np
def gaussian_model(x, a, b, c, d): # add constant d
return a*np.exp(-(x-b)**2/(2*c**2))+d
x = np.linspace(0, 20, 100)
mu, cov = curve_fit(gaussian_model, xdata, ydata)
fit_A = mu[0]
fit_B = mu[1]
fit_C = mu[2]
fit_D = mu[3]
fit_y = gaussian_model(xdata, fit_A, fit_B, fit_C, fit_D)
plt.plot(x, fit_y)
plt.scatter(xdata, ydata)
Here's the plot
When I printed the parameters, I got values of -17 for amplitude, 2.6 for mean, -2.5 for standard deviation, and 110 for the base. This is very far off from what I would expect from the scatter plot. Any ideas why?
Also, I'm pretty new to coding, so any advice is helpful! Thanks everyone :)
Edit: figured out what was wrong! Just needed to add some guesses.
This is not an answer as expected.
This is an alternative method of fitting gaussian.
The process is not iteratif and doesn't requier initial "guessed" values of the parameters to start as in the usual methods.
The result is :
The method of calculus is shown below :
The general principle is explained with examples in . This is a linear regression wrt an integral equation which solution is the gaussian function.
If one want more accurate and/or more specific result according to some specified criteria of fitting, one have to use a software with non-linear regression process. Then one can use the above result as initial values of parameters for a more robust iterative process.

How to estimate confidence-intervals beyond the current simulated step, based on existing data for 1,000,000 monte-carlo simulations?

I have a program which generates 600 random numbers per "step".
These 600 numbers are fed into a complicated algorithm, which then outputs a single value (which can be positive or negative) for that "step"; let's call this Value-X for that step.
This Value-X is then added to a Global-Value-Y, making the latter a running sum of each step in the series.
I have essentially run this simulation 1,000,000 times, recording the values of Global-Value-Y at each step in those simulations.
I have "calculated confidence intervals" from those one-million simulations, by sorting the simulations by (the absolute value of) their Global-Value-Y at each column, and finding the 90th percentile, the 99th percentile, etc.
What I want to do:
Using the pool of simulation results, "extrapolate" from that to find some equation that will estimate the confidence intervals for results from the used algorithm, many "steps" into the future, without having to extend the runs of those one-million simulations further. (it would take too long to keep running those one-million simulations indefinitely)
Note that the results do not have to be terribly precise at this point; the results are mainly used atm as a visual indicator on the graph, for the user to get an idea of how "normal" the current simulation's results are relative to the confidence-intervals extrapolated from the historical data of the one-million simulations.
Anyway, I've already made some attempts at finding an "estimated curve-fit" of the confidence-intervals from the historical data (ie. those based on the one-million simulations), but the results are not quite precise enough.
Here are the key parts from the curve-fitting Python code I've tried: (link to full code here)
# curve fit functions
def func_linear(t, a, b):
return a*t +b
def func_quadratic(t, a, b, c):
return a*pow(t,2) + b*t +c
def func_cubic(t, a, b, c, d):
return a*pow(t,3) + b*pow(t,2) + c*t + d
def func_biquadratic(t, a, b, c, d, e):
return a*pow(t,4) + b*pow(t,3) + c*pow(t,2) + d*t + e
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# calling the read function on the recorded percentile/confidence-intervals data
xVals,yVals = read_file()
# using inbuilt function for linear fit
popt, pcov = curve_fit(func_linear, xVals, yVals)
fit_linear = func_linear(np.array(xVals), *popt)
[same here for the other curve-fit functions]
plt.rcParams["figure.figsize"] = (40,20)
# plotting the respective curve fits
plt.plot(xVals, yVals, color="blue", linewidth=3)
plt.plot(xVals, fit_linear, color="red", linewidth=2)
plt.plot(xVals, fit_quadratic, color="green", linewidth=2)
plt.plot(xVals, fit_cubic, color="orange", linewidth=2)
#plt.plot(xVals, fit_biquadratic, color="black", linewidth=2) # extremely off
plt.legend(['Actual data','linear','quadratic','cubic','biquadratic'])
plt.xlabel('Session Column')
plt.ylabel('y-value for CI')
plt.title('Curve fitting')
And here are the results: (with the purple "actual data" being the 99.9th percentile of the Global-Value-Ys at each step, from the one-million recorded simulations)
While it seems to be attempting to estimate a curve-fit (over the graphed ~630 steps), the results are not quite accurate enough for my purposes. The imprecision is particularly noticeable at the first ~5% of the graph, where the estimate-curve is far too high. (though practically, my issues are more on the other end, as the CI curve-fit keeps getting less accurate the farther out from the historical data it goes)
EDIT: As requested by a commenter, here is a GitHub gist of the Python code I'm using to attempt the curve-fit (and it includes the data used):
So my questions:
Is there something wrong with my usage of scypy's curve_fit function?
Or is the curve_fit function too basic of a tool to get meaningful estimates/extrapolations of confidence-intervals for random-walk data like this?
If so, is there some alternative that works better for estimating/extrapolating confidence-intervals from random-walk data of this sort?
If I well understand your question, you have a lot of Monte Carlo simulations gathered as (x,y) points and you want to find a model that reasonably fits them.
Importing your data to create a MCVE:
import numpy as np
from scipy import optimize
import matplotlib.pyplot as plt
data = np.array([
And plotting them show a curve that seems to have a square root and linear behaviors respectively at the beginning and the end of the dataset. So let's try this first simple model:
def model(x, a, b, c):
return a*np.sqrt(x) + b*x + c
Notice than formulated as this, it is a LLS problem which is a good point for solving it. The optimization with curve_fit works as expected:
popt, pcov = optimize.curve_fit(model, data[:,0], data[:,1])
# [61.20233162 0.08897784 27.76102519]
# [[ 1.51146696e-02 -4.81216428e-04 -1.01108383e-01]
# [-4.81216428e-04 1.59734405e-05 3.01250722e-03]
# [-1.01108383e-01 3.01250722e-03 7.63590271e-01]]
And returns a pretty decent adjustment. Graphically it looks like:
Of course this is just an adjustment to an arbitrary model chosen based on a personal experience (to me it looks like a specific heterogeneous reaction kinetics).
If you have a theoretical reason to accept or reject this model then you should use it to discriminate. I would also investigate units of parameters to check if they have significant meanings.
But in any case this is out of scope of the Stack Overflow community which is oriented to solve programming issues not scientific validation of models (see Cross Validated or Math Overflow if you want to dig deeper). If you do so, draw my attention on it, I would be glad to follow this question in details.

Reducing difference between two graphs by optimizing more than one variable in MATLAB/Python?

Suppose 'h' is a function of x,y,z and t and it gives us a graph line (t,h) (simulated). At the same time we also have observed graph (observed values of h against t). How can I reduce the difference between observed (t,h) and simulated (t,h) graph by optimizing values of x,y and z? I want to change the simulated graph so that it imitates closer and closer to the observed graph in MATLAB/Python. In literature I have read that people have done same thing by Lavenberg-marquardt algorithm but don't know how to do it?
You are actually trying to fit the parameters x,y,z of the parametrized function h(x,y,z;t).
You're right that in MATLAB you should either use lsqcurvefit of the Optimization toolbox, or fit of the Curve Fitting Toolbox (I prefer the latter).
Looking at the documentation of lsqcurvefit:
x = lsqcurvefit(fun,x0,xdata,ydata);
It says in the documentation that you have a model F(x,xdata) with coefficients x and sample points xdata, and a set of measured values ydata. The function returns the least-squares parameter set x, with which your function is closest to the measured values.
Fitting algorithms usually need starting points, some implementations can choose randomly, in case of lsqcurvefit this is what x0 is for. If you have
h = #(x,y,z,t) ... %// actual function here
t_meas = ... %// actual measured times here
h_meas = ... %// actual measured data here
then in the conventions of lsqcurvefit,
fun <--> #(params,t) h(params(1),params(2),params(3),t)
x0 <--> starting guess for [x,y,z]: [x0,y0,z0]
xdata <--> t_meas
ydata <--> h_meas
Your function h(x,y,z,t) should be vectorized in t, such that for vector input in t the return value is the same size as t. Then the call to lsqcurvefit will give you the optimal set of parameters:
x = lsqcurvefit(#(params,t) h(params(1),params(2),params(3),t),[x0,y0,z0],t_meas,h_meas);
h_fit = h(x(1),x(2),x(3),t_meas); %// best guess from curve fitting
In python, you'd have to use the scipy.optimize module, and something like scipy.optimize.curve_fit in particular. With the above conventions you need something along the lines of this:
import scipy.optimize as opt
popt,pcov = opt.curve_fit(lambda t,x,y,z: h(x,y,z,t), t_meas, y_meas, p0=[x0,y0,z0])
Note that the p0 starting array is optional, but all parameters will be set to 1 if it's missing. The result you need is the popt array, containing the optimal values for [x,y,z]:
x,y,z = popt
h_fit = h(x,y,z,t_meas)

Gaussian fit in Python - parameters estimation

I want to fit an array of data (in the program called "data", of size "n") with a Gaussian function and I want to get the estimations for the parameters of the curve, namely the mean and the sigma. Is the following code, which I found on the Web, a fast way to do that? If so, how can I actually get the estimated values of the parameters?
import pylab as plb
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
x = ar(range(n))
y = data
n = len(x) #the number of data
mean = sum(x*y)/n #note this correction
sigma = sum(y*(x-mean)**2)/n #note this correction
def gaus(x,a,x0,sigma,c):
return a*exp(-(x-x0)**2/(sigma**2))+c
popt,pcov = curve_fit(gaus,x,y,p0=[1,mean,sigma,0.0])
print popt
print pcov
plt.title('Fig. 3 - Fit')
To answer your first question, "Is the following code, which I found on the Web, a fast way to do that?"
The code that you have is in fact the right way to proceed with fitting your data, when you believe is Gaussian and know the fitting function (except change the return function to
I believe for a Gaussian function you don't need the constant c parameter.
A common use of least-squares minimization is curve fitting, where one has a parametrized model function meant to explain some phenomena and wants to adjust the numerical values for the model to most closely match some data. With scipy, such problems are commonly solved with scipy.optimize.curve_fit.
To answer your second question, "If so, how can I actually get the estimated values of the parameters?"
You can go to the link provided for scipy.optimize.curve_fit and find that the best fit parameters reside in your popt variable. In your example, popt will contain the mean and sigma of your data. In addition to the best fit parameters, pcov will contain the covariance matrix, which will have the errors of your mean and sigma. To obtain 1sigma standard deviations, you can simply use np.sqrt(pcov) and obtain the same.
