I need to fit a special distribution of data with any available function. The distribution does not really follow a specific theoretical prediction, so I just want to fit any given function without great meaning. I attached an image with a sample distribution and a fifth order polynomial fit to show that this simple approach does not really work.
I know the distribution closely resembles an error function, but I did not manage to fit such a function with scipy...
I hope anyone has either a way to fit an error function to such a distribution, or maybe can suggest a different type of function I could fit to describe this distribution.
You can fit any function you want:
from scipy.optimize import curve_fit
popt, pcov = curve_fit(func, xdata, ydata)
In case you want some function similar to erf, you can use for example:
def func(z,a,b):
return a*scipy.special.erf(z)+b
This will find the parameters a,b.
Further fitting parameters might be helpful:
def func(x, a, b, z, f):
return a * scipy.special.erf((x - z)*f) + b
To prevent a runtime error, the number of iterations (default = 800) can be adapted with maxfev:
popt, pcov = curve_fit(func, x, y, maxfev=8000)
(See Scipy curvefit RuntimeError:Optimal parameters not found: Number of calls to function has reached maxfev = 1000)
Related
I'm having a lot of trouble fitting this data, particularly getting the fit parameters to match the expected parameters.
from scipy.optimize import curve_fit
import numpy as np
def gaussian_model(x, a, b, c, d): # add constant d
return a*np.exp(-(x-b)**2/(2*c**2))+d
x = np.linspace(0, 20, 100)
mu, cov = curve_fit(gaussian_model, xdata, ydata)
fit_A = mu[0]
fit_B = mu[1]
fit_C = mu[2]
fit_D = mu[3]
fit_y = gaussian_model(xdata, fit_A, fit_B, fit_C, fit_D)
print(mu)
plt.plot(x, fit_y)
plt.scatter(xdata, ydata)
plt.show()
Here's the plot
When I printed the parameters, I got values of -17 for amplitude, 2.6 for mean, -2.5 for standard deviation, and 110 for the base. This is very far off from what I would expect from the scatter plot. Any ideas why?
Also, I'm pretty new to coding, so any advice is helpful! Thanks everyone :)
Edit: figured out what was wrong! Just needed to add some guesses.
This is not an answer as expected.
This is an alternative method of fitting gaussian.
The process is not iteratif and doesn't requier initial "guessed" values of the parameters to start as in the usual methods.
The result is :
The method of calculus is shown below :
The general principle is explained with examples in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales . This is a linear regression wrt an integral equation which solution is the gaussian function.
If one want more accurate and/or more specific result according to some specified criteria of fitting, one have to use a software with non-linear regression process. Then one can use the above result as initial values of parameters for a more robust iterative process.
I am trying to understand whether I am carrying out curve fitting correctly and appropriately using the "curve_fit()" module from scipy within python. I have a script that iterates through each column of my dataframe, plots the data, and then fits a curve to it, and then judges the fit of the equation across the entire data set by calculating the average root mean square error across all of the rows.
First, here is my dataframe:
I would have tried to actually provide the data here but I am not sure how. As an example description of what this data means, I am looking at how pollution levels decay as distance away from a factory site increases, with a distance column and each subsequent column representing a different factory (factory "0", factory "1", factory "2", factory "3", etc.). And so we see pollution levels recorded at each distance moving away from each factory in 30-meter increments, with 0-meters representing the location at the factory site itself. And so I am going by the hypothesis that as move away from a factory site, the pollution levels will start to decay. The research question I am trying to discover here though is, from my specific data set, overall, across all of my studied factories, how does pollution decay with increasing distance away? Is this linear decay? Exponential decay? Logarithmic decay? polynomial decay? Inverse-logistic decay? That is what I am trying to find out.
And so I am trying to fit different curve equations to my data to see which type of curve best represents the decay I am looking at. Here is my code:
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
#Un-comment each of these individual curve functions to try them out individually
# def func(x, a, b):
# return a * x + b
# def func(x, a, b):
# return -a *np.log(x) + b
#def func(x, a, b, c):
# return a * np.exp(-b * x) + c
# def func(x, a, b, c):
# return a / (1 + b*c**(-x))
# def func(x, a, b, c):
# return a*x**2+b*x+c
RMSE_list = []
xdata = data['Distance'].to_numpy()
c = range(8)
for i in c:
# plt.figure(figsize=(3, 3))
greenspace = str(i)
ydata = data[greenspace].to_numpy()
plt.plot(xdata, ydata, 'b-', label='data')
popt, pcov = curve_fit(func, xdata, ydata)
plt.plot(xdata, func(xdata, *popt), 'g--')
#Calculate RMSE
params, _ = curve_fit(func, xdata, ydata)
a, b = params[0], params[1]
yfit1 = a * xdata + b
rmse = np.sqrt(np.mean((yfit1 - ydata) ** 2))
RMSE_list.append(rmse)
RMSE_Average = np.mean(RMSE_list)
print("RMSE Average:")
print(RMSE_Average)
And so here I am trying out curves of linear decay, logarithmic decay, exponential decay, inverse-logistic decay, and second-degree polynomial decay. And so I try out each of these curve fitting equations individually as a unique function and fit this to the data from each factory. I then find the RMSE of each of these fits for each factory, and then average them to get an average RMSE to describe all factories for each curve type. That's all my code does. What I am trying to seek some help here with is to ask whether I am actually doing this right or if it even sounds like my approach here is on the right track. This whole field of curve fitting is very new to me, and I am really just going as far as I know how to at my beginner level. The code works I suppose, but I get some strange outcomes.
When I try this all out I get the following results for each curve fitting attempt:
Overall the idea makes sense to me here, but there are specific issues I encountered that I do not understand. I specifically had issues with the logarithmic, exponential and inverse-logistic curve fittings. For the logistic curve attempt I received the following error for each factory:
C:\Users\MyName\AppData\Local\Temp/ipykernel_7788/37186127.py:8: RuntimeWarning: divide by zero encountered in log return -a *np.log(x) + b
Also, just looking at the curve, the fitted logarithmic curve is fitted way below the actual data, it doesn't even touch!
And for the exponential curve attempt I received the following error for each factory:
C:\Users\MyName\AppData\Roaming\Python\Python39\site-packages\scipy\optimize\minpack.py:833: OptimizeWarning: Covariance of the parameters could not be estimated
I am thinking that perhaps a lot of this might be resolved by setting bounds. When I set the parameter bounds for the exponential attempt to "bounds=(0, [3., 1., 0.5])", based on this curve_fit() documention from scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html
I get the same error message for each factory, but I receive this exponential curve fitting result:
So the visual fitting looks better, but the average RMSE looks the same, which is confusing me. And for the inverse-logistic fitting, it visually looks like the fitting is good, but the average RMSE is so high, which tells me the fit is really bad.
And so my question here is: how can I improve my approach and code here to most accurately and effectively fit these curves to my data and compare them using "curve_fit()" in python? I am thinking maybe this is a matter of methodically setting the bounds for each curve fitting, but I am not sure.
I have a set of data points which, according to the model I want to implement, could be modelled with a certain curve (in this case, a product between an exponential and a complementary error function).
For fitting these data into such a curve, I tried:
import numpy as np
from scipy.optimize import curve_fit
from scipy import special
x_fit = np.linspace(0,1,1000)
def fitted_function(x_fit, c, d, S):
return c*np.exp(((S*d/2)**2)-x_fit*d)*special.erfc(S*d/2-x_fit/S)
FitParameters, FitCovariance = curve_fit(fitted_function, x_data, y_data, maxfev = 100000)
It does not give me any particular error, but the result of the fitting is evidently wrong. I strongly suspect that it has to do with the the part x_fit/S, where the fitting parameter S appears as a denominator.
For example, I encounter the same problem while fitting a simple exponential: if I define the fitting curve with
return a*np.exp(-x_fit/b)
with a, b fitting parameters; since the fitting parameter b appears as a denominator, I find the same problem (i.e. the resulting fitted curve is a horizontal line for some reason).
For the case of a simple exponential I can simple bypass this by doing
return a*np.exp(-b*x_fit)
so that b is not a denominator anymore and the fitted curve is really an exponential curve. For my current case, instead, I cannot do this since S appears ad a numerator and a denominator in different part of the expression.
Any ideas? Thank you in advance!
I am experimenting with Python to fit curves to a series of data-points, summary below:
From the below, it would seem that polynomials of order greater than 2 are the best fit, followed by linear and finally exponential which has the overall worst outcome.
While I appreciate this might not be exponential growth, I just wanted to know whether you would expect the exponential function perform so badly (basically the coefficient of x,b, has been set to 0 and an arbitrary point has been picked on the curve to intersect) or if I have somehow done something wrong in my code to fit.
The code I'm using to fit is as follows:
# Fitting
def exponenial_func(x,a,b,c):
return a*np.exp(-b*x)+c
def linear(x,m,c):
return m*x+c
def quadratic(x,a,b,c):
return a*x**2 + b*x+c
def cubic(x,a,b,c,d):
return a*x**3 + b*x**2 + c*x + d
x = np.array(x)
yZero = np.array(cancerSizeMean['levelZero'].values)[start:]
print len(x)
print len(yZero)
popt, pcov = curve_fit(exponenial_func,x, yZero, p0=(1,1,1))
expZeroFit = exponenial_func(x, *popt)
plt.plot(x, expZeroFit, label='Control, Exponential Fit')
popt, pcov = curve_fit(linear, x, yZero, p0=(1,1))
linearZeroFit = linear(x, *popt)
plt.plot(x, linearZeroFit, label = 'Control, Linear')
popt, pcov = curve_fit(quadratic, x, yZero, p0=(1,1,1))
quadraticZeroFit = quadratic(x, *popt)
plt.plot(x, quadraticZeroFit, label = 'Control, Quadratic')
popt, pcov = curve_fit(cubic, x, yZero, p0=(1,1,1,1))
cubicZeroFit = cubic(x, *popt)
plt.plot(x, cubicZeroFit, label = 'Control, Cubic')
*Edit: curve_fit is imported from the scipy.optimize package
from scipy.optimize import curve_fit
curve_fit tends to perform poorly if you give it a poor initial guess with functions like the exponential that could end up with very large numbers. You could try altering the maxfev input so that it runs more iterations. otherwise, I would suggest trying with with something like:
p0=(1000,-.005,0)
-.01, since it ~doubles from 300 to 500 and you have -b in your eqn, 100 0 since it is ~3000 at 300 (1.5 doublings from 0). See how that turns out
As for why the initial exponential doesn't work at all, your initial guess is b=1, and x is in range of (300,1000) or range. This means python is calculating exp(-300) which either throws an exception or is set to 0. At this point, whether b is increased or decreased, the exponential is going to still be set to 0 for any value in the general vicinity of the initial estimate.
Basically, python uses a numerical method with limited precision, and the exponential estimate went outside of the range of values it can handle
I'm not sure how you're fitting the curves -- are you using polynomial least squares? In that case, you'd expect the fit to improve with each additional degree of flexibility, and you choose the power based on diminishing marginal improvement / outside theory.
The improving fit should look something like this.
I actually wrote some code to do Polynomial Least Squares in python for a class a while back, which you can find here on Github. It's a bit hacky though and loosely commented since I was just using it to solve exercises. Hope it's helpful.
I want to fit an array of data (in the program called "data", of size "n") with a Gaussian function and I want to get the estimations for the parameters of the curve, namely the mean and the sigma. Is the following code, which I found on the Web, a fast way to do that? If so, how can I actually get the estimated values of the parameters?
import pylab as plb
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
x = ar(range(n))
y = data
n = len(x) #the number of data
mean = sum(x*y)/n #note this correction
sigma = sum(y*(x-mean)**2)/n #note this correction
def gaus(x,a,x0,sigma,c):
return a*exp(-(x-x0)**2/(sigma**2))+c
popt,pcov = curve_fit(gaus,x,y,p0=[1,mean,sigma,0.0])
print popt
print pcov
plt.plot(x,y,'b+:',label='data')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.legend()
plt.title('Fig. 3 - Fit')
plt.xlabel('q')
plt.ylabel('data')
plt.show()
To answer your first question, "Is the following code, which I found on the Web, a fast way to do that?"
The code that you have is in fact the right way to proceed with fitting your data, when you believe is Gaussian and know the fitting function (except change the return function to
a*exp(-(x-x0)**2/(sigma**2)).
I believe for a Gaussian function you don't need the constant c parameter.
A common use of least-squares minimization is curve fitting, where one has a parametrized model function meant to explain some phenomena and wants to adjust the numerical values for the model to most closely match some data. With scipy, such problems are commonly solved with scipy.optimize.curve_fit.
To answer your second question, "If so, how can I actually get the estimated values of the parameters?"
You can go to the link provided for scipy.optimize.curve_fit and find that the best fit parameters reside in your popt variable. In your example, popt will contain the mean and sigma of your data. In addition to the best fit parameters, pcov will contain the covariance matrix, which will have the errors of your mean and sigma. To obtain 1sigma standard deviations, you can simply use np.sqrt(pcov) and obtain the same.