I am trying to fit a curve with the curve_fit function in SciPy. By changing the inital values of the model the quality of the fit is changing but I am not able to find the best fit through my data. Here is how my fit looks like
My question is how can I improve this fit and what is the best way of selecting the initial values of the model.
I have attached the raw data which I want to fit an exponential curve to it.
This is the data which I am using
y = [ 338.52656636 337.43934446 348.25434126 308.42768639 279.24436171
269.85992004 279.24436171 249.25992615 239.53215125 219.96215705
220.41993469 220.30549028 220.30549028 195.07049776 180.364391
171.20883816 180.24994659 180.13550218 180.47883541 209.89104892
220.19104587 180.02105777 595.45426801 324.50712607 150.60884426
170.97994934 171.20883816 170.75106052 170.75106052 159.76439711
140.88106937 150.37995544 140.88106937 1620.70451979 140.42329173
150.37995544 140.53773614 284.68047121 1146.84743797 170.97994934
150.60884426 145.74495682 141.10995819 121.53996399 121.19663076
131.38218329 170.40772729 140.42329173 140.82384716 145.5732902
140.30884732 121.53996399 700.39979247 2783.74584185 131.26773888
140.76662496 140.53773614 121.76885281 126.23218482 130.69551683]
and here is my code:
from numpy import arange
from pandas import read_csv
from scipy.optimize import curve_fit
from matplotlib import pyplot
def expDecay(t, Amax, tau):
return Amax/tau*np.exp(-t/tau)
Amax = []
Tau = []
ydata = y
x = array(range(len(y)))
xdata = x
popt, pcov = curve_fit(expDecay, x, y,
p0=(10000, 5),
bounds=([0., 2.], [10000., 30]),)
Amax.append(popt[0])
Tau.append(popt[1])
plt.plot(xdata, expDecay(xdata, *popt), 'k-', label='Pred.');
plt.plot(ydata);
plt.ylim([0, 500])
plt.show()
The deviation is due to the outliers. After eliminating them :
Note about eliminating the outliers.
Since the definition of outlier is subjective a software able to do this will probably be more or less interactive. I built my own very rudimentary software. The principle is :
A first nonlinear regression is done with all the points. With the function and parameters obtained the values of y are computed for each point. The absolute difference between the "y computed" and the "y values" from the given data file are compared. This allows to eliminate the point the further away.
Another nonlinear regression is done with the remaining points. The same procedure eliminates a second point.
And so on until a specified criteria be reached to stop. That is the subjective part.
With your data (60 points) the point n.54 was eliminated first. Then the point n.34, then n.39 and so on. The process stops after eliminating 6 points. Eliminating more points doesn't improve much the LMSE.
The curve above is the result of the last nonlinear regression with the 54 remaining points.
I'm having a lot of trouble fitting this data, particularly getting the fit parameters to match the expected parameters.
from scipy.optimize import curve_fit
import numpy as np
def gaussian_model(x, a, b, c, d): # add constant d
return a*np.exp(-(x-b)**2/(2*c**2))+d
x = np.linspace(0, 20, 100)
mu, cov = curve_fit(gaussian_model, xdata, ydata)
fit_A = mu[0]
fit_B = mu[1]
fit_C = mu[2]
fit_D = mu[3]
fit_y = gaussian_model(xdata, fit_A, fit_B, fit_C, fit_D)
print(mu)
plt.plot(x, fit_y)
plt.scatter(xdata, ydata)
plt.show()
Here's the plot
When I printed the parameters, I got values of -17 for amplitude, 2.6 for mean, -2.5 for standard deviation, and 110 for the base. This is very far off from what I would expect from the scatter plot. Any ideas why?
Also, I'm pretty new to coding, so any advice is helpful! Thanks everyone :)
Edit: figured out what was wrong! Just needed to add some guesses.
This is not an answer as expected.
This is an alternative method of fitting gaussian.
The process is not iteratif and doesn't requier initial "guessed" values of the parameters to start as in the usual methods.
The result is :
The method of calculus is shown below :
The general principle is explained with examples in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales . This is a linear regression wrt an integral equation which solution is the gaussian function.
If one want more accurate and/or more specific result according to some specified criteria of fitting, one have to use a software with non-linear regression process. Then one can use the above result as initial values of parameters for a more robust iterative process.
I am trying fit an exponential function and 5 Gaussians to my data. What I am aiming for is something along these lines: (where gDNA Fit is the exponential; 1-5Nuc Fit are the 5 Gaussians; Total fit is the sum of all the fits)
The way I approached it was fitting the exponential and then based on that introduce a cut-off that would allow me to fit the gaussians without taking into consideration the already fitted data. (I have already cut the data at 100 as this is where it dips down to 0)
The problem is I don't seems to be able to fit the exponential properly and the gaussians are off the scale:
from scipy.optimize import curve_fit
from pylab import *
import matplotlib.pyplot
#Exponential
x = np.array([1.010000000000000000e+02,1.100000000000000000e+02,1.190000000000000000e+02,1.280000000000000000e+02,1.370000000000000000e+02,1.460000000000000000e+02,1.550000000000000000e+02,1.640000000000000000e+02,1.730000000000000000e+02,1.820000000000000000e+02,1.910000000000000000e+02,2.000000000000000000e+02,2.090000000000000000e+02,2.180000000000000000e+02,2.270000000000000000e+02,2.360000000000000000e+02,2.450000000000000000e+02,2.540000000000000000e+02,2.630000000000000000e+02,2.720000000000000000e+02,2.810000000000000000e+02,2.900000000000000000e+02,2.990000000000000000e+02,3.080000000000000000e+02,3.170000000000000000e+02,3.260000000000000000e+02,3.350000000000000000e+02,3.440000000000000000e+02,3.530000000000000000e+02,3.620000000000000000e+02,3.710000000000000000e+02,3.800000000000000000e+02,3.890000000000000000e+02,3.980000000000000000e+02,4.070000000000000000e+02,4.160000000000000000e+02,4.250000000000000000e+02,4.340000000000000000e+02,4.430000000000000000e+02,4.520000000000000000e+02,4.610000000000000000e+02,4.700000000000000000e+02,4.790000000000000000e+02,4.880000000000000000e+02,4.970000000000000000e+02,5.060000000000000000e+02,5.150000000000000000e+02,5.240000000000000000e+02,5.330000000000000000e+02,5.420000000000000000e+02,5.510000000000000000e+02,5.600000000000000000e+02,5.690000000000000000e+02,5.780000000000000000e+02,5.870000000000000000e+02,5.960000000000000000e+02,6.050000000000000000e+02,6.140000000000000000e+02,6.230000000000000000e+02,6.320000000000000000e+02,6.410000000000000000e+02,6.500000000000000000e+02,6.590000000000000000e+02,6.680000000000000000e+02,6.770000000000000000e+02,6.860000000000000000e+02,6.950000000000000000e+02,7.040000000000000000e+02,7.130000000000000000e+02,7.220000000000000000e+02,7.310000000000000000e+02,7.400000000000000000e+02,7.490000000000000000e+02,7.580000000000000000e+02,7.670000000000000000e+02,7.760000000000000000e+02,7.850000000000000000e+02,7.940000000000000000e+02,8.030000000000000000e+02,8.120000000000000000e+02,8.210000000000000000e+02,8.300000000000000000e+02,8.390000000000000000e+02,8.480000000000000000e+02,8.570000000000000000e+02,8.660000000000000000e+02,8.750000000000000000e+02,8.840000000000000000e+02,8.930000000000000000e+02,9.020000000000000000e+02,9.110000000000000000e+02,9.200000000000000000e+02,9.290000000000000000e+02,9.380000000000000000e+02,9.470000000000000000e+02,9.560000000000000000e+02,9.650000000000000000e+02,9.740000000000000000e+02,9.830000000000000000e+02,9.920000000000000000e+02])
y = np.array([3.579280000000000000e+05,3.172290000000000000e+05,1.759610000000000000e+05,1.352610000000000000e+05,1.069130000000000000e+05,9.721000000000000000e+04,9.908200000000000000e+04,1.168480000000000000e+05,1.266880000000000000e+05,1.264760000000000000e+05,1.279850000000000000e+05,1.198880000000000000e+05,1.117730000000000000e+05,1.005850000000000000e+05,9.038500000000000000e+04,7.532400000000000000e+04,6.235500000000000000e+04,5.249600000000000000e+04,4.445600000000000000e+04,3.808000000000000000e+04,3.612100000000000000e+04,3.460600000000000000e+04,3.209700000000000000e+04,3.008200000000000000e+04,3.090700000000000000e+04,3.208600000000000000e+04,2.949700000000000000e+04,3.111600000000000000e+04,3.125700000000000000e+04,3.152700000000000000e+04,3.198700000000000000e+04,3.373800000000000000e+04,3.171200000000000000e+04,3.124900000000000000e+04,3.109700000000000000e+04,3.002200000000000000e+04,2.720100000000000000e+04,2.413600000000000000e+04,1.873100000000000000e+04,1.768900000000000000e+04,1.510600000000000000e+04,1.358800000000000000e+04,1.354400000000000000e+04,1.198900000000000000e+04,1.182800000000000000e+04,6.926000000000000000e+03,1.230000000000000000e+04,3.734000000000000000e+03,6.631000000000000000e+03,7.085000000000000000e+03,7.151000000000000000e+03,7.195000000000000000e+03,7.265000000000000000e+03,6.966000000000000000e+03,6.823000000000000000e+03,6.357000000000000000e+03,5.977000000000000000e+03,5.464000000000000000e+03,4.941000000000000000e+03,4.543000000000000000e+03,3.992000000000000000e+03,3.593000000000000000e+03,3.156000000000000000e+03,2.955000000000000000e+03,2.740000000000000000e+03,2.701000000000000000e+03,2.528000000000000000e+03,2.481000000000000000e+03,2.527000000000000000e+03,2.476000000000000000e+03,2.456000000000000000e+03,2.461000000000000000e+03,2.420000000000000000e+03,2.346000000000000000e+03,2.326000000000000000e+03,2.278000000000000000e+03,2.108000000000000000e+03,1.893000000000000000e+03,1.771000000000000000e+03,1.654000000000000000e+03,1.547000000000000000e+03,1.389000000000000000e+03,1.325000000000000000e+03,1.130000000000000000e+03,1.057000000000000000e+03,9.460000000000000000e+02,9.790000000000000000e+02,8.990000000000000000e+02,8.460000000000000000e+02,8.360000000000000000e+02,8.040000000000000000e+02,8.330000000000000000e+02,7.690000000000000000e+02,7.020000000000000000e+02,7.360000000000000000e+02,6.390000000000000000e+02,6.690000000000000000e+02,6.770000000000000000e+02,6.100000000000000000e+02,5.700000000000000000e+02])
def func(x, a, c, d):
return a*np.exp(-c*x)+d
#print np.exp(-x)
popt, pcov = curve_fit(func, x, y, p0=(1, 0.01, 1))
yy = func(x, *popt)
matplotlib.pyplot.plot(x, y, 'ko')
matplotlib.pyplot.plot(x, yy)
#gaussian
from sklearn import mixture
import scipy
gmm = mixture.GMM(n_components=5, covariance_type='full')
gmm.fit(y)
pdfs = [p * scipy.stats.norm.pdf(x, mu, sd) for mu, sd, p in zip(gmm.means_, (gmm.covars_)**2, gmm.weights_)]
density = np.sum(np.array(pdfs), axis=0)
#print density
matplotlib.pyplot.plot(x, density)
show()
If you do not mind to use least squares as opposed to maximum likelyhood I would suggest to fit the whole model at once, including the exponential with e.g. scipy curve_fit. You will never get a good fit to the exponential if you ignore the existance of the gaussian peaks. I recommend to use peak-o-mat (http://lorentz.sf.net) which is an interactive curve fitting software written in python. Within seconds you can get a result like this:
I have two variables x and y which I am trying to fit using curve_fit from scipy.optimize.
The equation that fits the data is a simple power law of the form y=a(x^b). The fit seems to be well for the data when I set the x and y axis to log scale, i.e ax.set_xscale('log') and ax.set_yscale('log').
Here is the code:
def fitfunc(x,p1,p2):
y = p1*(x**p2)
return y
popt_1,pcov_1 = curve_fit(fitfunc,x,y,p0=(1.0,1.0))
p1_1 = popt_1[0]
p1_2 = popt_1[1]
residuals1 = (ngal_mstar_1) - fitfunc(x,p1_1,p1_2)
xi_sq_1 = sum(residuals1**2) #The chi-square value
curve_y_1 = fitfunc(x,p1_1,p1_2) #This is the fit line seen in the graph
fig = plt.figure(figsize=(14,12))
ax1 = fig.add_subplot(111)
ax1.scatter(x,y,c='r')
ax1.plot(y,curve_y_1,'y.',linewidth=1)
ax1.legend(loc='best',shadow=True,scatterpoints=1)
ax1.set_xscale('log') #Scale is set to log
ax1.set_yscale('log') #SCale is set to log
plt.show()
When I use true log-log values for x and y, the power law fit becomes y=10^(a+b*log(x)),i.e raising the power of the right side to 10 as it is logbase 10. Now both by x and y values are log(x) and log(y).
The fit for the above does not seem to be good. Here is the code I have used.
def fitfunc(x,p1,p2):
y = 10**(p1+(p2*x))
return y
popt_1,pcov_1 = curve_fit(fitfunc,np.log10(x),np.log10(y),p0=(1.0,1.0))
p1_1 = popt_1[0]
p1_2 = popt_1[1]
residuals1 = (y) - fitfunc((x),p1_1,p1_2)
xi_sq_1 = sum(residuals1**2)
curve_y_1 = fitfunc(np.log10(x),p1_1,p1_2) #The fit line uses log(x) here itself
fig = plt.figure(figsize=(14,12))
ax1 = fig.add_subplot(111)
ax1.scatter(np.log10(x),np.log10(y),c='r')
ax1.plot(np.log10(y),curve_y_1,'y.',linewidth=1)
plt.show()
THE ONLY DIFFERENCE BETWEEN THE TWO PLOTS IS THE FITTING EQUATIONS, AND FOR THE SECOND PLOT THE VALUES HAVE BEEN LOGGED INDEPENDENTLY. Am I doing something wrong here, because I want a log(x) vs log(y) plot and the corresponding fit parameters (slope and intercept)
Your transformation of the power-law model to log-log is wrong, i.e. your second fit actually fits a different model. Take your original model y=a*(x^b) and apply the logarithm on both sides, you will get log(y) = log(a) + b*log(x). Thus, your model in log-scale should simply read y' = a' + b*x', where the primes indicate variables in log-scale. The model is now a linear function, a well known result that all power-laws become linear functions in log-log.
That said, you can still expect some small differences in the two versions of your fit, since curve_fit will optimise the least-squares problem. Therefore, in log scale, the fit will minimise the relative error between the fit and the data, while in linear scale, the fit will minimise the absolute error. Thus, in order to decide which way is actually the better domain for your fit, you will have to estimate the error in your data. The data you show certainly does not have a constant uncertainty in log-scale, so on linear scale your fit might be more faithful. If details about the error in each data-point are known, then you could consider using the sigma parameter. If that one is used properly, there should not be much difference in the two approaches. In that case, I would prefer the log-scale fitting, as the model is simpler and therefore likely to be more numerically stable.
Firstly this is an assignment I've been set so I'm only after pointers, and I am restricted to using the following libraries, NumPy, SciPy and MatPlotLib.
We have been given a txt file which includes x and y data for a resonance experiment and have to fit both a gaussian and lorentzian fit. I'm working on the gaussian fit at the minute and have tried following the code laid out in a previous question as a basis for my own code. (Gaussian fit for Python)
from numpy import *
from matplotlib import *
import matplotlib.pyplot as plt
import pylab
from scipy.optimize import curve_fit
energy, intensity = numpy.loadtxt('resonance_data.txt', unpack=True)
n = size(energy)
mean = 30.7
sigma = 10
intensity0 = 45
def gaus(energy, intensity0, energy0, sigma):
return intensity0 * exp(-(energy - energy0)**2 / (sigma**2))
popt, pcov = curve_fit(gaus, energy, intensity, p0=[45, mean, sigma])
plt.plot(energy, intensity, 'o')
plt.xlabel('Energy/eV')
plt.ylabel('Intensity')
plt.title('Plot of Intensity against Energy')
plt.plot(energy, gaus(energy, *popt))
plt.show()
Which returns the following graph
If I keep the expressions for mean and sigma, as in the url posted the curve fit is a horizontal line, so I'm guessing the problem lies in the curve fit not converging or something.
Looks like your data skews heavily to the left, why Gaussian? Not Boltzmann, Log-Normal, or anything else?
Much of these are already implemented in scipy.stats. See scipy.stats.cauchy for lorentzian and scipy.stats.normal gaussian. An example:
import scipy.stats as ss
A=ss.norm.rvs(0, 5, size=(100)) #Generate a random variable of 100 elements, with expected mean=0, std=5
ss.norm.fit_loc_scale(A) #fit both the mean and std
(-0.13053732553697531, 5.163322485150271) #your number will vary.
And I think you don't need the intensity0 parameter, it is just going to be 1/sigma/srqt(2*pi), because the density function has to sum up to 1.