Fitting exponential and 5 gaussians to data in python - python
I am trying fit an exponential function and 5 Gaussians to my data. What I am aiming for is something along these lines: (where gDNA Fit is the exponential; 1-5Nuc Fit are the 5 Gaussians; Total fit is the sum of all the fits)
The way I approached it was fitting the exponential and then based on that introduce a cut-off that would allow me to fit the gaussians without taking into consideration the already fitted data. (I have already cut the data at 100 as this is where it dips down to 0)
The problem is I don't seems to be able to fit the exponential properly and the gaussians are off the scale:
from scipy.optimize import curve_fit
from pylab import *
import matplotlib.pyplot
#Exponential
x = np.array([1.010000000000000000e+02,1.100000000000000000e+02,1.190000000000000000e+02,1.280000000000000000e+02,1.370000000000000000e+02,1.460000000000000000e+02,1.550000000000000000e+02,1.640000000000000000e+02,1.730000000000000000e+02,1.820000000000000000e+02,1.910000000000000000e+02,2.000000000000000000e+02,2.090000000000000000e+02,2.180000000000000000e+02,2.270000000000000000e+02,2.360000000000000000e+02,2.450000000000000000e+02,2.540000000000000000e+02,2.630000000000000000e+02,2.720000000000000000e+02,2.810000000000000000e+02,2.900000000000000000e+02,2.990000000000000000e+02,3.080000000000000000e+02,3.170000000000000000e+02,3.260000000000000000e+02,3.350000000000000000e+02,3.440000000000000000e+02,3.530000000000000000e+02,3.620000000000000000e+02,3.710000000000000000e+02,3.800000000000000000e+02,3.890000000000000000e+02,3.980000000000000000e+02,4.070000000000000000e+02,4.160000000000000000e+02,4.250000000000000000e+02,4.340000000000000000e+02,4.430000000000000000e+02,4.520000000000000000e+02,4.610000000000000000e+02,4.700000000000000000e+02,4.790000000000000000e+02,4.880000000000000000e+02,4.970000000000000000e+02,5.060000000000000000e+02,5.150000000000000000e+02,5.240000000000000000e+02,5.330000000000000000e+02,5.420000000000000000e+02,5.510000000000000000e+02,5.600000000000000000e+02,5.690000000000000000e+02,5.780000000000000000e+02,5.870000000000000000e+02,5.960000000000000000e+02,6.050000000000000000e+02,6.140000000000000000e+02,6.230000000000000000e+02,6.320000000000000000e+02,6.410000000000000000e+02,6.500000000000000000e+02,6.590000000000000000e+02,6.680000000000000000e+02,6.770000000000000000e+02,6.860000000000000000e+02,6.950000000000000000e+02,7.040000000000000000e+02,7.130000000000000000e+02,7.220000000000000000e+02,7.310000000000000000e+02,7.400000000000000000e+02,7.490000000000000000e+02,7.580000000000000000e+02,7.670000000000000000e+02,7.760000000000000000e+02,7.850000000000000000e+02,7.940000000000000000e+02,8.030000000000000000e+02,8.120000000000000000e+02,8.210000000000000000e+02,8.300000000000000000e+02,8.390000000000000000e+02,8.480000000000000000e+02,8.570000000000000000e+02,8.660000000000000000e+02,8.750000000000000000e+02,8.840000000000000000e+02,8.930000000000000000e+02,9.020000000000000000e+02,9.110000000000000000e+02,9.200000000000000000e+02,9.290000000000000000e+02,9.380000000000000000e+02,9.470000000000000000e+02,9.560000000000000000e+02,9.650000000000000000e+02,9.740000000000000000e+02,9.830000000000000000e+02,9.920000000000000000e+02])
y = np.array([3.579280000000000000e+05,3.172290000000000000e+05,1.759610000000000000e+05,1.352610000000000000e+05,1.069130000000000000e+05,9.721000000000000000e+04,9.908200000000000000e+04,1.168480000000000000e+05,1.266880000000000000e+05,1.264760000000000000e+05,1.279850000000000000e+05,1.198880000000000000e+05,1.117730000000000000e+05,1.005850000000000000e+05,9.038500000000000000e+04,7.532400000000000000e+04,6.235500000000000000e+04,5.249600000000000000e+04,4.445600000000000000e+04,3.808000000000000000e+04,3.612100000000000000e+04,3.460600000000000000e+04,3.209700000000000000e+04,3.008200000000000000e+04,3.090700000000000000e+04,3.208600000000000000e+04,2.949700000000000000e+04,3.111600000000000000e+04,3.125700000000000000e+04,3.152700000000000000e+04,3.198700000000000000e+04,3.373800000000000000e+04,3.171200000000000000e+04,3.124900000000000000e+04,3.109700000000000000e+04,3.002200000000000000e+04,2.720100000000000000e+04,2.413600000000000000e+04,1.873100000000000000e+04,1.768900000000000000e+04,1.510600000000000000e+04,1.358800000000000000e+04,1.354400000000000000e+04,1.198900000000000000e+04,1.182800000000000000e+04,6.926000000000000000e+03,1.230000000000000000e+04,3.734000000000000000e+03,6.631000000000000000e+03,7.085000000000000000e+03,7.151000000000000000e+03,7.195000000000000000e+03,7.265000000000000000e+03,6.966000000000000000e+03,6.823000000000000000e+03,6.357000000000000000e+03,5.977000000000000000e+03,5.464000000000000000e+03,4.941000000000000000e+03,4.543000000000000000e+03,3.992000000000000000e+03,3.593000000000000000e+03,3.156000000000000000e+03,2.955000000000000000e+03,2.740000000000000000e+03,2.701000000000000000e+03,2.528000000000000000e+03,2.481000000000000000e+03,2.527000000000000000e+03,2.476000000000000000e+03,2.456000000000000000e+03,2.461000000000000000e+03,2.420000000000000000e+03,2.346000000000000000e+03,2.326000000000000000e+03,2.278000000000000000e+03,2.108000000000000000e+03,1.893000000000000000e+03,1.771000000000000000e+03,1.654000000000000000e+03,1.547000000000000000e+03,1.389000000000000000e+03,1.325000000000000000e+03,1.130000000000000000e+03,1.057000000000000000e+03,9.460000000000000000e+02,9.790000000000000000e+02,8.990000000000000000e+02,8.460000000000000000e+02,8.360000000000000000e+02,8.040000000000000000e+02,8.330000000000000000e+02,7.690000000000000000e+02,7.020000000000000000e+02,7.360000000000000000e+02,6.390000000000000000e+02,6.690000000000000000e+02,6.770000000000000000e+02,6.100000000000000000e+02,5.700000000000000000e+02])
def func(x, a, c, d):
return a*np.exp(-c*x)+d
#print np.exp(-x)
popt, pcov = curve_fit(func, x, y, p0=(1, 0.01, 1))
yy = func(x, *popt)
matplotlib.pyplot.plot(x, y, 'ko')
matplotlib.pyplot.plot(x, yy)
#gaussian
from sklearn import mixture
import scipy
gmm = mixture.GMM(n_components=5, covariance_type='full')
gmm.fit(y)
pdfs = [p * scipy.stats.norm.pdf(x, mu, sd) for mu, sd, p in zip(gmm.means_, (gmm.covars_)**2, gmm.weights_)]
density = np.sum(np.array(pdfs), axis=0)
#print density
matplotlib.pyplot.plot(x, density)
show()
If you do not mind to use least squares as opposed to maximum likelyhood I would suggest to fit the whole model at once, including the exponential with e.g. scipy curve_fit. You will never get a good fit to the exponential if you ignore the existance of the gaussian peaks. I recommend to use peak-o-mat (http://lorentz.sf.net) which is an interactive curve fitting software written in python. Within seconds you can get a result like this:
Related
Python exponential curve fitting
I have added excel plot from which I get the exponential equation, I am trying to curve fit this in Python. My fitted equation is not as close to the empirical data i have provided when i use it to predict the y data, the prediction gives f(-25)= 5.30e-11, while the empirical data f(-25) gives = 5.3e-13 How can i improve the code to be predicting close to empirical data, or i have made mistakes in my code?? python fitted plot ![][2] import numpy as np import matplotlib.pyplot as plt from scipy.optimize import curve_fit import scipy.optimize as optimize import scipy.stats as stats pd.set_option('precision', 14) def f(x,A,B): return A * np.exp((-B) * (x)) y_data= [2.156e-05, 1.85e-07, 1.02e-10 , 1.268e-11, 5.352e-13] x= [-28.8, -27.4, -26 , -25.5, -25] p, pcov = optimize.curve_fit(f, x, y_data, p0=[10**(-59),4], maxfev=5000) plt.figure() plt.plot(x, y_data, 'ko', label="Empirical BER") plt.plot(x, f(x, *p ), 'g-', label="Fitted BER" ) plt.title(" BER ") plt.xlabel('Power Rx (dB)') plt.ylabel('') plt.legend() plt.grid() plt.yscale("log") plt.show()
Since you are plotting the data with a log-plot, your view of the data and fit is emphasizing the "tiny" compared to the "small". Fitting uses the sum of the squares of the misfit to determine the best fit. A misfit of a few percent of the data with a y-value of ~2e-5 would completely swamp a misfit of a factor of 10 or even 100 for the data with a y-value of 1.e-11. Your plot is consistent with that. There are two possible routes to a better fit: a) if you have uncertainties in the y-values, use those. It's quite possible that the uncertainty in the data with y~2e-5 is much larger than the uncertainty in the date with y~1.e-11, and scaling by the uncertainty so that the minimization is of the sum-of-squares of (data-model)/uncertainty will help fit the low-value data better. OTOH, if the errors are constant, plotting those uncertainties might show that the fit you have is actually not that bad -- the misfit where y~1.e-11 is only 1.e-10. b) realize that you are assessing the fit quality by plotting the log of the data, and embrace that observation so that you fit the log(data) to log(model). Conveniently for a simple exponential function, the log of that model is linear, so you could do linear regression of the log of your data. Bonus round: recognize that options a) and b) are related. Since a fit minimizes Sum[ ((data-model)/uncertainty)**2], not providing values for uncertainty is effectively saying that the has same uncertainty (=1.0 in fact) for all values of x and y. If you fit the log of the model to the log of the data, as withSum[ (log(data) - log(model))**2] is effectively saying that the uncertainty in the log(data) is the same for all values of x and y.
In Python, how to perform logistic regression for data set containing very large values of x and very small values of y?
I am trying to fit logistic function to a data set containing very large x values (1000's) and very small y values (1e-4). As shown in the code below, if I execute the code the interpreter returns RuntimeError: Optimal parameters not found: Number of calls to function has reached maxfev = 1000. If I multiply the ydata by 1000, then the curve_fit function successfully fits it. But then I would have to divide some of the fitted values by 1000. Is there a way to curve fit these extreme values without changing the original values? import numpy as np from scipy.optimize import leastsq, curve_fit import matplotlib.pyplot as plt def logistic(x, N, A, b, y0): return N / (1 + A*b**-x) + y0 xdata = np.array([100, 250, 500, 750, 1000, 1250, 1500]) ydata = np.array([0, 1e-6, 6.5e-5, 1.5e-4, 4.2e-4, 5.5e-4, 5.8e-4]) popt, pcov = curve_fit(logistic, xdata, ydata) x = np.linspace(0, 2500, 50) y = logistic(x, *popt) plt.plot(xdata, ydata, 'o', label='data') plt.plot(x,y, label='fit') plt.legend(loc='best') plt.show() Also, would it be possible to fit the curve in the forms of exponential for the logistic function, such as (https://en.wikipedia.org/wiki/Logistic_function)? def logistic_e(x, N, b, y0, x0): return N / (1 + np.exp(-b*(x-x0))) + y0 If I use logistic_e, then no matter how I modify the dataset, I always receive /usr/lib/python3/dist-packages/scipy/optimize/minpack.py:779: OptimizeWarning: Covariance of the parameters could not be estimated category=OptimizeWarning) Thank you for your support!
You could try transforming the scales like StandardScaler or Normalizer for X and MinMaxScaler for y. After predicting y, you could use inverse_transform on MinMaxScaler to rescale it in to your range of interest.
You have to keep adjusting your maxfev value,this is extremely high number so adjust as fits popt, pcov = curve_fit(logistic, xdata, ydata, maxfev=1005000)
You can scale your data, for example, you can use Sklearn preprocessing: enter link description here
Correct fitting with scipy curve_fit including errors in x?
I'm trying to fit a histogram with some data in it using scipy.optimize.curve_fit. If I want to add an error in y, I can simply do so by applying a weight to the fit. But how to apply the error in x (i. e. the error due to binning in case of histograms)? My question also applies to errors in x when making a linear regression with curve_fit or polyfit; I know how to add errors in y, but not in x. Here an example (partly from the matplotlib documentation): import numpy as np import pylab as P from scipy.optimize import curve_fit # create the data histogram mu, sigma = 200, 25 x = mu + sigma*P.randn(10000) # define fit function def gauss(x, *p): A, mu, sigma = p return A*np.exp(-(x-mu)**2/(2*sigma**2)) # the histogram of the data n, bins, patches = P.hist(x, 50, histtype='step') sigma_n = np.sqrt(n) # Adding Poisson errors in y bin_centres = (bins[:-1] + bins[1:])/2 sigma_x = (bins[1] - bins[0])/np.sqrt(12) # Binning error in x P.setp(patches, 'facecolor', 'g', 'alpha', 0.75) # fitting and plotting p0 = [700, 200, 25] popt, pcov = curve_fit(gauss, bin_centres, n, p0=p0, sigma=sigma_n, absolute_sigma=True) x = np.arange(100, 300, 0.5) fit = gauss(x, *popt) P.plot(x, fit, 'r--') Now, this fit (when it doesn't fail) does consider the y-errors sigma_n, but I haven't found a way to make it consider sigma_x. I scanned a couple of threads on the scipy mailing list and found out how to use the absolute_sigma value and a post on Stackoverflow about asymmetrical errors, but nothing about errors in both directions. Is it possible to achieve?
scipy.optmize.curve_fit uses standard non-linear least squares optimization and therefore only minimizes the deviation in the response variables. If you want to have an error in the independent variable to be considered you can try scipy.odr which uses orthogonal distance regression. As its name suggests it minimizes in both independent and dependent variables. Have a look at the sample below. The fit_type parameter determines whether scipy.odr does full ODR (fit_type=0) or least squares optimization (fit_type=2). EDIT Although the example worked it did not make much sense, since the y data was calculated on the noisy x data, which just resulted in an unequally spaced indepenent variable. I updated the sample which now also shows how to use RealData which allows for specifying the standard error of the data instead of the weights. from scipy.odr import ODR, Model, Data, RealData import numpy as np from pylab import * def func(beta, x): y = beta[0]+beta[1]*x+beta[2]*x**3 return y #generate data x = np.linspace(-3,2,100) y = func([-2.3,7.0,-4.0], x) # add some noise x += np.random.normal(scale=0.3, size=100) y += np.random.normal(scale=0.1, size=100) data = RealData(x, y, 0.3, 0.1) model = Model(func) odr = ODR(data, model, [1,0,0]) odr.set_job(fit_type=2) output = odr.run() xn = np.linspace(-3,2,50) yn = func(output.beta, xn) hold(True) plot(x,y,'ro') plot(xn,yn,'k-',label='leastsq') odr.set_job(fit_type=0) output = odr.run() yn = func(output.beta, xn) plot(xn,yn,'g-',label='odr') legend(loc=0)
Confidence interval for exponential curve fit
I'm trying to obtain a confidence interval on an exponential fit to some x,y data (available here). Here's the MWE I have to find the best exponential fit to the data: from pylab import * from scipy.optimize import curve_fit # Read data. x, y = np.loadtxt('exponential_data.dat', unpack=True) def func(x, a, b, c): '''Exponential 3-param function.''' return a * np.exp(b * x) + c # Find best fit. popt, pcov = curve_fit(func, x, y) print popt # Plot data and best fit curve. scatter(x, y) x = linspace(11, 23, 100) plot(x, func(x, *popt), c='r') show() which produces: How can I obtain the 95% (or some other value) confidence interval on this fit preferably using either pure python, numpy or scipy (which are the packages I already have installed)?
You can use the uncertainties module to do the uncertainty calculations. uncertainties keeps track of uncertainties and correlation. You can create correlated uncertainties.ufloat directly from the output of curve_fit. To be able to do those calculation on non-builtin operations such as exp you need to use the functions from uncertainties.unumpy. You should also avoid your from pylab import * import. This even overwrites python built-ins such as sum. A complete example: import numpy as np from scipy.optimize import curve_fit import uncertainties as unc import matplotlib.pyplot as plt import uncertainties.unumpy as unp def func(x, a, b, c): '''Exponential 3-param function.''' return a * np.exp(b * x) + c x, y = np.genfromtxt('data.txt', unpack=True) popt, pcov = curve_fit(func, x, y) a, b, c = unc.correlated_values(popt, pcov) # Plot data and best fit curve. plt.scatter(x, y, s=3, linewidth=0, alpha=0.3) px = np.linspace(11, 23, 100) # use unumpy.exp py = a * unp.exp(b * px) + c nom = unp.nominal_values(py) std = unp.std_devs(py) # plot the nominal value plt.plot(px, nom, c='r') # And the 2sigma uncertaintie lines plt.plot(px, nom - 2 * std, c='c') plt.plot(px, nom + 2 * std, c='c') plt.savefig('fit.png', dpi=300) And the result:
Gabriel's answer is incorrect. Here in red the 95% confidence band for his data as calculated by GraphPad Prism: Background: the "confidence interval of a fitted curve" is typically called confidence band. For a 95% confidence band, one can be 95% confident that it contains the true curve. (This is different from prediction bands, shown above in gray. Prediction bands are about future data points. For more details, see, e.g., this page of the GraphPad Curve Fitting Guide.) In Python, kmpfit can calculate the confidence band for non-linear least squares. Here for Gabriel's example: from pylab import * from kapteyn import kmpfit x, y = np.loadtxt('_exp_fit.txt', unpack=True) def model(p, x): a, b, c = p return a*np.exp(b*x)+c f = kmpfit.simplefit(model, [.1, .1, .1], x, y) print f.params # confidence band a, b, c = f.params dfdp = [np.exp(b*x), a*x*np.exp(b*x), 1] yhat, upper, lower = f.confidence_band(x, dfdp, 0.95, model) scatter(x, y, marker='.', s=10, color='#0000ba') ix = np.argsort(x) for i, l in enumerate((upper, lower, yhat)): plot(x[ix], l[ix], c='g' if i == 2 else 'r', lw=2) show() The dfdp are the partial derivatives ∂f/∂p of the model f = a*e^(b*x) + c with respect to each parameter p (i.e., a, b, and c). For background, see the kmpfit Tutorial or this page of the GraphPad Curve Fitting Guide. (Unlike my sample code, the kmpfit Tutorial does not use confidence_band() from the library but its own, slightly different, implementation.) Finally, the Python plot matches the Prism one:
Notice: the actual answer to obtaining the fitted curve's confidence interval is given by Ulrich here. After some research (see here, here and 1.96) I came up with my own solution. It accepts an arbitrary X% confidence interval and plots upper and lower curves. Here's the MWE: from pylab import * from scipy.optimize import curve_fit from scipy import stats def func(x, a, b, c): '''Exponential 3-param function.''' return a * np.exp(b * x) + c # Read data. x, y = np.loadtxt('exponential_data.dat', unpack=True) # Define confidence interval. ci = 0.95 # Convert to percentile point of the normal distribution. # See: https://en.wikipedia.org/wiki/Standard_score pp = (1. + ci) / 2. # Convert to number of standard deviations. nstd = stats.norm.ppf(pp) print nstd # Find best fit. popt, pcov = curve_fit(func, x, y) # Standard deviation errors on the parameters. perr = np.sqrt(np.diag(pcov)) # Add nstd standard deviations to parameters to obtain the upper confidence # interval. popt_up = popt + nstd * perr popt_dw = popt - nstd * perr # Plot data and best fit curve. scatter(x, y) x = linspace(11, 23, 100) plot(x, func(x, *popt), c='g', lw=2.) plot(x, func(x, *popt_up), c='r', lw=2.) plot(x, func(x, *popt_dw), c='r', lw=2.) text(12, 0.5, '{}% confidence interval'.format(ci * 100.)) show()
curve_fit() returns the covariance matrix - pcov -- which holds the estimated uncertainties (1 sigma). This assumes errors are normally distributed, which is sometimes questionable. You might also consider using the lmfit package (pure python, built on top of scipy), which provides a wrapper around scipy.optimize fitting routines (including leastsq(), which is what curve_fit() uses) and can, among other things, calculate confidence intervals explicitly.
I've always been a fan of simple bootstrapping to get confidence intervals. If you have n data points, then use the random package to select n points from your data WITH RESAMPLING (i.e. allow your program to get the same point multiple times if that's what it wants to do - very important). Once you've done that, plot the resampled points and get the best fit. Do this 10,000 times, getting a new fit line each time. Then your 95% confidence interval is the pair of lines that enclose 95% of the best fit lines you made. It's a pretty easy method to program in Python, but it's a bit unclear how this would work out from a statistical point of view. Some more information on why you want to do this would probably lead to more appropriate answers for your task.
Fitting a Gaussian to a set of x,y data
Firstly this is an assignment I've been set so I'm only after pointers, and I am restricted to using the following libraries, NumPy, SciPy and MatPlotLib. We have been given a txt file which includes x and y data for a resonance experiment and have to fit both a gaussian and lorentzian fit. I'm working on the gaussian fit at the minute and have tried following the code laid out in a previous question as a basis for my own code. (Gaussian fit for Python) from numpy import * from matplotlib import * import matplotlib.pyplot as plt import pylab from scipy.optimize import curve_fit energy, intensity = numpy.loadtxt('resonance_data.txt', unpack=True) n = size(energy) mean = 30.7 sigma = 10 intensity0 = 45 def gaus(energy, intensity0, energy0, sigma): return intensity0 * exp(-(energy - energy0)**2 / (sigma**2)) popt, pcov = curve_fit(gaus, energy, intensity, p0=[45, mean, sigma]) plt.plot(energy, intensity, 'o') plt.xlabel('Energy/eV') plt.ylabel('Intensity') plt.title('Plot of Intensity against Energy') plt.plot(energy, gaus(energy, *popt)) plt.show() Which returns the following graph If I keep the expressions for mean and sigma, as in the url posted the curve fit is a horizontal line, so I'm guessing the problem lies in the curve fit not converging or something.
Looks like your data skews heavily to the left, why Gaussian? Not Boltzmann, Log-Normal, or anything else? Much of these are already implemented in scipy.stats. See scipy.stats.cauchy for lorentzian and scipy.stats.normal gaussian. An example: import scipy.stats as ss A=ss.norm.rvs(0, 5, size=(100)) #Generate a random variable of 100 elements, with expected mean=0, std=5 ss.norm.fit_loc_scale(A) #fit both the mean and std (-0.13053732553697531, 5.163322485150271) #your number will vary. And I think you don't need the intensity0 parameter, it is just going to be 1/sigma/srqt(2*pi), because the density function has to sum up to 1.