I need to fit a sine curve created from two sine waves and extract the parameters for the fitted curve (such as frequency, amplitude, etc).
Data example:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
x = np.arange(0, 50, 0.01)
x2 = np.arange(0, 100, 0.02)
x3 = np.arange(0, 150, 0.03)
sin1 = np.sin(x)
sin2 = np.sin(x2)
sin3= np.sin(x3/2)
sin4 = sin1 + sin2+sin3
plt.plot(x, sin4)
plt.show()
I used the codes provided in this answer.
yy = sin4
tt = x
res = fit_sin(tt, yy)
print(str(i), "Amplitude=%(amp)s, Angular freq.=%(omega)s, phase=%(phase)s, offset=%(offset)s, Max. Cov.=%(maxcov)s" % res )
fit_values=res["fitfunc"](tt)
Frequenc_fit= res['freq']
print(i, Frequenc_fit)
Frequenc_fit=Frequenc_fit
Amp_fit=res['amp']
Omega_fit=res['omega']
Phase_fit=res['phase']
Offset_fit=res['offset']
maxcov_fit=res['maxcov']
plt.plot(tt, yy, "-k", label="y", linewidth=2)
plt.plot(tt,fit_values, "r-", label="y fit curve", linewidth=2)
plt.legend(loc="best")
plt.show()
I got a fitted sine curve with a single frequency and amplitude as follows:
2 Amplitude=1.0149282025860233, Angular freq.=2.01112187048004, phase=-0.2730905030152767, offset=0.003304158823058212, Max. Cov.=0.0015266032307905222
2 0.3200799868471169
Is there a method to obtain fitted curve matches with the original one?
Supposing that the function to be fitted is
y(x)=a * sin( w * x )+b * sin( W * x )
the principle of the method below is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
The graphical representation of the result is :
Blue curve : From data obtained by scanning the graph given in the question.
Black curve : From the above calculus.
The available data was not accurate because it comes from scanning of the original figure. The deviation is mainly due to the numerical integrations in computing the values of SS and SSSS (Four successive numerical integrations is not accurate especially with biaised data).
Probably the correct result should be : w=2 , W=1 , a=1 , b=1.
NOTE : The above method is not iterative and thus doesn't requires guessed values of the parameters to start an iterative process. The approximate results of the parameters can be good initial values in order to use an iterative non-linear regression process.
NOTE : If the values of w and W where known a-priori the solving thanks to linear regression would be very simple and much accurate (Only the last 2X2 matrix calculus shown above).
Below (blue dashed line) is what I get when I try to do linear regression on my data. It looks very off (but maybe it's correct?) Here is the image (does not allow me to embed):
and here is the code:
mm, cs, err = get_cols(data)
a = np.asarray(mm, dtype=float)
b = np.asarray(cs, dtype=float)
ax.errorbar(a, b, xerr=None, yerr=err, fmt='o', c='b', label='Detection Rate')
logB = np.log10(b)
m, y0 = np.polyfit(a, logB, 1)
ax.plot(a, np.exp(a*m+y0), '--')
The log scale of matplotlib uses the logarithm of base 10 by default. It therefore makes sense to use np.log10(b) to transform the data for fitting.
However, once fitting is done, the data needs to be backtransformed using the inverse of the transformation function.
In case of y = log10(x) the inverse is x = 10**(y), while
in case of y = log(x) the inverse is x = exp(y).
So you need to decide for one of the cases.
I am trying fit an exponential function and 5 Gaussians to my data. What I am aiming for is something along these lines: (where gDNA Fit is the exponential; 1-5Nuc Fit are the 5 Gaussians; Total fit is the sum of all the fits)
The way I approached it was fitting the exponential and then based on that introduce a cut-off that would allow me to fit the gaussians without taking into consideration the already fitted data. (I have already cut the data at 100 as this is where it dips down to 0)
The problem is I don't seems to be able to fit the exponential properly and the gaussians are off the scale:
from scipy.optimize import curve_fit
from pylab import *
import matplotlib.pyplot
#Exponential
x = np.array([1.010000000000000000e+02,1.100000000000000000e+02,1.190000000000000000e+02,1.280000000000000000e+02,1.370000000000000000e+02,1.460000000000000000e+02,1.550000000000000000e+02,1.640000000000000000e+02,1.730000000000000000e+02,1.820000000000000000e+02,1.910000000000000000e+02,2.000000000000000000e+02,2.090000000000000000e+02,2.180000000000000000e+02,2.270000000000000000e+02,2.360000000000000000e+02,2.450000000000000000e+02,2.540000000000000000e+02,2.630000000000000000e+02,2.720000000000000000e+02,2.810000000000000000e+02,2.900000000000000000e+02,2.990000000000000000e+02,3.080000000000000000e+02,3.170000000000000000e+02,3.260000000000000000e+02,3.350000000000000000e+02,3.440000000000000000e+02,3.530000000000000000e+02,3.620000000000000000e+02,3.710000000000000000e+02,3.800000000000000000e+02,3.890000000000000000e+02,3.980000000000000000e+02,4.070000000000000000e+02,4.160000000000000000e+02,4.250000000000000000e+02,4.340000000000000000e+02,4.430000000000000000e+02,4.520000000000000000e+02,4.610000000000000000e+02,4.700000000000000000e+02,4.790000000000000000e+02,4.880000000000000000e+02,4.970000000000000000e+02,5.060000000000000000e+02,5.150000000000000000e+02,5.240000000000000000e+02,5.330000000000000000e+02,5.420000000000000000e+02,5.510000000000000000e+02,5.600000000000000000e+02,5.690000000000000000e+02,5.780000000000000000e+02,5.870000000000000000e+02,5.960000000000000000e+02,6.050000000000000000e+02,6.140000000000000000e+02,6.230000000000000000e+02,6.320000000000000000e+02,6.410000000000000000e+02,6.500000000000000000e+02,6.590000000000000000e+02,6.680000000000000000e+02,6.770000000000000000e+02,6.860000000000000000e+02,6.950000000000000000e+02,7.040000000000000000e+02,7.130000000000000000e+02,7.220000000000000000e+02,7.310000000000000000e+02,7.400000000000000000e+02,7.490000000000000000e+02,7.580000000000000000e+02,7.670000000000000000e+02,7.760000000000000000e+02,7.850000000000000000e+02,7.940000000000000000e+02,8.030000000000000000e+02,8.120000000000000000e+02,8.210000000000000000e+02,8.300000000000000000e+02,8.390000000000000000e+02,8.480000000000000000e+02,8.570000000000000000e+02,8.660000000000000000e+02,8.750000000000000000e+02,8.840000000000000000e+02,8.930000000000000000e+02,9.020000000000000000e+02,9.110000000000000000e+02,9.200000000000000000e+02,9.290000000000000000e+02,9.380000000000000000e+02,9.470000000000000000e+02,9.560000000000000000e+02,9.650000000000000000e+02,9.740000000000000000e+02,9.830000000000000000e+02,9.920000000000000000e+02])
y = np.array([3.579280000000000000e+05,3.172290000000000000e+05,1.759610000000000000e+05,1.352610000000000000e+05,1.069130000000000000e+05,9.721000000000000000e+04,9.908200000000000000e+04,1.168480000000000000e+05,1.266880000000000000e+05,1.264760000000000000e+05,1.279850000000000000e+05,1.198880000000000000e+05,1.117730000000000000e+05,1.005850000000000000e+05,9.038500000000000000e+04,7.532400000000000000e+04,6.235500000000000000e+04,5.249600000000000000e+04,4.445600000000000000e+04,3.808000000000000000e+04,3.612100000000000000e+04,3.460600000000000000e+04,3.209700000000000000e+04,3.008200000000000000e+04,3.090700000000000000e+04,3.208600000000000000e+04,2.949700000000000000e+04,3.111600000000000000e+04,3.125700000000000000e+04,3.152700000000000000e+04,3.198700000000000000e+04,3.373800000000000000e+04,3.171200000000000000e+04,3.124900000000000000e+04,3.109700000000000000e+04,3.002200000000000000e+04,2.720100000000000000e+04,2.413600000000000000e+04,1.873100000000000000e+04,1.768900000000000000e+04,1.510600000000000000e+04,1.358800000000000000e+04,1.354400000000000000e+04,1.198900000000000000e+04,1.182800000000000000e+04,6.926000000000000000e+03,1.230000000000000000e+04,3.734000000000000000e+03,6.631000000000000000e+03,7.085000000000000000e+03,7.151000000000000000e+03,7.195000000000000000e+03,7.265000000000000000e+03,6.966000000000000000e+03,6.823000000000000000e+03,6.357000000000000000e+03,5.977000000000000000e+03,5.464000000000000000e+03,4.941000000000000000e+03,4.543000000000000000e+03,3.992000000000000000e+03,3.593000000000000000e+03,3.156000000000000000e+03,2.955000000000000000e+03,2.740000000000000000e+03,2.701000000000000000e+03,2.528000000000000000e+03,2.481000000000000000e+03,2.527000000000000000e+03,2.476000000000000000e+03,2.456000000000000000e+03,2.461000000000000000e+03,2.420000000000000000e+03,2.346000000000000000e+03,2.326000000000000000e+03,2.278000000000000000e+03,2.108000000000000000e+03,1.893000000000000000e+03,1.771000000000000000e+03,1.654000000000000000e+03,1.547000000000000000e+03,1.389000000000000000e+03,1.325000000000000000e+03,1.130000000000000000e+03,1.057000000000000000e+03,9.460000000000000000e+02,9.790000000000000000e+02,8.990000000000000000e+02,8.460000000000000000e+02,8.360000000000000000e+02,8.040000000000000000e+02,8.330000000000000000e+02,7.690000000000000000e+02,7.020000000000000000e+02,7.360000000000000000e+02,6.390000000000000000e+02,6.690000000000000000e+02,6.770000000000000000e+02,6.100000000000000000e+02,5.700000000000000000e+02])
def func(x, a, c, d):
return a*np.exp(-c*x)+d
#print np.exp(-x)
popt, pcov = curve_fit(func, x, y, p0=(1, 0.01, 1))
yy = func(x, *popt)
matplotlib.pyplot.plot(x, y, 'ko')
matplotlib.pyplot.plot(x, yy)
#gaussian
from sklearn import mixture
import scipy
gmm = mixture.GMM(n_components=5, covariance_type='full')
gmm.fit(y)
pdfs = [p * scipy.stats.norm.pdf(x, mu, sd) for mu, sd, p in zip(gmm.means_, (gmm.covars_)**2, gmm.weights_)]
density = np.sum(np.array(pdfs), axis=0)
#print density
matplotlib.pyplot.plot(x, density)
show()
If you do not mind to use least squares as opposed to maximum likelyhood I would suggest to fit the whole model at once, including the exponential with e.g. scipy curve_fit. You will never get a good fit to the exponential if you ignore the existance of the gaussian peaks. I recommend to use peak-o-mat (http://lorentz.sf.net) which is an interactive curve fitting software written in python. Within seconds you can get a result like this:
I'm trying to fit a histogram with some data in it using scipy.optimize.curve_fit. If I want to add an error in y, I can simply do so by applying a weight to the fit. But how to apply the error in x (i. e. the error due to binning in case of histograms)?
My question also applies to errors in x when making a linear regression with curve_fit or polyfit; I know how to add errors in y, but not in x.
Here an example (partly from the matplotlib documentation):
import numpy as np
import pylab as P
from scipy.optimize import curve_fit
# create the data histogram
mu, sigma = 200, 25
x = mu + sigma*P.randn(10000)
# define fit function
def gauss(x, *p):
A, mu, sigma = p
return A*np.exp(-(x-mu)**2/(2*sigma**2))
# the histogram of the data
n, bins, patches = P.hist(x, 50, histtype='step')
sigma_n = np.sqrt(n) # Adding Poisson errors in y
bin_centres = (bins[:-1] + bins[1:])/2
sigma_x = (bins[1] - bins[0])/np.sqrt(12) # Binning error in x
P.setp(patches, 'facecolor', 'g', 'alpha', 0.75)
# fitting and plotting
p0 = [700, 200, 25]
popt, pcov = curve_fit(gauss, bin_centres, n, p0=p0, sigma=sigma_n, absolute_sigma=True)
x = np.arange(100, 300, 0.5)
fit = gauss(x, *popt)
P.plot(x, fit, 'r--')
Now, this fit (when it doesn't fail) does consider the y-errors sigma_n, but I haven't found a way to make it consider sigma_x. I scanned a couple of threads on the scipy mailing list and found out how to use the absolute_sigma value and a post on Stackoverflow about asymmetrical errors, but nothing about errors in both directions. Is it possible to achieve?
scipy.optmize.curve_fit uses standard non-linear least squares optimization and therefore only minimizes the deviation in the response variables. If you want to have an error in the independent variable to be considered you can try scipy.odr which uses orthogonal distance regression. As its name suggests it minimizes in both independent and dependent variables.
Have a look at the sample below. The fit_type parameter determines whether scipy.odr does full ODR (fit_type=0) or least squares optimization (fit_type=2).
EDIT
Although the example worked it did not make much sense, since the y data was calculated on the noisy x data, which just resulted in an unequally spaced indepenent variable. I updated the sample which now also shows how to use RealData which allows for specifying the standard error of the data instead of the weights.
from scipy.odr import ODR, Model, Data, RealData
import numpy as np
from pylab import *
def func(beta, x):
y = beta[0]+beta[1]*x+beta[2]*x**3
return y
#generate data
x = np.linspace(-3,2,100)
y = func([-2.3,7.0,-4.0], x)
# add some noise
x += np.random.normal(scale=0.3, size=100)
y += np.random.normal(scale=0.1, size=100)
data = RealData(x, y, 0.3, 0.1)
model = Model(func)
odr = ODR(data, model, [1,0,0])
odr.set_job(fit_type=2)
output = odr.run()
xn = np.linspace(-3,2,50)
yn = func(output.beta, xn)
hold(True)
plot(x,y,'ro')
plot(xn,yn,'k-',label='leastsq')
odr.set_job(fit_type=0)
output = odr.run()
yn = func(output.beta, xn)
plot(xn,yn,'g-',label='odr')
legend(loc=0)
Below is an example of using Curve_Fit from Scipy based on a linear equation. My understanding of Curve Fit in general is that it takes a plot of random points and creates a curve to show the "best fit" to a series of data points. My question is using scipy curve_fit it returns:
"Optimal values for the parameters so that the sum of the squared error of f(xdata, *popt) - ydata is minimized".
What exactly do these two values mean in simple English? Thanks!
import numpy as np
from scipy.optimize import curve_fit
# Creating a function to model and create data
def func(x, a, b):
return a * x + b
# Generating clean data
x = np.linspace(0, 10, 100)
y = func(x, 1, 2)
# Adding noise to the data
yn = y + 0.9 * np.random.normal(size=len(x))
# Executing curve_fit on noisy data
popt, pcov = curve_fit(func, x, yn)
# popt returns the best fit values for parameters of
# the given model (func).
print(popt)
You're asking SciPy to tell you the "best" line through a set of pairs of points (x, y).
Here's the equation of a straight line:
y = a*x + b
The slope of the line is a; the y-intercept is b.
You have two parameters, a and b, so you only need two equations to solve for two unknowns. Two points define a line, right?
So what happens when you have more than two points? You can't go through all the points. How do you choose the slope and intercept to give you the "best" line?
One way is to define "best" is to calculate the slope and intercept that minimize the square of the difference between each y value and the predicted y at that x on the line:
error = sum[(y(i) - (a*x(i) + b))^2]
It's an easy exercise if you know calculus: take the first derivatives of error w.r.t. a and b and set them equal to zero. You'll have two equations with two unknowns, a and b. You solve them to get the coefficients for the "best" line.