Related
I am trying to show that economies follow a relatively sinusoidal growth pattern. I am building a python simulation to show that even when we let some degree of randomness take hold, we can still produce something relatively sinusoidal.
I am happy with the data I'm producing, but now I'd like to find some way to get a sine graph that pretty closely matches the data. I know you can do polynomial fit, but can you do sine fit?
Here is a parameter-free fitting function fit_sin() that does not require manual guess of frequency:
import numpy, scipy.optimize
def fit_sin(tt, yy):
'''Fit sin to the input time sequence, and return fitting parameters "amp", "omega", "phase", "offset", "freq", "period" and "fitfunc"'''
tt = numpy.array(tt)
yy = numpy.array(yy)
ff = numpy.fft.fftfreq(len(tt), (tt[1]-tt[0])) # assume uniform spacing
Fyy = abs(numpy.fft.fft(yy))
guess_freq = abs(ff[numpy.argmax(Fyy[1:])+1]) # excluding the zero frequency "peak", which is related to offset
guess_amp = numpy.std(yy) * 2.**0.5
guess_offset = numpy.mean(yy)
guess = numpy.array([guess_amp, 2.*numpy.pi*guess_freq, 0., guess_offset])
def sinfunc(t, A, w, p, c): return A * numpy.sin(w*t + p) + c
popt, pcov = scipy.optimize.curve_fit(sinfunc, tt, yy, p0=guess)
A, w, p, c = popt
f = w/(2.*numpy.pi)
fitfunc = lambda t: A * numpy.sin(w*t + p) + c
return {"amp": A, "omega": w, "phase": p, "offset": c, "freq": f, "period": 1./f, "fitfunc": fitfunc, "maxcov": numpy.max(pcov), "rawres": (guess,popt,pcov)}
The initial frequency guess is given by the peak frequency in the frequency domain using FFT. The fitting result is almost perfect assuming there is only one dominant frequency (other than the zero frequency peak).
import pylab as plt
N, amp, omega, phase, offset, noise = 500, 1., 2., .5, 4., 3
#N, amp, omega, phase, offset, noise = 50, 1., .4, .5, 4., .2
#N, amp, omega, phase, offset, noise = 200, 1., 20, .5, 4., 1
tt = numpy.linspace(0, 10, N)
tt2 = numpy.linspace(0, 10, 10*N)
yy = amp*numpy.sin(omega*tt + phase) + offset
yynoise = yy + noise*(numpy.random.random(len(tt))-0.5)
res = fit_sin(tt, yynoise)
print( "Amplitude=%(amp)s, Angular freq.=%(omega)s, phase=%(phase)s, offset=%(offset)s, Max. Cov.=%(maxcov)s" % res )
plt.plot(tt, yy, "-k", label="y", linewidth=2)
plt.plot(tt, yynoise, "ok", label="y with noise")
plt.plot(tt2, res["fitfunc"](tt2), "r-", label="y fit curve", linewidth=2)
plt.legend(loc="best")
plt.show()
The result is good even with high noise:
Amplitude=1.00660540618, Angular freq.=2.03370472482, phase=0.360276844224, offset=3.95747467506, Max. Cov.=0.0122923578658
You can use the least-square optimization function in scipy to fit any arbitrary function to another. In case of fitting a sin function, the 3 parameters to fit are the offset ('a'), amplitude ('b') and the phase ('c').
As long as you provide a reasonable first guess of the parameters, the optimization should converge well.Fortunately for a sine function, first estimates of 2 of these are easy: the offset can be estimated by taking the mean of the data and the amplitude via the RMS (3*standard deviation/sqrt(2)).
Note: as a later edit, frequency fitting has also been added. This does not work very well (can lead to extremely poor fits). Thus, use at your discretion, my advise would be to not use frequency fitting unless frequency error is smaller than a few percent.
This leads to the following code:
import numpy as np
from scipy.optimize import leastsq
import pylab as plt
N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
f = 1.15247 # Optional!! Advised not to use
data = 3.0*np.sin(f*t+0.001) + 0.5 + np.random.randn(N) # create artificial data with noise
guess_mean = np.mean(data)
guess_std = 3*np.std(data)/(2**0.5)/(2**0.5)
guess_phase = 0
guess_freq = 1
guess_amp = 1
# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = guess_std*np.sin(t+guess_phase) + guess_mean
# Define the function to optimize, in this case, we want to minimize the difference
# between the actual data and our "guessed" parameters
optimize_func = lambda x: x[0]*np.sin(x[1]*t+x[2]) + x[3] - data
est_amp, est_freq, est_phase, est_mean = leastsq(optimize_func, [guess_amp, guess_freq, guess_phase, guess_mean])[0]
# recreate the fitted curve using the optimized parameters
data_fit = est_amp*np.sin(est_freq*t+est_phase) + est_mean
# recreate the fitted curve using the optimized parameters
fine_t = np.arange(0,max(t),0.1)
data_fit=est_amp*np.sin(est_freq*fine_t+est_phase)+est_mean
plt.plot(t, data, '.')
plt.plot(t, data_first_guess, label='first guess')
plt.plot(fine_t, data_fit, label='after fitting')
plt.legend()
plt.show()
Edit: I assumed that you know the number of periods in the sine-wave. If you don't, it's somewhat trickier to fit. You can try and guess the number of periods by manual plotting and try and optimize it as your 6th parameter.
More userfriendly to us is the function curvefit. Here an example:
import numpy as np
from scipy.optimize import curve_fit
import pylab as plt
N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
data = 3.0*np.sin(t+0.001) + 0.5 + np.random.randn(N) # create artificial data with noise
guess_freq = 1
guess_amplitude = 3*np.std(data)/(2**0.5)
guess_phase = 0
guess_offset = np.mean(data)
p0=[guess_freq, guess_amplitude,
guess_phase, guess_offset]
# create the function we want to fit
def my_sin(x, freq, amplitude, phase, offset):
return np.sin(x * freq + phase) * amplitude + offset
# now do the fit
fit = curve_fit(my_sin, t, data, p0=p0)
# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = my_sin(t, *p0)
# recreate the fitted curve using the optimized parameters
data_fit = my_sin(t, *fit[0])
plt.plot(data, '.')
plt.plot(data_fit, label='after fitting')
plt.plot(data_first_guess, label='first guess')
plt.legend()
plt.show()
The current methods to fit a sin curve to a given data set require a first guess of the parameters, followed by an interative process. This is a non-linear regression problem.
A different method consists in transforming the non-linear regression to a linear regression thanks to a convenient integral equation. Then, there is no need for initial guess and no need for iterative process : the fitting is directly obtained.
In case of the function y = a + r*sin(w*x+phi) or y=a+b*sin(w*x)+c*cos(w*x), see pages 35-36 of the paper "RĂ©gression sinusoidale" published on Scribd
In case of the function y = a + p*x + r*sin(w*x+phi) : pages 49-51 of the chapter "Mixed linear and sinusoidal regressions".
In case of more complicated functions, the general process is explained in the chapter "Generalized sinusoidal regression" pages 54-61, followed by a numerical example y = r*sin(w*x+phi)+(b/x)+c*ln(x), pages 62-63
All the above answers are based on curve fitting, and most use an iterative method - they all work very nicely, but I wanted to add a different approach using an FFT. Here, we transform the data, set all but the peak frequency to zero and then do the inverse transform. Note, that you probably want to remove the data mean (and detrend) before doing the FFT and then you can add those back in after.
import numpy as np
import pylab as plt
# fake data
N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
f = 1.05
data = 3.0*np.sin(f*t+0.001) + np.random.randn(N) # create artificial data with noise
# FFT...
mfft=np.fft.fft(data)
imax=np.argmax(np.absolute(mfft))
mask=np.zeros_like(mfft)
mask[[imax]]=1
mfft*=mask
fdata=np.fft.ifft(mfft)
plt.plot(t, data, '.')
plt.plot(t, fdata,'.', label='FFT')
plt.legend()
plt.show()
I need to fit an tanh curve like this one :
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model
def f(x, a1=0.00010, a2=0.00013, a3=0.00013, teta1=1, teta2=0.00555, teta3=0.00555, phi1=-50, phi2=600, phi3=-900,
a=0.000000019, b=0):
formule = a1 * np.tanh(teta1 * (x + phi1)) + a2 * np.tanh(teta2 * (x + phi2)) + a3 * np.tanh(
teta3 * (x + phi3)) + a * x + b
return formule
# generate points used to plot
x_plot = np.linspace(-10000, 10000, 1000)
gmodel = Model(f)
result = gmodel.fit(f(x_plot), x=x_plot, a1=1,a2=1,a3=1,teta1=1,teta2=1,teta3=1,phi1=0,phi2=0,phi3=0)
plt.plot(x_plot, f(x_plot), 'bo')
plt.plot(x_plot, result.best_fit, 'r-')
plt.show()
i try to do someting like that but i got this result:
There is an other way for fitting this curve ? I don't know what i'm doing wrong ?
Basically your fit is fine (although not very nice from the coding point of view). Like always, non-linear fits strongly rely on initial parameters. Yours are just chosen badly. You could either think how to determine them manually or use a pre-made package like differential_evolution from scipy.optimize. I am not using this package but you can find an example here on SE
I agree with the answers from mikuszefski and F. Win but would like to add another point.
Your model includes a line + 3 tanh functions. It's not entirely clear that the data support that many different tanh functions. If so (and echoing mikuszefki), you will need to tell the fit that these are not identical. Your example starts them off being identical, which will make it very difficult for the fit to find a good solution. Either way, it would probably be helpful to be able to easily test if there really are 1, 2, 3, or more tanh functions.
You may also want to give not only initial values for your parameters, but also realistic boundaries on them so that the tanh functions are clearly separated and don't wander too far off from where they should be.
To clean up your code and to better allow you to change the number of tanh functions used and place boundary constraints, I would suggest making individual models and adding them as with:
from lmfit import Model
def f_tanh(x, eta=1, phi=0):
"tanh function"
return np.tanh(eta * (x + phi))
def f_line(x, slope=0, intercept=0):
"line function"
return slope*x + intercept
# create model as line + 2 tanh functions
gmodel = Model(f_line) + Model(f_tanh, prefix='t1_') + Model(f_tanh, prefix='t2_')
Now you can easily create parameters, with
params = gmodel.make_params(slope=0.003, intercept=0.001,
t1_eta=0.021, t1_phi=-2000,
t2_eta=0.013, t2_phi=600)
With the fit parameters defined, you can place bounds with:
params['t1_eta'].min = 0
params['t2_eta'].min = 0
params['t1_phi'].min = -3000
params['t1_phi'].max = -1000
params['t2_phi'].min = 0
params['t2_phi'].max = 1000
I think all of these will help you better explore the data and the fits to it. Putting this all together, you might have:
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model
def f_tanh(x, eta=1, phi=0):
"tanh function"
return np.tanh(eta * (x + phi))
def f_line(x, slope=0, intercept=0):
"line function"
return slope*x + intercept
# line + 2 tanh functions
gmodel = Model(f_line) + Model(f_tanh, prefix='t1_') + Model(f_tanh, prefix='t2_')
# generate "data"
x = np.linspace(-10000, 10000, 1000)
y = gmodel.eval(x=x, slope=0.0001,
t1_eta=0.010, t1_phi=-2100,
t2_eta=0.004, t2_phi=740)
y = y + np.random.normal(size=len(x), scale=0.02)
# make parameters with initial values
params = gmodel.make_params(slope=0.003, intercept=0.001,
t1_eta=0.021, t1_phi=-2000,
t2_eta=0.013, t2_phi=600)
# place realistic but generous constraints to keep tanhs separate
params['t1_eta'].min = 0
params['t2_eta'].min = 0
params['t1_phi'].min = -3000
params['t1_phi'].max = -1000
params['t2_phi'].min = 0
params['t2_phi'].max = 1000
result = gmodel.fit(y, params, x=x)
print(result.fit_report())
plt.plot(x, y, 'bo')
plt.plot(x, result.best_fit, 'r-')
plt.show()
This will give a good fit and plot and find the expected values, within the noise level. Hope that helps get you pointed in the right direction.
Your function is a bit confusing and you do not really have function values. You basically want to to fit to your function itself. Ideally you want to replace f(x_plot) in curve_fit() by real experimental data.
A good way to fit a function is using scipy.optimize.curve_fit
from scipy.optimize import curve_fit
popt, pcov = curve_fit(f, x_plot, f(x_plot), p0=[0.00010, 0.00013, 0.00013, 1, 0.00555, .00555, -50, 600, -900,
0.000000019, 0])
plt.plot(f(x_plot, *popt))
The resulting fit looks like this
with real data :
test_X = np.array(
[-9.77073e+03, -9.29706e+03, -8.82339e+03, -8.34979e+03, -7.87614e+03, -7.40242e+03, -6.92874e+03, -6.45506e+03,
-5.98143e+03, -5.50771e+03, -5.03404e+03, -4.56012e+03, -4.08674e+03, -3.61304e+03, -3.13937e+03, -2.66578e+03,
-2.19210e+03, -1.71845e+03, -1.24478e+03, -9.78925e+02, -9.29077e+02, -8.79059e+02, -8.29082e+02, -7.79092e+02,
-7.29080e+02, -6.79084e+02, -6.29061e+02, -5.79078e+02, -5.29103e+02, -4.79089e+02, -4.29094e+02, -3.79071e+02,
-3.29074e+02, -2.79062e+02, -2.29079e+02, -1.92907e+02, -1.72931e+02, -1.52930e+02, -1.32937e+02, -1.12946e+02,
-9.29511e+01, -7.29438e+01, -5.29292e+01, -3.29304e+01, -1.29330e+01, 7.04455e+00, 2.70676e+01, 4.70634e+01,
6.70526e+01, 8.70340e+01, 1.07056e+02, 1.27037e+02, 1.47045e+02, 1.67033e+02, 1.87039e+02, 2.20765e+02,
2.70680e+02, 3.20699e+02, 3.70693e+02, 4.20692e+02, 4.70696e+02, 5.20704e+02, 5.70685e+02, 6.20710e+02,
6.70682e+02, 7.20705e+02, 7.70707e+02, 8.20704e+02, 8.70713e+02, 9.20691e+02, 9.70700e+02, 1.23926e+03,
1.73932e+03, 2.23932e+03, 2.73926e+03, 3.23924e+03, 3.73926e+03, 4.23952e+03, 4.73926e+03, 5.23930e+03,
5.71508e+03, 6.21417e+03, 6.71413e+03, 7.21412e+03, 7.71410e+03, 8.21405e+03, 8.71402e+03, 9.21423e+03])
test_Y = np.array(
[-3.17679e-04, -3.27541e-04, -3.51184e-04, -3.60672e-04, -3.75965e-04, -3.86888e-04, -4.03222e-04, -4.23262e-04,
-4.38526e-04, -4.51187e-04, -4.61081e-04, -4.67121e-04, -4.96690e-04, -4.94811e-04, -5.10110e-04, -5.18985e-04,
-5.11754e-04, -4.90964e-04, -4.36904e-04, -3.93638e-04, -3.83336e-04, -3.71110e-04, -3.57207e-04, -3.39643e-04,
-3.24155e-04, -2.97296e-04, -2.74653e-04, -2.43700e-04, -1.95574e-04, -1.60716e-04, -1.43363e-04, -1.33610e-04,
-1.30734e-04, -1.26332e-04, -1.26063e-04, -1.24228e-04, -1.23424e-04, -1.20276e-04, -1.16886e-04, -1.21865e-04,
-1.16605e-04, -1.14148e-04, -1.14728e-04, -1.14660e-04, -1.16927e-04, -1.10380e-04, -1.09836e-04, 4.24232e-05,
8.66095e-05, 8.43905e-05, 9.09867e-05, 8.95580e-05, 9.02585e-05, 8.87033e-05, 8.86536e-05, 8.92236e-05,
9.24438e-05, 9.27929e-05, 9.24961e-05, 9.72166e-05, 1.00432e-04, 1.05457e-04, 1.11278e-04, 1.14716e-04,
1.25818e-04, 1.40721e-04, 1.62968e-04, 1.91776e-04, 2.28125e-04, 2.57918e-04, 2.88941e-04, 3.85003e-04,
4.91916e-04, 5.32483e-04, 5.50929e-04, 5.45350e-04, 5.38903e-04, 5.27765e-04, 5.15592e-04, 4.95717e-04,
4.81722e-04, 4.69538e-04, 4.58643e-04, 4.41407e-04, 4.29820e-04, 4.07784e-04, 3.92236e-04, 3.81761e-04])
i try this:
import numpy,
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
def function(x, a1, a2, a3, teta1, teta2, teta3, phi1, phi2, phi3, a, b):
import numpy as np
formule = a1 * np.tanh(teta1 * (x + phi1)) + a2 * np.tanh(teta2 * (x + phi2)) + a3 * np.tanh(teta3 * (x + phi3)) + a * x + b
return formule
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = function(test_X, *parameterTuple)
return numpy.sum((test_Y - val) ** 2.0)
def generate_Initial_Parameters():
parameterBounds = []
parameterBounds.append([1.4e-04, 1.4e-04])
parameterBounds.append([2.00e-04,2.0e-04])
parameterBounds.append([2.5e-04, 2.5e-04])
parameterBounds.append([0, 2.0e+01])
parameterBounds.append([0, 4.0e-03])
parameterBounds.append([0, 4.0e-03])
parameterBounds.append([-8.e+01, 0])
parameterBounds.append([0, 9.0e+02])
parameterBounds.append([-2.1e+03, 0])
parameterBounds.append([-3.4e-08, -2.4e-08])
parameterBounds.append([-2.2e-05*2, 4.2e-05])
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds)
return result.x
# generate initial parameter values
geneticParameters = generate_Initial_Parameters()
# curve fit the test data
fittedParameters, pcov = curve_fit(function, test_X, test_Y, geneticParameters)
print('Parameters', fittedParameters)
modelPredictions = function(test_X, *fittedParameters)
absError = modelPredictions - test_Y
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(test_Y))
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
ytry = ftry(test_X)
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth / 100.0, graphHeight / 100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(test_X, test_Y, 'D')
# create data for the fitted equation plot
yModel = function(test_X, *fittedParameters)
# now the model as a line plot
axes.plot(test_X, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
axes.plot(test_X, ytry)
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
R-squared: 0.9978, not perfect but not so bad
enter image description here
I'm very new to stat. I'm wondering that is there a way to find the goodness fit score for non-linear functions (Gaussian, Power, Log-logistics)? I tried to used curve fit function in scipy to fit my data into the model. I know that if it's linear regression we can find R-squared. However, I really have no idea how to get the measurement if my data fit with all those functions. I did some research somebody used pseudo-R squared and some used Chi-quare. Here is my code for running gaussian and try to find the fit.
# X,Y -> both are in array
x = ar(df['r'].values)
y = ar(df['D'].values)
# Parameters for Gaussian
n = len(x)
mean = sum(x*yn)/n
sigma = math.sqrt(sum(y*(x-mean)**2)/n)
def gaussian(x, a, x0, sigma):
return (a/(sigma*math.sqrt(2 * 3.14159265)))*exp(-(x-x0)**2/(2*sigma**2))
# Curve fit from Scipy
popt, pcov = curve_fit(gaussian, x, yn, p0=[1, mean, sigma])
# First option :Finding best fit from Chi-square
p1 = popt[0]
p2 = popt[1]
p3 = popt[2]
chisqr = sum((yn- guassian(x,p1,p2,p3)**2/sigma**2)
dof = len(yn) - 2
GOF = 1 - chi2.cdf(chisqr,dof)
#Second option : R^2 = SS_reg/SS_tot
y_fit = [gaus(xi, *popt) for xi in x]
y_bar = np.sum(yn)/len(y)
ssreg = np.sum([ (yihat - ybar)**2 for yihat in y_fit])
sstot = np.sum([ (yi - ybar)**2 for yi in yn])
results = ssreg / sstot
Another question:
Are my model correct?
Gaussian:
(a/(sigma*math.sqrt(2 * 3.14159265)))*exp(-(x-x0)**2/(2*sigma**2))
Power:
x*exp(-b)
Exponential:
exp(-b*x)
Logarithm
:exp(-x) / (1+exp(-x))**2
Are all this correct? I got the low GOF score which the graph from the result of curve_fit looks so fit. Thank you so much for your time. Really Appreciate
I'm attempting to estimate a decay rate using an exponential fit, but I'm puzzled by why the two methods don't give the same result.
In the first case, taking the log of the data to linearize the problem matches Excel's exponential trendline fit. I had expected that fitting the exponential directly would be the same.
import numpy as np
from scipy.optimize import curve_fit
def exp_func(x, a, b):
return a * np.exp(-b * x)
def lin_func(x, m, b):
return m*x + b
xdata = [1065.0, 1080.0, 1095.0, 1110.0, 1125.0, 1140.0, 1155.0, 1170.0, 1185.0, 1200.0, 1215.0, 1230.0, 1245.0, 1260.0, 1275.0, 1290.0, 1305.0, 1320.0, 1335.0, 1350.0, 1365.0, 1380.0, 1395.0, 1410.0, 1425.0, 1440.0, 1455.0, 1470.0, 1485.0, 1500.0]
ydata = [21.3934, 17.14985, 11.2703, 13.284, 12.28465, 12.46925, 12.6315, 12.1292, 10.32762, 8.509195, 14.5393, 12.02665, 10.9383, 11.23325, 6.03988, 9.34904, 8.08941, 6.847, 5.938535, 6.792715, 5.520765, 6.16601, 5.71889, 4.949725, 7.62808, 5.5079, 3.049625, 4.8566, 3.26551, 3.50161]
xdata = np.array(xdata)
xdata = xdata - xdata.min() + 1
ydata = np.array(ydata)
lydata = np.log(ydata)
lopt, lcov = curve_fit(lin_func, xdata, lydata)
elopt = [np.exp(lopt[1]),-lopt[0]]
eopt, ecov = curve_fit(exp_func, xdata, ydata, p0=elopt)
print 'elopt: {},{}'.format(*elopt)
print 'eopt: {},{}'.format(*eopt)
results:
elopt: 17.2526204283,0.00343624199064
eopt: 17.1516384575,0.00330590568338
You're solving two different optimization problems. The curve_fit() assumes that the noise eps_i is additive (and somewhat Gaussian). Else it wont deliver optimal results.
Assuming that you want to minimize Sum (y_i - f(x_i))**2 with:
f(x) = a * Exp(-b * x) + eps_i
where eps_i the unknown error for the i-th data item you want to eliminate. Taking the logarithm results in
Log(f(x)) = Log(a*Exp(-b*x) + eps_i) != Log(Exp(Log(a) - b*x)) + eps_i
You can interpret the exponential equation as having additive noise. Your linear version has multiplicative noise mu_i, because:
g(x) = a * mu_i * Exp(-b*x)
results in
Log(g(x) = Log(a) - b * x + Log(mu_i)
In conclusion, you will only get identical results when the magnitude of the errors eps_i is very small.
I am trying to show that economies follow a relatively sinusoidal growth pattern. I am building a python simulation to show that even when we let some degree of randomness take hold, we can still produce something relatively sinusoidal.
I am happy with the data I'm producing, but now I'd like to find some way to get a sine graph that pretty closely matches the data. I know you can do polynomial fit, but can you do sine fit?
Here is a parameter-free fitting function fit_sin() that does not require manual guess of frequency:
import numpy, scipy.optimize
def fit_sin(tt, yy):
'''Fit sin to the input time sequence, and return fitting parameters "amp", "omega", "phase", "offset", "freq", "period" and "fitfunc"'''
tt = numpy.array(tt)
yy = numpy.array(yy)
ff = numpy.fft.fftfreq(len(tt), (tt[1]-tt[0])) # assume uniform spacing
Fyy = abs(numpy.fft.fft(yy))
guess_freq = abs(ff[numpy.argmax(Fyy[1:])+1]) # excluding the zero frequency "peak", which is related to offset
guess_amp = numpy.std(yy) * 2.**0.5
guess_offset = numpy.mean(yy)
guess = numpy.array([guess_amp, 2.*numpy.pi*guess_freq, 0., guess_offset])
def sinfunc(t, A, w, p, c): return A * numpy.sin(w*t + p) + c
popt, pcov = scipy.optimize.curve_fit(sinfunc, tt, yy, p0=guess)
A, w, p, c = popt
f = w/(2.*numpy.pi)
fitfunc = lambda t: A * numpy.sin(w*t + p) + c
return {"amp": A, "omega": w, "phase": p, "offset": c, "freq": f, "period": 1./f, "fitfunc": fitfunc, "maxcov": numpy.max(pcov), "rawres": (guess,popt,pcov)}
The initial frequency guess is given by the peak frequency in the frequency domain using FFT. The fitting result is almost perfect assuming there is only one dominant frequency (other than the zero frequency peak).
import pylab as plt
N, amp, omega, phase, offset, noise = 500, 1., 2., .5, 4., 3
#N, amp, omega, phase, offset, noise = 50, 1., .4, .5, 4., .2
#N, amp, omega, phase, offset, noise = 200, 1., 20, .5, 4., 1
tt = numpy.linspace(0, 10, N)
tt2 = numpy.linspace(0, 10, 10*N)
yy = amp*numpy.sin(omega*tt + phase) + offset
yynoise = yy + noise*(numpy.random.random(len(tt))-0.5)
res = fit_sin(tt, yynoise)
print( "Amplitude=%(amp)s, Angular freq.=%(omega)s, phase=%(phase)s, offset=%(offset)s, Max. Cov.=%(maxcov)s" % res )
plt.plot(tt, yy, "-k", label="y", linewidth=2)
plt.plot(tt, yynoise, "ok", label="y with noise")
plt.plot(tt2, res["fitfunc"](tt2), "r-", label="y fit curve", linewidth=2)
plt.legend(loc="best")
plt.show()
The result is good even with high noise:
Amplitude=1.00660540618, Angular freq.=2.03370472482, phase=0.360276844224, offset=3.95747467506, Max. Cov.=0.0122923578658
You can use the least-square optimization function in scipy to fit any arbitrary function to another. In case of fitting a sin function, the 3 parameters to fit are the offset ('a'), amplitude ('b') and the phase ('c').
As long as you provide a reasonable first guess of the parameters, the optimization should converge well.Fortunately for a sine function, first estimates of 2 of these are easy: the offset can be estimated by taking the mean of the data and the amplitude via the RMS (3*standard deviation/sqrt(2)).
Note: as a later edit, frequency fitting has also been added. This does not work very well (can lead to extremely poor fits). Thus, use at your discretion, my advise would be to not use frequency fitting unless frequency error is smaller than a few percent.
This leads to the following code:
import numpy as np
from scipy.optimize import leastsq
import pylab as plt
N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
f = 1.15247 # Optional!! Advised not to use
data = 3.0*np.sin(f*t+0.001) + 0.5 + np.random.randn(N) # create artificial data with noise
guess_mean = np.mean(data)
guess_std = 3*np.std(data)/(2**0.5)/(2**0.5)
guess_phase = 0
guess_freq = 1
guess_amp = 1
# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = guess_std*np.sin(t+guess_phase) + guess_mean
# Define the function to optimize, in this case, we want to minimize the difference
# between the actual data and our "guessed" parameters
optimize_func = lambda x: x[0]*np.sin(x[1]*t+x[2]) + x[3] - data
est_amp, est_freq, est_phase, est_mean = leastsq(optimize_func, [guess_amp, guess_freq, guess_phase, guess_mean])[0]
# recreate the fitted curve using the optimized parameters
data_fit = est_amp*np.sin(est_freq*t+est_phase) + est_mean
# recreate the fitted curve using the optimized parameters
fine_t = np.arange(0,max(t),0.1)
data_fit=est_amp*np.sin(est_freq*fine_t+est_phase)+est_mean
plt.plot(t, data, '.')
plt.plot(t, data_first_guess, label='first guess')
plt.plot(fine_t, data_fit, label='after fitting')
plt.legend()
plt.show()
Edit: I assumed that you know the number of periods in the sine-wave. If you don't, it's somewhat trickier to fit. You can try and guess the number of periods by manual plotting and try and optimize it as your 6th parameter.
More userfriendly to us is the function curvefit. Here an example:
import numpy as np
from scipy.optimize import curve_fit
import pylab as plt
N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
data = 3.0*np.sin(t+0.001) + 0.5 + np.random.randn(N) # create artificial data with noise
guess_freq = 1
guess_amplitude = 3*np.std(data)/(2**0.5)
guess_phase = 0
guess_offset = np.mean(data)
p0=[guess_freq, guess_amplitude,
guess_phase, guess_offset]
# create the function we want to fit
def my_sin(x, freq, amplitude, phase, offset):
return np.sin(x * freq + phase) * amplitude + offset
# now do the fit
fit = curve_fit(my_sin, t, data, p0=p0)
# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = my_sin(t, *p0)
# recreate the fitted curve using the optimized parameters
data_fit = my_sin(t, *fit[0])
plt.plot(data, '.')
plt.plot(data_fit, label='after fitting')
plt.plot(data_first_guess, label='first guess')
plt.legend()
plt.show()
The current methods to fit a sin curve to a given data set require a first guess of the parameters, followed by an interative process. This is a non-linear regression problem.
A different method consists in transforming the non-linear regression to a linear regression thanks to a convenient integral equation. Then, there is no need for initial guess and no need for iterative process : the fitting is directly obtained.
In case of the function y = a + r*sin(w*x+phi) or y=a+b*sin(w*x)+c*cos(w*x), see pages 35-36 of the paper "RĂ©gression sinusoidale" published on Scribd
In case of the function y = a + p*x + r*sin(w*x+phi) : pages 49-51 of the chapter "Mixed linear and sinusoidal regressions".
In case of more complicated functions, the general process is explained in the chapter "Generalized sinusoidal regression" pages 54-61, followed by a numerical example y = r*sin(w*x+phi)+(b/x)+c*ln(x), pages 62-63
All the above answers are based on curve fitting, and most use an iterative method - they all work very nicely, but I wanted to add a different approach using an FFT. Here, we transform the data, set all but the peak frequency to zero and then do the inverse transform. Note, that you probably want to remove the data mean (and detrend) before doing the FFT and then you can add those back in after.
import numpy as np
import pylab as plt
# fake data
N = 1000 # number of data points
t = np.linspace(0, 4*np.pi, N)
f = 1.05
data = 3.0*np.sin(f*t+0.001) + np.random.randn(N) # create artificial data with noise
# FFT...
mfft=np.fft.fft(data)
imax=np.argmax(np.absolute(mfft))
mask=np.zeros_like(mfft)
mask[[imax]]=1
mfft*=mask
fdata=np.fft.ifft(mfft)
plt.plot(t, data, '.')
plt.plot(t, fdata,'.', label='FFT')
plt.legend()
plt.show()