Fitting Tanh curves with python - python

I need to fit an tanh curve like this one :
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model
def f(x, a1=0.00010, a2=0.00013, a3=0.00013, teta1=1, teta2=0.00555, teta3=0.00555, phi1=-50, phi2=600, phi3=-900,
a=0.000000019, b=0):
formule = a1 * np.tanh(teta1 * (x + phi1)) + a2 * np.tanh(teta2 * (x + phi2)) + a3 * np.tanh(
teta3 * (x + phi3)) + a * x + b
return formule
# generate points used to plot
x_plot = np.linspace(-10000, 10000, 1000)
gmodel = Model(f)
result = gmodel.fit(f(x_plot), x=x_plot, a1=1,a2=1,a3=1,teta1=1,teta2=1,teta3=1,phi1=0,phi2=0,phi3=0)
plt.plot(x_plot, f(x_plot), 'bo')
plt.plot(x_plot, result.best_fit, 'r-')
plt.show()
i try to do someting like that but i got this result:
There is an other way for fitting this curve ? I don't know what i'm doing wrong ?

Basically your fit is fine (although not very nice from the coding point of view). Like always, non-linear fits strongly rely on initial parameters. Yours are just chosen badly. You could either think how to determine them manually or use a pre-made package like differential_evolution from scipy.optimize. I am not using this package but you can find an example here on SE

I agree with the answers from mikuszefski and F. Win but would like to add another point.
Your model includes a line + 3 tanh functions. It's not entirely clear that the data support that many different tanh functions. If so (and echoing mikuszefki), you will need to tell the fit that these are not identical. Your example starts them off being identical, which will make it very difficult for the fit to find a good solution. Either way, it would probably be helpful to be able to easily test if there really are 1, 2, 3, or more tanh functions.
You may also want to give not only initial values for your parameters, but also realistic boundaries on them so that the tanh functions are clearly separated and don't wander too far off from where they should be.
To clean up your code and to better allow you to change the number of tanh functions used and place boundary constraints, I would suggest making individual models and adding them as with:
from lmfit import Model
def f_tanh(x, eta=1, phi=0):
"tanh function"
return np.tanh(eta * (x + phi))
def f_line(x, slope=0, intercept=0):
"line function"
return slope*x + intercept
# create model as line + 2 tanh functions
gmodel = Model(f_line) + Model(f_tanh, prefix='t1_') + Model(f_tanh, prefix='t2_')
Now you can easily create parameters, with
params = gmodel.make_params(slope=0.003, intercept=0.001,
t1_eta=0.021, t1_phi=-2000,
t2_eta=0.013, t2_phi=600)
With the fit parameters defined, you can place bounds with:
params['t1_eta'].min = 0
params['t2_eta'].min = 0
params['t1_phi'].min = -3000
params['t1_phi'].max = -1000
params['t2_phi'].min = 0
params['t2_phi'].max = 1000
I think all of these will help you better explore the data and the fits to it. Putting this all together, you might have:
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model
def f_tanh(x, eta=1, phi=0):
"tanh function"
return np.tanh(eta * (x + phi))
def f_line(x, slope=0, intercept=0):
"line function"
return slope*x + intercept
# line + 2 tanh functions
gmodel = Model(f_line) + Model(f_tanh, prefix='t1_') + Model(f_tanh, prefix='t2_')
# generate "data"
x = np.linspace(-10000, 10000, 1000)
y = gmodel.eval(x=x, slope=0.0001,
t1_eta=0.010, t1_phi=-2100,
t2_eta=0.004, t2_phi=740)
y = y + np.random.normal(size=len(x), scale=0.02)
# make parameters with initial values
params = gmodel.make_params(slope=0.003, intercept=0.001,
t1_eta=0.021, t1_phi=-2000,
t2_eta=0.013, t2_phi=600)
# place realistic but generous constraints to keep tanhs separate
params['t1_eta'].min = 0
params['t2_eta'].min = 0
params['t1_phi'].min = -3000
params['t1_phi'].max = -1000
params['t2_phi'].min = 0
params['t2_phi'].max = 1000
result = gmodel.fit(y, params, x=x)
print(result.fit_report())
plt.plot(x, y, 'bo')
plt.plot(x, result.best_fit, 'r-')
plt.show()
This will give a good fit and plot and find the expected values, within the noise level. Hope that helps get you pointed in the right direction.

Your function is a bit confusing and you do not really have function values. You basically want to to fit to your function itself. Ideally you want to replace f(x_plot) in curve_fit() by real experimental data.
A good way to fit a function is using scipy.optimize.curve_fit
from scipy.optimize import curve_fit
popt, pcov = curve_fit(f, x_plot, f(x_plot), p0=[0.00010, 0.00013, 0.00013, 1, 0.00555, .00555, -50, 600, -900,
0.000000019, 0])
plt.plot(f(x_plot, *popt))
The resulting fit looks like this

with real data :
test_X = np.array(
[-9.77073e+03, -9.29706e+03, -8.82339e+03, -8.34979e+03, -7.87614e+03, -7.40242e+03, -6.92874e+03, -6.45506e+03,
-5.98143e+03, -5.50771e+03, -5.03404e+03, -4.56012e+03, -4.08674e+03, -3.61304e+03, -3.13937e+03, -2.66578e+03,
-2.19210e+03, -1.71845e+03, -1.24478e+03, -9.78925e+02, -9.29077e+02, -8.79059e+02, -8.29082e+02, -7.79092e+02,
-7.29080e+02, -6.79084e+02, -6.29061e+02, -5.79078e+02, -5.29103e+02, -4.79089e+02, -4.29094e+02, -3.79071e+02,
-3.29074e+02, -2.79062e+02, -2.29079e+02, -1.92907e+02, -1.72931e+02, -1.52930e+02, -1.32937e+02, -1.12946e+02,
-9.29511e+01, -7.29438e+01, -5.29292e+01, -3.29304e+01, -1.29330e+01, 7.04455e+00, 2.70676e+01, 4.70634e+01,
6.70526e+01, 8.70340e+01, 1.07056e+02, 1.27037e+02, 1.47045e+02, 1.67033e+02, 1.87039e+02, 2.20765e+02,
2.70680e+02, 3.20699e+02, 3.70693e+02, 4.20692e+02, 4.70696e+02, 5.20704e+02, 5.70685e+02, 6.20710e+02,
6.70682e+02, 7.20705e+02, 7.70707e+02, 8.20704e+02, 8.70713e+02, 9.20691e+02, 9.70700e+02, 1.23926e+03,
1.73932e+03, 2.23932e+03, 2.73926e+03, 3.23924e+03, 3.73926e+03, 4.23952e+03, 4.73926e+03, 5.23930e+03,
5.71508e+03, 6.21417e+03, 6.71413e+03, 7.21412e+03, 7.71410e+03, 8.21405e+03, 8.71402e+03, 9.21423e+03])
test_Y = np.array(
[-3.17679e-04, -3.27541e-04, -3.51184e-04, -3.60672e-04, -3.75965e-04, -3.86888e-04, -4.03222e-04, -4.23262e-04,
-4.38526e-04, -4.51187e-04, -4.61081e-04, -4.67121e-04, -4.96690e-04, -4.94811e-04, -5.10110e-04, -5.18985e-04,
-5.11754e-04, -4.90964e-04, -4.36904e-04, -3.93638e-04, -3.83336e-04, -3.71110e-04, -3.57207e-04, -3.39643e-04,
-3.24155e-04, -2.97296e-04, -2.74653e-04, -2.43700e-04, -1.95574e-04, -1.60716e-04, -1.43363e-04, -1.33610e-04,
-1.30734e-04, -1.26332e-04, -1.26063e-04, -1.24228e-04, -1.23424e-04, -1.20276e-04, -1.16886e-04, -1.21865e-04,
-1.16605e-04, -1.14148e-04, -1.14728e-04, -1.14660e-04, -1.16927e-04, -1.10380e-04, -1.09836e-04, 4.24232e-05,
8.66095e-05, 8.43905e-05, 9.09867e-05, 8.95580e-05, 9.02585e-05, 8.87033e-05, 8.86536e-05, 8.92236e-05,
9.24438e-05, 9.27929e-05, 9.24961e-05, 9.72166e-05, 1.00432e-04, 1.05457e-04, 1.11278e-04, 1.14716e-04,
1.25818e-04, 1.40721e-04, 1.62968e-04, 1.91776e-04, 2.28125e-04, 2.57918e-04, 2.88941e-04, 3.85003e-04,
4.91916e-04, 5.32483e-04, 5.50929e-04, 5.45350e-04, 5.38903e-04, 5.27765e-04, 5.15592e-04, 4.95717e-04,
4.81722e-04, 4.69538e-04, 4.58643e-04, 4.41407e-04, 4.29820e-04, 4.07784e-04, 3.92236e-04, 3.81761e-04])
i try this:
import numpy,
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.optimize import differential_evolution
import warnings
def function(x, a1, a2, a3, teta1, teta2, teta3, phi1, phi2, phi3, a, b):
import numpy as np
formule = a1 * np.tanh(teta1 * (x + phi1)) + a2 * np.tanh(teta2 * (x + phi2)) + a3 * np.tanh(teta3 * (x + phi3)) + a * x + b
return formule
# function for genetic algorithm to minimize (sum of squared error)
def sumOfSquaredError(parameterTuple):
warnings.filterwarnings("ignore") # do not print warnings by genetic algorithm
val = function(test_X, *parameterTuple)
return numpy.sum((test_Y - val) ** 2.0)
def generate_Initial_Parameters():
parameterBounds = []
parameterBounds.append([1.4e-04, 1.4e-04])
parameterBounds.append([2.00e-04,2.0e-04])
parameterBounds.append([2.5e-04, 2.5e-04])
parameterBounds.append([0, 2.0e+01])
parameterBounds.append([0, 4.0e-03])
parameterBounds.append([0, 4.0e-03])
parameterBounds.append([-8.e+01, 0])
parameterBounds.append([0, 9.0e+02])
parameterBounds.append([-2.1e+03, 0])
parameterBounds.append([-3.4e-08, -2.4e-08])
parameterBounds.append([-2.2e-05*2, 4.2e-05])
# "seed" the numpy random number generator for repeatable results
result = differential_evolution(sumOfSquaredError, parameterBounds)
return result.x
# generate initial parameter values
geneticParameters = generate_Initial_Parameters()
# curve fit the test data
fittedParameters, pcov = curve_fit(function, test_X, test_Y, geneticParameters)
print('Parameters', fittedParameters)
modelPredictions = function(test_X, *fittedParameters)
absError = modelPredictions - test_Y
SE = numpy.square(absError) # squared errors
MSE = numpy.mean(SE) # mean squared errors
RMSE = numpy.sqrt(MSE) # Root Mean Squared Error, RMSE
Rsquared = 1.0 - (numpy.var(absError) / numpy.var(test_Y))
print('RMSE:', RMSE)
print('R-squared:', Rsquared)
ytry = ftry(test_X)
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth / 100.0, graphHeight / 100.0), dpi=100)
axes = f.add_subplot(111)
# first the raw data as a scatter plot
axes.plot(test_X, test_Y, 'D')
# create data for the fitted equation plot
yModel = function(test_X, *fittedParameters)
# now the model as a line plot
axes.plot(test_X, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
axes.plot(test_X, ytry)
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)
R-squared: 0.9978, not perfect but not so bad
enter image description here

Related

How to fit this data in python and scipy?

I have some function which behaves as shown below i.e. some tapered/decaying oscillations
I want to fit the data using scipy's curve_fit. I have previously asked a question related to fitting functions with scipy, which was well answered here, and highlighted the importance of the initial guess for the values of the fitting parameters.
However, I am struggling to fit this data in a way which captures both the oscillations and the decay. My approach is as follows:
from scipy.optimize import curve_fit
import numpy as np
def Fit(x,y):
#Define the function fit
func = ansatz
#Define the initial guess of parameters
mag = (y.max() + y.min()) / 2
y_shifted = y - mag
omega_guess = np.pi * np.sum(y_shifted[:-1] * y_shifted[1:] < 0) / (x.max() - x.min())
lam = np.log(2) / 1e7 #Rough guess based on approximate half life
p0 = (mag,mag, omega_guess,mag,lam)
#Do the fit
popt, pcov = curve_fit(func, x,y,p0=p0)
# return
return func(x, *popt)
def ansatz(x,A,B,omega,offset,lam):
osc = A*np.sin(omega*x) + B*np.cos(omega*x)
linear = offset
decay = np.exp(-x*lam)
return decay*osc + linear
data = np.load('example.npy')
x = data[:,0]
y = data[:,1]
yFit = Fit(x,y)
This approach captures the decay, but not the oscillations. What is erroneous with my approach? Guesses for fit parameters? Function ansatz? Code implementation?

Linear regression ODR fails

Following the recommendations in this answer I have used several combination of values for beta0, and as shown here, the values from polyfit.
This example is UPDATED in order to show the effect of relative scales of values of X versus Y (X range is 0.1 to 100 times Y):
from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
for num in range(1, 5):
plt.subplot(2, 2, num)
plt.title('X range is %.1f times Y' % (float(100 / max(X))))
X *= 10
z = np.polyfit(X, Y, 1)
plt.plot(X, Y, 'k.', alpha=0.1)
# Fit using odr
def f(B, X):
return B[0]*X + B[1]
linear = odr.Model(f)
mydata = odr.RealData(X, Y)
myodr = odr.ODR(mydata, linear, beta0=z)
myodr.set_job(fit_type=0)
myoutput = myodr.run()
a, b = myoutput.beta
sa, sb = myoutput.sd_beta
xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
yp = a*xp+b
plt.plot(xp, yp, label='ODR')
yp2 = z[0]*xp+z[1]
plt.plot(xp, yp2, label='polyfit')
plt.legend()
plt.ylim(-1000, 2000)
plt.show()
It seems that no combination of beta0 helps... The only way to get polyfit and ODR fit similar is to swap X and Y, OR as shown here to increase the range of values of X with regard to Y, still not really a solution :)
=== EDIT ===
I do not want ODR to be the same as polyfit. I am showing polyfit just to emphasize that the ODR fit is wrong and it is not a problem of the data.
=== SOLUTION ===
thanks to #norok2 answer when Y range is 0.001 to 100000 times X:
from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() / 1000 for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
plt.figure(figsize=(12, 12))
for num in range(1, 10):
plt.subplot(3, 3, num)
plt.title('Y range is %.1f times X' % (float(100 / max(X))))
X *= 10
z = np.polyfit(X, Y, 1)
plt.plot(X, Y, 'k.', alpha=0.1)
# Fit using odr
def f(B, X):
return B[0]*X + B[1]
linear = odr.Model(f)
mydata = odr.RealData(X, Y,
sy=min(1/np.var(Y), 1/np.var(X))) # here the trick!! :)
myodr = odr.ODR(mydata, linear, beta0=z)
myodr.set_job(fit_type=0)
myoutput = myodr.run()
a, b = myoutput.beta
sa, sb = myoutput.sd_beta
xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
yp = a*xp+b
plt.plot(xp, yp, label='ODR')
yp2 = z[0]*xp+z[1]
plt.plot(xp, yp2, label='polyfit')
plt.legend()
plt.ylim(-1000, 2000)
plt.show()
The key difference between polyfit() and the Orthogonal Distance Regression (ODR) fit is that polyfit works under the assumption that the error on x is negligible. If this assumption is violated, like it is in your data, you cannot expect the two methods to produce similar results.
In particular, ODR() is very sensitive to the errors you specify.
If you do not specify any error/weighting, it will assign a value of 1 for both x and y, meaning that any scale difference between x and y will affect the results (the so-called numerical conditioning).
On the contrary, polyfit(), before computing the fit, applies some sort of pre-whitening to the data (see around line 577 of its source code) for better numerical conditioning.
Therefore, if you want ODR() to match polyfit(), you could simply fine-tune the error on Y to change your numerical conditioning.
I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).
mydata = odr.RealData(X, Y)
# equivalent to: odr.RealData(X, Y, sx=1, sy=1)
to:
mydata = odr.RealData(X, Y, sx=1, sy=1/np.var(Y))
(EDIT: note there was a typo on the line above)
I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).
Note that this would only make sense for well-conditioned fits.
I cannot format source code in a comment, and so place it here. This code uses ODR to calculate fit statistics, note the line that has "parameter order for odr" such that I use a wrapper function for the ODR call to my "actual" function.
from scipy.optimize import curve_fit
import numpy as np
import scipy.odr
import scipy.stats
x = np.array([5.357, 5.797, 5.936, 6.161, 6.697, 6.731, 6.775, 8.442, 9.861])
y = np.array([0.376, 0.874, 1.049, 1.327, 2.054, 2.077, 2.138, 4.744, 7.104])
def f(x,b0,b1):
return b0 + (b1 * x)
def f_wrapper_for_odr(beta, x): # parameter order for odr
return f(x, *beta)
parameters, cov= curve_fit(f, x, y)
model = scipy.odr.odrpack.Model(f_wrapper_for_odr)
data = scipy.odr.odrpack.Data(x,y)
myodr = scipy.odr.odrpack.ODR(data, model, beta0=parameters, maxit=0)
myodr.set_job(fit_type=2)
parameterStatistics = myodr.run()
df_e = len(x) - len(parameters) # degrees of freedom, error
cov_beta = parameterStatistics.cov_beta # parameter covariance matrix from ODR
sd_beta = parameterStatistics.sd_beta * parameterStatistics.sd_beta
ci = []
t_df = scipy.stats.t.ppf(0.975, df_e)
ci = []
for i in range(len(parameters)):
ci.append([parameters[i] - t_df * parameterStatistics.sd_beta[i], parameters[i] + t_df * parameterStatistics.sd_beta[i]])
tstat_beta = parameters / parameterStatistics.sd_beta # coeff t-statistics
pstat_beta = (1.0 - scipy.stats.t.cdf(np.abs(tstat_beta), df_e)) * 2.0 # coef. p-values
for i in range(len(parameters)):
print('parameter:', parameters[i])
print(' conf interval:', ci[i][0], ci[i][1])
print(' tstat:', tstat_beta[i])
print(' pstat:', pstat_beta[i])
print()

SciPy Curve Fit Fails Power Law

So, I'm trying to fit a set of data with a power law of the following kind:
def f(x,N,a): # Power law fit
if a >0:
return N*x**(-a)
else:
return 10.**300
par,cov = scipy.optimize.curve_fit(f,data,time,array([10**(-7),1.2]))
where the else condition is just to force a to be positive. Using scipy.optimize.curve_fit yields an awful fit (green line), returning values of 1.2e+04 and 1.9e0-7 for N and a, respectively, with absolutely no intersection with the data. From fits I've put in manually, the values should land around 1e-07 and 1.2 for N and a, respectively, though putting those into curve_fit as initial parameters doesn't change the result. Removing the condition for a to be positive results in a worse fit, as it chooses a negative, which leads to a fit with the wrong sign slope.
I can't figure out how to get a believable, let alone reliable, fit out of this routine, but I can't find any other good Python curve fitting routines. Do I need to write my own least-squares algorithm or is there something I'm doing wrong here?
UPDATE
In the original post, I showed a solution that uses lmfit which allows to assign bounds to your parameters. Starting with version 0.17, scipy also allows to assign bounds to your parameters directly (see documentation). Please find this solution below after the EDIT which can hopefully serve as a minimal example on how to use scipy's curve_fit with parameter bounds.
Original post
As suggested by #Warren Weckesser, you could use lmfit to get this task done, which allows you to assign bounds to your parameters and avoids this 'ugly' if-clause.
Since you do not provide any data, I created some which are shown here:
They follow the law f(x) = 10.5 * x ** (-0.08)
I fit them - as suggested by #roadrunner66 - by transforming the power law in a linear function:
y = N * x ** a
ln(y) = ln(N * x ** a)
ln(y) = a * ln(x) + ln(N)
So I first use np.log on the original data and then do the fit. When I now use lmfit, I get the following output:
[[Variables]]
lN: 2.35450302 +/- 0.019531 (0.83%) (init= 1.704748)
a: -0.08035342 +/- 0.005158 (6.42%) (init=-0.5)
So a is pretty close to the original value and np.exp(2.35450302) gives 10.53 which is also very close to the original value.
The plot then looks as follows; as you can see the fit describes the data very well:
Here is the entire code with a couple of inline comments:
import numpy as np
import matplotlib.pyplot as plt
from lmfit import minimize, Parameters, Parameter, report_fit
# generate some data with noise
xData = np.linspace(0.01, 100., 50.)
aOrg = 0.08
Norg = 10.5
yData = Norg * xData ** (-aOrg) + np.random.normal(0, 0.5, len(xData))
plt.plot(xData, yData, 'bo')
plt.show()
# transform data so that we can use a linear fit
lx = np.log(xData)
ly = np.log(yData)
plt.plot(lx, ly, 'bo')
plt.show()
def decay(params, x, data):
lN = params['lN'].value
a = params['a'].value
# our linear model
model = a * x + lN
return model - data # that's what you want to minimize
# create a set of Parameters
params = Parameters()
params.add('lN', value=np.log(5.5), min=0.01, max=100) # value is the initial value
params.add('a', value=-0.5, min=-1, max=-0.001) # min, max define parameter bounds
# do fit, here with leastsq model
result = minimize(decay, params, args=(lx, ly))
# write error report
report_fit(params)
# plot data
xnew = np.linspace(0., 100., 5000.)
# plot the data
plt.plot(xData, yData, 'bo')
plt.plot(xnew, np.exp(result.values['lN']) * xnew ** (result.values['a']), 'r')
plt.show()
EDIT
Assuming that you have scipy 0.17 installed, you can also do the following using curve_fit. I show it for your original definition of the power law (red line in the plot below) as well as for the logarithmic data (black line in the plot below). The data is generated in the same way as above. The plot the looks as follows:
As you can see, the data is described very well. If you print popt and popt_log, you obtain array([ 10.47463426, 0.07914812]) and array([ 2.35158653, -0.08045776]), respectively (note: for the letter one you will have to take the exponantial of the first argument - np.exp(popt_log[0]) = 10.502 which is close to the original data).
Here is the entire code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# generate some data with noise
xData = np.linspace(0.01, 100., 50)
aOrg = 0.08
Norg = 10.5
yData = Norg * xData ** (-aOrg) + np.random.normal(0, 0.5, len(xData))
# get logarithmic data
lx = np.log(xData)
ly = np.log(yData)
def f(x, N, a):
return N * x ** (-a)
def f_log(x, lN, a):
return a * x + lN
# optimize using the appropriate bounds
popt, pcov = curve_fit(f, xData, yData, bounds=(0, [30., 20.]))
popt_log, pcov_log = curve_fit(f_log, lx, ly, bounds=([0, -10], [30., 20.]))
xnew = np.linspace(0.01, 100., 5000)
# plot the data
plt.plot(xData, yData, 'bo')
plt.plot(xnew, f(xnew, *popt), 'r')
plt.plot(xnew, f(xnew, np.exp(popt_log[0]), -popt_log[1]), 'k')
plt.show()

Python gaussian fit on simulated gaussian noisy data

I need to interpolate data coming from an instrument using a gaussian fit. To this end I thought about using the curve_fit function from scipy.
Since I'd like to test this functionality on fake data before trying it on the instrument I wrote the following code to generate noisy gaussian data and to fit it:
from scipy.optimize import curve_fit
import numpy
import pylab
# Create a gaussian function
def gaussian(x, a, b, c):
val = a * numpy.exp(-(x - b)**2 / (2*c**2))
return val
# Generate fake data.
zMinEntry = 80.0*1E-06
zMaxEntry = 180.0*1E-06
zStepEntry = 0.2*1E-06
x = numpy.arange(zMinEntry,
zMaxEntry,
zStepEntry,
dtype = numpy.float64)
n = len(x)
meanY = zMinEntry + (zMaxEntry - zMinEntry)/2
sigmaY = 10.0E-06
a = 1.0/(sigmaY*numpy.sqrt(2*numpy.pi))
y = gaussian(x, a, meanY, sigmaY) + a*0.1*numpy.random.normal(0, 1, size=len(x))
# Fit
popt, pcov = curve_fit(gaussian, x, y)
# Print results
print("Scale = %.3f +/- %.3f" % (popt[0], numpy.sqrt(pcov[0, 0])))
print("Offset = %.3f +/- %.3f" % (popt[1], numpy.sqrt(pcov[1, 1])))
print("Sigma = %.3f +/- %.3f" % (popt[2], numpy.sqrt(pcov[2, 2])))
pylab.plot(x, y, 'ro')
pylab.plot(x, gaussian(x, popt[0], popt[1], popt[2]))
pylab.grid(True)
pylab.show()
Unfortunately this does not work properly, the output of the code is the following:
Scale = 6174.816 +/- 7114424813.672
Offset = 429.319 +/- 3919751917.830
Sigma = 1602.869 +/- 17923909301.176
And the plotted result is (blue is the fit function, red dots is the noisy input data):
I also tried to look at this answer, but couldn't figure out where my problem is.
Am I missing something here? Or am I using the curve_fit function in the wrong way? Thanks in advance!
I agree with Olaf in so far as it is a question of scale. The optimal parameters differ by many orders of magnitude. However, scaling the parameters with which you generated your toy data does not seem to solve the problem for your actual application. curve_fit uses lestsq, which numerically approximates the Jacobian, where numerical problems arise because of the differences in scale (try to use the full_output keyword in curve_fit).
In my experience it is often best to use fmin which does not rely on approximated derivatives but uses only function values. You now have to write your own least-squares function that is to be optimized.
Starting values are still important. In your case you can make sufficiently good guesses by taking the maximum amplitude for a and the corresponding x-values for band c.
In code, it looks like this:
from scipy.optimize import curve_fit,fmin
import numpy
import pylab
# Create a gaussian function
def gaussian(x, a, b, c):
val = a * numpy.exp(-(x - b)**2 / (2*c**2))
return val
# Generate fake data.
zMinEntry = 80.0*1E-06
zMaxEntry = 180.0*1E-06
zStepEntry = 0.2*1E-06
x = numpy.arange(zMinEntry,
zMaxEntry,
zStepEntry,
dtype = numpy.float64)
n = len(x)
meanY = zMinEntry + (zMaxEntry - zMinEntry)/2
sigmaY = 10.0E-06
a = 1.0/(sigmaY*numpy.sqrt(2*numpy.pi))
y = gaussian(x, a, meanY, sigmaY) + a*0.1*numpy.random.normal(0, 1, size=len(x))
print a, meanY, sigmaY
# estimate starting values from the data
a = y.max()
b = x[numpy.argmax(a)]
c = b
# define a least squares function to optimize
def minfunc(params):
return sum((y-gaussian(x,params[0],params[1],params[2]))**2)
# fit
popt = fmin(minfunc,[a,b,c])
# Print results
print("Scale = %.3f" % (popt[0]))
print("Offset = %.3f" % (popt[1]))
print("Sigma = %.3f" % (popt[2]))
pylab.plot(x, y, 'ro')
pylab.plot(x, gaussian(x, popt[0], popt[1], popt[2]),lw = 2)
pylab.xlim(x.min(),x.max())
pylab.grid(True)
pylab.show()
Looks like some numerical instabilities are creeping into the optimizer. Try scaling the data. With the following data:
zMinEntry = 80.0*1E-06 * 1000
zMaxEntry = 180.0*1E-06 * 1000
zStepEntry = 0.2*1E-06 * 1000
sigmaY = 10.0E-06 * 1000
I get estimates of
Scale = 39.697 +/- 0.526
Offset = 0.130 +/- 0.000
Sigma = -0.010 +/- 0.000
Compare that to the true values:
Scale = 39.894228
Offset = 0.13
Sigma = 0.01
The minus sign of sigma can of course be ignored.
This gives the following plot
As I said in a comment, if you provide a reasonable initial guess, the fit succeeds, i.e. call curve_fit like that:
popt, pcov = curve_fit(gaussian, x, y, [50000,0.00012,0.00002])

numpy polyfit passing through 0

Suppose I have x and y vectors with a weight vector wgt. I can fit a cubic curve (y = a x^3 + b x^2 + c x + d) by using np.polyfit as follows:
y_fit = np.polyfit(x, y, deg=3, w=wgt)
Now, suppose I want to do another fit, but this time, I want the fit to pass through 0 (i.e. y = a x^3 + b x^2 + c x, d = 0), how can I specify a particular coefficient (i.e. d in this case) to be zero?
Thanks
You can try something like the following:
Import curve_fit from scipy, i.e.
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
Define the curve fitting function. In your case,
def fit_func(x, a, b, c):
# Curve fitting function
return a * x**3 + b * x**2 + c * x # d=0 is implied
Perform the curve fitting,
# Curve fitting
params = curve_fit(fit_func, x, y)
[a, b, c] = params[0]
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = a * x_fit**3 + b * x_fit**2 + c * x_fit
Plot the results if you please,
plt.plot(x, y, '.r') # Data
plt.plot(x_fit, y_fit, 'k') # Fitted curve
It does not answer the question in the sense that it uses numpy's polyfit function to pass through the origin, but it solves the problem.
Hope someone finds it useful :)
You can use np.linalg.lstsq and construct your coefficient matrix manually. To start, I'll create the example data x and y, and the "exact fit" y0:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(100)
y0 = 0.07 * x ** 3 + 0.3 * x ** 2 + 1.1 * x
y = y0 + 1000 * np.random.randn(x.shape[0])
Now I'll create a full cubic polynomial 'training' or 'independent variable' matrix that includes the constant d column.
XX = np.vstack((x ** 3, x ** 2, x, np.ones_like(x))).T
Let's see what I get if I compute the fit with this dataset and compare it to polyfit:
p_all = np.linalg.lstsq(X_, y)[0]
pp = np.polyfit(x, y, 3)
print np.isclose(pp, p_all).all()
# Returns True
Where I've used np.isclose because the two algorithms do produce very small differences.
You're probably thinking 'that's nice, but I still haven't answered the question'. From here, forcing the fit to have a zero offset is the same as dropping the np.ones column from the array:
p_no_offset = np.linalg.lstsq(XX[:, :-1], y)[0] # use [0] to just grab the coefs
Ok, let's see what this fit looks like compared to our data:
y_fit = np.dot(p_no_offset, XX[:, :-1].T)
plt.plot(x, y0, 'k-', linewidth=3)
plt.plot(x, y_fit, 'y--', linewidth=2)
plt.plot(x, y, 'r.', ms=5)
This gives this figure,
WARNING: When using this method on data that does not actually pass through (x,y)=(0,0) you will bias your estimates of your output solution coefficients (p) because lstsq will be trying to compensate for that fact that there is an offset in your data. Sort of a 'square peg round hole' problem.
Furthermore, you could also fit your data to a cubic only by doing:
p_ = np.linalg.lstsq(X_[:1, :], y)[0]
Here again the warning above applies. If your data contains quadratic, linear or constant terms the estimate of the cubic coefficient will be biased. There can be times when - for numerical algorithms - this sort of thing is useful, but for statistical purposes my understanding is that it is important to include all of the lower terms. If tests turn out to show that the lower terms are not statistically different from zero that's fine, but for safety's sake you should probably leave them in when you estimate your cubic.
Best of luck!

Categories