I have a question about fitting a step function using scipy routines like curve_fit. I have trouble making it vectorized, for example:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
xobs=np.linspace(0,10,100)
yl=np.random.rand(50); yr=np.random.rand(50)+100
yobs=np.concatenate((yl,yr),axis=0)
def model(x,rf,T1,T2):
#1: x=np.vectorize(x)
if x<rf:
ret= T1
else:
ret= T2
return ret
#2: model=np.vectorize(model)
popt, pcov = curve_fit(model, xobs, yobs, [40.,0.,100.])
It says
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
If I add #1 or #2 it runs but doesn't really fit the data:
OptimizeWarning: Covariance of the parameters could not be estimated category=OptimizeWarning)
[ 40. 50.51182064 50.51182064] [[ inf inf inf]
[ inf inf inf]
[ inf inf inf]]
Anybody know how to fix that? THX
Here's what I did. I retained xobs and yobs:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
xobs=np.linspace(0,10,100)
yl=np.random.rand(50); yr=np.random.rand(50)+100
yobs=np.concatenate((yl,yr),axis=0)
Now, Heaviside function must be generated. To give you an overview of this function, consider the half-maximum convention of Heaviside function:
In Python, this is equivalent to: def f(x): return 0.5 * (np.sign(x) + 1)
A sample plot would be:
xval = sorted(np.concatenate([np.linspace(-5,5,100),[0]])) # includes x = 0
yval = f(xval)
plt.plot(xval,yval,'ko-')
plt.ylim(-0.1,1.1)
plt.xlabel('x',size=18)
plt.ylabel('H(x)',size=20)
Now, plotting xobs and yobs gives:
plt.plot(xobs,yobs,'ko-')
plt.ylim(-10,110)
plt.xlabel('xobs',size=18)
plt.ylabel('yobs',size=20)
Notice that comparing the two figures, the second plot is shifted by 5 units and the maximum increases from 1.0 to 100. I infer that the function for the second plot can be represented as follows:
or in Python: (0.5 * (np.sign(x-5) + 1) * 100 = 50 * (np.sign(x-5) + 1)
Combining the plots yields (where Fit represents the above fitting function)
The plot confirms that my guess is correct. Now, assuming that YOU DO NOT KNOW how did this correct fitting function come about, a generalized fitting function is created: def f(x,a,b,c): return a * (np.sign(x-b) + c), where theoretically, a = 50, b = 5, and c = 1.
Proceed to estimation:
popt,pcov=curve_fit(f,xobs,yobs,bounds=([49,4.75,0],[50,5,2])).
Now, bounds = ([lower bound of each parameter (a,b,c)],[upper bound of each parameter]). Technically, this means that 49 < a < 50, 4.75 < b < 5, and 0 < c < 2.
Here are MY results for popt and pcov:
pcov represents the estimated covariance of popt. The diagonals provide the variance of the parameter estimate [Source].
Results show that the parameter estimates pcov are near the theoretical values.
Basically, a generalized Heaviside function can be represented by: a * (np.sign(x-b) + c)
Here is the code that will generate parameter estimates and the corresponding covariances:
import numpy as np
from scipy.optimize import curve_fit
xobs = np.linspace(0,10,100)
yl = np.random.rand(50); yr=np.random.rand(50)+100
yobs = np.concatenate((yl,yr),axis=0)
def f(x,a,b,c): return a * (np.sign(x-b) + c) # Heaviside fitting function
popt, pcov = curve_fit(f,xobs,yobs,bounds=([49,4.75,0],[50,5,2]))
print 'popt = %s' % popt
print 'pcov = \n %s' % pcov
Finally, note that the estimates of popt and pcov vary.
pythonscipy
This question is pretty old, but in case it can be useful to other people: The Heaviside function is not differentiable at the step, and this is causing issues in the minimization. In such cases, I fit a logistic function, as shown below.
Fitting a heaviside function always fails in my case.
x = np.linspace(0,10,101)
y = np.heaviside((x-5), 0.)
def sigmoid(x, x0,b):
return scipy.special.expit((x-x0)*b)
args, cov = optim.curve_fit(sigmoid, x, y)
plt.scatter(x,y)
plt.plot(x, sigmoid(x, *args))
print(args)
>
[ 5.05006427 532.21427701]
Related
I am trying to fit a progression of Gaussian peaks to a spectral lineshape.
The progression is a summation of N evenly spaced Gaussian peaks. When coded as a function, the formula for N=1 looks like this:
A * ((e0-i*hf)/e0)**3 * ((S**i)/np.math.factorial(i)) * np.exp(-4*np.log(2)*((x-e0+i*hf)/fwhm)**2)
where A, e0, hf, S and fwhm are to be determined from the fit with some good initial guesses.
Importantly, the parameter i starts at 0 and is incremented by 1 for every additional component.
So, for N = 3 the expression would take the form:
A * ((e0-0*hf)/e0)**3 * ((S**0)/np.math.factorial(0)) * np.exp(-4*np.log(2)*((x-e0+0*hf)/fwhm)**2) +
A * ((e0-1*hf)/e0)**3 * ((S**1)/np.math.factorial(1)) * np.exp(-4*np.log(2)*((x-e0+1*hf)/fwhm)**2) +
A * ((e0-2*hf)/e0)**3 * ((S**2)/np.math.factorial(2)) * np.exp(-4*np.log(2)*((x-e0+2*hf)/fwhm)**2)
All the parameters except i are constant for every component in the summation, and this is intended. i is changing in a controlled way depending on the number of parameters.
I am using curve_fit. One way to code the fitting routine would be to explicitly define the expression for any reasonable N and just use an appropriate one. Like, here it'would be 5 or 6, depending on the spacing, which is determined by hf. I could just define a long function with N components, writing an appropriate i value into each component. I understand how to do that (and did). But I would like to code this more intelligently. My goal is to write a function that will accept any value of N, add the appropriate amount of components as described above, compute the expression while incrementing the i properly and return the result.
I have attempted a variety of things. My main hurdle is that I don't know how to tell the program to use a particular N and the corresponding values of i. Finally, after some searching I thought I found a good way to code it with a lambda function.
from scipy.optimize import curve_fit
import numpy as np
def fullfunc(x,p,n):
def func(x,A,e0,hf,S,fwhm,i):
return A * ((e0-i*hf)/e0)**3 * ((S**i)/np.math.factorial(i)) * np.exp(-4*np.log(2)*((x-e0+i*hf)/fwhm)**2)
y_fit = np.zeros_like(x)
for i in range(n):
y_fit += func(x,p[0],p[1],p[2],p[3],p[4],i)
return y_fit
p = [1,26000,1400,1,1000]
x = [27027,25062,23364,21881,20576,19417,18382,17452,16611,15847,15151]
y = [0.01,0.42,0.93,0.97,0.65,0.33,0.14,0.06,0.02,0.01,0.004]
n = 7
fittedParameters, pcov = curve_fit(lambda x,p: fullfunc(x,p,n), x, y, p)
A,e0,hf,S,fwhm = fittedParameters
This gives:
TypeError: <lambda>() takes 2 positional arguments but 7 were given
and I don't understand why. I have a feeling the lambda function can't deal with a list of initial parameters.
I would greatly appreciate any advice on how to make this work without explicitly writing all the equations out, as I find that a bit too rigid.
The x and y ranges provided are samples of real data which give a general idea of what the shape is.
Since you only use summation over a range i=0, 1, ..., n-1, there is no need to refer to complicated lambda constructs that may or may not work in the context of curve fit. Just define your fit function as the summation of n components:
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit
import numpy as np
def func(x, A, e0, hf, S, fwhm):
return sum((A * ((e0-i*hf)/e0)**3 * ((S**i)/np.math.factorial(i)) * np.exp(-4*np.log(2)*((x-e0+i*hf)/fwhm)**2)) for i in range(n))
p = [1,26000,1400,1,1000]
x = [27027,25062,23364,21881,20576,19417,18382,17452,16611,15847,15151]
y = [0.01,0.42,0.93,0.97,0.65,0.33,0.14,0.06,0.02,0.01,0.004]
n = 7
fittedParameters, pcov = curve_fit(func, x, y, p0=p)
#A,e0,hf,S,fwhm = fittedParameters
print(fittedParameters)
plt.plot(x, y, "ro", label="data")
x_fit = np.linspace(min(x), max(x), 100)
y_fit = func(x_fit, *fittedParameters)
plt.plot(x_fit, y_fit, label="fit")
plt.legend()
plt.show()
Sample output:
P.S.: By the look of it, these data points are already well fitted with n=1.
Consider the following MWE
import numpy as np
from scipy.optimize import curve_fit
X=np.arange(1,10,1)
Y=abs(X+np.random.randn(15,9))
def linear(x, a, b):
return (x/b)**a
coeffs=[]
for ix in range(Y.shape[0]):
print(ix)
c0, pcov = curve_fit(linear, X, Y[ix])
coeffs.append(c0)
XX=np.tile(X, Y.shape[0])
c0, pcov = curve_fit(linear, XX, Y.flatten())
I have a problem where I have to do basically that, but instead of 15 iterations it's thousands and it's pretty slow.
Is there any way to do all of those iterations at once with curve_fit? I know the result from the function is supposed to be a 1D-array, so just passing the args like this
c0, pcov = curve_fit(nlinear, X, Y)
is not going to work. Also I think the answer has to be in flattening Y, so I can get a flattened result, but I just can't get anything to work.
EDIT
I know that if I do something like
XX=np.tile(X, Y.shape[0])
c0, pcov = curve_fit(nlinear, XX, Y.flatten())
then I get a "mean" value of the coefficients, but that's not what I want.
EDIT 2
For the record, I solved with using Jacques Kvam's set-up but implemented using Numpy (because of a limitation)
lX = np.log(X)
lY = np.log(Y)
A = np.vstack([lX, np.ones(len(lX))]).T
m, c=np.linalg.lstsq(A, lY.T)[0]
And then m is a and to get b:
b=np.exp(-c/m)
Least squares won't give the same result because the noise is transformed by log in this case. If the noise is zero, both methods give the same result.
import numpy as np
from numpy import random as rng
from scipy.optimize import curve_fit
rng.seed(0)
X=np.arange(1,7)
Y = np.zeros((4, 6))
for i in range(4):
b = a = i + 1
Y[i] = (X/b)**a + 0.01 * randn(6)
def linear(x, a, b):
return (x/b)**a
coeffs=[]
for ix in range(Y.shape[0]):
print(ix)
c0, pcov = curve_fit(linear, X, Y[ix])
coeffs.append(c0)
coefs is
[array([ 0.99309127, 0.98742861]),
array([ 2.00197613, 2.00082722]),
array([ 2.99130237, 2.99390585]),
array([ 3.99644048, 3.9992937 ])]
I'll use scikit-learn's implementation of linear regression since I believe that scales well.
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
Take logs of X and Y
lX = np.log(X)[None, :]
lY = np.log(Y)
Now fit and check that coeffiecients are the same as before.
lr.fit(lX.T, lY.T)
lr.coef_
Which gives similar exponent.
array([ 0.98613517, 1.98643974, 2.96602892, 4.01718514])
Now check the divisor.
np.exp(-lr.intercept_ / lr.coef_.ravel())
Which gives similar coefficient, you can see the methods diverging somewhat though in their answers.
array([ 0.99199406, 1.98234916, 2.90677142, 3.73416501])
It might be useful in some situations to have the best fit parameters as a numpy array for further calculations. One can add the following after the for loop:
bestfit_par = np.asarray(coeffs)
Suppose I have x and y vectors with a weight vector wgt. I can fit a cubic curve (y = a x^3 + b x^2 + c x + d) by using np.polyfit as follows:
y_fit = np.polyfit(x, y, deg=3, w=wgt)
Now, suppose I want to do another fit, but this time, I want the fit to pass through 0 (i.e. y = a x^3 + b x^2 + c x, d = 0), how can I specify a particular coefficient (i.e. d in this case) to be zero?
Thanks
You can try something like the following:
Import curve_fit from scipy, i.e.
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
Define the curve fitting function. In your case,
def fit_func(x, a, b, c):
# Curve fitting function
return a * x**3 + b * x**2 + c * x # d=0 is implied
Perform the curve fitting,
# Curve fitting
params = curve_fit(fit_func, x, y)
[a, b, c] = params[0]
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = a * x_fit**3 + b * x_fit**2 + c * x_fit
Plot the results if you please,
plt.plot(x, y, '.r') # Data
plt.plot(x_fit, y_fit, 'k') # Fitted curve
It does not answer the question in the sense that it uses numpy's polyfit function to pass through the origin, but it solves the problem.
Hope someone finds it useful :)
You can use np.linalg.lstsq and construct your coefficient matrix manually. To start, I'll create the example data x and y, and the "exact fit" y0:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(100)
y0 = 0.07 * x ** 3 + 0.3 * x ** 2 + 1.1 * x
y = y0 + 1000 * np.random.randn(x.shape[0])
Now I'll create a full cubic polynomial 'training' or 'independent variable' matrix that includes the constant d column.
XX = np.vstack((x ** 3, x ** 2, x, np.ones_like(x))).T
Let's see what I get if I compute the fit with this dataset and compare it to polyfit:
p_all = np.linalg.lstsq(X_, y)[0]
pp = np.polyfit(x, y, 3)
print np.isclose(pp, p_all).all()
# Returns True
Where I've used np.isclose because the two algorithms do produce very small differences.
You're probably thinking 'that's nice, but I still haven't answered the question'. From here, forcing the fit to have a zero offset is the same as dropping the np.ones column from the array:
p_no_offset = np.linalg.lstsq(XX[:, :-1], y)[0] # use [0] to just grab the coefs
Ok, let's see what this fit looks like compared to our data:
y_fit = np.dot(p_no_offset, XX[:, :-1].T)
plt.plot(x, y0, 'k-', linewidth=3)
plt.plot(x, y_fit, 'y--', linewidth=2)
plt.plot(x, y, 'r.', ms=5)
This gives this figure,
WARNING: When using this method on data that does not actually pass through (x,y)=(0,0) you will bias your estimates of your output solution coefficients (p) because lstsq will be trying to compensate for that fact that there is an offset in your data. Sort of a 'square peg round hole' problem.
Furthermore, you could also fit your data to a cubic only by doing:
p_ = np.linalg.lstsq(X_[:1, :], y)[0]
Here again the warning above applies. If your data contains quadratic, linear or constant terms the estimate of the cubic coefficient will be biased. There can be times when - for numerical algorithms - this sort of thing is useful, but for statistical purposes my understanding is that it is important to include all of the lower terms. If tests turn out to show that the lower terms are not statistically different from zero that's fine, but for safety's sake you should probably leave them in when you estimate your cubic.
Best of luck!
So I've got some data stored as two lists, and plotted them using
plot(datasetx, datasety)
Then I set a trendline
trend = polyfit(datasetx, datasety)
trendx = []
trendy = []
for a in range(datasetx[0], (datasetx[-1]+1)):
trendx.append(a)
trendy.append(trend[0]*a**2 + trend[1]*a + trend[2])
plot(trendx, trendy)
But I have a third list of data, which is the error in the original datasety. I'm fine with plotting the errorbars, but what I don't know is using this, how to find the error in the coefficients of the polynomial trendline.
So say my trendline came out to be 5x^2 + 3x + 4 = y, there needs to be some sort of error on the 5, 3 and 4 values.
Is there a tool using NumPy that will calculate this for me?
I think you can use the function curve_fit of scipy.optimize (documentation). A basic example of the usage:
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0,4,50)
y = func(x, 5, 3, 4)
yn = y + 0.2*np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn)
Following the documentation, pcov gives:
The estimated covariance of popt. The diagonals provide the variance
of the parameter estimate.
So in this way you can calculate an error estimate on the coefficients. To have the standard deviation you can take the square root of the variance.
Now you have an error on the coefficients, but it is only based on the deviation between the ydata and the fit. In case you also want to account for an error on the ydata itself, the curve_fit function provides the sigma argument:
sigma : None or N-length sequence
If not None, it represents the standard-deviation of ydata. This
vector, if given, will be used as weights in the least-squares
problem.
A complete example:
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0,4,20)
y = func(x, 5, 3, 4)
# generate noisy ydata
yn = y + 0.2 * y * np.random.normal(size=len(x))
# generate error on ydata
y_sigma = 0.2 * y * np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn, sigma = y_sigma)
# plot
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(x, yn, yerr = y_sigma, fmt = 'o')
ax.plot(x, np.polyval(popt, x), '-')
ax.text(0.5, 100, r"a = {0:.3f} +/- {1:.3f}".format(popt[0], pcov[0,0]**0.5))
ax.text(0.5, 90, r"b = {0:.3f} +/- {1:.3f}".format(popt[1], pcov[1,1]**0.5))
ax.text(0.5, 80, r"c = {0:.3f} +/- {1:.3f}".format(popt[2], pcov[2,2]**0.5))
ax.grid()
plt.show()
Then something else, about using numpy arrays. One of the main advantages of using numpy is that you can avoid for loops because operations on arrays apply elementwise. So the for-loop in your example can also be done as following:
trendx = arange(datasetx[0], (datasetx[-1]+1))
trendy = trend[0]*trendx**2 + trend[1]*trendx + trend[2]
Where I use arange instead of range as it returns a numpy array instead of a list.
In this case you can also use the numpy function polyval:
trendy = polyval(trend, trendx)
I have not been able to find any way of getting the errors in the coefficients that is built in to numpy or python. I have a simple tool that I wrote based on Section 8.5 and 8.6 of John Taylor's An Introduction to Error Analysis. Maybe this will be sufficient for your task (note the default return is the variance, not the standard deviation). You can get large errors (as in the provided example) because of significant covariance.
def leastSquares(xMat, yMat):
'''
Purpose
-------
Perform least squares using the procedure outlined in 8.5 and 8.6 of Taylor, solving
matrix equation X a = Y
Examples
--------
>>> from scipy import matrix
>>> xMat = matrix([[ 1, 5, 25],
[ 1, 7, 49],
[ 1, 9, 81],
[ 1, 11, 121]])
>>> # matrix has rows of format [constant, x, x^2]
>>> yMat = matrix([[142],
[168],
[211],
[251]])
>>> a, varCoef, yRes = leastSquares(xMat, yMat)
>>> # a is a column matrix, holding the three coefficients a, b, c, corresponding to
>>> # the equation a + b*x + c*x^2
Returns
-------
a: matrix
best fit coefficients
varCoef: matrix
variance of derived coefficents
yRes: matrix
y-residuals of fit
'''
xMatSize = xMat.shape
numMeas = xMatSize[0]
numVars = xMatSize[1]
xxMat = xMat.T * xMat
xyMat = xMat.T * yMat
xxMatI = xxMat.I
aMat = xxMatI * xyMat
yAvgMat = xMat * aMat
yRes = yMat - yAvgMat
var = (yRes.T * yRes) / (numMeas - numVars)
varCoef = xxMatI.diagonal() * var[0, 0]
return aMat, varCoef, yRes
The code below is giving me a flat line for the line of best fit rather than a nice curve along the model of e^(-x) that would fit the data. Can anyone show me how to fix the code below so that it fits my data?
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize
def _eNegX_(p,x):
x0,y0,c,k=p
y = (c * np.exp(-k*(x-x0))) + y0
return y
def _eNegX_residuals(p,x,y):
return y - _eNegX_(p,x)
def Get_eNegX_Coefficients(x,y):
print 'x is: ',x
print 'y is: ',y
# Calculate p_guess for the vectors x,y. Note that p_guess is the
# starting estimate for the minimization.
p_guess=(np.median(x),np.min(y),np.max(y),.01)
# Calls the leastsq() function, which calls the residuals function with an initial
# guess for the parameters and with the x and y vectors. Note that the residuals
# function also calls the _eNegX_ function. This will return the parameters p that
# minimize the least squares error of the _eNegX_ function with respect to the original
# x and y coordinate vectors that are sent to it.
p, cov, infodict, mesg, ier = scipy.optimize.leastsq(
_eNegX_residuals,p_guess,args=(x,y),full_output=1,warning=True)
# Define the optimal values for each element of p that were returned by the leastsq() function.
x0,y0,c,k=p
print('''Reference data:\
x0 = {x0}
y0 = {y0}
c = {c}
k = {k}
'''.format(x0=x0,y0=y0,c=c,k=k))
print 'x.min() is: ',x.min()
print 'x.max() is: ',x.max()
# Create a numpy array of x-values
numPoints = np.floor((x.max()-x.min())*100)
xp = np.linspace(x.min(), x.max(), numPoints)
print 'numPoints is: ',numPoints
print 'xp is: ',xp
print 'p is: ',p
pxp=_eNegX_(p,xp)
print 'pxp is: ',pxp
# Plot the results
plt.plot(x, y, '>', xp, pxp, 'g-')
plt.xlabel('BPM%Rest')
plt.ylabel('LVET/BPM',rotation='vertical')
plt.xlim(0,3)
plt.ylim(0,4)
plt.grid(True)
plt.show()
return p
# Declare raw data for use in creating regression equation
x = np.array([1,1.425,1.736,2.178,2.518],dtype='float')
y = np.array([3.489,2.256,1.640,1.043,0.853],dtype='float')
p=Get_eNegX_Coefficients(x,y)
It looks like it's a problem with your initial guesses; something like (1, 1, 1, 1) works fine:
You have
p_guess=(np.median(x),np.min(y),np.max(y),.01)
for the function
def _eNegX_(p,x):
x0,y0,c,k=p
y = (c * np.exp(-k*(x-x0))) + y0
return y
So that's test_data_maxe^( -.01(x - test_data_median)) + test_data_min
I don't know much about the art of choosing good starting parameters, but I can say a few things. leastsq is finding a local minimum here - the key in choosing these values is to find the right mountain to climb, not to try to cut down on the work that the minimization algorithm has to do. Your initial guess looks like this (green):
(1.736, 0.85299999999999998, 3.4889999999999999, 0.01)
which results in your flat line (blue):
(-59.20295956, 1.8562 , 1.03477144, 0.69483784)
Greater gains were made in adjusting the height of the line than in increasing the k value. If you know you're fitting to this kind of data, use a larger k. If you don't know, I guess you could try to find a decent k value by sampling your data, or working back from the slope between an average of the first half and the second half, but I wouldn't know how to go about that.
Edit: You could also start with several guesses, run the minimization several times, and take the line with the lowest residuals.