I'm trying to fit some data from a simulation code I've been running in order to figure out a power law dependence. When I plot a linear fit, the data does not fit very well.
Here's the python script I'm using to fit the data:
#!/usr/bin/env python
from scipy import optimize
import numpy
xdata=[ 0.00010851, 0.00021701, 0.00043403, 0.00086806, 0.00173611, 0.00347222]
ydata=[ 29.56241016, 29.82245508, 25.33930469, 19.97075977, 12.61276074, 7.12695312]
fitfunc = lambda p, x: p[0] + p[1] * x ** (p[2])
errfunc = lambda p, x, y: (y - fitfunc(p, x))
out,success = optimize.leastsq(errfunc, [1,-1,-0.5],args=(xdata, ydata),maxfev=3000)
print "%g + %g*x^%g"%(out[0],out[1],out[2])
the output I get is:
-71205.3 + 71174.5*x^-9.79038e-05
While on the plot the fit looks about as good as you'd expect from a leastsquares fit, the form of the output bothers me. I was hoping the constant would be close to where you'd expect the zero to be (around 30). And I was expecting to find a power dependence of a larger fraction than 10^-5.
I've tried rescaling my data and playing with the parameters to optimize.leastsq with no luck. Is what I'm trying to accomplish possible or does my data just not allow it? The calculation is expensive, so getting more data points is non-trivial.
Thanks!
It is much better to first take the logarithm, then use leastsquare to fit to this linear equation, which will give you a much better fit. There is a great example in the scipy cookbook, which I've adapted below to fit your code.
The best fits like this are: amplitude = 0.8955, and index = -0.40943265484
As we can see from the graph (and your data), if its a power law fit we would not expect the amplitude value to be near 30. As in the power law equation f(x) == Amp * x ** index, so with a negative index: f(1) == Amp and f(0) == infinity.
from pylab import *
from scipy import *
from scipy import optimize
xdata=[ 0.00010851, 0.00021701, 0.00043403, 0.00086806, 0.00173611, 0.00347222]
ydata=[ 29.56241016, 29.82245508, 25.33930469, 19.97075977, 12.61276074, 7.12695312]
logx = log10(xdata)
logy = log10(ydata)
# define our (line) fitting function
fitfunc = lambda p, x: p[0] + p[1] * x
errfunc = lambda p, x, y: (y - fitfunc(p, x))
pinit = [1.0, -1.0]
out = optimize.leastsq(errfunc, pinit,
args=(logx, logy), full_output=1)
pfinal = out[0]
covar = out[1]
index = pfinal[1]
amp = 10.0**pfinal[0]
print 'amp:',amp, 'index', index
powerlaw = lambda x, amp, index: amp * (x**index)
##########
# Plotting data
##########
clf()
subplot(2, 1, 1)
plot(xdata, powerlaw(xdata, amp, index)) # Fit
plot(xdata, ydata)#, yerr=yerr, fmt='k.') # Data
text(0.0020, 30, 'Ampli = %5.2f' % amp)
text(0.0020, 25, 'Index = %5.2f' % index)
xlabel('X')
ylabel('Y')
subplot(2, 1, 2)
loglog(xdata, powerlaw(xdata, amp, index))
plot(xdata, ydata)#, yerr=yerr, fmt='k.') # Data
xlabel('X (log scale)')
ylabel('Y (log scale)')
savefig('power_law_fit.png')
show()
It helps to rescale xdata so the numbers are not all so small.
You could work in a new variable xprime = 1000*x.
Then fit xprime versus y.
Least squares will find parameters q fitting
y = q[0] + q[1] * (xprime ** q[2])
= q[0] + q[1] * ((1000*x) ** q[2])
So let
p[0] = q[0]
p[1] = q[1] * (1000**q[2])
p[2] = q[2]
Then y = p[0] + p[1] * (x ** p[2])
It also helps to change the initial guess to something closer to your desired result, such as
[max(ydata), -1, -0.5].
from scipy import optimize
import numpy as np
def fitfunc(p, x):
return p[0] + p[1] * (x ** p[2])
def errfunc(p, x, y):
return y - fitfunc(p, x)
xdata=np.array([ 0.00010851, 0.00021701, 0.00043403, 0.00086806,
0.00173611, 0.00347222])
ydata=np.array([ 29.56241016, 29.82245508, 25.33930469, 19.97075977,
12.61276074, 7.12695312])
N = 5000
xprime = xdata * N
qout,success = optimize.leastsq(errfunc, [max(ydata),-1,-0.5],
args=(xprime, ydata),maxfev=3000)
out = qout[:]
out[0] = qout[0]
out[1] = qout[1] * (N**qout[2])
out[2] = qout[2]
print "%g + %g*x^%g"%(out[0],out[1],out[2])
yields
40.1253 + -282.949*x^0.375555
The standard way to use linear least squares to obtain an exponential fit is to do what fraxel suggests in his/her answer: fit a straight line to log(y_i).
However, this method has known numerical disadvantages, particularly sensitivity (a small change in the data yields a large change in the estimate). The preferred alternative is to use a nonlinear least squares approach -- it is less sensitive. But if you are satisfied with the linear LS method for non-critical purposes, just use that.
Related
I do have a function, for example , but this can be something else as well, like a quadratic or logarithmic function. I am only interested in the domain of . The parameters of the function (a and k in this case) are known as well.
My goal is to fit a continuous piece-wise function to this, which contains alternating segments of linear functions (i.e. sloped straight segments, each with intercept of 0) and constants (i.e. horizontal segments joining the sloped segments together). The first and last segments are both sloped. And the number of segments should be pre-selected between around 9-29 (that is 5-15 linear steps + 4-14 constant plateaus).
Formally
The input function:
The fitted piecewise function:
I am looking for the optimal resulting parameters (c,r,b) (in terms of least squares) if the segment numbers (n) are specified beforehand.
The resulting constants (c) and the breakpoints (r) should be whole natural numbers, and the slopes (b) round two decimal point values.
I have tried to do the fitting numerically using the pwlf package using a segmented constant models, and further processed the resulting constant model with some graphical intuition to "slice" the constant steps with the slopes. It works to some extent, but I am sure this is suboptimal from both fitting perspective and computational efficiency. It takes multiple minutes to generate a fitting with 8 slopes on the range of 1-50000. I am sure there must be a better way to do this.
My idea would be to instead using only numerical methods/ML, the fact that we have the algebraic form of the input function could be exploited in some way to at least to use algebraic transforms (integrals) to get to a simpler optimization problem.
import numpy as np
import matplotlib.pyplot as plt
import pwlf
# The input function
def input_func(x,k,a):
return np.power(x,1/a)*k
x = np.arange(1,5e4)
y = input_func(x, 1.8, 1.3)
plt.plot(x,y);
def pw_fit(func, x_r, no_seg, *fparams):
# working on the specified range
x = np.arange(1,x_r)
y_input = func(x, *fparams)
my_pwlf = pwlf.PiecewiseLinFit(x, y_input, degree=0)
res = my_pwlf.fit(no_seg)
yHat = my_pwlf.predict(x)
# Function values at the breakpoints
y_isec = func(res, *fparams)
# Slope values at the breakpoints
slopes = np.round(y_isec / res, decimals=2)
slopes = slopes[1:]
# For the first slope value, I use the intersection of the first constant plateau and the input function
slopes = np.insert(slopes,0,np.round(y_input[np.argwhere(np.diff(np.sign(y_input - yHat))).flatten()[0]] / np.argwhere(np.diff(np.sign(y_input - yHat))).flatten()[0], decimals=2))
plateaus = np.unique(np.round(yHat))
# If due to rounding slope values (to two decimals), there is no change in a subsequent step, I just remove those segments
to_del = np.argwhere(np.diff(slopes) == 0).flatten()
slopes = np.delete(slopes,to_del + 1)
plateaus = np.delete(plateaus,to_del)
breakpoints = [np.ceil(plateaus[0]/slopes[0])]
for idx, j in enumerate(slopes[1:-1]):
breakpoints.append(np.floor(plateaus[idx]/j))
breakpoints.append(np.ceil(plateaus[idx+1]/j))
breakpoints.append(np.floor(plateaus[-1]/slopes[-1]))
return slopes, plateaus, breakpoints
slo, plat, breaks = pw_fit(input_func, 50000, 8, 1.8, 1.3)
# The piecewise function itself
def pw_calc(x, slopes, plateaus, breaks):
x = x.astype('float')
cond_list = [x < breaks[0]]
for idx, j in enumerate(breaks[:-1]):
cond_list.append((j <= x) & (x < breaks[idx+1]))
cond_list.append(breaks[-1] <= x)
func_list = [lambda x: x * slopes[0]]
for idx, j in enumerate(slopes[1:]):
func_list.append(plateaus[idx])
func_list.append(lambda x, j=j: x * j)
return np.piecewise(x, cond_list, func_list)
y_output = pw_calc(x, slo, plat, breaks)
plt.plot(x,y,y_output);
(Not important, but I think the fitted piecewise function is not continuous as it is. Intervals should be x<=r1; r1<x<=r2; ....)
As Anatolyg has pointed out, it looks to me that in the optimal solution (for the function posted at least, and probably for any where the derivative is different from zero), the horizantal segments will collapse to a point or the minimum segment length (in this case 1).
EDIT---------------------------------------------
The behavior above could only be valid if the slopes could have an intercept. If the intercepts are zero, as posted in the question, one consideration must be taken into account: Is the initial parabolic function defined in zero or nearby? Imagine the function y=0.001 *sqrt(x-1000), then the segments defined as b*x will have a slope close to zero and will be so similar to the constant segments that the best fit will be just the line that without intercept that fits better all the function.
Provided that the function is defined in zero or nearby, you can start by approximating the curve just by linear segments (with intercepts):
divide the function domain in N intervals(equal intervals or whose size is a function of the average curvature (or second derivative) of the function along the domain).
linear fit/regression in each intervals
for each interval, if a point (or bunch of points) in the extreme of any interval is better fitted by the line of the neighbor interval than the line of its interval, this point is assigned to the neighbor interval.
Repeat from 2) until no extreme points are moved.
Linear regressions might be optimized not to calculate all the covariance matrixes from scratch on each iteration, but just adding the contributions of the moved points to the previous covariance matrixes.
Then each linear segment (LSi) is replaced by a combination of a small constant segment at the beginning (Cbi), a linear segment without intercept (Si), and another constant segment at the end (Cei). This segments are easy to calculate as Si will contain the middle point of LSi, and Cbi and Cei will have respectively the begin and end values of the segment LSi. Then the intervals of each segment has to be calculated as an intersection between lines.
With this, the constant end segment will be collinear with the constant begin segment from the next interval so they will merge, resulting in a series of constant and linear segments interleaved.
But this would be a floating point start solution. Next, you will have to apply all the roundings which will mess up quite a lot all the segments as the conditions integer intervals and linear segments without slope can be very confronting. In fact, b,c,r are not totally independent. If ci and ri+1 are known, then bi+1 is already fixed
If nothing is broken so far, the final task will be to minimize the error/cost function (I assume that it will be the integral of the error between the parabolic function and the segments). My guess is that gradients here will be quite a pain, as if you change for example one ci, all the rest of the bj and cj will have to adapt as well due to the integer intervals restriction. However, if you can generalize the derivatives between parameters ( how much do I have to adapt bi+1 if ci changes a unit), you can propagate the change of one parameter to all other parameters and have kind of a gradient. Then for each interval, you can estimate what would be the ideal parameter and averaging all intervals calculate the best gradient step. Let me illustrate this:
Assuming first that r parameters are fixed, if I change c1 by one unit, b2 changes by 0.1, c2 changes by -0.2 and b3 changes by 0.2. This would be the gradient.
Then I estimate, comparing with the parabolic curve, that c1 should increase 0.5 (to reduce the cost by 10 points), b2 should increase 0.2 (to reduce the cost by 5 points), c2 should increase 0.2 (to reduce the cost by 6 points) and b3 should increase 0.1 (to reduce the cost by 9 points).
Finally, the gradient step would be (0.5/1·10 + 0.2/0.1·5 - 0.2/(-0.2)·6 + 0.1/0.2·9)/(10 + 5 + 6 + 9)~= 0.45. Thus, c1 would increase 0.45 units, b2 would increase 0.45·0.1, and so on.
When you add the r parameters to the pot, as integer intervals do not have an proper derivative, calculation is not straightforward. However, you can consider r parameters as floating points, calculate and apply the gradient step and then apply the roundings.
We can integrate the squared error function for linear and constant pieces and let SciPy optimize it. Python 3:
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize
xl = 1
xh = 50000
a = 1.3
p = 1 / a
n = 8
def split_b_and_c(bc):
return bc[::2], bc[1::2]
def solve_for_r(b, c):
r = np.empty(2 * n)
r[0] = xl
r[1:-1:2] = c / b[:-1]
r[2::2] = c / b[1:]
r[-1] = xh
return r
def linear_residual_integral(b, x):
return (
(x ** (2 * p + 1)) / (2 * p + 1)
- 2 * b * x ** (p + 2) / (p + 2)
+ b ** 2 * x ** 3 / 3
)
def constant_residual_integral(c, x):
return x ** (2 * p + 1) / (2 * p + 1) - 2 * c * x ** (p + 1) / (p + 1) + c ** 2 * x
def squared_error(bc):
b, c = split_b_and_c(bc)
r = solve_for_r(b, c)
linear = np.sum(
linear_residual_integral(b, r[1::2]) - linear_residual_integral(b, r[::2])
)
constant = np.sum(
constant_residual_integral(c, r[2::2])
- constant_residual_integral(c, r[1:-1:2])
)
return linear + constant
def evaluate(x, b, c, r):
i = 0
while x > r[i + 1]:
i += 1
return b[i // 2] * x if i % 2 == 0 else c[i // 2]
def main():
bc0 = (xl + (xh - xl) * np.arange(1, 4 * n - 2, 2) / (4 * n - 2)) ** (
p - 1 + np.arange(2 * n - 1) % 2
)
bc = scipy.optimize.minimize(
squared_error, bc0, bounds=[(1e-06, None) for i in range(2 * n - 1)]
).x
b, c = split_b_and_c(bc)
r = solve_for_r(b, c)
X = np.linspace(xl, xh, 1000)
Y = [evaluate(x, b, c, r) for x in X]
plt.plot(X, X ** p)
plt.plot(X, Y)
plt.show()
if __name__ == "__main__":
main()
I have tried to come up with a new solution myself, based on the idea of #Amo Robb, where I have partitioned the domain, and curve fitted a dual - constant and linear - piece together (with the help of np.maximum). I have used the 1 / f(x)' as the function to designate the breakpoints, but I know this is arbitrary and does not provide a global optimum. Maybe there is some optimal function for these breakpoints. But this solution is OK for me, as it might be appropriate to have a better fit at the first segments, at the expense of the error for the later segments. (The task itself is actually a cost based retail margin calculation {supply price -> added margin}, as the retail POS software can only work with such piecewise margin function).
The answer from #David Eisenstat is correct optimal solution if the parameters are allowed to be floats. Unfortunately the POS software can not use floats. It is OK to round up c-s and r-s afterwards. But the b-s should be rounded to two decimals, as those are inputted as percents, and this constraint would ruin the optimal solution with long floats. I will try to further improve my solution with both Amo's and David's valuable input. Thank You for that!
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# The input function f(x)
def input_func(x,k,a):
return np.power(x,1/a) * k
# 1 / f(x)'
def one_per_der(x,k,a):
return a / (k * np.power(x, 1/a-1))
# 1 / f(x)' inverted
def one_per_der_inv(x,k,a):
return np.power(a / (x*k), a / (1-a))
def segment_fit(start,end,y,first_val):
b, _ = curve_fit(lambda x,b: np.maximum(first_val, b*x), np.arange(start,end), y[start-1:end-1])
b = float(np.round(b, decimals=2))
bp = np.round(first_val / b)
last_val = np.round(b * end)
return b, bp, last_val
def pw_fit(end_range, no_seg, **fparams):
y_bps = np.linspace(one_per_der(1, **fparams), one_per_der(end_range,**fparams) , no_seg+1)[1:]
x_bps = np.round(one_per_der_inv(y_bps, **fparams))
y = input_func(x, **fparams)
slopes = [np.round(float(curve_fit(lambda x,b: x * b, np.arange(1,x_bps[0]), y[:int(x_bps[0])-1])[0]), decimals = 2)]
plats = [np.round(x_bps[0] * slopes[0])]
bps = []
for i, xbp in enumerate(x_bps[1:]):
b, bp, last_val = segment_fit(int(x_bps[i]+1), int(xbp), y, plats[i])
slopes.append(b); bps.append(bp); plats.append(last_val)
breaks = sorted(list(x_bps) + bps)[:-1]
# If due to rounding slope values (to two decimals), there is no change in a subsequent step, I just remove those segments
to_del = np.argwhere(np.diff(slopes) == 0).flatten()
breaks_to_del = np.concatenate((to_del * 2, to_del * 2 + 1))
slopes = np.delete(slopes,to_del + 1)
plats = np.delete(plats[:-1],to_del)
breaks = np.delete(breaks,breaks_to_del)
return slopes, plats, breaks
def pw_calc(x, slopes, plateaus, breaks):
x = x.astype('float')
cond_list = [x < breaks[0]]
for idx, j in enumerate(breaks[:-1]):
cond_list.append((j <= x) & (x < breaks[idx+1]))
cond_list.append(breaks[-1] <= x)
func_list = [lambda x: x * slopes[0]]
for idx, j in enumerate(slopes[1:]):
func_list.append(plateaus[idx])
func_list.append(lambda x, j=j: x * j)
return np.piecewise(x, cond_list, func_list)
fparams = {'k':1.8, 'a':1.2}
end_range = 5e4
no_steps = 10
x = np.arange(1, end_range)
y = input_func(x, **fparams)
slopes, plats, breaks = pw_fit(end_range, no_steps, **fparams)
y_output = pw_calc(x, slopes, plats, breaks)
plt.plot(x,y_output,y);
I have some function which behaves as shown below i.e. some tapered/decaying oscillations
I want to fit the data using scipy's curve_fit. I have previously asked a question related to fitting functions with scipy, which was well answered here, and highlighted the importance of the initial guess for the values of the fitting parameters.
However, I am struggling to fit this data in a way which captures both the oscillations and the decay. My approach is as follows:
from scipy.optimize import curve_fit
import numpy as np
def Fit(x,y):
#Define the function fit
func = ansatz
#Define the initial guess of parameters
mag = (y.max() + y.min()) / 2
y_shifted = y - mag
omega_guess = np.pi * np.sum(y_shifted[:-1] * y_shifted[1:] < 0) / (x.max() - x.min())
lam = np.log(2) / 1e7 #Rough guess based on approximate half life
p0 = (mag,mag, omega_guess,mag,lam)
#Do the fit
popt, pcov = curve_fit(func, x,y,p0=p0)
# return
return func(x, *popt)
def ansatz(x,A,B,omega,offset,lam):
osc = A*np.sin(omega*x) + B*np.cos(omega*x)
linear = offset
decay = np.exp(-x*lam)
return decay*osc + linear
data = np.load('example.npy')
x = data[:,0]
y = data[:,1]
yFit = Fit(x,y)
This approach captures the decay, but not the oscillations. What is erroneous with my approach? Guesses for fit parameters? Function ansatz? Code implementation?
I'm tring to approximate an empirical cumulative distribution function (ECDF I want to approximate) with a smooth function (with less than 5 parameter) such as the generalized logistic function.
However, using scipy.optimize.curve_fit, the fitting operation gives really bad approximations or it doesn't work at all (depending on the initial values). The variable series represents my data stored as pandas.Series.
from scipy.optimize import curve_fit
def fit_ecdf(x):
x = np.sort(x)
def result(v):
return np.searchsorted(x, v, side='right') / x.size
return result
ecdf = fit_ecdf(series)
def genlogistic(x, B, M, Q, v):
return 1 / (1 + Q * np.exp(-B * (x - M))) ** (1 / v)
params = curve_fit(genlogistic, xdata = series, ydata = ecdf(series), p0 = (0.1, 10.0, 0.1, 0.1))[0]
Should I use another type of function for the fit?
Are there any code mistakes?
UPDATE - 1
As asked, I link to a csv containing the data.
UPDATE - 2
After a lot of search and trial and error I find out this function
f(x; a, b, c) = 1 - 1 / (1 + (x / b) ** a) ** c
with a = 4.61320000, b = 2.94570952, c = 0.5886922
which fits a lot better than the other one. The only problem is the little step that the ECDF shows near x=1. How can I modify f to improve the quality of the fit? I was thinking of adding some sort of function that is "relevant" only in those kind of points. Here are the graphical results of the fit where the solid blue line is the ECDF and the dotted line represents the (x, f(x)) points.
I find out how to deal with that little step near x=1. As expressed in the question, adding some sort of function that is significant only in that interval was the game changer.
The "step" ends at about (1.7, 0.04) so I needed a sort of function that flattens for x > 1.7 and has y = 0.04 as asymptote. The natural choice (just to stay on point) was to take a function like f(x) = 1/exp(x).
Thanks to JamesPhillips, I also picked up the proper data for the regression (no double values = no overweighted points).
Python Code
from scipy.optimize import curve_fit
def fit_ecdf(x):
x = np.sort(x)
def result(v):
return np.searchsorted(x, v, side = 'right') / x.size
return result
ecdf = fit_ecdf(series)
unique_series = series.unique().tolist()
def cdf_interpolation(x, a, b, c, d):
f_1 = 0.95 + (0 - 0.95) / (1 + (x / b) ** a) ** c + 0.05
f_2 = (0 - 0.05)/(np.exp(d * x))
return f_1 + f_2
params = curve_fit(cdf_interpolation,
xdata = unique_series ,
ydata = ecdf(unique_series),
p0 = (6.0, 3.0, 0.4, 1.0))[0]
Parameters
a = 6.03256462
b = 2.89418871
c = 0.42997956
d = 1.06864006
Graphical results
I got an OK fit for a 5-parameter logistic equation (see image and code) using unique values, not sure if the low end curve is sufficient for your needs, please check.
import numpy as np
def Sigmoidal_FiveParameterLogistic_model(x_in): # from zunzun.com
# coefficients
a = 9.9220221252324947E-01
b = -3.1572339989462903E+00
c = 2.2303376075685142E+00
d = 2.6271495036080207E-02
f = 3.4399008905318986E+00
return d + (a - d) / np.power(1.0 + np.power(x_in / c, b), f)
Suppose I have x and y vectors with a weight vector wgt. I can fit a cubic curve (y = a x^3 + b x^2 + c x + d) by using np.polyfit as follows:
y_fit = np.polyfit(x, y, deg=3, w=wgt)
Now, suppose I want to do another fit, but this time, I want the fit to pass through 0 (i.e. y = a x^3 + b x^2 + c x, d = 0), how can I specify a particular coefficient (i.e. d in this case) to be zero?
Thanks
You can try something like the following:
Import curve_fit from scipy, i.e.
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
Define the curve fitting function. In your case,
def fit_func(x, a, b, c):
# Curve fitting function
return a * x**3 + b * x**2 + c * x # d=0 is implied
Perform the curve fitting,
# Curve fitting
params = curve_fit(fit_func, x, y)
[a, b, c] = params[0]
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = a * x_fit**3 + b * x_fit**2 + c * x_fit
Plot the results if you please,
plt.plot(x, y, '.r') # Data
plt.plot(x_fit, y_fit, 'k') # Fitted curve
It does not answer the question in the sense that it uses numpy's polyfit function to pass through the origin, but it solves the problem.
Hope someone finds it useful :)
You can use np.linalg.lstsq and construct your coefficient matrix manually. To start, I'll create the example data x and y, and the "exact fit" y0:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(100)
y0 = 0.07 * x ** 3 + 0.3 * x ** 2 + 1.1 * x
y = y0 + 1000 * np.random.randn(x.shape[0])
Now I'll create a full cubic polynomial 'training' or 'independent variable' matrix that includes the constant d column.
XX = np.vstack((x ** 3, x ** 2, x, np.ones_like(x))).T
Let's see what I get if I compute the fit with this dataset and compare it to polyfit:
p_all = np.linalg.lstsq(X_, y)[0]
pp = np.polyfit(x, y, 3)
print np.isclose(pp, p_all).all()
# Returns True
Where I've used np.isclose because the two algorithms do produce very small differences.
You're probably thinking 'that's nice, but I still haven't answered the question'. From here, forcing the fit to have a zero offset is the same as dropping the np.ones column from the array:
p_no_offset = np.linalg.lstsq(XX[:, :-1], y)[0] # use [0] to just grab the coefs
Ok, let's see what this fit looks like compared to our data:
y_fit = np.dot(p_no_offset, XX[:, :-1].T)
plt.plot(x, y0, 'k-', linewidth=3)
plt.plot(x, y_fit, 'y--', linewidth=2)
plt.plot(x, y, 'r.', ms=5)
This gives this figure,
WARNING: When using this method on data that does not actually pass through (x,y)=(0,0) you will bias your estimates of your output solution coefficients (p) because lstsq will be trying to compensate for that fact that there is an offset in your data. Sort of a 'square peg round hole' problem.
Furthermore, you could also fit your data to a cubic only by doing:
p_ = np.linalg.lstsq(X_[:1, :], y)[0]
Here again the warning above applies. If your data contains quadratic, linear or constant terms the estimate of the cubic coefficient will be biased. There can be times when - for numerical algorithms - this sort of thing is useful, but for statistical purposes my understanding is that it is important to include all of the lower terms. If tests turn out to show that the lower terms are not statistically different from zero that's fine, but for safety's sake you should probably leave them in when you estimate your cubic.
Best of luck!
I'm attempting to estimate a decay rate using an exponential fit, but I'm puzzled by why the two methods don't give the same result.
In the first case, taking the log of the data to linearize the problem matches Excel's exponential trendline fit. I had expected that fitting the exponential directly would be the same.
import numpy as np
from scipy.optimize import curve_fit
def exp_func(x, a, b):
return a * np.exp(-b * x)
def lin_func(x, m, b):
return m*x + b
xdata = [1065.0, 1080.0, 1095.0, 1110.0, 1125.0, 1140.0, 1155.0, 1170.0, 1185.0, 1200.0, 1215.0, 1230.0, 1245.0, 1260.0, 1275.0, 1290.0, 1305.0, 1320.0, 1335.0, 1350.0, 1365.0, 1380.0, 1395.0, 1410.0, 1425.0, 1440.0, 1455.0, 1470.0, 1485.0, 1500.0]
ydata = [21.3934, 17.14985, 11.2703, 13.284, 12.28465, 12.46925, 12.6315, 12.1292, 10.32762, 8.509195, 14.5393, 12.02665, 10.9383, 11.23325, 6.03988, 9.34904, 8.08941, 6.847, 5.938535, 6.792715, 5.520765, 6.16601, 5.71889, 4.949725, 7.62808, 5.5079, 3.049625, 4.8566, 3.26551, 3.50161]
xdata = np.array(xdata)
xdata = xdata - xdata.min() + 1
ydata = np.array(ydata)
lydata = np.log(ydata)
lopt, lcov = curve_fit(lin_func, xdata, lydata)
elopt = [np.exp(lopt[1]),-lopt[0]]
eopt, ecov = curve_fit(exp_func, xdata, ydata, p0=elopt)
print 'elopt: {},{}'.format(*elopt)
print 'eopt: {},{}'.format(*eopt)
results:
elopt: 17.2526204283,0.00343624199064
eopt: 17.1516384575,0.00330590568338
You're solving two different optimization problems. The curve_fit() assumes that the noise eps_i is additive (and somewhat Gaussian). Else it wont deliver optimal results.
Assuming that you want to minimize Sum (y_i - f(x_i))**2 with:
f(x) = a * Exp(-b * x) + eps_i
where eps_i the unknown error for the i-th data item you want to eliminate. Taking the logarithm results in
Log(f(x)) = Log(a*Exp(-b*x) + eps_i) != Log(Exp(Log(a) - b*x)) + eps_i
You can interpret the exponential equation as having additive noise. Your linear version has multiplicative noise mu_i, because:
g(x) = a * mu_i * Exp(-b*x)
results in
Log(g(x) = Log(a) - b * x + Log(mu_i)
In conclusion, you will only get identical results when the magnitude of the errors eps_i is very small.