Suppose I have x and y vectors with a weight vector wgt. I can fit a cubic curve (y = a x^3 + b x^2 + c x + d) by using np.polyfit as follows:
y_fit = np.polyfit(x, y, deg=3, w=wgt)
Now, suppose I want to do another fit, but this time, I want the fit to pass through 0 (i.e. y = a x^3 + b x^2 + c x, d = 0), how can I specify a particular coefficient (i.e. d in this case) to be zero?
Thanks
You can try something like the following:
Import curve_fit from scipy, i.e.
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
Define the curve fitting function. In your case,
def fit_func(x, a, b, c):
# Curve fitting function
return a * x**3 + b * x**2 + c * x # d=0 is implied
Perform the curve fitting,
# Curve fitting
params = curve_fit(fit_func, x, y)
[a, b, c] = params[0]
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = a * x_fit**3 + b * x_fit**2 + c * x_fit
Plot the results if you please,
plt.plot(x, y, '.r') # Data
plt.plot(x_fit, y_fit, 'k') # Fitted curve
It does not answer the question in the sense that it uses numpy's polyfit function to pass through the origin, but it solves the problem.
Hope someone finds it useful :)
You can use np.linalg.lstsq and construct your coefficient matrix manually. To start, I'll create the example data x and y, and the "exact fit" y0:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(100)
y0 = 0.07 * x ** 3 + 0.3 * x ** 2 + 1.1 * x
y = y0 + 1000 * np.random.randn(x.shape[0])
Now I'll create a full cubic polynomial 'training' or 'independent variable' matrix that includes the constant d column.
XX = np.vstack((x ** 3, x ** 2, x, np.ones_like(x))).T
Let's see what I get if I compute the fit with this dataset and compare it to polyfit:
p_all = np.linalg.lstsq(X_, y)[0]
pp = np.polyfit(x, y, 3)
print np.isclose(pp, p_all).all()
# Returns True
Where I've used np.isclose because the two algorithms do produce very small differences.
You're probably thinking 'that's nice, but I still haven't answered the question'. From here, forcing the fit to have a zero offset is the same as dropping the np.ones column from the array:
p_no_offset = np.linalg.lstsq(XX[:, :-1], y)[0] # use [0] to just grab the coefs
Ok, let's see what this fit looks like compared to our data:
y_fit = np.dot(p_no_offset, XX[:, :-1].T)
plt.plot(x, y0, 'k-', linewidth=3)
plt.plot(x, y_fit, 'y--', linewidth=2)
plt.plot(x, y, 'r.', ms=5)
This gives this figure,
WARNING: When using this method on data that does not actually pass through (x,y)=(0,0) you will bias your estimates of your output solution coefficients (p) because lstsq will be trying to compensate for that fact that there is an offset in your data. Sort of a 'square peg round hole' problem.
Furthermore, you could also fit your data to a cubic only by doing:
p_ = np.linalg.lstsq(X_[:1, :], y)[0]
Here again the warning above applies. If your data contains quadratic, linear or constant terms the estimate of the cubic coefficient will be biased. There can be times when - for numerical algorithms - this sort of thing is useful, but for statistical purposes my understanding is that it is important to include all of the lower terms. If tests turn out to show that the lower terms are not statistically different from zero that's fine, but for safety's sake you should probably leave them in when you estimate your cubic.
Best of luck!
Related
Following the recommendations in this answer I have used several combination of values for beta0, and as shown here, the values from polyfit.
This example is UPDATED in order to show the effect of relative scales of values of X versus Y (X range is 0.1 to 100 times Y):
from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
for num in range(1, 5):
plt.subplot(2, 2, num)
plt.title('X range is %.1f times Y' % (float(100 / max(X))))
X *= 10
z = np.polyfit(X, Y, 1)
plt.plot(X, Y, 'k.', alpha=0.1)
# Fit using odr
def f(B, X):
return B[0]*X + B[1]
linear = odr.Model(f)
mydata = odr.RealData(X, Y)
myodr = odr.ODR(mydata, linear, beta0=z)
myodr.set_job(fit_type=0)
myoutput = myodr.run()
a, b = myoutput.beta
sa, sb = myoutput.sd_beta
xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
yp = a*xp+b
plt.plot(xp, yp, label='ODR')
yp2 = z[0]*xp+z[1]
plt.plot(xp, yp2, label='polyfit')
plt.legend()
plt.ylim(-1000, 2000)
plt.show()
It seems that no combination of beta0 helps... The only way to get polyfit and ODR fit similar is to swap X and Y, OR as shown here to increase the range of values of X with regard to Y, still not really a solution :)
=== EDIT ===
I do not want ODR to be the same as polyfit. I am showing polyfit just to emphasize that the ODR fit is wrong and it is not a problem of the data.
=== SOLUTION ===
thanks to #norok2 answer when Y range is 0.001 to 100000 times X:
from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() / 1000 for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
plt.figure(figsize=(12, 12))
for num in range(1, 10):
plt.subplot(3, 3, num)
plt.title('Y range is %.1f times X' % (float(100 / max(X))))
X *= 10
z = np.polyfit(X, Y, 1)
plt.plot(X, Y, 'k.', alpha=0.1)
# Fit using odr
def f(B, X):
return B[0]*X + B[1]
linear = odr.Model(f)
mydata = odr.RealData(X, Y,
sy=min(1/np.var(Y), 1/np.var(X))) # here the trick!! :)
myodr = odr.ODR(mydata, linear, beta0=z)
myodr.set_job(fit_type=0)
myoutput = myodr.run()
a, b = myoutput.beta
sa, sb = myoutput.sd_beta
xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
yp = a*xp+b
plt.plot(xp, yp, label='ODR')
yp2 = z[0]*xp+z[1]
plt.plot(xp, yp2, label='polyfit')
plt.legend()
plt.ylim(-1000, 2000)
plt.show()
The key difference between polyfit() and the Orthogonal Distance Regression (ODR) fit is that polyfit works under the assumption that the error on x is negligible. If this assumption is violated, like it is in your data, you cannot expect the two methods to produce similar results.
In particular, ODR() is very sensitive to the errors you specify.
If you do not specify any error/weighting, it will assign a value of 1 for both x and y, meaning that any scale difference between x and y will affect the results (the so-called numerical conditioning).
On the contrary, polyfit(), before computing the fit, applies some sort of pre-whitening to the data (see around line 577 of its source code) for better numerical conditioning.
Therefore, if you want ODR() to match polyfit(), you could simply fine-tune the error on Y to change your numerical conditioning.
I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).
mydata = odr.RealData(X, Y)
# equivalent to: odr.RealData(X, Y, sx=1, sy=1)
to:
mydata = odr.RealData(X, Y, sx=1, sy=1/np.var(Y))
(EDIT: note there was a typo on the line above)
I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).
Note that this would only make sense for well-conditioned fits.
I cannot format source code in a comment, and so place it here. This code uses ODR to calculate fit statistics, note the line that has "parameter order for odr" such that I use a wrapper function for the ODR call to my "actual" function.
from scipy.optimize import curve_fit
import numpy as np
import scipy.odr
import scipy.stats
x = np.array([5.357, 5.797, 5.936, 6.161, 6.697, 6.731, 6.775, 8.442, 9.861])
y = np.array([0.376, 0.874, 1.049, 1.327, 2.054, 2.077, 2.138, 4.744, 7.104])
def f(x,b0,b1):
return b0 + (b1 * x)
def f_wrapper_for_odr(beta, x): # parameter order for odr
return f(x, *beta)
parameters, cov= curve_fit(f, x, y)
model = scipy.odr.odrpack.Model(f_wrapper_for_odr)
data = scipy.odr.odrpack.Data(x,y)
myodr = scipy.odr.odrpack.ODR(data, model, beta0=parameters, maxit=0)
myodr.set_job(fit_type=2)
parameterStatistics = myodr.run()
df_e = len(x) - len(parameters) # degrees of freedom, error
cov_beta = parameterStatistics.cov_beta # parameter covariance matrix from ODR
sd_beta = parameterStatistics.sd_beta * parameterStatistics.sd_beta
ci = []
t_df = scipy.stats.t.ppf(0.975, df_e)
ci = []
for i in range(len(parameters)):
ci.append([parameters[i] - t_df * parameterStatistics.sd_beta[i], parameters[i] + t_df * parameterStatistics.sd_beta[i]])
tstat_beta = parameters / parameterStatistics.sd_beta # coeff t-statistics
pstat_beta = (1.0 - scipy.stats.t.cdf(np.abs(tstat_beta), df_e)) * 2.0 # coef. p-values
for i in range(len(parameters)):
print('parameter:', parameters[i])
print(' conf interval:', ci[i][0], ci[i][1])
print(' tstat:', tstat_beta[i])
print(' pstat:', pstat_beta[i])
print()
I want to exactly represent my noisy data with a numpy.polynomial polynomial. How can I do that?.
In this example, I chose legendre polynomials. When I use the polynomial legfit function, the coefficients it returns are either very large or very small. So, I think I need some sort of regularization.
Why doesn't my fit get more accurate as I increase the degree of the polynomial? (It can be seen that the 20, 200, and 300 degree polynomials are essentially identical.) Are any regularization options available in the polynomial package?
I tried implementing my own regression function, but it feels like I am re-inventing the wheel. Is making my own fitting function the best path forward?
from scipy.optimize import least_squares as mini
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 5, 1000)
tofit = np.sin(3 * x) + .6 * np.sin(7*x) - .5 * np.cos(3 * np.cos(10 * x))
# This is here to illustrate what I expected the legfit function to do
# I expected it to do least squares regression with some sort of regularization.
def myfitfun(x, y, deg):
def fitness(a):
return ((np.polynomial.legendre.legval(x, a) - y)**2).sum() + np.sum(a ** 2)
return mini(fitness, np.zeros(deg)).x
degrees = [2, 4, 8, 16, 40, 200]
plt.plot(x, tofit, c='k', lw=4, label='Data')
for deg in degrees:
#coeffs = myfitfun(x, tofit, deg)
coeffs = np.polynomial.legendre.legfit(x, tofit, deg)
plt.plot(x, np.polynomial.legendre.legval(x, coeffs), label="Degree=%i"%deg)
plt.legend()
Legendre polynomials are meant to be used over the interval [-1,1]. Try to replace x with 2*x/x[-1] - 1 in your fit and you'll see that all is good:
nx = 2*x/x[-1] - 1
for deg in degrees:
#coeffs = myfitfun(x, tofit, deg)
coeffs = np.polynomial.legendre.legfit(nx, tofit, deg)
plt.plot(x, np.polynomial.legendre.legval(nx, coeffs), label="Degree=%i"%deg)
The easy way to use the proper interval in the fit is to use the Legendre class
from numpy.polynomial import Legendre as L
p = L.fit(x, y, order)
This will scale and shift the data to the interval [-1, 1] and track the scaling factors.
I have a set of coordinates (x, y, z(x, y)) which describe intensities (z) at coordinates x, y. For a set number of these intensities at different coordinates, I need to fit a 2D Gaussian that minimizes the mean squared error.
The data is in numpy matrices and for each fitting session I will have either 4, 9, 16 or 25 coordinates. Ultimately I just need to get the central position of the gaussian (x_0, y_0) that has smallest MSE.
All of the examples that I have found use scipy.optimize.curve_fit but the input data they have is over an entire mesh rather than a few coordinates.
Any help would be appreciated.
Introduction
There are multiple ways to approach this. You can use non-linear methods (e.g. scipy.optimize.curve_fit), but they'll be slow and aren't guaranteed to converge. You can linearize the problem (fast, unique solution), but any noise in the "tails" of the distribution will cause issues. There are actually a few tricks you can apply to this particular case to avoid the latter issue. I'll show some examples, but I don't have time to demonstrate all of the "tricks" right now.
Just as a side note, a general 2D guassian has 6 parameters, so you won't be able to fully fit things with 4 points. However, it sounds like you might be assuming that there's no covariance between x and y and that the variances are the same in each direction (i.e. a perfectly "round" bell curve). If that's the case, then you only need four parameters. If you know the amplitude of the guassian, you'll only need three. However, I'm going to start with the general solution, and you can simplify it later on, if you want to.
For the moment, let's focus on solving this problem using non-linear methods (e.g. scipy.optimize.curve_fit).
The general equation for a 2D guassian is (directly from wikipedia):
where:
is essentially 0.5 over the covariance matrix, A is the amplitude,
and (X₀, Y₀) is the center
Generate simplified sample data
Let's write the equation above out:
import numpy as np
import matplotlib.pyplot as plt
def gauss2d(x, y, amp, x0, y0, a, b, c):
inner = a * (x - x0)**2
inner += 2 * b * (x - x0)**2 * (y - y0)**2
inner += c * (y - y0)**2
return amp * np.exp(-inner)
And then let's generate some example data. To start with, we'll generate some data that will be easy to fit:
np.random.seed(1977) # For consistency
x, y = np.random.random((2, 10))
x0, y0 = 0.3, 0.7
amp, a, b, c = 1, 2, 3, 4
zobs = gauss2d(x, y, amp, x0, y0, a, b, c)
fig, ax = plt.subplots()
scat = ax.scatter(x, y, c=zobs, s=200)
fig.colorbar(scat)
plt.show()
Note that we haven't added any noise, and the center of the distribution is within the range that we have data (i.e. center at 0.3, 0.7 and a scatter of x,y observations between 0 and 1). For the moment, let's stick with this, and then we'll see what happens when we add noise and shift the center.
Non-linear fitting
To start with, let's use scpy.optimize.curve_fit to preform a non-linear least-squares fit to the gaussian function. (On a side note, you can play around with the exact minimization algorithm by using some of the other functions in scipy.optimize.)
The scipy.optimize functions expect a slightly different function signature than the one we originally wrote above. We could write a wrapper to "translate", but let's just re-write the gauss2d function instead:
def gauss2d(xy, amp, x0, y0, a, b, c):
x, y = xy
inner = a * (x - x0)**2
inner += 2 * b * (x - x0)**2 * (y - y0)**2
inner += c * (y - y0)**2
return amp * np.exp(-inner)
All we did was have the function expect the independent variables (x & y) as a single 2xN array.
Now we need to make an initial guess at what the guassian curve's parameters actually are. This is optional (the default is all ones, if I recall correctly), but you're likely to have problems converging if 1, 1 is not particularly close to the "true" center of the gaussian curve. For that reason, we'll use the x and y values of our largest observed z-value as a starting point for the center. I'll leave the rest of the parameters as 1, but if you know that they're likely to consistently be significantly different, change them to something more reasonable.
Here's the full, stand-alone example:
import numpy as np
import scipy.optimize as opt
import matplotlib.pyplot as plt
def main():
x0, y0 = 0.3, 0.7
amp, a, b, c = 1, 2, 3, 4
true_params = [amp, x0, y0, a, b, c]
xy, zobs = generate_example_data(10, true_params)
x, y = xy
i = zobs.argmax()
guess = [1, x[i], y[i], 1, 1, 1]
pred_params, uncert_cov = opt.curve_fit(gauss2d, xy, zobs, p0=guess)
zpred = gauss2d(xy, *pred_params)
print 'True parameters: ', true_params
print 'Predicted params:', pred_params
print 'Residual, RMS(obs - pred):', np.sqrt(np.mean((zobs - zpred)**2))
plot(xy, zobs, pred_params)
plt.show()
def gauss2d(xy, amp, x0, y0, a, b, c):
x, y = xy
inner = a * (x - x0)**2
inner += 2 * b * (x - x0)**2 * (y - y0)**2
inner += c * (y - y0)**2
return amp * np.exp(-inner)
def generate_example_data(num, params):
np.random.seed(1977) # For consistency
xy = np.random.random((2, num))
zobs = gauss2d(xy, *params)
return xy, zobs
def plot(xy, zobs, pred_params):
x, y = xy
yi, xi = np.mgrid[:1:30j, -.2:1.2:30j]
xyi = np.vstack([xi.ravel(), yi.ravel()])
zpred = gauss2d(xyi, *pred_params)
zpred.shape = xi.shape
fig, ax = plt.subplots()
ax.scatter(x, y, c=zobs, s=200, vmin=zpred.min(), vmax=zpred.max())
im = ax.imshow(zpred, extent=[xi.min(), xi.max(), yi.max(), yi.min()],
aspect='auto')
fig.colorbar(im)
ax.invert_yaxis()
return fig
main()
In this case, we exactly(ish) recover our original "true" parameters.
True parameters: [1, 0.3, 0.7, 2, 3, 4]
Predicted params: [ 1. 0.3 0.7 2. 3. 4. ]
Residual, RMS(obs - pred): 1.01560615193e-16
As we'll see in a second, this won't always be the case...
Adding Noise
Let's add some noise to our observations. All I've done here is change the generate_example_data function:
def generate_example_data(num, params):
np.random.seed(1977) # For consistency
xy = np.random.random((2, num))
noise = np.random.normal(0, 0.3, num)
zobs = gauss2d(xy, *params) + noise
return xy, zobs
However, the result looks quite different:
And as far as the parameters go:
True parameters: [1, 0.3, 0.7, 2, 3, 4]
Predicted params: [ 1.129 0.263 0.750 1.280 32.333 10.103 ]
Residual, RMS(obs - pred): 0.152444640098
The predicted center hasn't changed much, but the b and c parameters have changed quite a bit.
If we change the center of the function to somewhere slightly outside of our scatter of points:
x0, y0 = -0.3, 1.1
We'll wind up with complete nonsense as a result in the presence of noise! (It still works correctly without noise.)
True parameters: [1, -0.3, 1.1, 2, 3, 4]
Predicted params: [ 0.546 -0.939 0.857 -0.488 44.069 -4.136]
Residual, RMS(obs - pred): 0.235664449826
This is a common problem when fitting a function that decays to zero. Any noise in the "tails" can result in a very poor result. There are a number of strategies to deal with this. One of the easiest is to weight the inversion by the observed z-values. Here's an example for the 1D case: (focusing on linearized the problem) How can I perform a least-squares fitting over multiple data sets fast? If I have time later, I'll add an example of this for the 2D case.
Regarding to this: polynomial equation parameters
where I get 3 parameters for a squared function y = a*x² + b*x + c now I want only to get the first parameter for a squared function which describes my function y = a*x². With other words: I want to set b=c=0 and get the adapted parameter for a. In case I understand it right, polyfit isn't able to do this.
This can be done by numpy.linalg.lstsq. To explain how to use it, it is maybe easiest to show how you would do a standard 2nd order polyfit 'by hand'. Assuming you have your measurement vectors x and y, you first construct a so-called design matrix M like so:
M = np.column_stack((x**2, x, np.ones_like(x)))
after which you can obtain the usual coefficients as the least-square solution to the equation M * k = y using lstsq like this:
k, _, _, _ = np.linalg.lstsq(M, y)
where k is the column vector [a, b, c] with the usual coefficients. Note that lstsq returns some other parameters, which you can ignore. This is a very powerful trick, which allows you to fit y to any linear combination of the columns you put into your design matrix. It can be used e.g. for 2D fits of the type z = a * x + b * y (see e.g. this example, where I used the same trick in Matlab), or polyfits with missing coefficients like in your problem.
In your case, the design matrix is simply a single column containing x**2. Quick example:
import numpy as np
import matplotlib.pylab as plt
# generate some noisy data
x = np.arange(1000)
y = 0.0001234 * x**2 + 3*np.random.randn(len(x))
# do fit
M = np.column_stack((x**2,)) # construct design matrix
k, _, _, _ = np.linalg.lstsq(M, y) # least-square fit of M * k = y
# quick plot
plt.plot(x, y, '.', x, k*x**2, 'r', linewidth=3)
plt.legend(('measurement', 'fit'), loc=2)
plt.title('best fit: y = {:.8f} * x**2'.format(k[0]))
plt.show()
Result:
The coefficients are get to minimize the squared error, you don't assign them. However, you can set some of the coefficients to zero if they are too much insignificant. E.g., I have a list of points on curve y = 33*x²:
In [51]: x=np.arange(20)
In [52]: y=33*x**2 #y = 33*x²
In [53]: coeffs=np.polyfit(x, y, 2)
In [54]: coeffs
Out[54]: array([ 3.30000000e+01, 8.99625199e-14, -7.62430619e-13])
In [55]: epsilon=np.finfo(np.float32).eps
In [56]: coeffs[np.abs(coeffs)<epsilon]=0
In [57]: coeffs
Out[57]: array([ 33., 0., 0.])
I want to use numpy.polyfit for physical calculations, therefore I need the magnitude of the error.
If you specify full=True in your call to polyfit, it will include extra information:
>>> x = np.arange(100)
>>> y = x**2 + 3*x + 5 + np.random.rand(100)
>>> np.polyfit(x, y, 2)
array([ 0.99995888, 3.00221219, 5.56776641])
>>> np.polyfit(x, y, 2, full=True)
(array([ 0.99995888, 3.00221219, 5.56776641]), # coefficients
array([ 7.19260721]), # residuals
3, # rank
array([ 11.87708199, 3.5299267 , 0.52876389]), # singular values
2.2204460492503131e-14) # conditioning threshold
The residual value returned is the sum of the squares of the fit errors, not sure if this is what you are after:
>>> np.sum((np.polyval(np.polyfit(x, y, 2), x) - y)**2)
7.1926072073491056
In version 1.7 there is also a cov keyword that will return the covariance matrix for your coefficients, which you could use to calculate the uncertainty of the fit coefficients themselves.
As you can see in the documentation:
Returns
-------
p : ndarray, shape (M,) or (M, K)
Polynomial coefficients, highest power first.
If `y` was 2-D, the coefficients for `k`-th data set are in ``p[:,k]``.
residuals, rank, singular_values, rcond : present only if `full` = True
Residuals of the least-squares fit, the effective rank of the scaled
Vandermonde coefficient matrix, its singular values, and the specified
value of `rcond`. For more details, see `linalg.lstsq`.
Which means that if you can do a fit and get the residuals as:
import numpy as np
x = np.arange(10)
y = x**2 -3*x + np.random.random(10)
p, res, _, _, _ = numpy.polyfit(x, y, deg, full=True)
Then, the p are your fit parameters, and the res will be the residuals, as described above. The _'s are because you don't need to save the last three parameters, so you can just save them in the variable _ which you won't use. This is a convention and is not required.
#Jaime's answer explains what the residual means. Another thing you can do is look at those squared deviations as a function (the sum of which is res). This is particularly helpful to see a trend that didn't fit sufficiently. res can be large because of statistical noise, or possibly systematic poor fitting, for example:
x = np.arange(100)
y = 1000*np.sqrt(x) + x**2 - 10*x + 500*np.random.random(100) - 250
p = np.polyfit(x,y,2) # insufficient degree to include sqrt
yfit = np.polyval(p,x)
figure()
plot(x,y, label='data')
plot(x,yfit, label='fit')
plot(x,yfit-y, label='var')
So in the figure, note the bad fit near x = 0: