What's the error of numpy.polyfit?

What's the error of numpy.polyfit? - python

I want to use numpy.polyfit for physical calculations, therefore I need the magnitude of the error.

If you specify full=True in your call to polyfit, it will include extra information:
>>> x = np.arange(100)
>>> y = x**2 + 3*x + 5 + np.random.rand(100)
>>> np.polyfit(x, y, 2)
array([ 0.99995888, 3.00221219, 5.56776641])
>>> np.polyfit(x, y, 2, full=True)
(array([ 0.99995888, 3.00221219, 5.56776641]), # coefficients
array([ 7.19260721]), # residuals
3, # rank
array([ 11.87708199, 3.5299267 , 0.52876389]), # singular values
2.2204460492503131e-14) # conditioning threshold
The residual value returned is the sum of the squares of the fit errors, not sure if this is what you are after:
>>> np.sum((np.polyval(np.polyfit(x, y, 2), x) - y)**2)
7.1926072073491056
In version 1.7 there is also a cov keyword that will return the covariance matrix for your coefficients, which you could use to calculate the uncertainty of the fit coefficients themselves.

As you can see in the documentation:
Returns
-------
p : ndarray, shape (M,) or (M, K)
Polynomial coefficients, highest power first.
If `y` was 2-D, the coefficients for `k`-th data set are in ``p[:,k]``.
residuals, rank, singular_values, rcond : present only if `full` = True
Residuals of the least-squares fit, the effective rank of the scaled
Vandermonde coefficient matrix, its singular values, and the specified
value of `rcond`. For more details, see `linalg.lstsq`.
Which means that if you can do a fit and get the residuals as:
import numpy as np
x = np.arange(10)
y = x**2 -3*x + np.random.random(10)
p, res, _, _, _ = numpy.polyfit(x, y, deg, full=True)
Then, the p are your fit parameters, and the res will be the residuals, as described above. The _'s are because you don't need to save the last three parameters, so you can just save them in the variable _ which you won't use. This is a convention and is not required.
#Jaime's answer explains what the residual means. Another thing you can do is look at those squared deviations as a function (the sum of which is res). This is particularly helpful to see a trend that didn't fit sufficiently. res can be large because of statistical noise, or possibly systematic poor fitting, for example:
x = np.arange(100)
y = 1000*np.sqrt(x) + x**2 - 10*x + 500*np.random.random(100) - 250
p = np.polyfit(x,y,2) # insufficient degree to include sqrt
yfit = np.polyval(p,x)
figure()
plot(x,y, label='data')
plot(x,yfit, label='fit')
plot(x,yfit-y, label='var')
So in the figure, note the bad fit near x = 0:

Related

numpy polyfit passing through 0

Suppose I have x and y vectors with a weight vector wgt. I can fit a cubic curve (y = a x^3 + b x^2 + c x + d) by using np.polyfit as follows:
y_fit = np.polyfit(x, y, deg=3, w=wgt)
Now, suppose I want to do another fit, but this time, I want the fit to pass through 0 (i.e. y = a x^3 + b x^2 + c x, d = 0), how can I specify a particular coefficient (i.e. d in this case) to be zero?
Thanks

You can try something like the following:
Import curve_fit from scipy, i.e.
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
Define the curve fitting function. In your case,
def fit_func(x, a, b, c):
# Curve fitting function
return a * x**3 + b * x**2 + c * x # d=0 is implied
Perform the curve fitting,
# Curve fitting
params = curve_fit(fit_func, x, y)
[a, b, c] = params[0]
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = a * x_fit**3 + b * x_fit**2 + c * x_fit
Plot the results if you please,
plt.plot(x, y, '.r') # Data
plt.plot(x_fit, y_fit, 'k') # Fitted curve
It does not answer the question in the sense that it uses numpy's polyfit function to pass through the origin, but it solves the problem.
Hope someone finds it useful :)

You can use np.linalg.lstsq and construct your coefficient matrix manually. To start, I'll create the example data x and y, and the "exact fit" y0:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(100)
y0 = 0.07 * x ** 3 + 0.3 * x ** 2 + 1.1 * x
y = y0 + 1000 * np.random.randn(x.shape[0])
Now I'll create a full cubic polynomial 'training' or 'independent variable' matrix that includes the constant d column.
XX = np.vstack((x ** 3, x ** 2, x, np.ones_like(x))).T
Let's see what I get if I compute the fit with this dataset and compare it to polyfit:
p_all = np.linalg.lstsq(X_, y)[0]
pp = np.polyfit(x, y, 3)
print np.isclose(pp, p_all).all()
# Returns True
Where I've used np.isclose because the two algorithms do produce very small differences.
You're probably thinking 'that's nice, but I still haven't answered the question'. From here, forcing the fit to have a zero offset is the same as dropping the np.ones column from the array:
p_no_offset = np.linalg.lstsq(XX[:, :-1], y)[0] # use [0] to just grab the coefs
Ok, let's see what this fit looks like compared to our data:
y_fit = np.dot(p_no_offset, XX[:, :-1].T)
plt.plot(x, y0, 'k-', linewidth=3)
plt.plot(x, y_fit, 'y--', linewidth=2)
plt.plot(x, y, 'r.', ms=5)
This gives this figure,
WARNING: When using this method on data that does not actually pass through (x,y)=(0,0) you will bias your estimates of your output solution coefficients (p) because lstsq will be trying to compensate for that fact that there is an offset in your data. Sort of a 'square peg round hole' problem.
Furthermore, you could also fit your data to a cubic only by doing:
p_ = np.linalg.lstsq(X_[:1, :], y)[0]
Here again the warning above applies. If your data contains quadratic, linear or constant terms the estimate of the cubic coefficient will be biased. There can be times when - for numerical algorithms - this sort of thing is useful, but for statistical purposes my understanding is that it is important to include all of the lower terms. If tests turn out to show that the lower terms are not statistically different from zero that's fine, but for safety's sake you should probably leave them in when you estimate your cubic.
Best of luck!

How to find error on slope and intercept using numpy.polyfit

I'm fitting a straight line to some data with numpy.polyfit. The data themselves do not come with any error bars. Here's a simplified version of my code:
from numpy import polyfit
data = loadtxt("data.txt")
x,y = data[:,0],data[:,1]
fit = polyfit(x,y,1)
Of course that gives me the values for the slope and intercept, but how to I find the uncertainty on the best-fit values?

I'm a bit late to answer this, but I think that this question remains unanswered and was the top hit on Google for me. Therefore, I think the following is the correct method
x = np.linspace(0, 1, 100)
y = 10 * x + 2 + np.random.normal(0, 1, 100)
p, V = np.polyfit(x, y, 1, cov=True)
print("x_1: {} +/- {}".format(p[0], np.sqrt(V[0][0])))
print("x_2: {} +/- {}".format(p[1], np.sqrt(V[1][1])))
which outputs
x_1: 10.2069326441 +/- 0.368862837662
x_2: 1.82929420943 +/- 0.213500166807
So you need to return the covariance matrix, V, for which the square root of the diagonals are the estimated standard-deviation for each of the fitted coefficients. This of course generalised to higher dimensions.

An array of R^2 values

I have a function which computes the R^2 value for a given set of x and y data with the degree specified:
import numpy as np
# Polynomial Regression
def polyfit(x,y,degree):
coeffs = np.polyfit(x, y, degree)
# Polynomial Coefficients
results = coeffs.tolist()
# r-squared
p = np.poly1d(coeffs)
# fit values, and mean
yhat = p(x) # or [p(z) for z in x]
ybar = np.sum(y)/len(y) # or sum(y)/len(y)
ssreg = np.sum((yhat-ybar)**2) # or sum([ (yihat - ybar)**2 for yihat in yhat])
sstot = np.sum((y - ybar)**2) # or sum([ (yi - ybar)**2 for yi in y])
r2= ssreg / sstot
results=r2
return results
I tried this function to a sample set of data:
x = np.arange(1,5)
y = np.arange(1,5)
print polyfit(x,y,1)
>>1.0
So far so good. My problem now is I want to vary the degree (the 3rd parameter for the polyfit function) by means of iteration. I'm thinking of using
n=np.linspace(1,9,100)
and for each value of n, I can get an r^2 value then store it to an array.
r2= [] #array for each r^2 value for each value of n in linspace
Can someone help me on this one? I'm new in python and I'm still having a hard time doing iterations.Thank you.

array for each r^2 value for each value of n in linspace
You could write this as a list comprehension:
r2 = [polyfit(x, y, n) for n in np.linspace(1,9,100)]
Converting to a numpy array is not hard:
r2 = np.array(r2)

Second order gradient in numpy

I am trying to calculate the 2nd-order gradient numerically of an array in numpy.
a = np.sin(np.arange(0, 10, .01))
da = np.gradient(a)
dda = np.gradient(da)
This is what I come up. Is the the way it should be done?
I am asking this, because in numpy there isn't an option saying np.gradient(a, order=2). I am concerned about whether this usage is wrong, and that is why numpy does not have this implemented.
PS1: I do realize that there is np.diff(a, 2). But this is only single-sided estimation, so I was curious why np.gradient does not have a similar keyword.
PS2: The np.sin() is a toy data - the real data does not have an analytic form.
Thank you!

I'll second #jrennie's first sentence - it can all depend. The numpy.gradient function requires that the data be evenly spaced (although allows for different distances in each direction if multi-dimensional). If your data does not adhere to this, than numpy.gradient isn't going to be much use. Experimental data may have (OK, will have) noise on it, in addition to not necessarily being all evenly spaced. In this case it might be better to use one of the scipy.interpolate spline functions (or objects). These can take unevenly spaced data, allow for smoothing, and can return derivatives up to k-1 where k is the order of the spline fit requested. The default value for k is 3, so a second derivative is just fine.
Example:
spl = scipy.interpolate.splrep(x,y,k=3) # no smoothing, 3rd order spline
ddy = scipy.interpolate.splev(x,spl,der=2) # use those knots to get second derivative
The object oriented splines like scipy.interpolate.UnivariateSpline have methods for the derivatives. Note that the derivative methods are implemented in Scipy 0.13 and are not present in 0.12.
Note that, as pointed out by #JosephCottham in comments in 2018, this answer (good for Numpy 1.08 at least), is no longer applicable since (at least) Numpy 1.14. Check your version number and the available options for the call.

There's no universal right answer for numerical gradient calculation. Before you can calculate the gradient about sample data, you have to make some assumption about the underlying function that generated that data. You can technically use np.diff for gradient calculation. Using np.gradient is a reasonable approach. I don't see anything fundamentally wrong with what you are doing---it's one particular approximation of the 2nd derivative of a 1-D function.

The double gradient approach fails for discontinuities in the first derivative.
As the gradient function takes one data point to the left and to the right into account, this continues/spreads when applying it multiple times.
On the other hand side, the second derivative can be calculated by the formula
d^2 f(x[i]) / dx^2 = (f(x[i-1]) - 2*f(x[i]) + f(x[i+1])) / h^2
compare here. This has the advantage to just take the two neighboring pixels into account.
In the picture the double np.gradient approach (left) and the above mentioned formula (right), as implemented by np.diff are compared. As f(x) has only one kink at zero, the second derivative (green) should only there have a peak.
As the double gradient solution takes 2 neighboring points in each direction into account, this leads to finite second derivative values at +/- 1.
In some cases, however, you may want to prefer the double gradient solution, as this is more robust to noise.
I am not sure why there is np.gradient and np.diff, but a reason might be, that the second argument of np.gradient defines the pixel distance (for each dimension) and for images it can be applied for both dimensions simultaneously gy, gx = np.gradient(a).
Code
import numpy as np
import matplotlib.pyplot as plt
xs = np.arange(-5,6,1)
f = np.abs(xs)
f_x = np.gradient(f)
f_xx_bad = np.gradient(f_x)
f_xx_good = np.diff(f, 2)
test = f[:-2] - 2* f[1:-1] + f[2:]
# lets plot all this
fig, axs = plt.subplots(1, 2, figsize=(9, 3), sharey=True)
ax = axs[0]
ax.set_title('bad: double gradient')
ax.plot(xs, f, marker='o', label='f(x)')
ax.plot(xs, f_x, marker='o', label='d f(x) / dx')
ax.plot(xs, f_xx_bad, marker='o', label='d^2 f(x) / dx^2')
ax.legend()
ax = axs[1]
ax.set_title('good: diff with n=2')
ax.plot(xs, f, marker='o', label='f(x)')
ax.plot(xs, f_x, marker='o', label='d f(x) / dx')
ax.plot(xs[1:-1], f_xx_good, marker='o', label='d^2 f(x) / dx^2')
ax.plot(xs[1:-1], test, marker='o', label='test', markersize=1)
ax.legend()

As I keep stepping over this problem in one form or the other again and again, I decided to write a function gradient_n, which adds an differentiation oder functionality to np.gradient. Not all functionalities of np.gradient are supported, like differentiation of mutiple axis.
Like np.gradient, gradient_n returns the differentiated result in the same shape as the input. Also a pixel distance argument (d) is supported.
import numpy as np
def gradient_n(arr, n, d=1, axis=0):
"""Differentiate np.ndarray n times.
Similar to np.diff, but additional support of pixel distance d
and padding of the result to the same shape as arr.
If n is even: np.diff is applied and the result is zero-padded
If n is odd:
np.diff is applied n-1 times and zero-padded.
Then gradient is applied. This ensures the right output shape.
"""
n2 = int((n // 2) * 2)
diff = arr
if n2 > 0:
a0 = max(0, axis)
a1 = max(0, arr.ndim-axis-1)
diff = np.diff(arr, n2, axis=axis) / d**n2
diff = np.pad(diff, tuple([(0,0)]*a0 + [(1,1)] +[(0,0)]*a1),
'constant', constant_values=0)
if n > n2:
assert n-n2 == 1, 'n={:f}, n2={:f}'.format(n, n2)
diff = np.gradient(diff, d, axis=axis)
return diff
def test_gradient_n():
import matplotlib.pyplot as plt
x = np.linspace(-4, 4, 17)
y = np.linspace(-2, 2, 9)
X, Y = np.meshgrid(x, y)
arr = np.abs(X)
arr_x = np.gradient(arr, .5, axis=1)
arr_x2 = gradient_n(arr, 1, .5, axis=1)
arr_xx = np.diff(arr, 2, axis=1) / .5**2
arr_xx = np.pad(arr_xx, ((0, 0), (1, 1)), 'constant', constant_values=0)
arr_xx2 = gradient_n(arr, 2, .5, axis=1)
assert np.sum(arr_x - arr_x2) == 0
assert np.sum(arr_xx - arr_xx2) == 0
fig, axs = plt.subplots(2, 2, figsize=(29, 21))
axs = np.array(axs).flatten()
ax = axs[0]
ax.set_title('x-cut')
ax.plot(x, arr[0, :], marker='o', label='arr')
ax.plot(x, arr_x[0, :], marker='o', label='arr_x')
ax.plot(x, arr_x2[0, :], marker='x', label='arr_x2', ls='--')
ax.plot(x, arr_xx[0, :], marker='o', label='arr_xx')
ax.plot(x, arr_xx2[0, :], marker='x', label='arr_xx2', ls='--')
ax.legend()
ax = axs[1]
ax.set_title('arr')
im = ax.imshow(arr, cmap='bwr')
cbar = ax.figure.colorbar(im, ax=ax, pad=.05)
ax = axs[2]
ax.set_title('arr_x')
im = ax.imshow(arr_x, cmap='bwr')
cbar = ax.figure.colorbar(im, ax=ax, pad=.05)
ax = axs[3]
ax.set_title('arr_xx')
im = ax.imshow(arr_xx, cmap='bwr')
cbar = ax.figure.colorbar(im, ax=ax, pad=.05)
test_gradient_n()

This is an excerpt from the original documentation (at the time of writing found at http://docs.scipy.org/doc/numpy/reference/generated/numpy.gradient.html). It states that unless the sampling distance is 1 you need to include a list containing the distances as an argument.
numpy.gradient(f, *varargs, **kwargs)
Return the gradient of an N-dimensional array.
The gradient is computed using second order accurate central differences in the interior and either first differences or second order accurate one-sides (forward or backwards) differences at the boundaries. The returned gradient hence has the same shape as the input array.
Parameters:
f : array_like
An N-dimensional array containing samples of a scalar function.
varargs : list of scalar, optional
N scalars specifying the sample distances for each dimension, i.e. dx, dy, dz, ... Default distance: 1.
edge_order : {1, 2}, optional
Gradient is calculated using Nth order accurate differences at the boundaries. Default: 1.
New in version 1.9.1.
Returns:
gradient : ndarray
N arrays of the same shape as f giving the derivative of f with respect to each dimension.

My solution is to create a function similar to np.gradient that calculates the 2nd derivatives numerically from the array data.
import numpy as np
def gradient2_even(y, h=None, edge_order=1):
"""
Return the 2nd-order gradient i.e.
2nd derivatives of y with n samples and k components.
The 2nd-order gradient is computed using second-order-accurate central differences
in the interior points and either first or second order accurate one-sided
(forward or backwards) differences at the boundaries.
The returned gradient hence has the same shape as the input array.
Parameters
----------
y : 1d or 2d array_like
The array containing the samples. If 2d with shape (n,k),
n is the number of samples at least 2 while k is the number of
y series/components. 1d input is equivalent to 2d input with shape (n,1).
h : constant or 1d, optional
spacing between the y samples. Default unitary spacing for
all y components. Spacing can be specified using:
1. Single scalar spacing value for all y components.
2. 1d array_like of length k specifying the spacing for each y component
edge_order : {1, 2}, optional
Order 1 means 3-point forward/backward finite differences
are used to calculate the 2nd derivatves at the edge points while
order 2 uses 4-point forward/backward finite differences.
Returns
----------
d2y : 1d or 2d array
Array containing the 2nd derivatives. The output shape is the same as y.
"""
if edge_order!=1 and edge_order!=2:
raise ValueError('edge_order must be 1 or 2.')
else:
pass
y = np.asfarray(y)
origshape = y.shape
if y.ndim!=1 and y.ndim!=2:
raise ValueError('y can only be 1d or 2d.')
elif y.ndim==1:
y = np.atleast_2d(y).T
elif y.ndim==2:
if y.shape[0]<2:
raise ValueError('The number of y samples must be atleast 2.')
else:
pass
else:
pass
n,k = y.shape
if h is None:
h = 1.0
else:
h = np.asfarray(h)
if h.ndim!=0 and h.ndim!=1:
raise ValueError('h can only be 0d or 1d.')
elif h.ndim==0:
pass
elif h.ndim==1 and h.size!=n:
raise ValueError('If h is 1d, it must have the same number as the components of y.')
else:
pass
d2y = np.zeros_like(y)
if n==2:
pass
elif n==3:
d2y[:] = ( 1/h**2 * (y[0] - 2*y[1] + y[2]) )
else:
d2y = np.zeros_like(y)
d2y[1:-1]=1/h**2 * ( y[:-2] - 2*y[1:-1] + y[2:] )
if edge_order==1:
d2y[0]=1/h**2 * ( y[0] - 2*y[1] + y[2] )
d2y[-1]=1/h**2 * ( y[-1] - 2*y[-2] + y[-3] )
else:
d2y[0]=1/h**2 * ( 2*y[0] - 5*y[1] + 4*y[2] - y[3] )
d2y[-1]=1/h**2 * ( 2*y[-1] - 5*y[-2] + 4*y[-3] - y[-4] )
return d2y.reshape(origshape)
Using your example,
# After importing the function from the script file or running it
from numpy import *
from matplotlib.pyplot import *
x, h = linspace(0, 10, 17) # use a fairly coarse grid to see the discrepancies better
y = sin(x)
ypp = -sin(x) # analytical 2nd derivatives
# Compute numerically the 2nd derivatives using 2nd-order finite differences at the edge points
d2y = gradient2_even(y, h, 2)
# Compute numerically the 2nd derivatives using nested gradient function
d2y2 = gradient(gradient(y, h, edge_order=2), h, edge_order=2)
# Compute numerically the 2nd derivatives using 1st-order finite differences at the edge points
d2y3 = gradient2_even(y, h, 1)
fig,ax=subplots(1,1)
ax.plot(x, ypp, x, d2y, 'o', x, d2y2, 'o', x, d2y3, 'o'), ax.grid()
ax.legend(['Analytical', 'edge_order=2', 'nested gradient', 'edge_order=1'])
fig.tight_layout()

numpy.polyfit with adapted parameters

Regarding to this: polynomial equation parameters
where I get 3 parameters for a squared function y = a*x² + b*x + c now I want only to get the first parameter for a squared function which describes my function y = a*x². With other words: I want to set b=c=0 and get the adapted parameter for a. In case I understand it right, polyfit isn't able to do this.

This can be done by numpy.linalg.lstsq. To explain how to use it, it is maybe easiest to show how you would do a standard 2nd order polyfit 'by hand'. Assuming you have your measurement vectors x and y, you first construct a so-called design matrix M like so:
M = np.column_stack((x**2, x, np.ones_like(x)))
after which you can obtain the usual coefficients as the least-square solution to the equation M * k = y using lstsq like this:
k, _, _, _ = np.linalg.lstsq(M, y)
where k is the column vector [a, b, c] with the usual coefficients. Note that lstsq returns some other parameters, which you can ignore. This is a very powerful trick, which allows you to fit y to any linear combination of the columns you put into your design matrix. It can be used e.g. for 2D fits of the type z = a * x + b * y (see e.g. this example, where I used the same trick in Matlab), or polyfits with missing coefficients like in your problem.
In your case, the design matrix is simply a single column containing x**2. Quick example:
import numpy as np
import matplotlib.pylab as plt
# generate some noisy data
x = np.arange(1000)
y = 0.0001234 * x**2 + 3*np.random.randn(len(x))
# do fit
M = np.column_stack((x**2,)) # construct design matrix
k, _, _, _ = np.linalg.lstsq(M, y) # least-square fit of M * k = y
# quick plot
plt.plot(x, y, '.', x, k*x**2, 'r', linewidth=3)
plt.legend(('measurement', 'fit'), loc=2)
plt.title('best fit: y = {:.8f} * x**2'.format(k[0]))
plt.show()
Result:

The coefficients are get to minimize the squared error, you don't assign them. However, you can set some of the coefficients to zero if they are too much insignificant. E.g., I have a list of points on curve y = 33*x²:
In [51]: x=np.arange(20)
In [52]: y=33*x**2 #y = 33*x²
In [53]: coeffs=np.polyfit(x, y, 2)
In [54]: coeffs
Out[54]: array([ 3.30000000e+01, 8.99625199e-14, -7.62430619e-13])
In [55]: epsilon=np.finfo(np.float32).eps
In [56]: coeffs[np.abs(coeffs)<epsilon]=0
In [57]: coeffs
Out[57]: array([ 33., 0., 0.])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

What's the error of numpy.polyfit? - python

I want to use numpy.polyfit for physical calculations, therefore I need the magnitude of the error.

Related

numpy polyfit passing through 0

How to find error on slope and intercept using numpy.polyfit

An array of R^2 values

Second order gradient in numpy

numpy.polyfit with adapted parameters

Categories

Resources