How to find error on slope and intercept using numpy.polyfit - python

I'm fitting a straight line to some data with numpy.polyfit. The data themselves do not come with any error bars. Here's a simplified version of my code:
from numpy import polyfit
data = loadtxt("data.txt")
x,y = data[:,0],data[:,1]
fit = polyfit(x,y,1)
Of course that gives me the values for the slope and intercept, but how to I find the uncertainty on the best-fit values?

I'm a bit late to answer this, but I think that this question remains unanswered and was the top hit on Google for me. Therefore, I think the following is the correct method
x = np.linspace(0, 1, 100)
y = 10 * x + 2 + np.random.normal(0, 1, 100)
p, V = np.polyfit(x, y, 1, cov=True)
print("x_1: {} +/- {}".format(p[0], np.sqrt(V[0][0])))
print("x_2: {} +/- {}".format(p[1], np.sqrt(V[1][1])))
which outputs
x_1: 10.2069326441 +/- 0.368862837662
x_2: 1.82929420943 +/- 0.213500166807
So you need to return the covariance matrix, V, for which the square root of the diagonals are the estimated standard-deviation for each of the fitted coefficients. This of course generalised to higher dimensions.

Related

given percentiles find distribution function python

From https://stackoverflow.com/a/30460089/2202107, we can generate CDF of a normal distribution:
import numpy as np
import matplotlib.pyplot as plt
N = 100
Z = np.random.normal(size = N)
# method 1
H,X1 = np.histogram( Z, bins = 10, normed = True )
dx = X1[1] - X1[0]
F1 = np.cumsum(H)*dx
#method 2
X2 = np.sort(Z)
F2 = np.array(range(N))/float(N)
# plt.plot(X1[1:], F1)
plt.plot(X2, F2)
plt.show()
Question: How do we generate the "original" normal distribution, given only x (eg X2) and y (eg F2) coordinates?
My first thought was plt.plot(x,np.gradient(y)), but gradient of y was all zero (data points are evenly spaced in y, but not in x) These kind of data is often met in percentile calculations. The key is to get the data evenly space in x and not in y, using interpolation:
x=X2
y=F2
num_points=10
xinterp = np.linspace(-2,2,num_points)
yinterp = np.interp(xinterp, x, y)
# for normalizing that sum of all bars equals to 1.0
tot_val=1.0
normalization_factor = tot_val/np.trapz(np.ones(len(xinterp)),yinterp)
plt.bar(xinterp, normalization_factor * np.gradient(yinterp), width=0.2)
plt.show()
output looks good to me:
I put my approach here for examination. Let me know if my logic is flawed.
One issue is: when num_points is large, the plot looks bad, but it's a issue in discretization, not sure how to avoid it.
Related posts:
I failed to understand why the answer was so complicated in https://stats.stackexchange.com/a/6065/131632
I also didn't understand why my approach was different than Generate distribution given percentile ranks

np.polyfit won't plot a characteristic but gives values

The issue I am having is that when I use the code below to find the norm-1 of my error. Firstly, when I plot the error against step-size h, the error values are quite small, in the range of 10^-14 to 10^-16. Secondly, underneath, you can see my attempt to apply the np.polyfit to my graph, which when run, won't fit a characteristic but will output values. The value of p[0] is not perfect, so I believe something is wrong, but it is "close" to the desired output of 3. Is this a matter of just the wrong input or bad data?
def rk3(A,bvector,y0,interval,N):
x0=interval[0]
x_end=interval[1]
x=np.linspace(x0,x_end,N+1)
h=(x_end-x0)/N
y=np.zeros((N+1,len(y0)))
y[0, :] = y0
for n in range(N):
y_1=y[n,:]+h*(np.dot(A,y[n,:])+bvector(x[n]))
y_2=(3/4)*y[n,:]+(1/4)*y_1+(1/4)*h*(np.dot(A,y_1)+bvector(x[n]+h))
y[n+1,:]=(1/3)*y[n,:]+(2/3)*y_2+(2/3)*h*(np.dot(A,y_2)+bvector(x[n]+(1/2)*h))
return x,y
err_vals = []
h_vals = []
for k in range(2,11): #for the range of N=40k, where k=1,...,10
N=40*k
x, y = rk3(A,bvector,y0,[0,0.1],N)
yc = y[-1,:]
h = (x[-1]-x[0])/N
h_vals.append(h)
yvals.append(yc)
yn = y[:,1]
abs_err = np.zeros(N)
print("The value of y at k=",k," is ",yc)
for j in range(1,N):
y_exact=np.array([np.exp(-1000*x[j]), (1000/999)*(np.exp(-x[j])-np.exp(-1000*x[j]))])
y_exact_2 = y_exact[1]
abs_err[j] = np.abs((y[j, 1] - y_exact_2)/y_exact_2)
Error = h*np.sum(abs_err[j])
err_vals.append(Error)
p = np.polyfit(np.log(h_vals), np.log(err_vals), 1)
pyplot.loglog(h_vals,err_vals,"kx")
pyplot.xlabel("h")
pyplot.ylabel("Error")
pyplot.loglog(h,np.exp(p[1])*h**(p[0]), 'r--')
print("Best fit line slope ",format(p[0]))
My evolution of your code below gives a completely straight line with slope close to 3 for the integration over the interval [0,0.01].
For the given interval [0,0.1] the slope value is about 1/3 larger. The error profiles, that is, the absolute error divided by the expected global error power of the step size, gives a converging pattern, confirming the convergence of order 3 of the method.
The error bound 2e7*h^3 is rather large, showing why the combination of problem and method can become very problematic for larger step sizes.
The error is computed via the L1 norms of the function difference and exact solution,
Error = sum(abs((y-y_exact(x))[:,1]))/sum(abs(y[:,1]))
giving a mathematically sound quantity. The summation of the local relative errors can lead to distortions of the total error where the exact solution has a root or small values. But still, even using your computation method of integrating the local relative error leaving out the first data point which is zero,
Error = sum(abs((y[1:,1]/y_exact(x)[1:,1]-1)))*h
gives a similar linear plot, with the range shifted down to 1e-7..1e-9, the slope staying at 3.0293
Note that if you want to use the list h_vals in a computation like the one to plot the fitted line, you have to convert in into a numpy array first.
h=np.asarray(h_vals)
complete code
def rk3(A,bvector,y0,interval,N):
"""Solves an IVP y'=f(x, y(x)) on x \in [0, x_end] with y(0) = y0 using N points, using Runge-Kutta method."""
x=np.linspace(*interval,N+1)
h=x[1]-x[0]
y=np.zeros((N+1,len(y0)))
y[0, :] = y0
for n in range(N):
y_1=y[n]+h*(np.dot(A,y[n])+bvector(x[n]))
y_2=(3/4)*y[n,:]+(1/4)*y_1+(1/4)*h*(np.dot(A,y_1)+bvector(x[n]+h))
y[n+1]=(1/3)*y[n]+(2/3)*y_2+(2/3)*h*(np.dot(A,y_2)+bvector(x[n]+0.5*h))
return x,y
A = np.array([[-1000.0,0.0],[1000.0,-1.0]]);
bvector = lambda x: 0
y_exact = lambda x: np.array([np.exp(-1000*x), (1000/999)*(np.exp(-x)-np.exp(-1000*x))]).T
y0 = y_exact(0)
plt.figure(figsize=(6,3));
h_vals, y_vals, err_vals = [],[],[]
for k in range(2,11): #for the range of N=40k, where k=1,...,10
N=40*k
x, y = rk3(A,bvector,y0,[0,0.01],N)
yc = y[-1,:]
h = x[1]-x[0];
plt.plot(x,(y-y_exact(x))[:,1]/h**3)
h_vals.append(h)
y_vals.append(yc)
yn = y[:,1]
print("The value of y at k=",k," is ",yc)
Error = sum(abs((y-y_exact(x))[:,1]))/sum(abs(y[:,1]))
err_vals.append(Error)
plt.grid(); plt.show()
p = np.polyfit(np.log(h_vals), np.log(err_vals), 1)
plt.figure(figsize=(6,4))
plt.loglog(h_vals,err_vals,"kx")
h=np.asarray(h_vals)
plt.plot(h,np.exp(p[1])*h**(p[0]), '--r', lw=0.5)
plt.xlabel("h")
plt.ylabel("Error")
plt.grid(); plt.show()
print("Best fit line slope ",format(p[0]))

numpy polyfit passing through 0

Suppose I have x and y vectors with a weight vector wgt. I can fit a cubic curve (y = a x^3 + b x^2 + c x + d) by using np.polyfit as follows:
y_fit = np.polyfit(x, y, deg=3, w=wgt)
Now, suppose I want to do another fit, but this time, I want the fit to pass through 0 (i.e. y = a x^3 + b x^2 + c x, d = 0), how can I specify a particular coefficient (i.e. d in this case) to be zero?
Thanks
You can try something like the following:
Import curve_fit from scipy, i.e.
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
Define the curve fitting function. In your case,
def fit_func(x, a, b, c):
# Curve fitting function
return a * x**3 + b * x**2 + c * x # d=0 is implied
Perform the curve fitting,
# Curve fitting
params = curve_fit(fit_func, x, y)
[a, b, c] = params[0]
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = a * x_fit**3 + b * x_fit**2 + c * x_fit
Plot the results if you please,
plt.plot(x, y, '.r') # Data
plt.plot(x_fit, y_fit, 'k') # Fitted curve
It does not answer the question in the sense that it uses numpy's polyfit function to pass through the origin, but it solves the problem.
Hope someone finds it useful :)
You can use np.linalg.lstsq and construct your coefficient matrix manually. To start, I'll create the example data x and y, and the "exact fit" y0:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(100)
y0 = 0.07 * x ** 3 + 0.3 * x ** 2 + 1.1 * x
y = y0 + 1000 * np.random.randn(x.shape[0])
Now I'll create a full cubic polynomial 'training' or 'independent variable' matrix that includes the constant d column.
XX = np.vstack((x ** 3, x ** 2, x, np.ones_like(x))).T
Let's see what I get if I compute the fit with this dataset and compare it to polyfit:
p_all = np.linalg.lstsq(X_, y)[0]
pp = np.polyfit(x, y, 3)
print np.isclose(pp, p_all).all()
# Returns True
Where I've used np.isclose because the two algorithms do produce very small differences.
You're probably thinking 'that's nice, but I still haven't answered the question'. From here, forcing the fit to have a zero offset is the same as dropping the np.ones column from the array:
p_no_offset = np.linalg.lstsq(XX[:, :-1], y)[0] # use [0] to just grab the coefs
Ok, let's see what this fit looks like compared to our data:
y_fit = np.dot(p_no_offset, XX[:, :-1].T)
plt.plot(x, y0, 'k-', linewidth=3)
plt.plot(x, y_fit, 'y--', linewidth=2)
plt.plot(x, y, 'r.', ms=5)
This gives this figure,
WARNING: When using this method on data that does not actually pass through (x,y)=(0,0) you will bias your estimates of your output solution coefficients (p) because lstsq will be trying to compensate for that fact that there is an offset in your data. Sort of a 'square peg round hole' problem.
Furthermore, you could also fit your data to a cubic only by doing:
p_ = np.linalg.lstsq(X_[:1, :], y)[0]
Here again the warning above applies. If your data contains quadratic, linear or constant terms the estimate of the cubic coefficient will be biased. There can be times when - for numerical algorithms - this sort of thing is useful, but for statistical purposes my understanding is that it is important to include all of the lower terms. If tests turn out to show that the lower terms are not statistically different from zero that's fine, but for safety's sake you should probably leave them in when you estimate your cubic.
Best of luck!

What's the error of numpy.polyfit?

I want to use numpy.polyfit for physical calculations, therefore I need the magnitude of the error.
If you specify full=True in your call to polyfit, it will include extra information:
>>> x = np.arange(100)
>>> y = x**2 + 3*x + 5 + np.random.rand(100)
>>> np.polyfit(x, y, 2)
array([ 0.99995888, 3.00221219, 5.56776641])
>>> np.polyfit(x, y, 2, full=True)
(array([ 0.99995888, 3.00221219, 5.56776641]), # coefficients
array([ 7.19260721]), # residuals
3, # rank
array([ 11.87708199, 3.5299267 , 0.52876389]), # singular values
2.2204460492503131e-14) # conditioning threshold
The residual value returned is the sum of the squares of the fit errors, not sure if this is what you are after:
>>> np.sum((np.polyval(np.polyfit(x, y, 2), x) - y)**2)
7.1926072073491056
In version 1.7 there is also a cov keyword that will return the covariance matrix for your coefficients, which you could use to calculate the uncertainty of the fit coefficients themselves.
As you can see in the documentation:
Returns
-------
p : ndarray, shape (M,) or (M, K)
Polynomial coefficients, highest power first.
If `y` was 2-D, the coefficients for `k`-th data set are in ``p[:,k]``.
residuals, rank, singular_values, rcond : present only if `full` = True
Residuals of the least-squares fit, the effective rank of the scaled
Vandermonde coefficient matrix, its singular values, and the specified
value of `rcond`. For more details, see `linalg.lstsq`.
Which means that if you can do a fit and get the residuals as:
import numpy as np
x = np.arange(10)
y = x**2 -3*x + np.random.random(10)
p, res, _, _, _ = numpy.polyfit(x, y, deg, full=True)
Then, the p are your fit parameters, and the res will be the residuals, as described above. The _'s are because you don't need to save the last three parameters, so you can just save them in the variable _ which you won't use. This is a convention and is not required.
#Jaime's answer explains what the residual means. Another thing you can do is look at those squared deviations as a function (the sum of which is res). This is particularly helpful to see a trend that didn't fit sufficiently. res can be large because of statistical noise, or possibly systematic poor fitting, for example:
x = np.arange(100)
y = 1000*np.sqrt(x) + x**2 - 10*x + 500*np.random.random(100) - 250
p = np.polyfit(x,y,2) # insufficient degree to include sqrt
yfit = np.polyval(p,x)
figure()
plot(x,y, label='data')
plot(x,yfit, label='fit')
plot(x,yfit-y, label='var')
So in the figure, note the bad fit near x = 0:

Nonlinear e^(-x) regression using scipy, python, numpy

The code below is giving me a flat line for the line of best fit rather than a nice curve along the model of e^(-x) that would fit the data. Can anyone show me how to fix the code below so that it fits my data?
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize
def _eNegX_(p,x):
x0,y0,c,k=p
y = (c * np.exp(-k*(x-x0))) + y0
return y
def _eNegX_residuals(p,x,y):
return y - _eNegX_(p,x)
def Get_eNegX_Coefficients(x,y):
print 'x is: ',x
print 'y is: ',y
# Calculate p_guess for the vectors x,y. Note that p_guess is the
# starting estimate for the minimization.
p_guess=(np.median(x),np.min(y),np.max(y),.01)
# Calls the leastsq() function, which calls the residuals function with an initial
# guess for the parameters and with the x and y vectors. Note that the residuals
# function also calls the _eNegX_ function. This will return the parameters p that
# minimize the least squares error of the _eNegX_ function with respect to the original
# x and y coordinate vectors that are sent to it.
p, cov, infodict, mesg, ier = scipy.optimize.leastsq(
_eNegX_residuals,p_guess,args=(x,y),full_output=1,warning=True)
# Define the optimal values for each element of p that were returned by the leastsq() function.
x0,y0,c,k=p
print('''Reference data:\
x0 = {x0}
y0 = {y0}
c = {c}
k = {k}
'''.format(x0=x0,y0=y0,c=c,k=k))
print 'x.min() is: ',x.min()
print 'x.max() is: ',x.max()
# Create a numpy array of x-values
numPoints = np.floor((x.max()-x.min())*100)
xp = np.linspace(x.min(), x.max(), numPoints)
print 'numPoints is: ',numPoints
print 'xp is: ',xp
print 'p is: ',p
pxp=_eNegX_(p,xp)
print 'pxp is: ',pxp
# Plot the results
plt.plot(x, y, '>', xp, pxp, 'g-')
plt.xlabel('BPM%Rest')
plt.ylabel('LVET/BPM',rotation='vertical')
plt.xlim(0,3)
plt.ylim(0,4)
plt.grid(True)
plt.show()
return p
# Declare raw data for use in creating regression equation
x = np.array([1,1.425,1.736,2.178,2.518],dtype='float')
y = np.array([3.489,2.256,1.640,1.043,0.853],dtype='float')
p=Get_eNegX_Coefficients(x,y)
It looks like it's a problem with your initial guesses; something like (1, 1, 1, 1) works fine:
You have
p_guess=(np.median(x),np.min(y),np.max(y),.01)
for the function
def _eNegX_(p,x):
x0,y0,c,k=p
y = (c * np.exp(-k*(x-x0))) + y0
return y
So that's test_data_maxe^( -.01(x - test_data_median)) + test_data_min
I don't know much about the art of choosing good starting parameters, but I can say a few things. leastsq is finding a local minimum here - the key in choosing these values is to find the right mountain to climb, not to try to cut down on the work that the minimization algorithm has to do. Your initial guess looks like this (green):
(1.736, 0.85299999999999998, 3.4889999999999999, 0.01)
which results in your flat line (blue):
(-59.20295956, 1.8562 , 1.03477144, 0.69483784)
Greater gains were made in adjusting the height of the line than in increasing the k value. If you know you're fitting to this kind of data, use a larger k. If you don't know, I guess you could try to find a decent k value by sampling your data, or working back from the slope between an average of the first half and the second half, but I wouldn't know how to go about that.
Edit: You could also start with several guesses, run the minimization several times, and take the line with the lowest residuals.

Categories