What is the easiest way to get uncertainties on linear regression parameters? - python

I used to run the DROITEREG functions in a calc sheet. Here is an example :
At the top left, there are the data and at the bottom the results of the function DROITEREG which are a 2 by 5 table. I wrote the labels of several cells. a and b are the parameters of the linear regression and u(a) and u(b) the uncertaintes on a and b. I would like to compute these uncertaintes from a numpy functions.
I succeed from curve_fit function :
import numpy as np
from scipy.stats import linregress
from scipy.optimize import curve_fit
data_o = """
0.42 2.0
0.97 5.0
1.71 10.0
2.49 20.0
3.53 50.0
3.72 100.0
"""
vo, So = np.loadtxt(data_o.split("\n"), unpack=True)
def f_model(x, a, b):
return a * x + b
popt, pcov = curve_fit(
f=f_model, # model function
xdata=1 / So, # x data
ydata=1 / vo, # y data
p0=(1, 1), # initial value of the parameters
)
# parameters
print(popt)
# uncertaintes :
print(np.sqrt(np.diag(pcov)))
The output is the following, the results are consistent with those obtained with DROITEREG :
[ 4.35522612 0.18629772]
[ 0.07564571 0.01699926]
But this is not fully satisfactory because, this should be obtained easily from a simple least square function. So I tried to use polyfit.
(a, b), Mcov = np.polyfit(1 / So, 1 / vo, 1, cov=True)
print("a = ", a, " b = ", b)
print("SSR = ", sum([(y - (a * x + b))**2 for x, y in zip(1 / So, 1 / vo)]))
print("Cov mat\n", Mcov)
print("Cov mat diag ", np.diag(Mcov))
print("sqrt 1/2 cov mat diag ", np.sqrt(0.5 * np.diag(Mcov)))
The output is :
a = 4.35522612104 b = 0.186297716685
SSR = 0.00398117627681
Cov mat
[[ 0.01144455 -0.00167853]
[-0.00167853 0.00057795]]
Cov mat diag [ 0.01144455 0.00057795]
sqrt 1/2 cov mat diag [ 0.07564571 0.01699926]
At the end, I noticed that the Mcov matrix from polyfit is 2 times the pcov matrix from curve_fit. If I try to do a fit with a larger degrees of the polynomial I saw that the factor is equal to the number of parameters.
I do not succeed using linregress from scipy.stats because I do not know how to obtain this covariance matrix of the parameters estimates. Again I also succeed using scipy.odr but it is again more complicated than the above solutions and this for a trivial linear regression. Maybe I missed something because I am not kind in statistic and I do not really understand the meaning of this covariance matrix.
Thus, what I would to know is the easiest way to obtain the parameters of the linear regression and the related uncertainties (correlation coefficient would be also a good point but it is more easy to compute it). My main objective is for example to give to a student in chemistry or physics the easy way to do this linear regression and compute the parameters associated to this model.

Related

Numpy polyfit: possible error in the scaling of the covariance matrix?

I am having a hard time figuring out the scaling for the covariance matrix in numpy polyfit.
In the documentation I read that the scaling factor to go from an unscaled to a scaled covariance matrix is
chi2 / sqrt(N - DOF).
In the code attached below, it seems that the scaling factor actually is
chi2 / DOF
Here is my code
# Generate synthetically the data
# True parameters
import numpy as np
true_slope = 3
true_intercept = 7
x_data = np.linspace(-5, 5, 30)
# The y-data will have a noise term, to simulate imperfect observations
sigma = 1
y_data = true_slope * np.linspace(-5, 5, 30) + true_intercept
y_obs = y_data + np.random.normal(loc=0.0, scale=sigma, size=x_data.size)
# Here I generate artificially some unequal uncertainties
# (even if there is no reason for them to be so)
y_uncertainties = sigma * np.random.normal(loc=1.0, scale=0.5*sigma, size=x_data.size)
# Make the fit
popt, pcov = np.polyfit(x_data, y_obs, 1, w=1/y_uncertainties, cov='unscaled')
popt, pcov_scaled = np.polyfit(x_data, y_obs, 1, w=1/y_uncertainties, cov=True)
my_scale_factor = np.sum((y_obs - popt[0] * x_data - popt[1])**2 / y_uncertainties**2)\
/ (len(y_obs)-2)
scale_factor = pcov_scaled[0,0] / pcov[0,0]
If I run the code, I see that the actual scale factor is chi2 / DOF and not the value reported in the documentation. Is this true or am I missing something?
I have a further question. Why is it suggested to use just the inverse of the y-data error instead of the square of the inverse of the y-data errors for the weights in the case that the uncertainties are normally-distributed?
Edit to add the data generated by a run of the code
x_data = array([-5. , -4.65517241, -4.31034483, -3.96551724, -3.62068966,
-3.27586207, -2.93103448, -2.5862069 , -2.24137931, -1.89655172,
-1.55172414, -1.20689655, -0.86206897, -0.51724138, -0.17241379,
0.17241379, 0.51724138, 0.86206897, 1.20689655, 1.55172414,
1.89655172, 2.24137931, 2.5862069 , 2.93103448, 3.27586207,
3.62068966, 3.96551724, 4.31034483, 4.65517241, 5. ])
y_obs = array([-7.27819725, -8.41939411, -3.9089926 , -5.24622589, -3.78747379,
-1.92898727, -1.375255 , -1.84388812, -0.37092441, 0.27572306,
2.57470918, 3.860485 , 4.62580789, 5.34147103, 6.68231985,
7.38242258, 8.28346559, 9.46008873, 10.69300274, 12.46051285,
13.35049975, 13.28279961, 14.31604781, 16.8226239 , 16.81708308,
18.64342284, 19.37375515, 19.6714002 , 20.13700708, 22.72327533])
y_uncertainties = array([ 0.63543112, 1.07608924, 0.83603265, -0.03442888, -0.07049299,
1.30864191, 1.36015322, 1.42125414, 1.04099854, 1.20556608,
0.43749964, 1.635056 , 1.00627014, 0.40512511, 1.19638787,
1.26230966, 0.68253139, 0.98055035, 1.01512232, 1.83910276,
0.96763007, 0.57373151, 1.69358475, 0.62068133, 0.70030971,
0.34648312, 1.85234844, 1.18687269, 1.23841579, 1.19741206])
With this data I obtain that scale_factor = 1.6534129347542432, my_scale_factor = 1.653412934754234 and that the "nominal" scale factor reported in the documentation, i.e.
nominal_scale_factor = np.sum((y_obs - popt[0] * x_data - popt[1])**2 /\
y_uncertainties**2) / np.sqrt(len(y_obs) - len(y_obs) + 2)
has value nominal_scale_factor = 32.73590595145554
PS. my numpy version is
1.18.5 3.7.7 (default, May 6 2020, 11:45:54) [MSC v.1916 64 bit (AMD64)]
Regarding the numpy.polyfit documentation:
By default, the covariance are scaled by chi2/sqrt(N-dof), i.e., the weights are presumed to be unreliable except in a relative sense and everything is scaled such that the reduced chi2 is unity.
This looks like a documentation bug. The correct scaling factor for the covariance is chi_square/(N-M) where M is the number of fit parameters and N-M is the number of degrees of freedom. It looks like np.polyfit is implemented correctly, because my_scale_factor and scale_factor are consistent.
Regarding the question on why not "the square of the inverse of the y-data errors": a polynomial fit or more generally, a least-squares fit involves solving the p vector in
A # p = y
where A is an (N, M) matrix for N data points in y and M elements in p and each column in A is the polynomial term evaluated at the corresponding x values.
The solution minimizes
(SUM_j A[i, j] p[j] - y[i])^2
SUM -----------------------------
i sigma_y[i]^2
Computationally, the cheapest way to calculate this is by multiplying each row in A and each y value by the corresponding 1/sigma_y and then taking a standard least-square solution of the A#p=y equation. By having the user supply the inverse errors, it saves the fit routine from handling division by zero issues and slow square-root operations.
Regarding the first part, I opened a Github issue
https://github.com/numpy/numpy/issues/16842
The conclusion on that thread is that the documentation is wrong, but the function behaves correctly.
The documentation should be updated to
By default, the covariance is scaled by chi2/dof, i.e., the weights are presumed to be unreliable except in a relative sense and everything is scaled such that the reduced chi2 is unity.

Calculating errors in python curvefit/polyfit by setting it based on chi^2=2

I have three arrays namely, x,y and y_sigma each containing 3 data points. y_sigma are absolute sigmas obtained from measurements and I have used python curvefit to fitting these three points.
x = np.array([2005.3877, 2005.4616, 2017.3959])
y = np.array([631137.78043004, 631137.88309611, 631138.12697976])
y_sigma = np.array([1.12781053, 1.1152334 , 0.31252557])
def linear(x,m,c):
return m*x + c
popt, pcov = curve_fit(linear, x, y, sigma=y_sigma, absolute_sigma = True)
print ('Best fit line =', popt[0])
print ('Uncertainty on line =', np.sqrt(pcov[0, 0]))
Best fit line = 0.0246132542120004
Uncertainty on line = 0.07066734926776239
If I perform the chi statistic I get:
chi = (y - linear(x, *popt)) / y_sigma
chi2 = (chi ** 2).sum()
dof = len(x) - len(popt)
chi2: 0.004042667706414678
dof: 1
chi2 is very small but uncertainty is comparably larger. I did a little bit reading and understood that the standard approach for estimating errors is to set Delta chi^2=1. But if you have chi^2 of <<1 (like in my example) per degree of freedom, and use the standard approach, you end up with the supposed one-sigma errors giving a good fit. So I think what I want to do is to set the errors based on chi^2=2, since I have one degree of freedom, and then go to 1+1.
Basically, what I want to do is freeze a parameter, let's say c (intercept) and fit the function with b in a way that the chi-square statistic give me a value close to 2 rather than 1!

Can't figure out how to print the least squares error

I wrote some code to find the best fitting line for a couple of data points using the analytical solution to least squares. Now I would like to print the error between the actual data and my estimated line, but I have no idea how to compute it. Here is my code:
import numpy as np
import matplotlib.pyplot as plt
A = np.array(((0,1),
(1,1),
(2,1),
(3,1)))
b = np.array((1,2,0,3), ndmin = 2 ).T
xstar = np.matmul( np.matmul( np.linalg.inv( np.matmul(A.T, A) ), A.T), b)
print(xstar)
plt.scatter(A.T[0], b)
u = np.linspace(0,3,20)
plt.plot(u, u * xstar[0] + xstar[1], 'b-')
You have already plotted the predictions from the linear regression. So from the value of the prediction, you can calculate the "sum of square errors (SSE)" or the "mean square error (MSE)" as follows:
y_prediction = u * xstar[0] + xstar[1]
SSE = np.sum(np.square(y_prediction - b))
MSE = np.mean(np.square(y_prediction - b))
print(SSE)
print(MSE)
An aside note. You might want to use np.linalg.pinv as that is a more numerically stable matrix inverse operator.
Note that numpy has a function for it, calles lstsq (i.e. least-squares), that returns the residuals as well as the solution, so you don't have to implement it yourself:
xstar, residuals = np.linalg.lstsq(A,b)
MSE = np.mean(residuals)
SSE = np.sum(residuals)
try it!

numpy fit coefficients to linear combination of polynomials

I have data that I want to fit with polynomials. I have 200,000 data points, so I want an efficient algorithm. I want to use the numpy.polynomial package so that I can try different families and degrees of polynomials. Is there some way I can formulate this as a system of equations like Ax=b? Is there a better way to solve this than with scipy.minimize?
import numpy as np
from scipy.optimize import minimize as mini
x1 = np.random.random(2000)
x2 = np.random.random(2000)
y = 20 * np.sin(x1) + x2 - np.sin (30 * x1 - x2 / 10)
def fitness(x, degree=5):
poly1 = np.polynomial.polynomial.polyval(x1, x[:degree])
poly2 = np.polynomial.polynomial.polyval(x2, x[degree:])
return np.sum((y - (poly1 + poly2)) ** 2 )
# It seems like I should be able to solve this as a system of equations
# x = np.linalg.solve(np.concatenate([x1, x2]), y)
# minimize the sum of the squared residuals to find the optimal polynomial coefficients
x = mini(fitness, np.ones(10))
print fitness(x.x)
Your intuition is right. You can solve this as a system of equations of the form Ax = b.
However:
The system is overdefined and you want to get the least-squares solution, so you need to use np.linalg.lstsq instead of np.linalg.solve.
You can't use polyval because you need to separate the coefficients and powers of the independent variable.
This is how to construct the system of equations and solve it:
A = np.stack([x1**0, x1**1, x1**2, x1**3, x1**4, x2**0, x2**1, x2**2, x2**3, x2**4]).T
xx = np.linalg.lstsq(A, y)[0]
print(fitness(xx)) # test the result with original fitness function
Of course you can generalize over the degree:
A = np.stack([x1**p for p in range(degree)] + [x2**p for p in range(degree)]).T
With the example data, the least squares solution runs much faster than the minimize solution (800µs vs 35ms on my laptop). However, A can become quite large, so if memory is an issue minimize might still be an option.
Update:
Without any knowledge about the internals of the polynomial function things become tricky, but it is possible to separate terms and coefficients. Here is a somewhat ugly way to construct the system matrix A from a function like polyval:
def construct_A(valfunc, degree):
columns1 = []
columns2 = []
for p in range(degree):
c = np.zeros(degree)
c[p] = 1
columns1.append(valfunc(x1, c))
columns2.append(valfunc(x2, c))
return np.stack(columns1 + columns2).T
A = construct_A(np.polynomial.polynomial.polyval, 5)
xx = np.linalg.lstsq(A, y)[0]
print(fitness(xx)) # test the result with original fitness function

Scipy Curve_Fit return value explained

Below is an example of using Curve_Fit from Scipy based on a linear equation. My understanding of Curve Fit in general is that it takes a plot of random points and creates a curve to show the "best fit" to a series of data points. My question is using scipy curve_fit it returns:
"Optimal values for the parameters so that the sum of the squared error of f(xdata, *popt) - ydata is minimized".
What exactly do these two values mean in simple English? Thanks!
import numpy as np
from scipy.optimize import curve_fit
# Creating a function to model and create data
def func(x, a, b):
return a * x + b
# Generating clean data
x = np.linspace(0, 10, 100)
y = func(x, 1, 2)
# Adding noise to the data
yn = y + 0.9 * np.random.normal(size=len(x))
# Executing curve_fit on noisy data
popt, pcov = curve_fit(func, x, yn)
# popt returns the best fit values for parameters of
# the given model (func).
print(popt)
You're asking SciPy to tell you the "best" line through a set of pairs of points (x, y).
Here's the equation of a straight line:
y = a*x + b
The slope of the line is a; the y-intercept is b.
You have two parameters, a and b, so you only need two equations to solve for two unknowns. Two points define a line, right?
So what happens when you have more than two points? You can't go through all the points. How do you choose the slope and intercept to give you the "best" line?
One way is to define "best" is to calculate the slope and intercept that minimize the square of the difference between each y value and the predicted y at that x on the line:
error = sum[(y(i) - (a*x(i) + b))^2]
It's an easy exercise if you know calculus: take the first derivatives of error w.r.t. a and b and set them equal to zero. You'll have two equations with two unknowns, a and b. You solve them to get the coefficients for the "best" line.

Categories