I wrote some code to find the best fitting line for a couple of data points using the analytical solution to least squares. Now I would like to print the error between the actual data and my estimated line, but I have no idea how to compute it. Here is my code:
import numpy as np
import matplotlib.pyplot as plt
A = np.array(((0,1),
(1,1),
(2,1),
(3,1)))
b = np.array((1,2,0,3), ndmin = 2 ).T
xstar = np.matmul( np.matmul( np.linalg.inv( np.matmul(A.T, A) ), A.T), b)
print(xstar)
plt.scatter(A.T[0], b)
u = np.linspace(0,3,20)
plt.plot(u, u * xstar[0] + xstar[1], 'b-')
You have already plotted the predictions from the linear regression. So from the value of the prediction, you can calculate the "sum of square errors (SSE)" or the "mean square error (MSE)" as follows:
y_prediction = u * xstar[0] + xstar[1]
SSE = np.sum(np.square(y_prediction - b))
MSE = np.mean(np.square(y_prediction - b))
print(SSE)
print(MSE)
An aside note. You might want to use np.linalg.pinv as that is a more numerically stable matrix inverse operator.
Note that numpy has a function for it, calles lstsq (i.e. least-squares), that returns the residuals as well as the solution, so you don't have to implement it yourself:
xstar, residuals = np.linalg.lstsq(A,b)
MSE = np.mean(residuals)
SSE = np.sum(residuals)
try it!
Related
I have data that I want to fit with polynomials. I have 200,000 data points, so I want an efficient algorithm. I want to use the numpy.polynomial package so that I can try different families and degrees of polynomials. Is there some way I can formulate this as a system of equations like Ax=b? Is there a better way to solve this than with scipy.minimize?
import numpy as np
from scipy.optimize import minimize as mini
x1 = np.random.random(2000)
x2 = np.random.random(2000)
y = 20 * np.sin(x1) + x2 - np.sin (30 * x1 - x2 / 10)
def fitness(x, degree=5):
poly1 = np.polynomial.polynomial.polyval(x1, x[:degree])
poly2 = np.polynomial.polynomial.polyval(x2, x[degree:])
return np.sum((y - (poly1 + poly2)) ** 2 )
# It seems like I should be able to solve this as a system of equations
# x = np.linalg.solve(np.concatenate([x1, x2]), y)
# minimize the sum of the squared residuals to find the optimal polynomial coefficients
x = mini(fitness, np.ones(10))
print fitness(x.x)
Your intuition is right. You can solve this as a system of equations of the form Ax = b.
However:
The system is overdefined and you want to get the least-squares solution, so you need to use np.linalg.lstsq instead of np.linalg.solve.
You can't use polyval because you need to separate the coefficients and powers of the independent variable.
This is how to construct the system of equations and solve it:
A = np.stack([x1**0, x1**1, x1**2, x1**3, x1**4, x2**0, x2**1, x2**2, x2**3, x2**4]).T
xx = np.linalg.lstsq(A, y)[0]
print(fitness(xx)) # test the result with original fitness function
Of course you can generalize over the degree:
A = np.stack([x1**p for p in range(degree)] + [x2**p for p in range(degree)]).T
With the example data, the least squares solution runs much faster than the minimize solution (800µs vs 35ms on my laptop). However, A can become quite large, so if memory is an issue minimize might still be an option.
Update:
Without any knowledge about the internals of the polynomial function things become tricky, but it is possible to separate terms and coefficients. Here is a somewhat ugly way to construct the system matrix A from a function like polyval:
def construct_A(valfunc, degree):
columns1 = []
columns2 = []
for p in range(degree):
c = np.zeros(degree)
c[p] = 1
columns1.append(valfunc(x1, c))
columns2.append(valfunc(x2, c))
return np.stack(columns1 + columns2).T
A = construct_A(np.polynomial.polynomial.polyval, 5)
xx = np.linalg.lstsq(A, y)[0]
print(fitness(xx)) # test the result with original fitness function
I used to run the DROITEREG functions in a calc sheet. Here is an example :
At the top left, there are the data and at the bottom the results of the function DROITEREG which are a 2 by 5 table. I wrote the labels of several cells. a and b are the parameters of the linear regression and u(a) and u(b) the uncertaintes on a and b. I would like to compute these uncertaintes from a numpy functions.
I succeed from curve_fit function :
import numpy as np
from scipy.stats import linregress
from scipy.optimize import curve_fit
data_o = """
0.42 2.0
0.97 5.0
1.71 10.0
2.49 20.0
3.53 50.0
3.72 100.0
"""
vo, So = np.loadtxt(data_o.split("\n"), unpack=True)
def f_model(x, a, b):
return a * x + b
popt, pcov = curve_fit(
f=f_model, # model function
xdata=1 / So, # x data
ydata=1 / vo, # y data
p0=(1, 1), # initial value of the parameters
)
# parameters
print(popt)
# uncertaintes :
print(np.sqrt(np.diag(pcov)))
The output is the following, the results are consistent with those obtained with DROITEREG :
[ 4.35522612 0.18629772]
[ 0.07564571 0.01699926]
But this is not fully satisfactory because, this should be obtained easily from a simple least square function. So I tried to use polyfit.
(a, b), Mcov = np.polyfit(1 / So, 1 / vo, 1, cov=True)
print("a = ", a, " b = ", b)
print("SSR = ", sum([(y - (a * x + b))**2 for x, y in zip(1 / So, 1 / vo)]))
print("Cov mat\n", Mcov)
print("Cov mat diag ", np.diag(Mcov))
print("sqrt 1/2 cov mat diag ", np.sqrt(0.5 * np.diag(Mcov)))
The output is :
a = 4.35522612104 b = 0.186297716685
SSR = 0.00398117627681
Cov mat
[[ 0.01144455 -0.00167853]
[-0.00167853 0.00057795]]
Cov mat diag [ 0.01144455 0.00057795]
sqrt 1/2 cov mat diag [ 0.07564571 0.01699926]
At the end, I noticed that the Mcov matrix from polyfit is 2 times the pcov matrix from curve_fit. If I try to do a fit with a larger degrees of the polynomial I saw that the factor is equal to the number of parameters.
I do not succeed using linregress from scipy.stats because I do not know how to obtain this covariance matrix of the parameters estimates. Again I also succeed using scipy.odr but it is again more complicated than the above solutions and this for a trivial linear regression. Maybe I missed something because I am not kind in statistic and I do not really understand the meaning of this covariance matrix.
Thus, what I would to know is the easiest way to obtain the parameters of the linear regression and the related uncertainties (correlation coefficient would be also a good point but it is more easy to compute it). My main objective is for example to give to a student in chemistry or physics the easy way to do this linear regression and compute the parameters associated to this model.
I have code here that draws from two gaussian distributions with an equal number of points.
Ultimately, I want to simulate noise but I'm trying to see why if I have two gaussians with means that are really far off from each other, my curve_fit should return their average mean value. It doesn't do that.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import gauss
N_tot = 1000
# Draw from the major gaussian. Note the number N. It is
# the main parameter in obtaining your estimators.
mean = 0; sigma = 1; var = sigma**2; N = 100
A = 1/np.sqrt((2*np.pi*var))
points = gauss.draw_1dGauss(mean,var,N)
# Now draw from a minor gaussian. Note Np
meanp = 10; sigmap = 1; varp = sigmap**2; Np = N_tot-N
pointsp = gauss.draw_1dGauss(meanp,varp,Np)
Ap = 1/np.sqrt((2*np.pi*varp))
# Now implement the sum of the draws by concatenating the two arrays.
points_tot = np.array(points.tolist()+pointsp.tolist())
bins_tot = len(points_tot)/5
hist_tot, bin_edges_tot = np.histogram(points_tot,bins_tot,density=True)
bin_centres_tot = (bin_edges_tot[:-1] + bin_edges_tot[1:])/2.0
# Initial guess
p0 = [A, mean, sigma]
# Result of the fit
coeff, var_matrix = curve_fit(gauss.gaussFun, bin_centres_tot, hist_tot, p0=p0)
# Get the fitted curve
hist_fit = gauss.gaussFun(bin_centres, *coeff)
plt.figure(5); plt.title('Gaussian Estimate')
plt.suptitle('Gaussian Parameters: Mu = '+ str(coeff[1]) +' , Sigma = ' + str(coeff[2]) + ', Amplitude = ' + str(coeff[0]))
plt.plot(bin_centres,hist_fit)
plt.draw()
# Error on the estimates
error_parameters = np.sqrt(np.array([var_matrix[0][0],var_matrix[1][1],var_matrix[2][2]]))
The returned parameters are still centered about 0 and I'm not sure why. It should be centered around 10.
Edit: Changed the integer division portions but still not returning good fit value.
I should get a mean of about ~10 since most of my points are being drawn from that distribution (i.e. the minor distribution)
You find that the least-squares optimization converges to the larger of the two peaks.
The least-squares optimum does not find the "average mean value" of two component distributions, it algorithm merely minimizes the squared error. This will usually happen when the biggest peak is fit.
When the distribution is this lopsided (90% of the samples are from the larger of the two peaks) the error terms on the main peak destroy the local minima at the smaller peak and the minimum between the peaks.
You can get the fit to converge to a point in the center only when the peaks are nearly equal in size, otherwise you should expect least-squares to find the "strongest" peak if it doesn't get stuck in a local minimum.
With the following pieces, I can run your code:
bin_centres = bin_centres_tot
def draw_1dGauss(mean,var,N):
from scipy.stats import norm
from numpy import sqrt
return scipy.stats.norm.rvs(loc = mean, scale = sqrt(var), size=N)
def gaussFun(bin_centres, *coeff):
from numpy import sqrt, exp, pi
A, mean, sigma = coeff[0], coeff[1], coeff[2]
return exp(-(bin_centres-mean)**2 / 2. / sigma**2 ) / sigma / sqrt(2*pi)
plt.hist(points_tot, normed=True, bins=40)
Below is an example of using Curve_Fit from Scipy based on a linear equation. My understanding of Curve Fit in general is that it takes a plot of random points and creates a curve to show the "best fit" to a series of data points. My question is using scipy curve_fit it returns:
"Optimal values for the parameters so that the sum of the squared error of f(xdata, *popt) - ydata is minimized".
What exactly do these two values mean in simple English? Thanks!
import numpy as np
from scipy.optimize import curve_fit
# Creating a function to model and create data
def func(x, a, b):
return a * x + b
# Generating clean data
x = np.linspace(0, 10, 100)
y = func(x, 1, 2)
# Adding noise to the data
yn = y + 0.9 * np.random.normal(size=len(x))
# Executing curve_fit on noisy data
popt, pcov = curve_fit(func, x, yn)
# popt returns the best fit values for parameters of
# the given model (func).
print(popt)
You're asking SciPy to tell you the "best" line through a set of pairs of points (x, y).
Here's the equation of a straight line:
y = a*x + b
The slope of the line is a; the y-intercept is b.
You have two parameters, a and b, so you only need two equations to solve for two unknowns. Two points define a line, right?
So what happens when you have more than two points? You can't go through all the points. How do you choose the slope and intercept to give you the "best" line?
One way is to define "best" is to calculate the slope and intercept that minimize the square of the difference between each y value and the predicted y at that x on the line:
error = sum[(y(i) - (a*x(i) + b))^2]
It's an easy exercise if you know calculus: take the first derivatives of error w.r.t. a and b and set them equal to zero. You'll have two equations with two unknowns, a and b. You solve them to get the coefficients for the "best" line.
I am fitting curves using curve_fit. Is there a way to read out the coefficient of determination and the absolute sum of squares?
Thanks, Woodpicker
According to doc, optimization with curve_fit gives you
Optimal values for the parameters so that the sum of the squared error
of f(xdata, *popt) - ydata is minimized
Then, use optimize.leastsq
import scipy.optimize
p,cov,infodict,mesg,ier = optimize.leastsq(
residuals,a_guess,args=(x,y),full_output=True,warning=True)
with this for residuals:
def residuals(a,x,y):
return y-f(x,a)
residuals is the method returning difference between true output data y and model output, with f the model, a the parameter(s), x the input data.
Method optimize.leastsq is returning a lot of information you can use to compute RSquared and RMSE by yourself. For RSQuared, you can do
ssErr = (infodict['fvec']**2).sum()
ssTot = ((y-y.mean())**2).sum()
rsquared = 1-(ssErr/ssTot )
More details on what is infodict['fvec']
In [48]: optimize.leastsq?
...
infodict -- a dictionary of optional outputs with the keys:
'fvec' : the function evaluated at the output