Calculating slope and intercept error of linear regression - python

I have a very simple case of 3 Datapoints and I would like to do a linear fit y=a0 + a1x through those points using np.polyfit or scipy.stats.linregress.
For the further error propagation I need the errors in the slope and the intercept. I am by far no expert in statistics but on the scipy side I am only aware of the stderr which does not split in slope and intercept.
Polyfit has the possibly to estimate the covariance matrix, but this does not work with only 3 datapoints.
When using qtiplot for example it yields errors for slope and intercept.
B (y-intercept) = 9,291335740072202e-12 +/- 2,391260092282606e-13
A (slope) = 2,527075812274368e-12 +/- 6,878180102259077e-13
What would be the appropiate way to calculate these in python?
EDIT:
np.polyfit(x, y, 1, cov=True)
results in
ValueError: the number of data points must exceed order + 2 for
Bayesian estimate the covariance matrix

scipy.stats.linregress gives you slope, intercept, correleation coefficient, p value & standard error. The fitted line does not have errors associated with its slope or intercept, the errors are to do with the distances of the points from the line. Have a read through this to clear up the point
An example...
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
points = np.array([[1, 3], [2, 4], [2, 7]])
slope, intercept, r_value, p_value, std_err = stats.linregress(points)
print("slope = ", slope)
print("intercept = ", intercept)
print("R = ", r_value)
print("p = ", p_value)
print("Standard error = ", std_err)
for xy in points:
plt.plot(xy[0], xy[1], 'ob')
x = np.linspace(0, 10, 100)
y = slope * x + intercept
plt.plot(x, y, '-r')
plt.grid()
plt.show()

Related

How to calculate slope with plus minus uncertainty in the value

Following code gives a specific value of slope but I want to calculate it with some uncertainty like (1.95+_ 0.03) . How can I do that?
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
x= np.arange(10)
y = np.array([2,4,25,8,10,30,14,16,28,20])
z = stats.linregress(x,y)
print (z)
slope = z[0]
intercept = z[1]
line = slope*x + intercept
plt.plot(x,y,'o', label='original data')
plt.plot(x,line,color='green', label='fitted line')
plt.xlabel("independent _variable")
plt.ylabel("dependentt_variable")
plt.savefig("./linear regression")
You can calculate the confidence interval for the slope using the formula described in detail here. From your code (assuming you are using the scipy.stats package), you can find the (1-α)% confidence interval as follows:
alpha = 0.05
CI = [z.slope+z.stderr*t for t in stats.t.interval(alpha/2, len(x)-2)]
print(CI)
# [1.9276895391491087, 1.9874619760024068]
To print the confidence interval in the form stated in your question:
halfwidth = z.stderr*stats.t.interval(alpha/2, len(x)-2)[1]
print('({} +/- {})'.format(z.slope, halfwidth))
# (1.9575757575757577 +/- 0.02988621842664903)
Alternatively, you could use the StatsModels package which has a built-in method to find the confidence interval. This is explained in the question found here.

How to represent 1D vector as sum of Gaussian curves with scipy/numpy? [duplicate]

This question already has answers here:
fit multiple gaussians to the data in python
(3 answers)
Closed 6 years ago.
UPD: Thanks, it works.
I have an 1D-vector, which represents a histogram. It looks like sum of few gaussian functions:
I've found curve_fit sample code on SO, but don't know how to modify it to receive more gaussian tuples (mu, sigma). I've heard 'curve_fit' optimizes only one function (in this case one gaussian curve).
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
def estimate_sigma(hist):
bin_edges = np.arange(len(hist))
bin_centres = bin_edges + 0.5
# Define model function to be used to fit to the data above:
def gauss(x, *p):
A, mu, sigma = p
return A*numpy.exp(-(x-mu)**2/(2.*sigma**2))
# p0 is the initial guess for the fitting coefficients (A, mu and sigma above)
p0 = [1., 0., 1.]
coeff, var_matrix = curve_fit(gauss, bin_centres, hist, p0=p0)
# Get the fitted curve
hist_fit = gauss(bin_centres, *coeff)
plt.plot(bin_centres, hist, label='Test data')
plt.plot(bin_centres, hist_fit, label='Fitted data')
print 'Fitted mean = ', coeff[1]
coeff2 =coeff[2]
print 'Fitted standard deviation = ', coeff2
plt.show()
This functions finds one gaussian curve, while visually there are 3 or 4 of them:
Please, could you advice some numpy/scipy functions to achieve gmm representation of 1D vector in form ([m1, sigma1],[m2, sigma2],..,[mN,sigmaN])?
As tBuLi recommended, I passed additional Gaussian curves coefficients to gauss as well as to curve_fit.
Now fitted curve looks so:
Updated code:
def estimate_sigma(hist):
bin_edges = np.arange(len(hist))
bin_centres = bin_edges + 0.5
# Define model function to be used to fit to the data above:
def gauss(x, *gparams):
g_count = len(gparams)/3
def gauss_impl(x, A, mu, sigma):
return A*numpy.exp(-(x-mu)**2/(2.*sigma**2))
res = np.zeros(len(x))
for gi in range(g_count):
res += gauss_impl(x, gparams[gi*3], gparams[gi*3+1], gparams[gi*3+2])
return res
# p0 is the initial guess for the fitting coefficients (A, mu and sigma above)
curves_count = 4
p0 = np.tile([1., 0., 1.], curves_count)
coeff, var_matrix = curve_fit(gauss, bin_centres, hist, p0=p0)
# Get the fitted curve
hist_fit = gauss(bin_centres, *coeff)
plt.plot(bin_centres, hist, label='Test data')
plt.plot(bin_centres, hist_fit, label='Fitted data')
# Finally, lets get the fitting parameters, i.e. the mean and standard deviation:
print coeff
plt.show()

Python: Linear Regression, reshaping numpy arrays for use in model

Sorry for the noob question...here's my code:
from __future__ import division
import sklearn
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
X =np.array([6,8,10,14,18])
Y = np.array([7,9,13,17.5,18])
X = np.reshape(X,(1,5))
Y = np.reshape(Y,(1,5))
print X
print Y
plt.figure()
plt.title('Pizza Price as a function of Pizza Diameter')
plt.xlabel('Pizza Diameter (Inches)')
plt.ylabel('Pizza Price (Dollars)')
axis = plt.axis([0, 25, 0 ,25])
m, b = np.polyfit(X,Y,1)
plt.grid(True)
plt.plot(X,Y, 'k.')
plt.plot(X, m*X + b, '-')
#plt.show()
#training data
#x= [[6],[8],[10],[14],[18]]
#y= [[7],[9],[13],[17.5],[18]]
# create and fit linear regression model
model = LinearRegression()
model.fit(X,Y)
print 'A 12" pizza should cost $% .2f' % model.predict(19)
#work out cost function, which is residual sum of squares
print 'Residual sum of squares: %.2f' % np.mean((model.predict(x)- y) ** 2)
#work out variance (AKA Mean squared error)
xMean = np.mean(x)
print 'Variance is: %.2f' %np.var([x], ddof=1)
#work out covariance (this is whether the x axis data and y axis data correlate with eachother)
#When a and b are 1-dimensional sequences, numpy.cov(x,y)[0][1] calculates covariance
print 'Covariance is: %.2f' %np.cov(X, Y, ddof = 1)[0][1]
#test the model on new test data, printing the r squared coefficient
X_test = [[8], [9], [11], [16], [12]]
y_test = [[11], [8.5], [15], [18], [11]]
print 'R squared for model on test data is: %.2f' %model.score(X_test,y_test)
Basically, some of these functions work for the variables I have called X and Y and some don't.
For example, as the code is, it throws up this error:
TypeError: expected 1D vector for x
for the line
m, b = np.polyfit(X,Y,1)
However, when I comment out the two lines reshaping the variables like this:
#X = np.reshape(X,(1,5))
#Y = np.reshape(Y,(1,5))
I get the error:
ValueError: Found input variables with inconsistent numbers of samples: [1, 5]
on the line
model.fit(X,Y)
So, how do I get the array to work for all the functions in my script, without having different arrays of the same data with slightly different structures?
Thanks for your help!
Change these lines
X = np.reshape(X,(5))
Y = np.reshape(Y,(5))
or just removed them both

Python - Translating best fit line in log plot

I'm trying a best fit linear regression line for huge arrays in a loglog plot.
import scipy.stats as stats
x = subhalos['SubhaloVmax']
y = subhalos['SubhaloMass'] * 1e10 / 0.704 # in units of M_sol h^-1
slope, intercept, r_value, p_value, slope_std_error = stats.linregress(np.log(x), np.log(y))
predict_y = intercept + slope * x
pred_error = y - predict_y
degrees_of_freedom = len(x) - 2
residual_std_error = np.sqrt(np.sum(pred_error**2) / degrees_of_freedom)
idx = np.argsort(x)
plt.plot(x,y,'k.')
plt.plot(x[idx], predict_y[idx], 'b--')
plt.xscale('log')
plt.yscale('log')
plt.xlabel('$V_{max}$ [km s$^{-1}$]')
plt.ylabel('$M_{sub} $ [$M_\odot h^{-1}$]')
plt.title(' $V_{max} - M_{sub}$ relation ')
giving me this graph
I would've thought that my code would set up the y intercept automatically. But that does not seem to be the case.
How do I translate the line to the correct intercept?
You're computing the regression on log(x) vs log(y), so your prediction should actually be computed as
predict_logy = intercept + slope * logx
Whether you would then compute the residuals as log(y) - predict_logy or y - exp(predict_logy) or something else depends on your application.

How to find error on slope and intercept using numpy.polyfit

I'm fitting a straight line to some data with numpy.polyfit. The data themselves do not come with any error bars. Here's a simplified version of my code:
from numpy import polyfit
data = loadtxt("data.txt")
x,y = data[:,0],data[:,1]
fit = polyfit(x,y,1)
Of course that gives me the values for the slope and intercept, but how to I find the uncertainty on the best-fit values?
I'm a bit late to answer this, but I think that this question remains unanswered and was the top hit on Google for me. Therefore, I think the following is the correct method
x = np.linspace(0, 1, 100)
y = 10 * x + 2 + np.random.normal(0, 1, 100)
p, V = np.polyfit(x, y, 1, cov=True)
print("x_1: {} +/- {}".format(p[0], np.sqrt(V[0][0])))
print("x_2: {} +/- {}".format(p[1], np.sqrt(V[1][1])))
which outputs
x_1: 10.2069326441 +/- 0.368862837662
x_2: 1.82929420943 +/- 0.213500166807
So you need to return the covariance matrix, V, for which the square root of the diagonals are the estimated standard-deviation for each of the fitted coefficients. This of course generalised to higher dimensions.

Categories