Related
Following the recommendations in this answer I have used several combination of values for beta0, and as shown here, the values from polyfit.
This example is UPDATED in order to show the effect of relative scales of values of X versus Y (X range is 0.1 to 100 times Y):
from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
for num in range(1, 5):
plt.subplot(2, 2, num)
plt.title('X range is %.1f times Y' % (float(100 / max(X))))
X *= 10
z = np.polyfit(X, Y, 1)
plt.plot(X, Y, 'k.', alpha=0.1)
# Fit using odr
def f(B, X):
return B[0]*X + B[1]
linear = odr.Model(f)
mydata = odr.RealData(X, Y)
myodr = odr.ODR(mydata, linear, beta0=z)
myodr.set_job(fit_type=0)
myoutput = myodr.run()
a, b = myoutput.beta
sa, sb = myoutput.sd_beta
xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
yp = a*xp+b
plt.plot(xp, yp, label='ODR')
yp2 = z[0]*xp+z[1]
plt.plot(xp, yp2, label='polyfit')
plt.legend()
plt.ylim(-1000, 2000)
plt.show()
It seems that no combination of beta0 helps... The only way to get polyfit and ODR fit similar is to swap X and Y, OR as shown here to increase the range of values of X with regard to Y, still not really a solution :)
=== EDIT ===
I do not want ODR to be the same as polyfit. I am showing polyfit just to emphasize that the ODR fit is wrong and it is not a problem of the data.
=== SOLUTION ===
thanks to #norok2 answer when Y range is 0.001 to 100000 times X:
from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() / 1000 for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
plt.figure(figsize=(12, 12))
for num in range(1, 10):
plt.subplot(3, 3, num)
plt.title('Y range is %.1f times X' % (float(100 / max(X))))
X *= 10
z = np.polyfit(X, Y, 1)
plt.plot(X, Y, 'k.', alpha=0.1)
# Fit using odr
def f(B, X):
return B[0]*X + B[1]
linear = odr.Model(f)
mydata = odr.RealData(X, Y,
sy=min(1/np.var(Y), 1/np.var(X))) # here the trick!! :)
myodr = odr.ODR(mydata, linear, beta0=z)
myodr.set_job(fit_type=0)
myoutput = myodr.run()
a, b = myoutput.beta
sa, sb = myoutput.sd_beta
xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
yp = a*xp+b
plt.plot(xp, yp, label='ODR')
yp2 = z[0]*xp+z[1]
plt.plot(xp, yp2, label='polyfit')
plt.legend()
plt.ylim(-1000, 2000)
plt.show()
The key difference between polyfit() and the Orthogonal Distance Regression (ODR) fit is that polyfit works under the assumption that the error on x is negligible. If this assumption is violated, like it is in your data, you cannot expect the two methods to produce similar results.
In particular, ODR() is very sensitive to the errors you specify.
If you do not specify any error/weighting, it will assign a value of 1 for both x and y, meaning that any scale difference between x and y will affect the results (the so-called numerical conditioning).
On the contrary, polyfit(), before computing the fit, applies some sort of pre-whitening to the data (see around line 577 of its source code) for better numerical conditioning.
Therefore, if you want ODR() to match polyfit(), you could simply fine-tune the error on Y to change your numerical conditioning.
I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).
mydata = odr.RealData(X, Y)
# equivalent to: odr.RealData(X, Y, sx=1, sy=1)
to:
mydata = odr.RealData(X, Y, sx=1, sy=1/np.var(Y))
(EDIT: note there was a typo on the line above)
I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).
Note that this would only make sense for well-conditioned fits.
I cannot format source code in a comment, and so place it here. This code uses ODR to calculate fit statistics, note the line that has "parameter order for odr" such that I use a wrapper function for the ODR call to my "actual" function.
from scipy.optimize import curve_fit
import numpy as np
import scipy.odr
import scipy.stats
x = np.array([5.357, 5.797, 5.936, 6.161, 6.697, 6.731, 6.775, 8.442, 9.861])
y = np.array([0.376, 0.874, 1.049, 1.327, 2.054, 2.077, 2.138, 4.744, 7.104])
def f(x,b0,b1):
return b0 + (b1 * x)
def f_wrapper_for_odr(beta, x): # parameter order for odr
return f(x, *beta)
parameters, cov= curve_fit(f, x, y)
model = scipy.odr.odrpack.Model(f_wrapper_for_odr)
data = scipy.odr.odrpack.Data(x,y)
myodr = scipy.odr.odrpack.ODR(data, model, beta0=parameters, maxit=0)
myodr.set_job(fit_type=2)
parameterStatistics = myodr.run()
df_e = len(x) - len(parameters) # degrees of freedom, error
cov_beta = parameterStatistics.cov_beta # parameter covariance matrix from ODR
sd_beta = parameterStatistics.sd_beta * parameterStatistics.sd_beta
ci = []
t_df = scipy.stats.t.ppf(0.975, df_e)
ci = []
for i in range(len(parameters)):
ci.append([parameters[i] - t_df * parameterStatistics.sd_beta[i], parameters[i] + t_df * parameterStatistics.sd_beta[i]])
tstat_beta = parameters / parameterStatistics.sd_beta # coeff t-statistics
pstat_beta = (1.0 - scipy.stats.t.cdf(np.abs(tstat_beta), df_e)) * 2.0 # coef. p-values
for i in range(len(parameters)):
print('parameter:', parameters[i])
print(' conf interval:', ci[i][0], ci[i][1])
print(' tstat:', tstat_beta[i])
print(' pstat:', pstat_beta[i])
print()
I'm following Andrew Ng Coursera course on Machine Learning and I tried to implement the Gradient Descent Algorithm in Python. I'm having trouble with the y-intercept parameter because it doesn't look like to go to the best value. Here's my code:
# IMPORTS
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# Acquiring Data
# Source: https://github.com/mattnedrich/GradientDescentExample
data = pd.read_csv('data.csv')
def cost_function(a, b, x_values, y_values):
'''
Calculates the square mean error for a given dataset
with (x,y) pairs and the model y' = a + bx
a: y-intercept for the model
b: slope of the curve
x_values, y_values: points (x,y) of the dataset
'''
data_len = len(x_values)
total_error = sum([((a + b * x_values[i]) - y_values[i])**2
for i in range(data_len)])
return total_error / (2 * float(data_len))
def a_gradient(a, b, x_values, y_values):
'''
Partial derivative of the cost_function with respect to 'a'
a, b: values for 'a' and 'b'
x_values, y_values: points (x,y) of the dataset
'''
data_len = len(x_values)
a_gradient = sum([((a + b * x_values[i]) - y_values[i])
for i in range(data_len)])
return a_gradient / float(data_len)
def b_gradient(a, b, x_values, y_values):
'''
Partial derivative of the cost_function with respect to 'b'
a, b: values for 'a' and 'b'
x_values, y_values: points (x,y) of the dataset
'''
data_len = len(x_values)
b_gradient = sum([(((a + b * x_values[i]) - y_values[i]) * x_values[i])
for i in range(data_len)])
return b_gradient / float(data_len)
def gradient_descent_step(a_current, b_current, x_values, y_values, alpha):
'''
Give a step in direction of the minimum of the cost_function using
the 'a' and 'b' gradiants. Return new values for 'a' and 'b'.
a_current, b_current: the current values for 'a' and 'b'
x_values, y_values: points (x,y) of the dataset
'''
new_a = a_current - alpha * a_gradient(a_current, b_current, x_values, y_values)
new_b = b_current - alpha * b_gradient(a_current, b_current, x_values, y_values)
return (new_a, new_b)
def run_gradient_descent(a, b, x_values, y_values, alpha, precision, plot=False, verbose=False):
'''
Runs the gradient_descent_step function and updates (a,b) until
the value of the cost function varies less than 'precision'.
a, b: initial values for the point a and b in the cost_function
x_values, y_values: points (x,y) of the dataset
alpha: learning rate for the algorithm
precision: value for the algorithm to stop calculation
'''
iterations = 0
delta_cost = cost_function(a, b, x_values, y_values)
error_list = [delta_cost]
iteration_list = [0]
# The loop runs until the delta_cost reaches the precision defined
# When the variation in cost_function is small it means that the
# the function is near its minimum and the parameters 'a' and 'b'
# are a good guess for modeling the dataset.
while delta_cost > precision:
iterations += 1
iteration_list.append(iterations)
# Calculates the initial error with current a,b values
prev_cost = cost_function(a, b, x_values, y_values)
# Calculates new values for a and b
a, b = gradient_descent_step(a, b, x_values, y_values, alpha)
# Updates the value of the error
actual_cost = cost_function(a, b, x_values, y_values)
error_list.append(actual_cost)
# Calculates the difference between previous and actual error values.
delta_cost = prev_cost - actual_cost
# Plot the error in each iteration to see how it decreases
# and some information about our final results
if plot:
plt.plot(iteration_list, error_list, '-')
plt.title('Error Minimization')
plt.xlabel('Iteration',fontsize=12)
plt.ylabel('Error',fontsize=12)
plt.show()
if verbose:
print('Iterations = ' + str(iterations))
print('Cost Function Value = '+ str(cost_function(a, b, x_values, y_values)))
print('a = ' + str(a) + ' and b = ' + str(b))
return (actual_cost, a, b)
When I run the algorithm with:
run_gradient_descent(0, 0, data['x'], data['y'], 0.0001, 0.01)
I get (a = 0.0496688656535 and b = 1.47825808018)
But the best value for 'a' is around 7.9 (tried another resources for linear regression).
Also, if I change the initial guess for the parameter 'a' the algorithm simply try to adjust the parameter 'b'.
For example, if I set a = 200 and b = 0
run_gradient_descent(200, 0, data['x'], data['y'], 0.0001, 0.01)
I get (a = 199.933763331 and b = -2.44824996193)
I couldn't find anything wrong with the code and I realized that the problem is the initial guess for a parameter. See my own answer above where I defined a helper function to get a range for search some values for initial a guess.
Gradient descent does not guarantee to find global optimum. Your chances of finding the global optimum depend on your starting value. To get the real values of the parameters, first I solved the least squares problem which guarantees global minimum.
data = pd.read_csv('data.csv',header=-1)
x,y = data[0],data[1]
from scipy.stats import linregress
linregress(x,y)
This results in following statistics:
LinregressResult(slope=1.32243102275536, intercept=7.9910209822703848, rvalue=0.77372849988782377, pvalue=3.855655536990139e-21, stderr=0.109377979589804)
Thus b = 1.32243102275536 and a = 7.9910209822703848. Given this, using your code I solved the problem a couple of times using randomized starting values a and b:
a,b = np.random.rand()*10,np.random.rand()*10
print("Initial values of parameters: ")
print("a=%f\tb=%f" % (a,b))
run_gradient_descent(a, b,x,y,1e-4,1e-2)
Here is the solution that I got:
Initial values of parameters:
a=6.100305 b=2.606448
Iterations = 21
Cost Function Value = 55.2093808263
a = 6.07601889437 and b = 1.36310312751
Therefore, it seems like the reason that you cannot get close to minimum is because of choice of your initial parameter values. You will see it yourself as well, if you put a and b obtained from least squares into your gradient descent algorithm, it will iterate only for one time and stay where it is.
Somehow, at some point delta_cost > precision is True and it stops there considering it a local optimum. If you decrease your precision and if you run it long enough then you might be able to find the global optimum.
The complete code for my Gradient Descent implementation could be found on my Github repository:
Gradient Descent for Linear Regression
Thinking about what #relay said that the Gradient Descent algorithm does not guarantee to find the global minima I tried to come up with an helper function to limit guesses for the parameter a in a certain search range, as follows:
def search_range(x, y, plot=False):
'''
Given a dataset with points (x, y) searches for a best guess for
initial values of 'a'.
'''
data_lenght = len(x) # Total size of of the dataset
q_lenght = int(data_lenght / 4) # Size of a quartile of the dataset
# Finding the max and min value for y in the first quartile
min_Q1 = (x[0], y[0])
max_Q1 = (x[0], y[0])
for i in range(q_lenght):
temp_point = (x[i], y[i])
if temp_point[1] < min_Q1[1]:
min_Q1 = temp_point
if temp_point[1] > max_Q1[1]:
max_Q1 = temp_point
# Finding the max and min value for y in the 4th quartile
min_Q4 = (x[data_lenght - 1], y[data_lenght - 1])
max_Q4 = (x[data_lenght - 1], y[data_lenght - 1])
for i in range(data_lenght - 1, data_lenght - q_lenght, -1):
temp_point = (x[i], y[i])
if temp_point[1] < min_Q4[1]:
min_Q4 = temp_point
if temp_point[1] > max_Q4[1]:
max_Q4 = temp_point
mean_Q4 = (((min_Q4[0] + max_Q4[0]) / 2), ((min_Q4[1] + max_Q4[1]) / 2))
# Finding max_y and min_y given the points found above
# Two lines need to be defined, L1 and L2.
# L1 will pass through min_Q1 and mean_Q4
# L2 will pass through max_Q1 and mean_Q4
# Calculatin slope for L1 and L2 given m = Delta(y) / Delta (x)
slope_L1 = (min_Q1[1] - mean_Q4[1]) / (min_Q1[0] - mean_Q4[0])
slope_L2 = (max_Q1[1] - mean_Q4[1]) / (max_Q1[0] -mean_Q4[0])
# Calculating y-intercepts for L1 and L2 given line equation in the form y = mx + b
# Float numbers are converted to int because they will be used as range for itaration
y_L1 = int(min_Q1[1] - min_Q1[0] * slope_L1)
y_L2 = int(max_Q1[1] - max_Q1[0] * slope_L2)
# Ploting L1 and L2
if plot:
L1 = [(y_L1 + slope_L1 * x) for x in data['x']]
L2 = [(y_L2 + slope_L2 * x) for x in data['x']]
plt.plot(data['x'], data['y'], '.')
plt.plot(data['x'], L1, '-', color='r')
plt.plot(data['x'], L2, '-', color='r')
plt.title('Scatterplot of Sample Data')
plt.xlabel('x',fontsize=12)
plt.ylabel('y',fontsize=12)
plt.show()
return y_L1, y_L2
The idea is to run the gradient descent with guesses for a within the range given by search_range() function and get the minimum possible value for the cost_function(). The new way to run the gradient descente becomes:
def run_search_gradient_descent(x_values, y_values, alpha, precision, verbose=False):
'''
Runs the gradient_descent_step function and updates (a,b) until
the value of the cost function varies less than 'precision'.
x_values, y_values: points (x,y) of the dataset
alpha: learning rate for the algorithm
precision: value for the algorithm to stop calculation
'''
from math import inf
a1, a2 = search_range(x_values, y_values)
best_guess = [inf, 0, 0]
for a in range(a1, a2):
cost, linear_coef, slope = run_gradient_descent(a, 0, x_values, y_values, alpha, precision)
# Saving value for cost_function and parameters (a,b)
if cost < best_guess[0]:
best_guess = [cost, linear_coef, slope]
if verbose:
print('Cost Function = ' + str(best_guess[0]))
print('a = ' + str(best_guess[1]) + ' and b = ' + str(best_guess[2]))
return (best_guess[0], best_guess[1], best_guess[2])
Running the code
run_search_gradient_descent(data['x'], data['y'], 0.0001, 0.001, verbose=True)
I've got:
Cost Function = 55.1294483959
a = 8.02595996606 and b = 1.3209768383
For comparison, using the linear regression from scipy.stats it returned
a = 7.99102098227and b = 1.32243102276
I'm working through my Matlab code for the Andrew NG Coursera course and turning it into python. I am working on non-regularized logistic regression and after writing my gradient and cost functions I needed something similar to fminunc and after some googling, I found a couple options. They are both returning the same results, but they do not match what is in Andrew NG's expected results code. Others seem to be getting this to work correctly, but I'm wondering why my specific code does not seem to return the desired result when using scipy.optimize functions, but does for the cost and gradient pieces earlier in the code.
The data I'm using can be found at the link below;
ex2data1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as op
#Machine Learning Online Class - Exercise 2: Logistic Regression
#Load Data
#The first two columns contains the exam scores and the third column contains the label.
data = pd.read_csv('ex2data1.txt', header = None)
X = np.array(data.iloc[:, 0:2]) #100 x 3
y = np.array(data.iloc[:,2]) #100 x 1
y.shape = (len(y), 1)
#Creating sub-dataframes for plotting
pos_plot = data[data[2] == 1]
neg_plot = data[data[2] == 0]
#==================== Part 1: Plotting ====================
#We start the exercise by first plotting the data to understand the
#the problem we are working with.
print('Plotting data with + indicating (y = 1) examples and o indicating (y = 0) examples.')
plt.plot(pos_plot[0], pos_plot[1], "+", label = "Admitted")
plt.plot(neg_plot[0], neg_plot[1], "o", label = "Not Admitted")
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.legend()
plt.show()
def sigmoid(z):
'''
SIGMOID Compute sigmoid function
g = SIGMOID(z) computes the sigmoid of z.
Instructions: Compute the sigmoid of each value of z (z can be a matrix,
vector or scalar).
'''
g = 1 / (1 + np.exp(-z))
return g
def costFunction(theta, X, y):
'''
COSTFUNCTION Compute cost and gradient for logistic regression
J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
parameter for logistic regression and the gradient of the cost
w.r.t. to the parameters.
'''
m = len(y) #number of training examples
h = sigmoid(X.dot(theta)) #logisitic regression hypothesis
J = (1/m) * np.sum((-y*np.log(h)) - ((1-y)*np.log(1-h)))
#h is 100x1, y is %100x1, these end up as 2 vector we subtract from each other
#then we sum the values by rows
#cost function for logisitic regression
return J
def gradient(theta, X, y):
m = len(y)
grad = np.zeros((theta.shape))
h = sigmoid(X.dot(theta))
for i in range(len(theta)): #number of rows in theta
XT = X[:,i]
XT.shape = (len(X),1)
grad[i] = (1/m) * np.sum((h-y)*XT) #updating each row of the gradient
return grad
#============ Part 2: Compute Cost and Gradient ============
#In this part of the exercise, you will implement the cost and gradient
#for logistic regression. You neeed to complete the code in costFunction.m
#Add intercept term to x and X_test
Bias = np.ones((len(X), 1))
X = np.column_stack((Bias, X))
#Initialize fitting parameters
initial_theta = np.zeros((len(X[0]), 1))
#Compute and display initial cost and gradient
(cost, grad) = costFunction(initial_theta, X, y), gradient(initial_theta, X, y)
print('Cost at initial theta (zeros): %f' % cost)
print('Expected cost (approx): 0.693\n')
print('Gradient at initial theta (zeros):')
print(grad)
print('Expected gradients (approx):\n -0.1000\n -12.0092\n -11.2628')
#Compute and display cost and gradient with non-zero theta
test_theta = np.array([[-24], [0.2], [0.2]]);
(cost, grad) = costFunction(test_theta, X, y), gradient(test_theta, X, y)
print('\nCost at test theta: %f' % cost)
print('Expected cost (approx): 0.218\n')
print('Gradient at test theta:')
print(grad)
print('Expected gradients (approx):\n 0.043\n 2.566\n 2.647\n')
result = op.fmin_tnc(func = costFunction, x0 = initial_theta, fprime = gradient, args = (X,y))
result[1]
Result = op.minimize(fun = costFunction,
x0 = initial_theta,
args = (X, y),
method = 'TNC',
jac = gradient, options={'gtol': 1e-3, 'disp': True, 'maxiter': 1000})
theta = Result.x
theta
test = np.array([[1, 45, 85]])
prob = sigmoid(test.dot(theta))
print('For a student with scores 45 and 85, we predict an admission probability of %f,' % prob)
print('Expected value: 0.775 +/- 0.002\n')
This was a very difficult problem to debug, and illustrates a poorly documented aspect of the scipy.optimize interface. The documentation vaguely indicates that theta will be passed around as a vector:
Minimization of scalar function of one or more variables.
In general, the optimization problems are of the form:
minimize f(x) subject to
g_i(x) >= 0, i = 1,...,m
h_j(x) = 0, j = 1,...,p
where x is a vector of one or more variables.
What's important is that they really mean vector in the most primitive sense, a 1-dimensional array. So you have to expect that whenever theta is passed into one of your callbacks, it will be passed in as a 1-d array. But in numpy, 1-d arrays sometimes behave differently from 2-d row arrays (and, obviously, from 2-d column arrays).
I don't know exactly why it's causing a problem in your case, but it's easily fixed regardless. You just have to add the following at the top of both your cost function and your gradient function:
theta = theta.reshape(-1, 1)
This guarantees that theta will be a 2-d column array, as expected. Once you've done this, the results are correct.
I have had similar issues with Scipy dealing with the same problem as you. As senderle points out the interface is not the easiest to deal with, especially combined with the numpy array interface... Here is my implementation which works as expected.
Defining the cost and gradient functions
Note that initial_theta is passed as a simple array of shape (3,) and converted to a column vector of shape (3,1) within the function. The gradient function then returns the grad.ravel() which has shape (3,) again. This is important as doing otherwise caused an error message with various optimization methods in Scipy.optimize.
Note that different methods have different behaviours but returning .ravel() seems to fix most issues...
import pandas as pd
import numpy as np
import scipy.optimize as opt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def CostFunc(theta,X,y):
#Initializing variables
m = len(y)
J = 0
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
J = (1/m) * ( (-y.T # np.log(h)) - (1 - y).T # np.log(1-h));
return J
def Gradient(theta,X,y):
#Initializing variables
m = len(y)
theta = theta[:,np.newaxis]
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
grad = (1/m)*(X.T # ( h - y));
return grad.ravel() #<-- This is the trick
Initializing variables and parameters
Note that initial_theta.shape returns (3,)
X = data1.iloc[:,0:2].values
m,n = X.shape
X = np.concatenate((np.ones(m)[:,np.newaxis],X),1)
y = data1.iloc[:,-1].values[:,np.newaxis]
initial_theta = np.zeros((n+1))
Calling Scipy.optimize
model = opt.minimize(fun = CostFunc, x0 = initial_theta, args = (X, y), method = 'TNC', jac = Gradient)
Any comments from more knowledgeable people are welcome, this Scipy interface is a mystery to me, thanks
In Python, I know how to calculate r and associated p-value using scipy.stats.pearsonr, but I'm unable to find a way to calculate the confidence interval of r. How is this done? Thanks for any help :)
According to [1], calculation of confidence interval directly with Pearson r is complicated due to the fact that it is not normally distributed. The following steps are needed:
Convert r to z',
Calculate the z' confidence interval. The sampling distribution of z' is approximately normally distributed and has standard error of 1/sqrt(n-3).
Convert the confidence interval back to r.
Here are some sample codes:
def r_to_z(r):
return math.log((1 + r) / (1 - r)) / 2.0
def z_to_r(z):
e = math.exp(2 * z)
return((e - 1) / (e + 1))
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf(1 - alpha/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Reference:
http://onlinestatbook.com/2/estimation/correlation_ci.html
Using rpy2 and the psychometric library (you will need R installed and to run install.packages("psychometric") within R first)
from rpy2.robjects.packages import importr
psychometric=importr('psychometric')
psychometric.CIr(r=.9, n = 100, level = .95)
Where 0.9 is your correlation, n the sample size and 0.95 the confidence level
Here's a solution that uses bootstrapping to compute the confidence interval, rather than the Fisher transformation (which assumes bivariate normality, etc.), borrowing from this answer:
import numpy as np
def pearsonr_ci(x, y, ci=95, n_boots=10000):
x = np.asarray(x)
y = np.asarray(y)
# (n_boots, n_observations) paired arrays
rand_ixs = np.random.randint(0, x.shape[0], size=(n_boots, x.shape[0]))
x_boots = x[rand_ixs]
y_boots = y[rand_ixs]
# differences from mean
x_mdiffs = x_boots - x_boots.mean(axis=1)[:, None]
y_mdiffs = y_boots - y_boots.mean(axis=1)[:, None]
# sums of squares
x_ss = np.einsum('ij, ij -> i', x_mdiffs, x_mdiffs)
y_ss = np.einsum('ij, ij -> i', y_mdiffs, y_mdiffs)
# pearson correlations
r_boots = np.einsum('ij, ij -> i', x_mdiffs, y_mdiffs) / np.sqrt(x_ss * y_ss)
# upper and lower bounds for confidence interval
ci_low = np.percentile(r_boots, (100 - ci) / 2)
ci_high = np.percentile(r_boots, (ci + 100) / 2)
return ci_low, ci_high
Answer given by bennylp is mostly correct, however, there is a small error in calculating the critical value in the 3rd function.
It should instead be:
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf((1 + alpha)/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Here's another post for reference: Scipy - two tail ppf function for a z value?
I know bootstrapping has been suggested above, proposing another variation of it below, which may suit some other set ups better.
#1
Sample your data (paired X & Ys and can also add other say weight) , fit original model on it, record r2, append it. Then extract your confidence intervals from your distribution of all R2s recorded.
#2 Additionally can fit on sampled data and using sampled data model predict on non sampled X (could also supply a continuous range to extend your predictions instead of using original X)
to get confidence intervals on your Y hats.
So in sample code:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
from sklearn.metrics import r2_score
x = np.array([your numbers here])
y = np.array([your numbers here])
### define list for R2 values
r2s = []
### define dataframe to append your bootstrapped fits for Y hat ranges
ci_df = pd.DataFrame({'x': x})
### define how many samples you want
how_many_straps = 5000
### define your fit function/s
def func_exponential(x,a,b):
return np.exp(b) * np.exp(a * x)
### fit original, using log because fitting exponential
polyfit_original = np.polyfit(x
,np.log(y)
,1
,# w= could supply weight for observations here)
)
for i in range(how_many_straps+1):
### zip into tuples attaching X to Y, can combine more variables as well
zipped_for_boot = pd.Series(tuple(zip(x,y)))
### sample zipped X & Y pairs above with replacement
zipped_resampled = zipped_for_boot.sample(frac=1,
replace=True)
### creater your sampled X & Y
boot_x = []
boot_y = []
for sample in zipped_resampled:
boot_x.append(sample[0])
boot_y.append(sample[1])
### predict sampled using original fit
y_hat_boot_via_original_fit = func_exponential(np.asarray(boot_x),
polyfit_original[0],
polyfit_original[1])
### calculate r2 and append
r2s.append(r2_score(boot_y, y_hat_boot_via_original_fit))
### fit sampled
polyfit_boot = np.polyfit(boot_x
,np.log(boot_y)
,1
,# w= could supply weight for observations here)
)
### predict original via sampled fit or on a range of min(x) to Z
y_hat_original_via_sampled_fit = func_exponential(x,
polyfit_boot[0],
polyfit_boot[1])
### insert y hat into dataframe for calculating y hat confidence intervals
ci_df["trial_" + str(i)] = y_hat_original_via_sampled_fit
### R2 conf interval
low = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[0],3)
up = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[1],3)
F"r2 confidence interval = {low} - {up}"
I want to use numpy.polyfit for physical calculations, therefore I need the magnitude of the error.
If you specify full=True in your call to polyfit, it will include extra information:
>>> x = np.arange(100)
>>> y = x**2 + 3*x + 5 + np.random.rand(100)
>>> np.polyfit(x, y, 2)
array([ 0.99995888, 3.00221219, 5.56776641])
>>> np.polyfit(x, y, 2, full=True)
(array([ 0.99995888, 3.00221219, 5.56776641]), # coefficients
array([ 7.19260721]), # residuals
3, # rank
array([ 11.87708199, 3.5299267 , 0.52876389]), # singular values
2.2204460492503131e-14) # conditioning threshold
The residual value returned is the sum of the squares of the fit errors, not sure if this is what you are after:
>>> np.sum((np.polyval(np.polyfit(x, y, 2), x) - y)**2)
7.1926072073491056
In version 1.7 there is also a cov keyword that will return the covariance matrix for your coefficients, which you could use to calculate the uncertainty of the fit coefficients themselves.
As you can see in the documentation:
Returns
-------
p : ndarray, shape (M,) or (M, K)
Polynomial coefficients, highest power first.
If `y` was 2-D, the coefficients for `k`-th data set are in ``p[:,k]``.
residuals, rank, singular_values, rcond : present only if `full` = True
Residuals of the least-squares fit, the effective rank of the scaled
Vandermonde coefficient matrix, its singular values, and the specified
value of `rcond`. For more details, see `linalg.lstsq`.
Which means that if you can do a fit and get the residuals as:
import numpy as np
x = np.arange(10)
y = x**2 -3*x + np.random.random(10)
p, res, _, _, _ = numpy.polyfit(x, y, deg, full=True)
Then, the p are your fit parameters, and the res will be the residuals, as described above. The _'s are because you don't need to save the last three parameters, so you can just save them in the variable _ which you won't use. This is a convention and is not required.
#Jaime's answer explains what the residual means. Another thing you can do is look at those squared deviations as a function (the sum of which is res). This is particularly helpful to see a trend that didn't fit sufficiently. res can be large because of statistical noise, or possibly systematic poor fitting, for example:
x = np.arange(100)
y = 1000*np.sqrt(x) + x**2 - 10*x + 500*np.random.random(100) - 250
p = np.polyfit(x,y,2) # insufficient degree to include sqrt
yfit = np.polyval(p,x)
figure()
plot(x,y, label='data')
plot(x,yfit, label='fit')
plot(x,yfit-y, label='var')
So in the figure, note the bad fit near x = 0: