I predicted the principal components of the climate index in the US for multiple years. Principal components and transformation matrix were received with df_eof() function from https://github.com/zzheng93/pyEOF. After that, inverse transform was performed according to https://stackoverflow.com/a/32757849/19713543. For the simple visual test, I draw a map with correlation coefficients by years for any particular pixel for an inverse transform of original data components compared to original index values.
I have attached a truncated version of the data here https://github.com/Gr1Lo/d73279736.
For standart PCA I get:
rev_diff(pcs_CMIP6_1.to_numpy(), df_data_summer_CMIP6_1.to_numpy(),
eofs_CMIP6_1,
eigvals_CMIP6_1, pca_CMIP6_1, ds_n_summer_CMIP6_1, "Standart pca;",
"corr",
scale_type = 2)
But for PCA with varimax rotation results are worse ("_v" in the variable name means that it was got from PCA with varimax rotation):
rev_diff(pcs_CMIP6_1_v.to_numpy(), df_data_summer_CMIP6_1.to_numpy(),
eofs_CMIP6_1_v,
eigvals_CMIP6_1_v,
pca_CMIP6_1_v, ds_n_summer_CMIP6_1, "pca + varimax;",
"corr",
scale_type = 2)
I changed the original pyEOF code a bit, so I can receive a rotation matrix and order of components before rotation to perform the backward procedure. For PCA with varimax it looks like:
import numpy as np
inv_rotmat = np.linalg.inv(pca_CMIP6_1_v.rotmat_)
unord = pcs_CMIP6_1_v.to_numpy()[:,np.argsort(pca_CMIP6_1_v.order.ravel())]
unord_eofs = eofs_CMIP6_1_v.to_numpy()[np.argsort(pca_CMIP6_1_v.order.ravel())]
unord_eigvals = eigvals_CMIP6_1_v[np.argsort(pca_CMIP6_1_v.order.ravel())]
So, after that result became looks like result from standard PCA procedure:
rev_diff(np.dot(unord,inv_rotmat), df_data_summer_CMIP6_1.to_numpy(),
np.dot(unord_eofs.T,inv_rotmat).T,
eigvals_CMIP6_1,
pca_CMIP6_1, ds_n_summer_CMIP6_1, "pca + varimax;",
"corr",
scale_type = 2)
My question is: why the second result differs from others and how to perform inverse transformation for the varimax case correctly? Code for rev_diff() function:
import numpy as np
from matplotlib import pyplot as plt
import scipy
def rev_diff(y_pred, y_true, eofs, eigvals, pca, ds_n, ttl, p_type='diff', scale_type = 2):
'''
Visual assessment of prediction results:
y_pred - predicted components values,
y_true - test scpdsi values scpdsi,
eofs - transformation matrix,
eigvals - coefficients applied to eigenvectors that give the vectors their length or magnitude,
pca - returned from eof_an() object,
ds_n - 3d array for retrieving shape,
ttl - name of model,
p_type - verification metrics
'mae' - mean absolute error
'corr' - Spearman's Rank correlation coefficient
'd' - index of agreement ādā
'nse' - Nash-Sutcliffe Efficiency
scale_type - converting loadings to components for reverse operation
'''
eg = np.sqrt(eigvals)
if scale_type == 2:
eofs = eofs / eg[:, np.newaxis]
pcs = y_pred[:, 0:y_pred.shape[1]] / eg
pcs0= y_true[:, 0:y_pred.shape[1]] / eg
if scale_type == 1:
eofs = eofs * eg[:, np.newaxis]
pcs = y_pred[:, 0:y_pred.shape[1]] * eg
pcs0 = y_true[:, 0:y_pred.shape[1]] * eg
if scale_type == 0:
eofs = eofs
pcs = y_pred[:, 0:y_pred.shape[1]]
pcs0 = y_true[:, 0:y_pred.shape[1]]
Yhat = np.dot(pcs, eofs)
Yhat = pca._scaler.inverse_transform(Yhat)
u = Yhat
u0 = y_true
if p_type=='corr':
coor_ar = []
for i in range(u0.shape[1]):
i1 = u[:,i]
i0 = u0[:,i]
if ~np.isnan(i0[0]):
corr2 = scipy.stats.spearmanr(i0,i1)[0]
coor_ar.append(corr2)
else:
coor_ar.append(np.nan)
loss0 = np.array(coor_ar)
ttl_str = " average Spearman's Rank correlation coefficient = "
vmin = -1
vmax = 1
new = np.reshape(loss0, (-1, ds_n.shape[2]))
plt.figure(figsize = (19,10))
im = plt.imshow(new[::-1], interpolation='none',
vmin=vmin, vmax=vmax,cmap='jet')
cbar = plt.colorbar(im,
orientation='vertical')
plt.axis('off')
plt.tight_layout()
loss0 = np.nanmean(loss0)
plt.title(ttl + ttl_str + str(round(loss0,3)),fontsize=20)
plt.show()
return loss0
Related
I don't think I have a good understanding of PCA, can someone help with my confusion below please?
Take iris dataset as an example, I have 4 covariates, x1:sepal length; x2:sepal width; x3:petal length; x4:petal width. And the formula can be seen below, a1,a2,a3,a4 are the weightings for the covariates. And PCA will try to maximise the variance using different linear transformations. While also follows the rule of a1^2 + a2^2 + a3^2 + a4^2=1. I'm interested in knowing the value of a1,a2,a3,a4.
a1*x1 + a2*x2 + a3*x3 + a4*x4
I have below code on python which I think is correct?
# load libraries
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import seaborn as sns
import pandas as pd
import numpy as np
iris = load_iris()
X = iris.data
df = pd.DataFrame(X,columns=iris.feature_names)
pca = decomposition.PCA(n_components = 4)
digits_pca_4 = pca.fit(X)
digits_pca_4.explained_variance_ratio_
And the result is
array([0.92461872, 0.05306648, 0.01710261, 0.00521218])
My question is:
Am I correct to assume that a1=sqrt(0.92), a2=sqrt(0.05), a3=sqrt(0.02), a4=sqrt(0.005)?
Second question:
And if I were to choose the linear combination of a1=a2=a3=a4=0.5, what's the variance of this compared to the variance from the PCA (I'm assuming it's less than the PCA result since PCA maximise the variance?)? How can I get the variance of when a1=a2=a3=a4=0.5 in python? And is the variance from PCA the code below?
pca.explained_variance_.sum()
Many thanks!
To answer directly your question: no, your initial interpretation is not correct.
Explanation
The actual projection done by PCA is a matrix multiplication Y = (X - u) W where u is the mean of X (u = X.mean(axis=0)) and W is the projection matrix found by PCA: a n x p orthonormal matrix, where n is the original data dimension and p the desired output dimensions. The expression you give (a1*x1 + a2*x2 + a3*x3 + a4*x4) does not mean anything with all values being scalars. At best, it could mean the calculation of a single component, using one column j of W below as the a_k: Y[i, j] == sum(W[k, j] * (X[i, k] - u[k]) for k in range(n)).
In any case, you can inspect all the variables of the result of pca = PCA.fit(...) with vars(pca). In particular, the projection matrix described above can be found as W = pca.components_.T. The following statements can be verified:
# projection
>>> u = pca.mean_
... W = pca.components_.T
... Y = (X - u).dot(W)
... np.allclose(Y, pca.transform(X))
True
>>> np.allclose(X.mean(axis=0), u)
True
# orthonormality
>>> np.allclose(W.T.dot(W), np.eye(W.shape[1]))
True
# explained variance is the sample variation (not population variance)
# of the projection (i.e. the variance along the proj axes)
>>> np.allclose(Y.var(axis=0, ddof=1), pca. explained_variance_)
True
Graphical demo
The simplest way to understand PCA is that it is purely a rotation in n-D (after mean removal) while retaining only the first p-dimensions. The rotation is such that your data's directions of largest variance become aligned with the natural axes in the projection.
Here is a little demo code to help you visualize what's going on. Please also read the Wikipedia page on PCA.
def pca_plot(V, W, idx, ax):
# plot only first 2 dimensions of W along with axes W
colors = ['k', 'r', 'b', 'g', 'c', 'm', 'y']
u = V.mean(axis=0) # n
axes_lengths = 1.5*(V - u).dot(W).std(axis=0)
axes = W * axes_lengths # n x p
axes = axes[:2].T # p x 2
ax.set_aspect('equal')
ax.scatter(V[:, 0], V[:, 1], alpha=.2)
ax.scatter(V[idx, 0], V[idx, 1], color='r')
hlen = np.max(np.linalg.norm((V - u)[:, :2], axis=1)) / 25
for k in range(axes.shape[0]):
ax.arrow(*u[:2], *axes[k], head_width=hlen/2, head_length=hlen, fc=colors[k], ec=colors[k])
def pca_demo(X, p):
n = X.shape[1] # input dimension
pca = PCA(n_components=p).fit(X)
u = pca.mean_
v = pca.explained_variance_
W = pca.components_.T
Y = pca.transform(X)
assert np.allclose((X - u).dot(W), Y)
# plot first 2D of both input space and output space
# for visual identification: select a point that's as far as possible
# in the direction of the diagonal of the axes cube, after normalization
# Z: variance-1 projection
Z = (X - u).dot(W/np.sqrt(v))
idx = np.argmax(Z.sum(axis=1) / np.sqrt(np.linalg.norm(Z, axis=1)))
fig, ax = plt.subplots(ncols=2, figsize=(12, 6))
# input space
pca_plot(X, W, idx, ax[0])
ax[0].set_title('input data (first 2D)')
# output space
pca_plot(Y, np.eye(p), idx, ax[1])
ax[1].set_title('projection (first 2D)')
return Y, W, u, pca
Examples
Iris data
# to better understand the shape of W, we project onto
# a space of dimension p=3
X = load_iris().data
Y, W, u, pca = pca_demo(X, 3)
Note that the projection is really just (X - u) W:
>>> np.allclose((X - u).dot(W), Y)
True
Synthetic ellipsoid data
A = np.array([
[20, 10, 7],
[-1, 3, 7],
[5, 1, 2],
])
X = np.random.normal(size=(1000, A.shape[0])).dot(A)
Y, W, u, pca = pca_demo(X, 3)
I'm trying to recreate a plot from An Introduction to Statistical Learning and I'm having trouble figuring out how to calculate the confidence interval for a probability prediction. Specifically, I'm trying to recreate the right-hand panel of this figure (figure 7.1) which is predicting the probability that wage>250 based on a degree 4 polynomial of age with associated 95% confidence intervals. The wage data is here if anyone cares.
I can predict and plot the predicted probabilities fine with the following code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures
wage = pd.read_csv('../../data/Wage.csv', index_col=0)
wage['wage250'] = 0
wage.loc[wage['wage'] > 250, 'wage250'] = 1
poly = Polynomialfeatures(degree=4)
age = poly.fit_transform(wage['age'].values.reshape(-1, 1))
logit = sm.Logit(wage['wage250'], age).fit()
age_range_poly = poly.fit_transform(np.arange(18, 81).reshape(-1, 1))
y_proba = logit.predict(age_range_poly)
plt.plot(age_range_poly[:, 1], y_proba)
But I'm at a loss as to how the confidence intervals of the predicted probabilities are calculated. I have thought about bootstrapping the data many times to get the distribution of probabilities for each age but I know there is an easier way which is just beyond my grasp.
I have the estimated coefficient covariance matrix and the standard errors associated with each estimated coefficient. How would I go about calculating the confidence intervals as shown in the right-hand panel of the figure above given this information?
Thanks!
You can use delta method to find approximate variance for predicted probability. Namely,
var(proba) = np.dot(np.dot(gradient.T, cov), gradient)
where gradient is the vector of derivatives of predicted probability by model coefficients, and cov is the covariance matrix of coefficients.
Delta method is proven to work asymptotically for all maximum likelihood estimates. However, if you have a small training sample, asymptotic methods may not work well, and you should consider bootstrapping.
Here is a toy example of applying delta method to logistic regression:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
# generate data
np.random.seed(1)
x = np.arange(100)
y = (x * 0.5 + np.random.normal(size=100,scale=10)>30)
# estimate the model
X = sm.add_constant(x)
model = sm.Logit(y, X).fit()
proba = model.predict(X) # predicted probability
# estimate confidence interval for predicted probabilities
cov = model.cov_params()
gradient = (proba * (1 - proba) * X.T).T # matrix of gradients for each observation
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient])
c = 1.96 # multiplier for confidence interval
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))
plt.plot(x, proba)
plt.plot(x, lower, color='g')
plt.plot(x, upper, color='g')
plt.show()
It draws the following nice picture:
For your example the code would be
proba = logit.predict(age_range_poly)
cov = logit.cov_params()
gradient = (proba * (1 - proba) * age_range_poly.T).T
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient])
c = 1.96
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))
plt.plot(age_range_poly[:, 1], proba)
plt.plot(age_range_poly[:, 1], lower, color='g')
plt.plot(age_range_poly[:, 1], upper, color='g')
plt.show()
and it would give the following picture
Looks pretty much like a boa-constrictor with an elephant inside.
You could compare it with the bootstrap estimates:
preds = []
for i in range(1000):
boot_idx = np.random.choice(len(age), replace=True, size=len(age))
model = sm.Logit(wage['wage250'].iloc[boot_idx], age[boot_idx]).fit(disp=0)
preds.append(model.predict(age_range_poly))
p = np.array(preds)
plt.plot(age_range_poly[:, 1], np.percentile(p, 97.5, axis=0))
plt.plot(age_range_poly[:, 1], np.percentile(p, 2.5, axis=0))
plt.show()
Results of delta method and bootstrap look pretty much the same.
Authors of the book, however, go the third way. They use the fact that
proba = np.exp(np.dot(x, params)) / (1 + np.exp(np.dot(x, params)))
and calculate confidence interval for the linear part, and then transform with the logit function
xb = np.dot(age_range_poly, logit.params)
std_errors = np.array([np.sqrt(np.dot(np.dot(g, cov), g)) for g in age_range_poly])
upper_xb = xb + c * std_errors
lower_xb = xb - c * std_errors
upper = np.exp(upper_xb) / (1 + np.exp(upper_xb))
lower = np.exp(lower_xb) / (1 + np.exp(lower_xb))
plt.plot(age_range_poly[:, 1], upper)
plt.plot(age_range_poly[:, 1], lower)
plt.show()
So they get the diverging interval:
These methods produce so different results because they assume different things (predicted probability and log-odds) being distributed normally. Namely, delta method assumes predicted probabilites are normal, and in the book, log-odds are normal. In fact, none of them are normal in finite samples, and they all converge to normal in infinite samples, but their variances converge to zero at the same time. Maximum likelihood estimates are insensitive to reparametrization, but their estimated distribution is, and that's the problem.
Here is an instructive and efficient method to calculate the standard errors ('se') of the fit ('mean_se') and single observations ('obs_se') on top of a statsmodels Logit().fit() object ('fit'), identical to the method in the book ISLR and the last method from the answer by David Dale:
fit_mean = fit.model.exog.dot(fit.params)
fit_mean_se = ((fit.model.exog*fit.model.exog.dot(fit.cov_params())).sum(axis=1))**0.5
fit_obs_se = ( ((fit.model.endog-fit_mean).std(ddof=fit.params.shape[0]))**2 + \
fit_mean_se**2 )**0.5
A figure similar to the one in the book ISLR
The shaded regions represent the 95% confidence intervals for the fit and single observations.
Ideas for improvement are most welcome.
I'm working through my Matlab code for the Andrew NG Coursera course and turning it into python. I am working on non-regularized logistic regression and after writing my gradient and cost functions I needed something similar to fminunc and after some googling, I found a couple options. They are both returning the same results, but they do not match what is in Andrew NG's expected results code. Others seem to be getting this to work correctly, but I'm wondering why my specific code does not seem to return the desired result when using scipy.optimize functions, but does for the cost and gradient pieces earlier in the code.
The data I'm using can be found at the link below;
ex2data1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as op
#Machine Learning Online Class - Exercise 2: Logistic Regression
#Load Data
#The first two columns contains the exam scores and the third column contains the label.
data = pd.read_csv('ex2data1.txt', header = None)
X = np.array(data.iloc[:, 0:2]) #100 x 3
y = np.array(data.iloc[:,2]) #100 x 1
y.shape = (len(y), 1)
#Creating sub-dataframes for plotting
pos_plot = data[data[2] == 1]
neg_plot = data[data[2] == 0]
#==================== Part 1: Plotting ====================
#We start the exercise by first plotting the data to understand the
#the problem we are working with.
print('Plotting data with + indicating (y = 1) examples and o indicating (y = 0) examples.')
plt.plot(pos_plot[0], pos_plot[1], "+", label = "Admitted")
plt.plot(neg_plot[0], neg_plot[1], "o", label = "Not Admitted")
plt.xlabel('Exam 1 score')
plt.ylabel('Exam 2 score')
plt.legend()
plt.show()
def sigmoid(z):
'''
SIGMOID Compute sigmoid function
g = SIGMOID(z) computes the sigmoid of z.
Instructions: Compute the sigmoid of each value of z (z can be a matrix,
vector or scalar).
'''
g = 1 / (1 + np.exp(-z))
return g
def costFunction(theta, X, y):
'''
COSTFUNCTION Compute cost and gradient for logistic regression
J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
parameter for logistic regression and the gradient of the cost
w.r.t. to the parameters.
'''
m = len(y) #number of training examples
h = sigmoid(X.dot(theta)) #logisitic regression hypothesis
J = (1/m) * np.sum((-y*np.log(h)) - ((1-y)*np.log(1-h)))
#h is 100x1, y is %100x1, these end up as 2 vector we subtract from each other
#then we sum the values by rows
#cost function for logisitic regression
return J
def gradient(theta, X, y):
m = len(y)
grad = np.zeros((theta.shape))
h = sigmoid(X.dot(theta))
for i in range(len(theta)): #number of rows in theta
XT = X[:,i]
XT.shape = (len(X),1)
grad[i] = (1/m) * np.sum((h-y)*XT) #updating each row of the gradient
return grad
#============ Part 2: Compute Cost and Gradient ============
#In this part of the exercise, you will implement the cost and gradient
#for logistic regression. You neeed to complete the code in costFunction.m
#Add intercept term to x and X_test
Bias = np.ones((len(X), 1))
X = np.column_stack((Bias, X))
#Initialize fitting parameters
initial_theta = np.zeros((len(X[0]), 1))
#Compute and display initial cost and gradient
(cost, grad) = costFunction(initial_theta, X, y), gradient(initial_theta, X, y)
print('Cost at initial theta (zeros): %f' % cost)
print('Expected cost (approx): 0.693\n')
print('Gradient at initial theta (zeros):')
print(grad)
print('Expected gradients (approx):\n -0.1000\n -12.0092\n -11.2628')
#Compute and display cost and gradient with non-zero theta
test_theta = np.array([[-24], [0.2], [0.2]]);
(cost, grad) = costFunction(test_theta, X, y), gradient(test_theta, X, y)
print('\nCost at test theta: %f' % cost)
print('Expected cost (approx): 0.218\n')
print('Gradient at test theta:')
print(grad)
print('Expected gradients (approx):\n 0.043\n 2.566\n 2.647\n')
result = op.fmin_tnc(func = costFunction, x0 = initial_theta, fprime = gradient, args = (X,y))
result[1]
Result = op.minimize(fun = costFunction,
x0 = initial_theta,
args = (X, y),
method = 'TNC',
jac = gradient, options={'gtol': 1e-3, 'disp': True, 'maxiter': 1000})
theta = Result.x
theta
test = np.array([[1, 45, 85]])
prob = sigmoid(test.dot(theta))
print('For a student with scores 45 and 85, we predict an admission probability of %f,' % prob)
print('Expected value: 0.775 +/- 0.002\n')
This was a very difficult problem to debug, and illustrates a poorly documented aspect of the scipy.optimize interface. The documentation vaguely indicates that theta will be passed around as a vector:
Minimization of scalar function of one or more variables.
In general, the optimization problems are of the form:
minimize f(x) subject to
g_i(x) >= 0, i = 1,...,m
h_j(x) = 0, j = 1,...,p
where x is a vector of one or more variables.
What's important is that they really mean vector in the most primitive sense, a 1-dimensional array. So you have to expect that whenever theta is passed into one of your callbacks, it will be passed in as a 1-d array. But in numpy, 1-d arrays sometimes behave differently from 2-d row arrays (and, obviously, from 2-d column arrays).
I don't know exactly why it's causing a problem in your case, but it's easily fixed regardless. You just have to add the following at the top of both your cost function and your gradient function:
theta = theta.reshape(-1, 1)
This guarantees that theta will be a 2-d column array, as expected. Once you've done this, the results are correct.
I have had similar issues with Scipy dealing with the same problem as you. As senderle points out the interface is not the easiest to deal with, especially combined with the numpy array interface... Here is my implementation which works as expected.
Defining the cost and gradient functions
Note that initial_theta is passed as a simple array of shape (3,) and converted to a column vector of shape (3,1) within the function. The gradient function then returns the grad.ravel() which has shape (3,) again. This is important as doing otherwise caused an error message with various optimization methods in Scipy.optimize.
Note that different methods have different behaviours but returning .ravel() seems to fix most issues...
import pandas as pd
import numpy as np
import scipy.optimize as opt
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def CostFunc(theta,X,y):
#Initializing variables
m = len(y)
J = 0
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
J = (1/m) * ( (-y.T # np.log(h)) - (1 - y).T # np.log(1-h));
return J
def Gradient(theta,X,y):
#Initializing variables
m = len(y)
theta = theta[:,np.newaxis]
grad = np.zeros(theta.shape)
#Vectorized computations
z = X # theta
h = sigmoid(z)
grad = (1/m)*(X.T # ( h - y));
return grad.ravel() #<-- This is the trick
Initializing variables and parameters
Note that initial_theta.shape returns (3,)
X = data1.iloc[:,0:2].values
m,n = X.shape
X = np.concatenate((np.ones(m)[:,np.newaxis],X),1)
y = data1.iloc[:,-1].values[:,np.newaxis]
initial_theta = np.zeros((n+1))
Calling Scipy.optimize
model = opt.minimize(fun = CostFunc, x0 = initial_theta, args = (X, y), method = 'TNC', jac = Gradient)
Any comments from more knowledgeable people are welcome, this Scipy interface is a mystery to me, thanks
In Python, I know how to calculate r and associated p-value using scipy.stats.pearsonr, but I'm unable to find a way to calculate the confidence interval of r. How is this done? Thanks for any help :)
According to [1], calculation of confidence interval directly with Pearson r is complicated due to the fact that it is not normally distributed. The following steps are needed:
Convert r to z',
Calculate the z' confidence interval. The sampling distribution of z' is approximately normally distributed and has standard error of 1/sqrt(n-3).
Convert the confidence interval back to r.
Here are some sample codes:
def r_to_z(r):
return math.log((1 + r) / (1 - r)) / 2.0
def z_to_r(z):
e = math.exp(2 * z)
return((e - 1) / (e + 1))
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf(1 - alpha/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Reference:
http://onlinestatbook.com/2/estimation/correlation_ci.html
Using rpy2 and the psychometric library (you will need R installed and to run install.packages("psychometric") within R first)
from rpy2.robjects.packages import importr
psychometric=importr('psychometric')
psychometric.CIr(r=.9, n = 100, level = .95)
Where 0.9 is your correlation, n the sample size and 0.95 the confidence level
Here's a solution that uses bootstrapping to compute the confidence interval, rather than the Fisher transformation (which assumes bivariate normality, etc.), borrowing from this answer:
import numpy as np
def pearsonr_ci(x, y, ci=95, n_boots=10000):
x = np.asarray(x)
y = np.asarray(y)
# (n_boots, n_observations) paired arrays
rand_ixs = np.random.randint(0, x.shape[0], size=(n_boots, x.shape[0]))
x_boots = x[rand_ixs]
y_boots = y[rand_ixs]
# differences from mean
x_mdiffs = x_boots - x_boots.mean(axis=1)[:, None]
y_mdiffs = y_boots - y_boots.mean(axis=1)[:, None]
# sums of squares
x_ss = np.einsum('ij, ij -> i', x_mdiffs, x_mdiffs)
y_ss = np.einsum('ij, ij -> i', y_mdiffs, y_mdiffs)
# pearson correlations
r_boots = np.einsum('ij, ij -> i', x_mdiffs, y_mdiffs) / np.sqrt(x_ss * y_ss)
# upper and lower bounds for confidence interval
ci_low = np.percentile(r_boots, (100 - ci) / 2)
ci_high = np.percentile(r_boots, (ci + 100) / 2)
return ci_low, ci_high
Answer given by bennylp is mostly correct, however, there is a small error in calculating the critical value in the 3rd function.
It should instead be:
def r_confidence_interval(r, alpha, n):
z = r_to_z(r)
se = 1.0 / math.sqrt(n - 3)
z_crit = stats.norm.ppf((1 + alpha)/2) # 2-tailed z critical value
lo = z - z_crit * se
hi = z + z_crit * se
# Return a sequence
return (z_to_r(lo), z_to_r(hi))
Here's another post for reference: Scipy - two tail ppf function for a z value?
I know bootstrapping has been suggested above, proposing another variation of it below, which may suit some other set ups better.
#1
Sample your data (paired X & Ys and can also add other say weight) , fit original model on it, record r2, append it. Then extract your confidence intervals from your distribution of all R2s recorded.
#2 Additionally can fit on sampled data and using sampled data model predict on non sampled X (could also supply a continuous range to extend your predictions instead of using original X)
to get confidence intervals on your Y hats.
So in sample code:
import numpy as np
from scipy.optimize import curve_fit
import pandas as pd
from sklearn.metrics import r2_score
x = np.array([your numbers here])
y = np.array([your numbers here])
### define list for R2 values
r2s = []
### define dataframe to append your bootstrapped fits for Y hat ranges
ci_df = pd.DataFrame({'x': x})
### define how many samples you want
how_many_straps = 5000
### define your fit function/s
def func_exponential(x,a,b):
return np.exp(b) * np.exp(a * x)
### fit original, using log because fitting exponential
polyfit_original = np.polyfit(x
,np.log(y)
,1
,# w= could supply weight for observations here)
)
for i in range(how_many_straps+1):
### zip into tuples attaching X to Y, can combine more variables as well
zipped_for_boot = pd.Series(tuple(zip(x,y)))
### sample zipped X & Y pairs above with replacement
zipped_resampled = zipped_for_boot.sample(frac=1,
replace=True)
### creater your sampled X & Y
boot_x = []
boot_y = []
for sample in zipped_resampled:
boot_x.append(sample[0])
boot_y.append(sample[1])
### predict sampled using original fit
y_hat_boot_via_original_fit = func_exponential(np.asarray(boot_x),
polyfit_original[0],
polyfit_original[1])
### calculate r2 and append
r2s.append(r2_score(boot_y, y_hat_boot_via_original_fit))
### fit sampled
polyfit_boot = np.polyfit(boot_x
,np.log(boot_y)
,1
,# w= could supply weight for observations here)
)
### predict original via sampled fit or on a range of min(x) to Z
y_hat_original_via_sampled_fit = func_exponential(x,
polyfit_boot[0],
polyfit_boot[1])
### insert y hat into dataframe for calculating y hat confidence intervals
ci_df["trial_" + str(i)] = y_hat_original_via_sampled_fit
### R2 conf interval
low = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[0],3)
up = round(pd.Series(r2s).quantile([0.025, 0.975]).tolist()[1],3)
F"r2 confidence interval = {low} - {up}"
I'm trying to fit a 2D Gaussian to an image. Noise is very low, so my attempt was to rotate the image such that the two principal axes do not co-vary, figure out the maximum and just compute the standard deviation in both dimensions. Weapon of choice is python.
However I got stuck at finding the eigenvectors of the image - numpy.linalg.py assumes discrete data points. I thought about taking this image to be a probability distribution, sampling a few thousand points and then computing the eigenvectors from that distribution, but I'm sure there must be a way of finding the eigenvectors (ie., semi-major and semi-minor axes of the gaussian ellipse) directly from that image. Any ideas?
Thanks a lot :)
Just a quick note, there are several tools to fit a gaussian to an image. The only thing I can think of off the top of my head is scikits.learn, which isn't completely image-oriented, but I know there are others.
To calculate the eigenvectors of the covariance matrix exactly as you had in mind is very computationally expensive. You have to associate each pixel (or a large-ish random sample) of image with an x,y point.
Basically, you do something like:
import numpy as np
# grid is your image data, here...
grid = np.random.random((10,10))
nrows, ncols = grid.shape
i,j = np.mgrid[:nrows, :ncols]
coords = np.vstack((i.reshape(-1), j.reshape(-1), grid.reshape(-1))).T
cov = np.cov(coords)
eigvals, eigvecs = np.linalg.eigh(cov)
You can instead make use of the fact that it's a regularly-sampled image and compute it's moments (or "intertial axes") instead. This will be considerably faster for large images.
As a quick example, (I'm using a part of one of my previous answers, in case you find it useful...)
import numpy as np
import matplotlib.pyplot as plt
def main():
data = generate_data()
xbar, ybar, cov = intertial_axis(data)
fig, ax = plt.subplots()
ax.imshow(data)
plot_bars(xbar, ybar, cov, ax)
plt.show()
def generate_data():
data = np.zeros((200, 200), dtype=np.float)
cov = np.array([[200, 100], [100, 200]])
ij = np.random.multivariate_normal((100,100), cov, int(1e5))
for i,j in ij:
data[int(i), int(j)] += 1
return data
def raw_moment(data, iord, jord):
nrows, ncols = data.shape
y, x = np.mgrid[:nrows, :ncols]
data = data * x**iord * y**jord
return data.sum()
def intertial_axis(data):
"""Calculate the x-mean, y-mean, and cov matrix of an image."""
data_sum = data.sum()
m10 = raw_moment(data, 1, 0)
m01 = raw_moment(data, 0, 1)
x_bar = m10 / data_sum
y_bar = m01 / data_sum
u11 = (raw_moment(data, 1, 1) - x_bar * m01) / data_sum
u20 = (raw_moment(data, 2, 0) - x_bar * m10) / data_sum
u02 = (raw_moment(data, 0, 2) - y_bar * m01) / data_sum
cov = np.array([[u20, u11], [u11, u02]])
return x_bar, y_bar, cov
def plot_bars(x_bar, y_bar, cov, ax):
"""Plot bars with a length of 2 stddev along the principal axes."""
def make_lines(eigvals, eigvecs, mean, i):
"""Make lines a length of 2 stddev."""
std = np.sqrt(eigvals[i])
vec = 2 * std * eigvecs[:,i] / np.hypot(*eigvecs[:,i])
x, y = np.vstack((mean-vec, mean, mean+vec)).T
return x, y
mean = np.array([x_bar, y_bar])
eigvals, eigvecs = np.linalg.eigh(cov)
ax.plot(*make_lines(eigvals, eigvecs, mean, 0), marker='o', color='white')
ax.plot(*make_lines(eigvals, eigvecs, mean, -1), marker='o', color='red')
ax.axis('image')
if __name__ == '__main__':
main()
Fitting a Gaussian robustly can be tricky. There was a fun article on this topic in the IEEE Signal Processing Magazine:
Hongwei Guo, "A Simple Algorithm for Fitting a Gaussian Function" IEEE
Signal Processing Magazine, September 2011, pp. 134--137
I give an implementation of the 1D case here:
http://scipy-central.org/item/28/2/fitting-a-gaussian-to-noisy-data-points
(Scroll down to see the resulting fits)
Did you try Principal Component Analysis (PCA)? Maybe the MDP package could do the job with minimal effort.