How do I determine a correlation coefficient in Python?

How do I determine a correlation coefficient in Python? - python

The equation that I'm working with is as follows-
The description says that x-bar and y-bar are the average of array 1 and array 2. The minimum coefficient is 0.3.
The reason I'm asking is because I am not too familiar with reading statistical equations, let alone implementing them in Python...

Easiest would be to use scipy.stats (see here)
import numpy as np
from scipy.stats.stats import pearsonr
x = np.random.random(20)
y = np.random.random(20)
print(pearsonr(x, y))
This will give you two values, the correlation and the p-value.
You can implement it yourself like this:
x = np.random.random(20)
y = np.random.random(20)
x_bar = np.mean(x)
y_bar = np.mean(y)
top = np.sum((x - x_bar) * (y - y_bar))
bot = np.sqrt(np.sum(np.power(x - x_bar, 2)) * np.sum(np.power(y - y_bar, 2)))
print(top/bot)
Both give the same result, good luck!

The straightforward implementation using for loops would be:
import math
def correlation(x, y):
x_bar = sum(x) / len(x)
y_bar = sum(y) / len(y)
var_x = sum((x_i - x_bar)**2 for x_i in x)
var_y = sum((y_i - y_bar)**2 for y_i in y)
assert len(x) == len(y)
numerator = sum((x_i - x_bar) * (y_i - y_bar) for x_i, y_i in zip(x, y))
denominator = math.sqrt(var_x * var_y)
return numerator / denominator
if __name__ == "__main__":
x = [...]
y = [...]
print(correlation(x, y))
When doing a lot of numeric calculations one usually uses the numpy module where this function is already defined:
import numpy as np
if __name__ == "__main__":
x = np.array([...])
y = np.array([...])
print(np.corrcoef(x, y)[0, 1])

Related

Numpy - vectorize the bivariate poisson pmf equation

I'm trying to write a function to evaluate the probability mass function for the bivariate poisson distribution.
This is easy when all of the parameters (x, y, theta1, theta2, theta0) are scalars, but tricky to scale up without loops to allow these parameters to be vectors. I need it to scale such that, for:
theta0 being a scalar - the "correlation parameter" in the equation
theta1 and theta2 having length l
x, y both having length n
the output array would have shape (l, n, n). For example, a slice [j, :, :] from the output array would look like:
The first part (the constant, before the summation) I think i've figured out:
import numpy as np
from scipy.special import factorial
def constant(theta1, theta2, theta0, x, y):
exponential_part = np.exp(-(theta1 + theta2 + theta0)).reshape(-1, 1, 1)
x = np.tile(x, (len(x), 1)).transpose()
y = np.tile(y, (len(y), 1))
double_factorial = (np.power(np.array(theta1).reshape(-1, 1, 1), x)/factorial(x)) * \
(np.power(np.array(theta2).reshape(-1, 1, 1), y)/factorial(y))
return exponential_part * double_factorial
But I'm struggling with the summation part. How can I vectorize a summation where the limits depend on variable arrays?

I think I have this figured out, based on the approach that #w-m suggests: calculate every possible summation term which could appear, based on the maximum x or y value which appears, and use a mask to get rid of the ones you don't want. Assuming you have your x and y terms go from 0 to N, in consecutive order, this is calculating up to three times more terms than are actually required, but this is offset by getting to use vectorization.
Reference implementation
I wrote this by first writing a pure-Python reference implementation, which just implements your problem using loops. With 4 nested loops, it's not exactly fast, but it's handy to have while testing the numpy version.
import numpy as np
from scipy.special import factorial, comb
import operator as op
from functools import reduce
def choose(n, r):
# https://stackoverflow.com/a/4941932/530160
r = min(r, n-r)
numer = reduce(op.mul, range(n, n-r, -1), 1)
denom = reduce(op.mul, range(1, r+1), 1)
return numer // denom # or / in Python 2
def reference_impl_constant(s_theta1, s_theta2, s_theta0, s_x, s_y):
# Cast to float to prevent overflow
s_theta1 = float(s_theta1)
s_theta2 = float(s_theta2)
s_theta0 = float(s_theta0)
s_x = float(s_x)
s_y = float(s_y)
term1 = np.exp(-(s_theta1 + s_theta2 + s_theta0))
term2 = (s_theta1 ** s_x / factorial(s_x))
term3 = (s_theta2 ** s_y / factorial(s_y))
assert term1 >= 0
assert term2 >= 0
assert term3 >= 0
return term1 * term2 * term3
def reference_impl_constant_loop(theta1, theta2, theta0, x, y):
theta_len = theta1.shape[0]
xy_len = x.shape[0]
constant_array = np.zeros((theta_len, xy_len, xy_len))
for i in range(theta_len):
for j in range(xy_len):
for k in range(xy_len):
s_theta1 = theta1[i]
s_theta2 = theta2[i]
s_theta0 = theta0
s_x = x[j]
s_y = y[k]
constant_term = reference_impl_constant(s_theta1, s_theta2, s_theta0, s_x, s_y)
assert constant_term >= 0
constant_array[i, j, k] = constant_term
return constant_array
def reference_impl_summation(s_theta1, s_theta2, s_theta0, s_x, s_y):
sum_ = 0
for i in range(min(s_x, s_y) + 1):
sum_ += choose(s_x, i) * choose(s_y, i) * factorial(i) * ((s_theta0/s_theta1/s_theta2) ** i)
assert sum_ >= 0
return sum_
def reference_impl_summation_loop(theta1, theta2, theta0, x, y):
theta_len = theta1.shape[0]
xy_len = x.shape[0]
summation_array = np.zeros((theta_len, xy_len, xy_len))
for i in range(theta_len):
for j in range(xy_len):
for k in range(xy_len):
s_theta1 = theta1[i]
s_theta2 = theta2[i]
s_theta0 = theta0
s_x = x[j]
s_y = y[k]
summation_term = reference_impl_summation(s_theta1, s_theta2, s_theta0, s_x, s_y)
assert summation_term >= 0
summation_array[i, j, k] = summation_term
return summation_array
def reference_impl(theta1, theta2, theta0, x, y):
# all array inputs must be 1D
assert len(theta1.shape) == 1
assert len(theta2.shape) == 1
assert len(x.shape) == 1
assert len(y.shape) == 1
# theta vectors must have same length
theta_len = theta1.shape[0]
assert theta2.shape[0] == theta_len
# x and y must have same length
xy_len = x.shape[0]
assert y.shape[0] == xy_len
# theta0 is scalar
assert isinstance(theta0, (int, float))
constant_array = np.zeros((theta_len, xy_len, xy_len))
output = np.zeros((theta_len, xy_len, xy_len))
constant_array = reference_impl_constant_loop(theta1, theta2, theta0, x, y)
summation_array = reference_impl_summation_loop(theta1, theta2, theta0, x, y)
output = constant_array * summation_array
return output
Numpy implementation
I split the implementation of this across two functions.
The fast_constant() function calculates everything to the left of the summation symbol. The fast_summation() function calculates everything inside the summation symbol.
import numpy as np
from scipy.special import factorial, comb
def fast_summation(theta1, theta2, theta0, x, y):
x = np.tile(x, (len(x), 1)).transpose()
y = np.tile(y, (len(y), 1))
sum_limit = np.minimum(x, y)
max_sum_limit = np.max(sum_limit)
i = np.arange(max_sum_limit + 1).reshape(-1, 1, 1)
summation_mask = (i <= sum_limit)
theta_ratio = (theta0 / (theta1 * theta2)).reshape(-1, 1, 1, 1)
theta_to_power = np.power(theta_ratio, i)
terms = comb(x, i) * comb(y, i) * factorial(i) * theta_to_power
# mask out terms which aren't part of sum
terms *= summation_mask
# axis 0 is theta
# axis 1 is i
# axis 2 & 3 are x and y
# so sum across axis 1
terms = terms.sum(axis=1)
return terms
def fast_constant(theta1, theta2, theta0, x, y):
theta1 = theta1.astype('float64')
theta2 = theta2.astype('float64')
exponential_part = np.exp(-(theta1 + theta2 + theta0)).reshape(-1, 1, 1)
# x and y must be 1D
assert len(x.shape) == 1
assert len(y.shape) == 1
# x and y must have same shape
assert x.shape == y.shape
x_len, y_len = x.shape[0], y.shape[0]
x = x.reshape((x_len, 1))
y = y.reshape((1, y_len))
double_factorial = (np.power(np.array(theta1).reshape(-1, 1, 1), x)/factorial(x)) * \
(np.power(np.array(theta2).reshape(-1, 1, 1), y)/factorial(y))
return exponential_part * double_factorial
def fast_impl(theta1, theta2, theta0, x, y):
return fast_summation(theta1, theta2, theta0, x, y) * fast_constant(theta1, theta2, theta0, x, y)
Benchmarking
Assuming that X and Y range from 0 to 20, and that theta is centered somewhere inside that range, I get the result that the numpy version is roughly 280 times faster than the pure python reference.
Numerical stability
I'm unsure how numerically stable this is. For example, when I center theta at 100, I get a floating-point overflow. Typically, when computing an expression which has lots of choose and factorial expressions inside it, you'll use some mathematical equivalent which results in smaller intermediate sums. In this case I have so little understanding of the math that I don't know how you'd do that.

What type of orthogonal polynomials does R use?

I was trying to match the orthogonal polynomials in the following code in R:
X <- cbind(1, poly(x = x, degree = 9))
but in python.
To do this I implemented my own method for giving orthogonal polynomials:
def get_hermite_poly(x,degree):
#scipy.special.hermite()
N, = x.shape
##
X = np.zeros( (N,degree+1) )
for n in range(N):
for deg in range(degree+1):
X[n,deg] = hermite( n=deg, z=float(x[deg]) )
return X
though it does not seem to match it. Does someone know type of orthogonal polynomial it uses? I tried search in the documentation but didn't say.
To give some context I am trying to implement the following R code in python (https://stats.stackexchange.com/questions/313265/issue-with-convergence-with-sgd-with-function-approximation-using-polynomial-lin/315185#comment602020_315185):
set.seed(1234)
N <- 10
x <- seq(from = 0, to = 1, length = N)
mu <- sin(2 * pi * x * 4)
y <- mu
plot(x,y)
X <- cbind(1, poly(x = x, degree = 9))
# X <- sapply(0:9, function(i) x^i)
w <- rnorm(10)
learning_rate <- function(t) .1 / t^(.6)
n_samp <- 2
for(t in 1:100000) {
mu_hat <- X %*% w
idx <- sample(1:N, n_samp)
X_batch <- X[idx,]
y_batch <- y[idx]
score_vec <- t(X_batch) %*% (y_batch - X_batch %*% w)
change <- score_vec * learning_rate(t)
w <- w + change
}
plot(mu_hat, ylim = c(-1, 1))
lines(mu)
fit_exact <- predict(lm(y ~ X - 1))
lines(fit_exact, col = 'red')
abs(w - coef(lm(y ~ X - 1)))
because it seems to be the only one that works with gradient descent with linear regression with polynomial features.
I feel that any orthogonal polynomial (or at least orthonormal) should work and give a hessian with condition number 1 but I can't seem to make it work in python. Related question: How does one use Hermite polynomials with Stochastic Gradient Descent (SGD)?

poly uses QR factorization, as described in some detail in this answer.
I think that what you really seem to be looking for is how to replicate the output of R's poly using python.
Here I have written a function to do that based on R's implementation. I have also added some comments so that you can see the what the equivalent statements in R look like:
import numpy as np
def poly(x, degree):
xbar = np.mean(x)
x = x - xbar
# R: outer(x, 0L:degree, "^")
X = x[:, None] ** np.arange(0, degree+1)
#R: qr(X)$qr
q, r = np.linalg.qr(X)
#R: r * (row(r) == col(r))
z = np.diag((np.diagonal(r)))
# R: Z = qr.qy(QR, z)
Zq, Zr = np.linalg.qr(q)
Z = np.matmul(Zq, z)
# R: colSums(Z^2)
norm1 = (Z**2).sum(0)
#R: (colSums(x * Z^2)/norm2 + xbar)[1L:degree]
alpha = ((x[:, None] * (Z**2)).sum(0) / norm1 +xbar)[0:degree]
# R: c(1, norm2)
norm2 = np.append(1, norm1)
# R: Z/rep(sqrt(norm1), each = length(x))
Z = Z / np.reshape(np.repeat(norm1**(1/2.0), repeats = x.size), (-1, x.size), order='F')
#R: Z[, -1]
Z = np.delete(Z, 0, axis=1)
return [Z, alpha, norm2];
Checking that this works:
x = np.arange(10) + 1
degree = 9
poly(x, degree)
The first row of the returned matrix is
[-0.49543369, 0.52223297, -0.45342519, 0.33658092, -0.21483446,
0.11677484, -0.05269379, 0.01869894, -0.00453516],
compared to the same operation in R
poly(1:10, 9)
# [1] -0.495433694 0.522232968 -0.453425193 0.336580916 -0.214834462
# [6] 0.116774842 -0.052693786 0.018698940 -0.004535159

Python double integral taking too long to compute

I am trying to compute the fresnel integral over a grid of coordinates using dblquad. But its taking very long and finally it's not giving any result.
Below is my code. In this code I integrated only over a 10 x 10 grid but I need to integrate at least over a 500 x 500 grid.
import time
st = time.time()
import pylab
import scipy.integrate as inte
import numpy as np
print 'imhere 0'
def sinIntegrand(y,x, X , Y):
a = 0.0001
R = 2e-3
z = 10e-3
Lambda = 0.5e-6
alpha = 0.01
k = np.pi * 2 / Lambda
return np.cos(k * (((x-R)**2)*a + (R-(x**2 + y**2)) * np.tan(np.radians(alpha)) + ((x - X)**2 + (y - Y)**2) / (2 * z)))
print 'im here 1'
def cosIntegrand(y,x,X,Y):
a = 0.0001
R = 2e-3
z = 10e-3
Lambda = 0.5e-6
alpha = 0.01
k = np.pi * 2 / Lambda
return np.sin(k * (((x-R)**2)*a + (R-(x**2 + y**2)) * np.tan(np.radians(alpha)) + ((x - X)**2 + (y - Y)**2) / (2 * z)))
def y1(x,R = 2e-3):
return (R**2 - x**2)**0.5
def y2(x, R = 2e-3):
return -1*(R**2 - x**2)**0.5
points = np.linspace(-1e-3,1e-3,10)
points2 = np.linspace(1e-3,-1e-3,10)
yv,xv = np.meshgrid(points , points2)
#def integrate_on_grid(func, lo, hi,y1,y2):
# """Returns a callable that can be evaluated on a grid."""
# return np.vectorize(lambda n,m: dblquad(func, lo, hi,y1,y2,(n,m))[0])
#
#intensity = abs(integrate_on_grid(sinIntegrand,-1e-3 ,1e-3,y1, y2)(yv,xv))**2 + abs(integrate_on_grid(cosIntegrand,-1e-3 ,1e-3,y1, y2)(yv,xv))**2
Intensity = []
print 'im here2'
for i in points:
row = []
for j in points2:
print 'im here'
intensity = abs(inte.dblquad(sinIntegrand,-1e-3 ,1e-3,y1, y2,(i,j))[0])**2 + abs(inte.dblquad(cosIntegrand,-1e-3 ,1e-3,y1, y2,(i,j))[0])**2
row.append(intensity)
Intensity.append(row)
Intensity = np.asarray(Intensity)
pylab.imshow(Intensity,cmap = 'gray')
pylab.show()
print str(time.time() - st)
I would really appreciate if you could tell any better way of doing this.

Using a scipy.integrate.dblquad to calculate every pixel of your image is going to be slow in any case.
You should try rewriting your mathematical problem so you can use some classical function in scipy.special instead. For instance, scipy.special.fresnel might work, although it is 1D and your problem seems to be in 2D. Otherwise, that there is a relationship between the Fresnel integral and the incomplete Gamma function (scipy.special.gammainc), if that helps.
If none of this work, as a last resort you can spend time optimizing your code and adapting it to Cython. This it will probably give a speed up of a factor of 10 to 100 (see this answer). Though this wouldn't be sufficient to go from a grid 10x10 to a grid 500x500.

how can I do a maximum likelihood regression using scipy.optimize.minimize

How can I do a maximum likelihood regression using scipy.optimize.minimize? I specifically want to use the minimize function here, because I have a complex model and need to add some constraints. I am currently trying a simple example using the following:
from scipy.optimize import minimize
def lik(parameters):
m = parameters[0]
b = parameters[1]
sigma = parameters[2]
for i in np.arange(0, len(x)):
y_exp = m * x + b
L = sum(np.log(sigma) + 0.5 * np.log(2 * np.pi) + (y - y_exp) ** 2 / (2 * sigma ** 2))
return L
x = [1,2,3,4,5]
y = [2,3,4,5,6]
lik_model = minimize(lik, np.array([1,1,1]), method='L-BFGS-B', options={'disp': True})
When I run this, convergence fails. Does anyone know what is wrong with my code?
The message I get running this is 'ABNORMAL_TERMINATION_IN_LNSRCH'. I am using the same algorithm that I have working using optim in R.

Thank you Aleksander. You were correct that my likelihood function was wrong, not the code. Using a formula I found on wikipedia I adjusted the code to:
import numpy as np
from scipy.optimize import minimize
def lik(parameters):
m = parameters[0]
b = parameters[1]
sigma = parameters[2]
for i in np.arange(0, len(x)):
y_exp = m * x + b
L = (len(x)/2 * np.log(2 * np.pi) + len(x)/2 * np.log(sigma ** 2) + 1 /
(2 * sigma ** 2) * sum((y - y_exp) ** 2))
return L
x = np.array([1,2,3,4,5])
y = np.array([2,5,8,11,14])
lik_model = minimize(lik, np.array([1,1,1]), method='L-BFGS-B')
plt.scatter(x,y)
plt.plot(x, lik_model['x'][0] * x + lik_model['x'][1])
plt.show()
Now it seems to be working.
Thanks for the help!

Converting Numpy Lstsq residual value to R^2

I am performing a least squares regression as below (univariate). I would like to express the significance of the result in terms of R^2. Numpy returns a value of unscaled residual, what would be a sensible way of normalizing this.
field_clean,back_clean = rid_zeros(backscatter,field_data)
num_vals = len(field_clean)
x = field_clean[:,row:row+1]
y = 10*log10(back_clean)
A = hstack([x, ones((num_vals,1))])
soln = lstsq(A, y )
m, c = soln [0]
residues = soln [1]
print residues

See http://en.wikipedia.org/wiki/Coefficient_of_determination
Your R2 value =
1 - residual / sum((y - y.mean())**2)
which is equivalent to
1 - residual / (n * y.var())
As an example:
import numpy as np
# Make some data...
n = 10
x = np.arange(n)
y = 3 * x + 5 + np.random.random(n)
# Note that polyfit is an easier way to do this...
# It would just be "model, resid = np.polyfit(x,y,1,full=True)[:2]"
A = np.vstack((x, np.ones(n))).T
model, resid = np.linalg.lstsq(A, y)[:2]
r2 = 1 - resid / (y.size * y.var())
print r2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I determine a correlation coefficient in Python? - python

The equation that I'm working with is as follows- The description says that x-bar and y-bar are the average of array 1 and array 2. The minimum coefficient is 0.3. The reason I'm asking is because I am not too familiar with reading statistical equations, let alone implementing them in Python...

Related

Numpy - vectorize the bivariate poisson pmf equation

What type of orthogonal polynomials does R use?

Python double integral taking too long to compute

how can I do a maximum likelihood regression using scipy.optimize.minimize

Converting Numpy Lstsq residual value to R^2

Categories

Resources