Numpy find covariance of two 2-dimensional ndarray - python

I am new to numpy and am stuck at this problem.
I have two 2-dimensional numpy array such as
x = numpy.random.random((10, 5))
y = numpy.random.random((10, 5))
I want to use numpy cov function to find covariance of these two ndarrays row wise. i.e., for above example the output array should consist of 10 elements each denoting the covariance of corresponding rows of the ndarrays. I know I can do this by traversing the rows and finding the covariance of two 1D arrays but it isn't pythonic.
Edit1: The covariance of two array denotes the element at 0, 1 index.
Edit2: Currently this is my implementation
s = numpy.empty((x.shape[0], 1))
for i in range(x.shape[0]):
s[i] = numpy.cov(x[i], y[i])[0][1]

Use the definition of the covariance: E(XY) - E(X)E(Y).
import numpy as np
x = np.random.random((10, 5))
y = np.random.random((10, 5))
n = x.shape[1]
cov_bias = np.mean(x * y, axis=1) - np.mean(x, axis=1) * np.mean(y, axis=1))
cov_bias * n / (n-1)
Note that cov_bias corresponds to the result of numpy.cov(bias=True).

Here's one using the definition of covariance and inspired by corr2_coeff_rowwise -
def covariance_rowwise(A,B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(-1, keepdims=True)
B_mB = B - B.mean(-1, keepdims=True)
# Finally get covariance
N = A.shape[1]
return np.einsum('ij,ij->i',A_mA,B_mB)/(N-1)
Sample run -
In [66]: np.random.seed(0)
...: x = np.random.random((10, 5))
...: y = np.random.random((10, 5))
In [67]: s = np.empty((x.shape[0]))
...: for i in range(x.shape[0]):
...: s[i] = np.cov(x[i], y[i])[0][1]
In [68]: np.allclose(covariance_rowwise(x,y),s)
Out[68]: True

This works, but I'm not sure if it is faster for larger matrices x and y, the call numpy.cov(x, y) computes many entries we discard with numpy.diag:
x = numpy.random.random((10, 5))
y = numpy.random.random((10, 5))
# with loop
for (xi, yi) in zip(x, y):
print(numpy.cov(xi, yi)[0][1])
# vectorized
cov_mat = numpy.cov(x, y)
covariances = numpy.diag(cov_mat, x.shape[0])
print(covariances)
I also did some timing for square matrices of size n x n:
import time
import numpy
def run(n):
x = numpy.random.random((n, n))
y = numpy.random.random((n, n))
started = time.time()
for (xi, yi) in zip(x, y):
numpy.cov(xi, yi)[0][1]
needed_loop = time.time() - started
started = time.time()
cov_mat = numpy.cov(x, y)
covariances = numpy.diag(cov_mat, x.shape[0])
needed_vectorized = time.time() - started
print(
f"n={n:4d} needed_loop={needed_loop:.3f} s "
f"needed_vectorized={needed_vectorized:.3f} s"
)
for n in (100, 200, 500, 600, 700, 1000, 2000, 3000):
run(n)
output on my slow MacBook Air is
n= 100 needed_loop=0.006 s needed_vectorized=0.001 s
n= 200 needed_loop=0.011 s needed_vectorized=0.003 s
n= 500 needed_loop=0.033 s needed_vectorized=0.023 s
n= 600 needed_loop=0.041 s needed_vectorized=0.039 s
n= 700 needed_loop=0.043 s needed_vectorized=0.049 s
n=1000 needed_loop=0.061 s needed_vectorized=0.130 s
n=2000 needed_loop=0.137 s needed_vectorized=0.742 s
n=3000 needed_loop=0.224 s needed_vectorized=2.264 s
so the break even point is around n=600

Pick the diagonal vector of cov(x,y) and expand dims:
numpy.expand_dims(numpy.diag(numpy.cov(x,y),x.shape[0]),1)

Related

More pythonic way of creating within-class scatter matrix

I am looking for a better way of calculating the following
import numpy as np
np.random.seed(123)
# test code
t = np.random.randint(3, size = 100)
X = np.random.random((100, 3))
m = np.random.random((3, 3))
# current method
res = 0
for k in np.unique(t):
for row in X[t == k] - m[k]:
res += np.outer(row, row)
res
"""
Output:
array([[12.45661335, -3.51124346, 3.75900294],
[-3.51124346, 14.85327689, -3.02281263],
[ 3.75900294, -3.02281263, 18.30868772]])
"""
I would prefer getting rid of the for loops using numpy.
This is the within-class scatter matrix for fischers linear discriminant.
You can write as follows:
Y = X - m[t]
np.matmul(Y.T, Y)
This is because sum_i x_i x'_i = X' X, where X is (N, 3) matrix and x_i = X[i,:], i.e. i-th row of X. ' indicates the transpose.

Creating a matrix of random data in Python

I am trying to create a matrix in python that is 30 × 10 and has randomly generated numbers inside of it. But my numbers in the matrix have to follow the condition:
Randomly generate 30 data points from the sine function, where each data point (x,y) has the form
x = [x0, x1, x2,..., x10], x ∈ [0, 2π]
y = sin(x) + ε, ε ∈ N(0,0.3)
How might I be able to go about this?
Right now I only have a 1 × 10 matrix
def generate_sin_data():
x = np.random.rand()
y = np.sin(x)
features = [x**0, x**1, x**2, x**3, x**4,x**5, x**6, x**7, x**8, x**9,x**10]
return x,y,features
I'm not 100% certain I follow everything, but we can break it down. Here's how you can generate 30 random numbers between 0 and 2π:
import numpy as np
x = np.random.random(30) * 2*np.pi
Here, x is a 1D array of 30 numbers. Check this with x.shape.
Now if you add a dimension, it's easy to generate a matrix of powers up to 10 using NumPy's broadcasting feature. The question seems to ask for 11 numbers (0 to 10) not 10, so I'll do that:
X = x.reshape(-1, 1) ** np.arange(0, 11)
That reshape effectively turns x into a column vector. Now check X.shape and it's (30, 11), which is what I think you were after. Notice we use a big X for a matrix — this convention will help you keep track of things. Each column of X is the original function raised to a power from 0 to 10. (Note that each column comes from the same set of random numbers — I'm not sure if that's what you want?)
If you want y as a function of x (the vector) then do like so:
ϵ = np.random.random(30) * 0.3
y = np.sin(x) + ϵ
import numpy as np
# 30 random uniform values in [0, 2*pi)
_x = np.random.uniform(0, 2*np.pi, 30)
# matrix of 30x10:
x = np.array([
[v ** i for i in range(10)]
for v in _x
])
# random 30x10 normal noise:
eps = np.random.normal(0, 0.3, [30, 10])
# final result 30x10 matrix:
y = np.sin(x) + eps

Arrange and Sub-sample 3d Points in Coordinate Grid With Numpy

I have a list of 3d points such as
np.array([
[220, 114, 2000],
[125.24, 214, 2519],
...
[54.1, 254, 1249]
])
The points are in no meaningful order. I'd like to sort and reshape the array in a way that better represents a coordinate grid (such that I have a known width and height and can retrieve Z values by index). I would also like to down sample the points into say whole integers to handle collisions. Applying min,max, or mean during the down sampling.
I know I can down sample a 1d array using np.mean and np.shape
The approach I'm currently using finds the min and max in X,Y and then puts the Z values into a 2d array while doing the down sampling manually.
This iterates the giant array numerous times and I'm wondering if there is a way to do this with np.meshgrid or some other numpy functionality that I'm overlooking.
Thanks
You can use the binning method from Most efficient way to sort an array into bins specified by an index array?
To get an index array from y,x coordinates you can use np.searchsorted and np.ravel_multi_index
Here is a sample implementation, stb module is the code from the linked post.
import numpy as np
from stb import sort_to_bins_sparse as sort_to_bins
def grid1D(u, N):
mn, mx = u.min(), u.max()
return np.linspace(mn, mx, N, endpoint=False)
def gridify(yxz, N):
try:
Ny, Nx = N
except TypeError:
Ny = Nx = N
y, x, z = yxz.T
yg, xg = grid1D(y, Ny), grid1D(x, Nx)
yidx, xidx = yg.searchsorted(y, 'right')-1, xg.searchsorted(x, 'right')-1
yx = np.ravel_multi_index((yidx, xidx), (Ny, Nx))
zs = sort_to_bins(yx, z)
return np.concatenate([[0], np.bincount(yx).cumsum()]), zs, yg, xg
def bin(yxz, N, binning_method='min'):
boundaries, binned, yg, xg = gridify(yxz, N)
result = np.full((yg.size, xg.size), np.nan)
if binning_method == 'min':
result.reshape(-1)[:len(boundaries)-1] = np.minimum.reduceat(binned, boundaries[:-1])
elif binning_method == 'max':
result.reshape(-1)[:len(boundaries)-1] = np.maximum.reduceat(binned, boundaries[:-1])
elif binning_method == 'mean':
result.reshape(-1)[:len(boundaries)-1] = np.add.reduceat(binned, boundaries[:-1]) / np.diff(boundaries)
else:
raise ValueError
result.reshape(-1)[np.where(boundaries[1:] == boundaries[:-1])] = np.nan
return result
def test():
yxz = np.random.uniform(0, 100, (100000, 3))
N = 20
boundaries, binned, yg, xg = gridify(yxz, N)
binmin = bin(yxz, N)
binmean = bin(yxz, N, 'mean')
y, x, z = yxz.T
for i in range(N-1):
for j in range(N-1):
msk = (y>=yg[i]) & (y<yg[i+1]) & (x>=xg[j]) & (x<xg[j+1])
assert (z[msk].min() == binmin[i, j]) if msk.any() else np.isnan(binmin[i, j])
assert np.isclose(z[msk].mean(), binmean[i, j]) if msk.any() else np.isnan(binmean[i, j])

Fully vectorise numpy polyfit

Overview
I am running into issues with performance using polyfit because it doesn't appear able to accept broadcast arrays. I am aware from this post that the dependant data y can be multidimensional if you use numpy.polynomial.polynomial.polyfit. However, the x dimension cannot be multidimensional. Is there anyway around this?
Motivation
I need to compute the rate of change of some data. To match with an experiment I want to use the following method: take data y and x, for short sections of data fit a polynomial, then use the fitted coefficient as an estimate of the rate of change.
Illustration
import numpy as np
import matplotlib.pyplot as plt
n = 100
x = np.linspace(0, 10, n)
y = np.sin(x)
window_length = 10
ydot = [np.polyfit(x[j:j+window_length], y[j:j+window_length], 1)[0]
for j in range(n - window_length)]
x_mids = [x[j+window_length/2] for j in range(n - window_length)]
plt.plot(x, y)
plt.plot(x_mids, ydot)
plt.show()
The blue line is the original data (a sine curve), while the green is the first differential (a cosine curve).
The problem
To vectorise this I did the following:
window_length = 10
vert_idx_list = np.arange(0, len(x) - window_length, 1)
hori_idx_list = np.arange(window_length)
A, B = np.meshgrid(hori_idx_list, vert_idx_list)
idx_array = A + B
x_array = x[idx_array]
y_array = y[idx_array]
This broadcasts the two 1D vectors to 2D vectors of shape (n-window_length, window_length). Now I was hoping that polyfit would have an axis argument so I could parallelise the calculation, but no such luck.
Does anyone have any suggestion for how to do this? I am open to
The way polyfit works is by solving a least-square problem of the form:
y = [X].a
where y are your dependent coordinates, [X] is the Vandermonde matrix of the corresponding independent coordinates, and a is the vector of fitted coefficients.
In your case you are always computing a 1st degree polynomial approximation, and are actually only interested in the coefficient of the 1st degree term. This has a well known closed form solution you can find in any statistics book, or produce your self by creating a 2x2 linear system of equation premultiplying both sides of the above equation by the transpose of [X]. This all adds up to the value you want to calculate being:
>>> n = 10
>>> x = np.random.random(n)
>>> y = np.random.random(n)
>>> np.polyfit(x, y, 1)[0]
-0.29207474654700277
>>> (n*(x*y).sum() - x.sum()*y.sum()) / (n*(x*x).sum() - x.sum()*x.sum())
-0.29207474654700216
On top of that you have a sliding window running over your data, so you can use something akin to a 1D summed area table as follows:
def sliding_fitted_slope(x, y, win):
x = np.concatenate(([0], x))
y = np.concatenate(([0], y))
Sx = np.cumsum(x)
Sy = np.cumsum(y)
Sx2 = np.cumsum(x*x)
Sxy = np.cumsum(x*y)
Sx = Sx[win:] - Sx[:-win]
Sy = Sy[win:] - Sy[:-win]
Sx2 = Sx2[win:] - Sx2[:-win]
Sxy = Sxy[win:] - Sxy[:-win]
return (win*Sxy - Sx*Sy) / (win*Sx2 - Sx*Sx)
With this code you can easily check that (notice I extended the range by 1):
>>> np.allclose(sliding_fitted_slope(x, y, window_length),
[np.polyfit(x[j:j+window_length], y[j:j+window_length], 1)[0]
for j in range(n - window_length + 1)])
True
And:
%timeit sliding_fitted_slope(x, y, window_length)
10000 loops, best of 3: 34.5 us per loop
%%timeit
[np.polyfit(x[j:j+window_length], y[j:j+window_length], 1)[0]
for j in range(n - window_length + 1)]
100 loops, best of 3: 10.1 ms per loop
So it is about 300x faster for your sample data.
Sorry for answering my own question, but 20 minutes more of trying to get to grips with it I have the following solution:
ydot = np.polynomial.polynomial.polyfit(x_array[0], y_array.T, 1)[-1]
One confusing part is that np.polyfit returns the coefficients with the highest power first. In np.polynomial.polynomial.polyfit the highest power is last (hence the -1 instead of 0 index).
Another confusion is that we use only the first slice of x (x_array[0]). I think that this is okay because it is not the absolute values of the independent vector x that are used, but the difference between them. Or alternatively it is like changing the reference x value.
If there is a better way to do this I am still happy to hear about it!
Using an alternative method for calculating the rate of change may be the solution for both speed and accuracy increase.
n = 1000
x = np.linspace(0, 10, n)
y = np.sin(x)
def timingPolyfit(x,y):
window_length = 10
vert_idx_list = np.arange(0, len(x) - window_length, 1)
hori_idx_list = np.arange(window_length)
A, B = np.meshgrid(hori_idx_list, vert_idx_list)
idx_array = A + B
x_array = x[idx_array]
y_array = y[idx_array]
ydot = np.polynomial.polynomial.polyfit(x_array[0], y_array.T, 1)[-1]
x_mids = [x[j+window_length/2] for j in range(n - window_length)]
return ydot, x_mids
def timingSimple(x,y):
dy = (y[2:] - y[:-2])/2
dx = x[1] - x[0]
dydx = dy/dx
return dydx, x[1:-1]
y1, x1 = timingPolyfit(x,y)
y2, x2 = timingSimple(x,y)
polyfitError = np.abs(y1 - np.cos(x1))
simpleError = np.abs(y2 - np.cos(x2))
print("polyfit average error: {:.2e}".format(np.average(polyfitError)))
print("simple average error: {:.2e}".format(np.average(simpleError)))
result = %timeit -o timingPolyfit(x,y)
result2 = %timeit -o timingSimple(x,y)
print("simple is {0} times faster".format(result.best / result2.best))
polyfit average error: 3.09e-03
simple average error: 1.09e-05
100 loops, best of 3: 3.2 ms per loop
100000 loops, best of 3: 9.46 µs per loop
simple is 337.995634151131 times faster
Relative error:
Results:

Scipy Fast 1-D interpolation without any loop

I have two 2D array, x(ni, nj) and y(ni,nj), that I need to interpolate over one axis. I want to interpolate along last axis for every ni.
I wrote
import numpy as np
from scipy.interpolate import interp1d
z = np.asarray([200,300,400,500,600])
out = []
for i in range(ni):
f = interp1d(x[i,:], y[i,:], kind='linear')
out.append(f(z))
out = np.asarray(out)
However, I think this method is inefficient and slow due to loop if array size is too large. What is the fastest way to interpolate multi-dimensional array like this? Is there any way to perform linear and cubic interpolation without loop? Thanks.
The method you propose does have a python loop, so for large values of ni it is going to get slow. That said, unless you are going to have large ni you shouldn't worry much.
I have created sample input data with the following code:
def sample_data(n_i, n_j, z_shape) :
x = np.random.rand(n_i, n_j) * 1000
x.sort()
x[:,0] = 0
x[:, -1] = 1000
y = np.random.rand(n_i, n_j)
z = np.random.rand(*z_shape) * 1000
return x, y, z
And have tested them with this two versions of linear interpolation:
def interp_1(x, y, z) :
rows, cols = x.shape
out = np.empty((rows,) + z.shape, dtype=y.dtype)
for j in xrange(rows) :
out[j] =interp1d(x[j], y[j], kind='linear', copy=False)(z)
return out
def interp_2(x, y, z) :
rows, cols = x.shape
row_idx = np.arange(rows).reshape((rows,) + (1,) * z.ndim)
col_idx = np.argmax(x.reshape(x.shape + (1,) * z.ndim) > z, axis=1) - 1
ret = y[row_idx, col_idx + 1] - y[row_idx, col_idx]
ret /= x[row_idx, col_idx + 1] - x[row_idx, col_idx]
ret *= z - x[row_idx, col_idx]
ret += y[row_idx, col_idx]
return ret
interp_1 is an optimized version of your code, following Dave's answer. interp_2 is a vectorized implementation of linear interpolation that avoids any python loop whatsoever. Coding something like this requires a sound understanding of broadcasting and indexing in numpy, and some things are going to be less optimized than what interp1d does. A prime example being finding the bin in which to interpolate a value: interp1d will surely break out of loops early once it finds the bin, the above function is comparing the value to all bins.
So the result is going to be very dependent on what n_i and n_j are, and even how long your array z of values to interpolate is. If n_j is small and n_i is large, you should expect an advantage from interp_2, and from interp_1 if it is the other way around. Smaller z should be an advantage to interp_2, longer ones to interp_1.
I have actually timed both approaches with a variety of n_i and n_j, for z of shape (5,) and (50,), here are the graphs:
So it seems that for z of shape (5,) you should go with interp_2 whenever n_j < 1000, and with interp_1 elsewhere. Not surprisingly, the threshold is different for z of shape (50,), now being around n_j < 100. It seems tempting to conclude that you should stick with your code if n_j * len(z) > 5000, but change it to something like interp_2 above if not, but there is a great deal of extrapolating in that statement! If you want to further experiment yourself, here's the code I used to produce the graphs.
n_s = np.logspace(1, 3.3, 25)
int_1 = np.empty((len(n_s),) * 2)
int_2 = np.empty((len(n_s),) * 2)
z_shape = (5,)
for i, n_i in enumerate(n_s) :
print int(n_i)
for j, n_j in enumerate(n_s) :
x, y, z = sample_data(int(n_i), int(n_j), z_shape)
int_1[i, j] = min(timeit.repeat('interp_1(x, y, z)',
'from __main__ import interp_1, x, y, z',
repeat=10, number=1))
int_2[i, j] = min(timeit.repeat('interp_2(x, y, z)',
'from __main__ import interp_2, x, y, z',
repeat=10, number=1))
cs = plt.contour(n_s, n_s, np.transpose(int_1-int_2))
plt.clabel(cs, inline=1, fontsize=10)
plt.xlabel('n_i')
plt.ylabel('n_j')
plt.title('timeit(interp_2) - timeit(interp_1), z.shape=' + str(z_shape))
plt.show()
One optimization is to allocate the result array once like so:
import numpy as np
from scipy.interpolate import interp1d
z = np.asarray([200,300,400,500,600])
out = np.zeros( [ni, len(z)], dtype=np.float32 )
for i in range(ni):
f = interp1d(x[i,:], y[i,:], kind='linear')
out[i,:]=f(z)
This will save you some memory copying that occurs in your implementation, which occurs in the calls to out.append(...).

Categories