Memory efficient mean pairwise distance - python

I am aware of the scipy.spatial.distance.pdist function and how to compute the mean from the resulting matrix/ndarray.
>>> x = np.random.rand(10000, 2)
>>> y = pdist(x, metric='euclidean')
>>> y.mean()
0.5214255824176626
In the example above y gets quite large (nearly 2,500 times as large as the input array):
>>> y.shape
(49995000,)
>>> from sys import getsizeof
>>> getsizeof(x)
160112
>>> getsizeof(y)
399960096
>>> getsizeof(y) / getsizeof(x)
2498.0019986009793
But since I am only interested in the mean pairwise distance, the distance matrix doesn't have to be kept in memory. Instead the mean of each row (or column) can be computed seperatly. The final mean value can then be computed from the row mean values.
Is there already a function which exploit this property or is there an easy way to extend/combine existing functions to do so?

If you use the square version of distance, it is equivalent to using the variance with n-1:
from scipy.spatial.distance import pdist, squareform
import numpy as np
x = np.random.rand(10000, 2)
y = np.array([[1,1], [0,0], [2,0]])
print(pdist(x, 'sqeuclidean').mean())
print(np.var(x, 0, ddof=1).sum()*2)
>>0.331474285845873
0.33147428584587346

You will have to weight each row by the number of observations that make up the mean. For example the pdist of a 3 x 2 matrix is the flattened upper triangle (offset of 1) of the squareform 3 x 3 distance matrix.
arr = np.arange(6).reshape(3,2)
arr
array([[0, 1],
[2, 3],
[4, 5]])
pdist(arr)
array([2.82842712, 5.65685425, 2.82842712])
from sklearn.metrics import pairwise_distances
square = pairwise_distances(arr)
square
array([[0. , 2.82842712, 5.65685425],
[2.82842712, 0. , 2.82842712],
[5.65685425, 2.82842712, 0. ]])
square[triu_indices(square.shape[0], 1)]
array([2.82842712, 5.65685425, 2.82842712])
There is the pairwise_distances_chuncked function that can be used to iterate over the distance matrix row by row, but you will need to keep track of the row index to make sure you only take the mean of values in the upper/lower triangle of the matrix (distance matrix is symmetrical). This isn't complicated, but I imagine you will introduce a significant slowdown.
tot = ((arr.shape[0]**2) - arr.shape[0]) / 2
weighted_means = 0
for i in gen:
if r < arr.shape[0]:
sm = i[0, r:].mean()
wgt = (i.shape[1] - r) / tot
weighted_means += sm * wgt
r += 1

Related

Is there a way to vectorize the calculation of correlation coefficients from this Numpy array?

This code computes the Pearson correlation coefficient for all possible pairs of L=45 element vectors taken from a stack of M=102272. The result is a symmetric MxM matrix occupying about 40 GB of memory. The memory requirement isn't a problem for my computer, but I estimate from test runs that the ~5 billion passes through the inner loop will take a good 2-3 days to complete. My question: Is there a straightforward way to vectorize the inner loop to speed things up significantly?
# L = 45
# M = 102272
# data[M,L] (type 'float32')
cmat = np.zeros((M,M))
for i in range(M):
v1 = data[i,:]
z1 = (v1-np.average(v1))/np.std(v1)
for j in range(i+1):
v2 = data[j,:]
z2 = (v2-np.average(v2))/np.std(v2)
cmmat[i,j] = cmmat[j,i] = z1.dot(z2)/L
There's a built-in numpy function that already exists to compute correlation matrix. Just use it!
>>> import numpy as np
>>> rng = np.random.default_rng(seed=42)
>>> xarr = rng.random((3, 3))
>>> xarr
array([[0.77395605, 0.43887844, 0.85859792],
[0.69736803, 0.09417735, 0.97562235],
[0.7611397 , 0.78606431, 0.12811363]])
>>> R1 = np.corrcoef(xarr)
>>> R1
array([[ 1. , 0.99256089, -0.68080986],
[ 0.99256089, 1. , -0.76492172],
[-0.68080986, -0.76492172, 1. ]])
Documentation link

Calculating variance within cells across matricies in python

I have several matricies with identical dimensions, X, and Y. I want to calculate the variance for each cell across the matricies, such that the resulting output matrix would also have the same dimensions, X, and Y. For example
matrix1 = [[1,1,1], [2,2,2], [3,3,3]]
matrix2 = [[2,2,2], [3,3,3], [4,4,4]]
matrix3 = [[3,3,3], [4,4,4], [5,5,5]]
Using position (0,0) in each cell as an example, I need to first calculate the mean, which would be (1+2+3)/3 = 2
matrix_sum = matrix1 + matrix2 + matrix3
matrix_mean = matrix_sum / 3
Next I'd calculate the population variance which would be:
[(1-2)+(2-2)+(3-2)]^2
And I'd like to be able to do this for an indeterminate (but small number) of matricies (say 50), and the matricies themselves would be at max 250, 250 (they will always be square matricies)
for x in range(1,matrix_mean.shape[0]):
for y in range(1,matrix_mean.shape[1]):
standard_deviation_matrix.iat[x,y] = pow(matrix_mean.iat[x,y]- matrix1.iat[x,y],2) + pow(matrix_mean.iat[x,y]- matrix2.iat[x,y],2) + pow(matrix_mean.iat[x,y]- matrix3.iat[x,y],2)
standard_deviation_matrix = standard_deviation_matrix / (3-1)
Here, combined_matrix is just (matrix1 + matrix2 + matrix3 .. matrix5) / 5 (i.e. the mean within each cell across the matricies)
This seems to work, but it's super slow and super clunky; but it's how I'd do it in C. Is there an easier/better/more pythonic way to do this?
Thanks
You can try:
all_mat = np.stack([matrix1, matrix2, matrix3])
mat_mean = all_mat.mean(axis=0)
variance = np.var(all_mat, axi=0)
Which gives you:
array([[0.66666667, 0.66666667, 0.66666667],
[0.66666667, 0.66666667, 0.66666667],
[0.66666667, 0.66666667, 0.66666667]])
Or for the std:
np.std(all_mat, axis=0)
And you get:
array([[0.81649658, 0.81649658, 0.81649658],
[0.81649658, 0.81649658, 0.81649658],
[0.81649658, 0.81649658, 0.81649658]])
Convert each matrix into a numpy array, stack the arrays (this will add another dimension), and calculate the variance along that dimension:
m1 = np.array(matrix1)
...
m = np.stack([m1, m2, ...])
m.var(axis=0)

Vectorized calculation of scaled/rotated pairwise squared euclidean distance

Given a set of n vectors of dimension d stored in a (n,d) array and a second set of m vectors of the same dimension (stored in (m,d) array) I want to calculate the squared point wise distance between the vectors, scaled by some matrix A with the size (d,d).
The output should be a (n,m) array.
I expect the input range to be somewhere between 1 to 10.000 for m and n and 1 to 100 for d.
The distance between two points is given by:
In the non-optimized, but working python code this looks like this:
import numpy as np
v1 = np.array([[1, 2],
[3, 4],
[4, 5]])
v2 = np.array([[1,1],
[2, 2],
[2, 2],
[0, 0]])
A = np.array([[1,0], [2, 3]])
d = np.zeros((3, 4))
for i in range(0,3):
for j in range(0,4):
d[i,j] = (v1[i,:] - v2[j,:]).T # A # (v1[i,:] - v2[j,:])
The squared distance between the example points is:
d = [[ 3. 1. 1. 17.]
[ 43. 17. 17. 81.]
[ 81. 43. 43. 131.]]
Is there a version of this, that avoids the nested loop in python e.g. using broadcasting black magic?
EDIT:
For the case
A = np.array([[1,0], [0, 1]])
this is the normal squared euclidean distance which can be calculated e.g.
from scipy.spatial.distance import cdist
cdist(v1,v2,'sqeuclidean')
We can use np.einsum -
V = v1[:,None,:]-v2
d_out = np.einsum('ijk,kl,ijl->ij',V,A,V)
Also, play around with the optimize flag in np.einsum by setting it as True to use BLAS.
Explanation on the vectorized method
Original code was -
d[i,j] = (v1[i,:] - v2[j,:]).T # A # (v1[i,:] - v2[j,:])
I. We are translating :
v1[i,:] - v2[j,:]
to the outer operation with broadcasting :
v1[:,None,:]-v2
Schematically put :
v1[:,None,:] : m x 1 x n
v2 : m x n
output, V : m x m x n
More info on outer explanation.
More info on broadcasting could be found in docs.
II. Next up, (v1[i,:] - v2[j,:]).T # A # (v1[i,:] - v2[j,:]) with the new V becomes np.einsum('ijk,kl,ijl->ij',V,A,V) using einsum's string notation. More info could be found in docs.

How to plot pairwise distances of two-dimensional vectors?

I have a set of data in python likes:
x y angle
If I want to calculate the distance between two points with all possible value and plot the distances with the difference between two angles.
x, y, a = np.loadtxt('w51e2-pa-2pk.log', unpack=True)
n = 0
f=(((x[n])-x[n+1:])**2+((y[n])-y[n+1:])**2)**0.5
d = a[n]-a[n+1:]
plt.scatter(f,d)
There are 255 points in my data.
f is the distance and d is the difference between two angles.
My question is can I set n = [1,2,3,.....255] and do the calculation again to get the f and d of all possible pairs?
You can obtain the pairwise distances through broadcasting by considering it as an outer operation on the array of 2-dimensional vectors as follows:
vecs = np.stack((x, y)).T
np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
For example,
In [1]: import numpy as np
...: x = np.array([1, 2, 3])
...: y = np.array([3, 4, 6])
...: vecs = np.stack((x, y)).T
...: np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
...:
Out[1]:
array([[ 0. , 1.41421356, 3.60555128],
[ 1.41421356, 0. , 2.23606798],
[ 3.60555128, 2.23606798, 0. ]])
Here, the (i, j)'th entry is the distance between the i'th and j'th vectors.
The case of the pairwise differences between angles is similar, but simpler, as you only have one dimension to deal with:
In [2]: a = np.array([10, 12, 15])
...: a[np.newaxis, :] - a[: , np.newaxis]
...:
Out[2]:
array([[ 0, 2, 5],
[-2, 0, 3],
[-5, -3, 0]])
Moreover, plt.scatter does not care that the results are given as matrices, and putting everything together using the notation of the question, you can obtain the plot of angles by distances by doing something like
vecs = np.stack((x, y)).T
f = np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
d = angle[np.newaxis, :] - angle[: , np.newaxis]
plt.scatter(f, d)
You have to use a for loop and range() to iterate over n, e.g. like like this:
n = len(x)
for i in range(n):
# do something with the current index
# e.g. print the points
print x[i]
print y[i]
But note that if you use i+1 inside the last iteration, this will already be outside of your list.
Also in your calculation there are errors. (x[n])-x[n+1:] does not work because x[n] is a single value in your list while x[n+1:] is a list starting from n+1'th element. You can not subtract a list from an int or whatever it is.
Maybe you will have to even use two nested loops to do what you want. I guess that you want to calculate the distance between each point so a two dimensional array may be the data structure you want.
If you are interested in all combinations of the points in x and y I suggest to use itertools, which will give you all possible combinations. Then you can do it like follows:
import itertools
f = [((x[i]-x[j])**2 + (y[i]-y[j])**2)**0.5 for i,j in itertools.product(255,255) if i!=j]
# and similar for the angles
But maybe there is even an easier way...

Fastest way to compute entropy of each numpy array row?

I have a array in size MxN and I like to compute the entropy value of each row. What would be the fastest way to do so ?
scipy.special.entr computes -x*log(x) for each element in an array. After calling that, you can sum the rows.
Here's an example. First, create an array p of positive values whose rows sum to 1:
In [23]: np.random.seed(123)
In [24]: x = np.random.rand(3, 10)
In [25]: p = x/x.sum(axis=1, keepdims=True)
In [26]: p
Out[26]:
array([[ 0.12798052, 0.05257987, 0.04168536, 0.1013075 , 0.13220688,
0.07774843, 0.18022149, 0.1258417 , 0.08837421, 0.07205402],
[ 0.08313743, 0.17661773, 0.1062474 , 0.01445742, 0.09642919,
0.17878489, 0.04420998, 0.0425045 , 0.12877228, 0.1288392 ],
[ 0.11793032, 0.15790292, 0.13467074, 0.11358463, 0.13429674,
0.06003561, 0.06725376, 0.0424324 , 0.05459921, 0.11729367]])
In [27]: p.shape
Out[27]: (3, 10)
In [28]: p.sum(axis=1)
Out[28]: array([ 1., 1., 1.])
Now compute the entropy of each row. entr uses the natural logarithm, so to get the base-2 log, divide the result by log(2).
In [29]: from scipy.special import entr
In [30]: entr(p).sum(axis=1)
Out[30]: array([ 2.22208731, 2.14586635, 2.22486581])
In [31]: entr(p).sum(axis=1)/np.log(2)
Out[31]: array([ 3.20579434, 3.09583074, 3.20980287])
If you don't want the dependency on scipy, you can use the explicit formula:
In [32]: (-p*np.log2(p)).sum(axis=1)
Out[32]: array([ 3.20579434, 3.09583074, 3.20980287])
As #Warren pointed out, it's unclear from your question whether you are starting out from an array of probabilities, or from the raw samples themselves. In my answer I've assumed the latter, in which case the main bottleneck will be computing the bin counts over each row.
Assuming that each vector of samples is relatively long, the fastest way to do this will probably be to use np.bincount:
import numpy as np
def entropy(x):
"""
x is assumed to be an (nsignals, nsamples) array containing integers between
0 and n_unique_vals
"""
x = np.atleast_2d(x)
nrows, ncols = x.shape
nbins = x.max() + 1
# count the number of occurrences for each unique integer between 0 and x.max()
# in each row of x
counts = np.vstack((np.bincount(row, minlength=nbins) for row in x))
# divide by number of columns to get the probability of each unique value
p = counts / float(ncols)
# compute Shannon entropy in bits
return -np.sum(p * np.log2(p), axis=1)
Although Warren's method of computing the entropies from the probability values using entr is slightly faster than using the explicit formula, in practice this is likely to represent a tiny fraction of the total runtime compared to the time taken to compute the bin counts.
Test correctness for a single row:
vals = np.arange(3)
prob = np.array([0.1, 0.7, 0.2])
row = np.random.choice(vals, p=prob, size=1000000)
print("theoretical H(x): %.6f, empirical H(x): %.6f" %
(-np.sum(prob * np.log2(prob)), entropy(row)[0]))
# theoretical H(x): 1.156780, empirical H(x): 1.157532
Test speed:
In [1]: %%timeit x = np.random.choice(vals, p=prob, size=(1000, 10000))
....: entropy(x)
....:
10 loops, best of 3: 34.6 ms per loop
If your data don't consist of integer indices between 0 and the number of unique values, you can convert them into this format using np.unique:
y = np.random.choice([2.5, 3.14, 42], p=prob, size=(1000, 10000))
unq, x = np.unique(y, return_inverse=True)
x.shape = y.shape

Categories