Euclidian Distances between points - python

I have an array of points in numpy:
points = rand(dim, n_points)
And I want to:
Calculate all the l2 norm (euclidian distance) between a certain point and all other points
Calculate all pairwise distances.
and preferably all numpy and no for's. How can one do it?

If you're willing to use SciPy, the scipy.spatial.distance module (the functions cdist and/or pdist) do exactly what you want, with all the looping done in C. You can do it with broadcasting too but there's some extra memory overhead.

This might help with the second part:
import numpy as np
from numpy import *
p=rand(3,4) # this is column-wise so each vector has length 3
sqrt(sum((p[:,np.newaxis,:]-p[:,:,np.newaxis])**2 ,axis=0) )
which gives
array([[ 0. , 0.37355868, 0.64896708, 1.14974483],
[ 0.37355868, 0. , 0.6277216 , 1.19625254],
[ 0.64896708, 0.6277216 , 0. , 0.77465192],
[ 1.14974483, 1.19625254, 0.77465192, 0. ]])
if p was
array([[ 0.46193242, 0.11934744, 0.3836483 , 0.84897951],
[ 0.19102709, 0.33050367, 0.36382587, 0.96880535],
[ 0.84963349, 0.79740414, 0.22901247, 0.09652746]])
and you can check one of the entries via
sqrt(sum ((p[:,0]-p[:,2] )**2 ))
0.64896708223796884
The trick is to put newaxis and then do broadcasting.
Good luck!

Related

Quickest way to calculate the euclidean distance matrix of two list of points [duplicate]

I have a set of points in 2-dimensional space and need to calculate the distance from each point to each other point.
I have a relatively small number of points, maybe at most 100. But since I need to do it often and rapidly in order to determine the relationships between these moving points, and since I'm aware that iterating through the points could be as bad as O(n^2) complexity, I'm looking for ways to take advantage of numpy's matrix magic (or scipy).
As it stands in my code, the coordinates of each object are stored in its class. However, I could also update them in a numpy array when I update the class coordinate.
class Cell(object):
"""Represents one object in the field."""
def __init__(self,id,x=0,y=0):
self.m_id = id
self.m_x = x
self.m_y = y
It occurs to me to create a Euclidean distance matrix to prevent duplication, but perhaps you have a cleverer data structure.
I'm open to pointers to nifty algorithms as well.
Also, I note that there are similar questions dealing with Euclidean distance and numpy but didn't find any that directly address this question of efficiently populating a full distance matrix.
You can take advantage of the complex type :
# build a complex array of your cells
z = np.array([complex(c.m_x, c.m_y) for c in cells])
First solution
# mesh this array so that you will have all combinations
m, n = np.meshgrid(z, z)
# get the distance via the norm
out = abs(m-n)
Second solution
Meshing is the main idea. But numpy is clever, so you don't have to generate m & n. Just compute the difference using a transposed version of z. The mesh is done automatically :
out = abs(z[..., np.newaxis] - z)
Third solution
And if z is directly set as a 2-dimensional array, you can use z.T instead of the weird z[..., np.newaxis]. So finally, your code will look like this :
z = np.array([[complex(c.m_x, c.m_y) for c in cells]]) # notice the [[ ... ]]
out = abs(z.T-z)
Example
>>> z = np.array([[0.+0.j, 2.+1.j, -1.+4.j]])
>>> abs(z.T-z)
array([[ 0. , 2.23606798, 4.12310563],
[ 2.23606798, 0. , 4.24264069],
[ 4.12310563, 4.24264069, 0. ]])
As a complement, you may want to remove duplicates afterwards, taking the upper triangle :
>>> np.triu(out)
array([[ 0. , 2.23606798, 4.12310563],
[ 0. , 0. , 4.24264069],
[ 0. , 0. , 0. ]])
Some benchmarks
>>> timeit.timeit('abs(z.T-z)', setup='import numpy as np;z = np.array([[0.+0.j, 2.+1.j, -1.+4.j]])')
4.645645342274779
>>> timeit.timeit('abs(z[..., np.newaxis] - z)', setup='import numpy as np;z = np.array([0.+0.j, 2.+1.j, -1.+4.j])')
5.049334864854522
>>> timeit.timeit('m, n = np.meshgrid(z, z); abs(m-n)', setup='import numpy as np;z = np.array([0.+0.j, 2.+1.j, -1.+4.j])')
22.489568296184686
If you don't need the full distance matrix, you will be better off using kd-tree. Consider scipy.spatial.cKDTree or sklearn.neighbors.KDTree. This is because a kd-tree kan find k-nearnest neighbors in O(n log n) time, and therefore you avoid the O(n**2) complexity of computing all n by n distances.
Jake Vanderplas gives this example using broadcasting in Python Data Science Handbook, which is very similar to what #shx2 proposed.
import numpy as np
rand = random.RandomState(42)
X = rand.rand(3, 2)
dist_sq = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis = -1)
dist_sq
array([[0. , 0.18543317, 0.81602495],
[0.18543317, 0. , 0.22819282],
[0.81602495, 0.22819282, 0. ]])
Here is how you can do it using numpy:
import numpy as np
x = np.array([0,1,2])
y = np.array([2,4,6])
# take advantage of broadcasting, to make a 2dim array of diffs
dx = x[..., np.newaxis] - x[np.newaxis, ...]
dy = y[..., np.newaxis] - y[np.newaxis, ...]
dx
=> array([[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0]])
# stack in one array, to speed up calculations
d = np.array([dx,dy])
d.shape
=> (2, 3, 3)
Now all is left is computing the L2-norm along the 0-axis (as discussed here):
(d**2).sum(axis=0)**0.5
=> array([[ 0. , 2.23606798, 4.47213595],
[ 2.23606798, 0. , 2.23606798],
[ 4.47213595, 2.23606798, 0. ]])
If you are looking for the most efficient way of computation - use SciPy's cdist() (or pdist() if you need just vector of pairwise distances instead of full distance matrix) as suggested in Tweakimp's comment. As he said it's a lot faster than method based on vectorization and broadcasting, proposed by RichPauloo and shx2. The reason for that is that SciPy's cdist() and pdist() under the hood use for loop and C implementations for metric computations, which are even faster than vectorization.
By the way, if you can use SciPy and still prefer method using broadcasting, you don't have to implement it by yourself, as distance_matrix() function is pure Python implementation, which leverages broadcasting and vectorization (source code, docs).
It's worth mentioning that cdist()/pdist() is also more efficient than broadcasting memory-wise, as it computes distances one by one and avoids creating arrays of n*n*d elements, where n is number of points and d is points' dimensionality.
Experiments
I've conducted some simple experiments to compare performance of SciPy's cdist(), distance_matrix() and broadcasting implementation in NumPy. I used perf_counter_ns() from Python's time module to measure time and all the results are averaged over 10 runs on 10000 points in 2D space using np.float64 datatype (tested on Python 3.8.10, Windows 10 with Ryzen 2700 and 16 GB RAM):
cdist() - 0.6724s
distance_matrix() - 3.0128s
my NumPy implementation - 3.6931s
Code if someone wants to reproduce experiments:
from scipy.spatial import *
import numpy as np
from time import perf_counter_ns
def dist_mat_custom(a, b):
return np.sqrt(np.sum(np.square(a[:, np.newaxis, :] - b[np.newaxis, :, :]), axis=-1))
results = []
size = 10000
it_num = 10
for i in range(it_num):
a = np.random.normal(size=(size, 2))
b = np.random.normal(size=(size, 2))
start = perf_counter_ns()
c = distance_matrix(a, b)
#c = dist_mat_custom(a, b)
#c = distance.cdist(a, b)
results.append(perf_counter_ns() - start)
print(np.mean(results) / 1e9)

Avoid using for loop. Python 3

I have an array of shape (3,2):
import numpy as np
arr = np.array([[0.,0.],[0.25,-0.125],[0.5,-0.125]])
I was trying to build a matrix (matrix) of dimensions (6,2), with the results of the outer product of the elements i,i of arr and arr.T. At the moment I am using a for loop such as:
size = np.shape(arr)
matrix = np.zeros((size[0]*size[1],size[1]))
for i in range(np.shape(arr)[0]):
prod = np.outer(arr[i],arr[i].T)
matrix[size[1]*i:size[1]+size[1]*i,:] = prod
Resulting:
matrix =array([[ 0. , 0. ],
[ 0. , 0. ],
[ 0.0625 , -0.03125 ],
[-0.03125 , 0.015625],
[ 0.25 , -0.0625 ],
[-0.0625 , 0.015625]])
Is there any way to build this matrix without using a for loop (e.g. broadcasting)?
Extend arrays to 3D with None/np.newaxis keeping the first axis aligned, while letting the second axis getting pair-wise multiplied, perform multiplication leveraging broadcasting and reshape to 2D -
matrix = (arr[:,None,:]*arr[:,:,None]).reshape(-1,arr.shape[1])
We can also use np.einsum -
matrix = np.einsum('ij,ik->ijk',arr,arr).reshape(-1,arr.shape[1])
einsum string representation might be more intuitive as it lets us visualize three things :
Axes that are aligned (axis=0 here).
Axes that are getting summed up (none here).
Axes that are kept i.e. element-wise multiplied (axis=1 here).

How to blur 3D array of points, while maintaining their original values? (Python)

I have a sparse 3D array of values. I am trying to turn each "point" into a fuzzy "sphere", by applying a Gaussian filter to the array.
I would like the original value at the point (x,y,z) to remain the same. I just want to create falloff values around this point... But applying the Gaussian filter changes the original (x,y,z) value as well.
I am currently doing this:
dataCube = scipy.ndimage.filters.gaussian_filter(dataCube, 3, truncate=8)
Is there a way for me to normalize this, or do something so that my original values are still in this new dataCube? I am not necessarily tied to using a Gaussian filter, if that is not the best approach.
You can do this using a convolution with a kernel that has 1 as its central value, and a width smaller than the spacing between your data points.
1-d example:
import numpy as np
import scipy.signal
data = np.array([0,0,0,0,0,5,0,0,0,0,0])
kernel = np.array([0.5,1,0.5])
scipy.signal.convolve(data, kernel, mode="same")
gives
array([ 0. , 0. , 0. , 0. , 2.5, 5. , 2.5, 0. , 0. , 0. , 0. ])
Note that fftconvolve might be much faster for large arrays. You also have to specify what should happen at the boundaries of your array.
Update: 3-d example
import numpy as np
from scipy import signal
# first build the smoothing kernel
sigma = 1.0 # width of kernel
x = np.arange(-3,4,1) # coordinate arrays -- make sure they contain 0!
y = np.arange(-3,4,1)
z = np.arange(-3,4,1)
xx, yy, zz = np.meshgrid(x,y,z)
kernel = np.exp(-(xx**2 + yy**2 + zz**2)/(2*sigma**2))
# apply to sample data
data = np.zeros((11,11,11))
data[5,5,5] = 5.
filtered = signal.convolve(data, kernel, mode="same")
# check output
print filtered[:,5,5]
gives
[ 0. 0. 0.05554498 0.67667642 3.0326533 5. 3.0326533
0.67667642 0.05554498 0. 0. ]

Efficiently Calculating a Euclidean Distance Matrix Using Numpy

I have a set of points in 2-dimensional space and need to calculate the distance from each point to each other point.
I have a relatively small number of points, maybe at most 100. But since I need to do it often and rapidly in order to determine the relationships between these moving points, and since I'm aware that iterating through the points could be as bad as O(n^2) complexity, I'm looking for ways to take advantage of numpy's matrix magic (or scipy).
As it stands in my code, the coordinates of each object are stored in its class. However, I could also update them in a numpy array when I update the class coordinate.
class Cell(object):
"""Represents one object in the field."""
def __init__(self,id,x=0,y=0):
self.m_id = id
self.m_x = x
self.m_y = y
It occurs to me to create a Euclidean distance matrix to prevent duplication, but perhaps you have a cleverer data structure.
I'm open to pointers to nifty algorithms as well.
Also, I note that there are similar questions dealing with Euclidean distance and numpy but didn't find any that directly address this question of efficiently populating a full distance matrix.
You can take advantage of the complex type :
# build a complex array of your cells
z = np.array([complex(c.m_x, c.m_y) for c in cells])
First solution
# mesh this array so that you will have all combinations
m, n = np.meshgrid(z, z)
# get the distance via the norm
out = abs(m-n)
Second solution
Meshing is the main idea. But numpy is clever, so you don't have to generate m & n. Just compute the difference using a transposed version of z. The mesh is done automatically :
out = abs(z[..., np.newaxis] - z)
Third solution
And if z is directly set as a 2-dimensional array, you can use z.T instead of the weird z[..., np.newaxis]. So finally, your code will look like this :
z = np.array([[complex(c.m_x, c.m_y) for c in cells]]) # notice the [[ ... ]]
out = abs(z.T-z)
Example
>>> z = np.array([[0.+0.j, 2.+1.j, -1.+4.j]])
>>> abs(z.T-z)
array([[ 0. , 2.23606798, 4.12310563],
[ 2.23606798, 0. , 4.24264069],
[ 4.12310563, 4.24264069, 0. ]])
As a complement, you may want to remove duplicates afterwards, taking the upper triangle :
>>> np.triu(out)
array([[ 0. , 2.23606798, 4.12310563],
[ 0. , 0. , 4.24264069],
[ 0. , 0. , 0. ]])
Some benchmarks
>>> timeit.timeit('abs(z.T-z)', setup='import numpy as np;z = np.array([[0.+0.j, 2.+1.j, -1.+4.j]])')
4.645645342274779
>>> timeit.timeit('abs(z[..., np.newaxis] - z)', setup='import numpy as np;z = np.array([0.+0.j, 2.+1.j, -1.+4.j])')
5.049334864854522
>>> timeit.timeit('m, n = np.meshgrid(z, z); abs(m-n)', setup='import numpy as np;z = np.array([0.+0.j, 2.+1.j, -1.+4.j])')
22.489568296184686
If you don't need the full distance matrix, you will be better off using kd-tree. Consider scipy.spatial.cKDTree or sklearn.neighbors.KDTree. This is because a kd-tree kan find k-nearnest neighbors in O(n log n) time, and therefore you avoid the O(n**2) complexity of computing all n by n distances.
Jake Vanderplas gives this example using broadcasting in Python Data Science Handbook, which is very similar to what #shx2 proposed.
import numpy as np
rand = random.RandomState(42)
X = rand.rand(3, 2)
dist_sq = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis = -1)
dist_sq
array([[0. , 0.18543317, 0.81602495],
[0.18543317, 0. , 0.22819282],
[0.81602495, 0.22819282, 0. ]])
Here is how you can do it using numpy:
import numpy as np
x = np.array([0,1,2])
y = np.array([2,4,6])
# take advantage of broadcasting, to make a 2dim array of diffs
dx = x[..., np.newaxis] - x[np.newaxis, ...]
dy = y[..., np.newaxis] - y[np.newaxis, ...]
dx
=> array([[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0]])
# stack in one array, to speed up calculations
d = np.array([dx,dy])
d.shape
=> (2, 3, 3)
Now all is left is computing the L2-norm along the 0-axis (as discussed here):
(d**2).sum(axis=0)**0.5
=> array([[ 0. , 2.23606798, 4.47213595],
[ 2.23606798, 0. , 2.23606798],
[ 4.47213595, 2.23606798, 0. ]])
If you are looking for the most efficient way of computation - use SciPy's cdist() (or pdist() if you need just vector of pairwise distances instead of full distance matrix) as suggested in Tweakimp's comment. As he said it's a lot faster than method based on vectorization and broadcasting, proposed by RichPauloo and shx2. The reason for that is that SciPy's cdist() and pdist() under the hood use for loop and C implementations for metric computations, which are even faster than vectorization.
By the way, if you can use SciPy and still prefer method using broadcasting, you don't have to implement it by yourself, as distance_matrix() function is pure Python implementation, which leverages broadcasting and vectorization (source code, docs).
It's worth mentioning that cdist()/pdist() is also more efficient than broadcasting memory-wise, as it computes distances one by one and avoids creating arrays of n*n*d elements, where n is number of points and d is points' dimensionality.
Experiments
I've conducted some simple experiments to compare performance of SciPy's cdist(), distance_matrix() and broadcasting implementation in NumPy. I used perf_counter_ns() from Python's time module to measure time and all the results are averaged over 10 runs on 10000 points in 2D space using np.float64 datatype (tested on Python 3.8.10, Windows 10 with Ryzen 2700 and 16 GB RAM):
cdist() - 0.6724s
distance_matrix() - 3.0128s
my NumPy implementation - 3.6931s
Code if someone wants to reproduce experiments:
from scipy.spatial import *
import numpy as np
from time import perf_counter_ns
def dist_mat_custom(a, b):
return np.sqrt(np.sum(np.square(a[:, np.newaxis, :] - b[np.newaxis, :, :]), axis=-1))
results = []
size = 10000
it_num = 10
for i in range(it_num):
a = np.random.normal(size=(size, 2))
b = np.random.normal(size=(size, 2))
start = perf_counter_ns()
c = distance_matrix(a, b)
#c = dist_mat_custom(a, b)
#c = distance.cdist(a, b)
results.append(perf_counter_ns() - start)
print(np.mean(results) / 1e9)

Efficient way of taking Logarithm function in a sparse matrix

I have a big sparse matrix. I want to take log4 for all element in that sparse matrix.
I try to use numpy.log() but it doesn't work with matrices.
I can also take logarithm row by row. Then I crush old row with a new one.
# Assume A is a sparse matrix (Linked List Format) with float values as data
# It is only for one row
import numpy as np
c = np.log(A.getrow(0)) / numpy.log(4)
A[0, :] = c
This was not as quick as I'd expected. Is there a faster way to do this?
You can modify the data attribute directly:
>>> a = np.array([[5,0,0,0,0,0,0],[0,0,0,0,2,0,0]])
>>> coo = coo_matrix(a)
>>> coo.data
array([5, 2])
>>> coo.data = np.log(coo.data)
>>> coo.data
array([ 1.60943791, 0.69314718])
>>> coo.todense()
matrix([[ 1.60943791, 0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. , 0.69314718,
0. , 0. ]])
Note that this doesn't work properly if the sparse format has repeated elements (which is valid in the COO format); it'll take the logs individually, and log(a) + log(b) != log(a + b). You probably want to convert to CSR or CSC first (which is fast) to avoid this problem.
You'll also have to add checks if the sparse matrix is in a different format, of course. And if you don't want to modify the matrix in-place, just construct a new sparse matrix as you did in your answer, but without adding 3 because that's completely unnecessary here.
I think I solve it with very easy way. It is very strange that no one could answer immediately.
# Let A be a COO_matrix
import numpy as np
from scipy.sparse import coo_matrix
new_data = np.log(A.data+3)/np.log(4) #3 is not so important. It can be 1 too.
A = coo_matrix((new_data, (A.row, A.col)), shape=A.shape)

Categories