Efficient Matrix construction for a weighted Euclidean distance

Efficient Matrix construction for a weighted Euclidean distance - python

I have M points in 2-dimensional Euclidean space, and have stored them in an array X of size M x 2.
I have constructed a cost matrix whereby element ij is the distance d(X[i, :], X[j, :]). The distance function I am using is the standard Euclidean distance weighted by an inverse of the matrix D. i.e d(x,y)= < D^{-1}(x-y) , x-y >. I would like to know if there is a more efficient way of doing this, note I have practically avoided for loops.
import numpy as np
Dinv = np.linalg.inv(D)
def cost(X, Dinv):
Msq = len(X) ** 2
mesh = []
for i in range(2): # separate each coordinate axis
xmesh = np.meshgrid(X[:, i], X[:, i]) # meshgrid each axis
xmesh = xmesh[1] - xmesh[0] # create the difference matrix
xmesh = xmesh.reshape(Msq) # reshape into vector
mesh.append(xmesh) # save/append into list
meshv = np.vstack((mesh[0], mesh[1])).T # recombined coordinate axis
# apply D^{-1}
Dx = np.einsum("ij,kj->ki", Dinv, meshv)
return np.sum(Dx * meshv, axis=1) # dot the elements

I ll try something like this, mostly optimizing your meshv calculation:
meshv = (X[:,None]-X).reshape(-1,2)
((meshv # Dinv.T)*meshv).sum(1)

Related

Efficient way to map 3D function to a meshgrid with NumPy

I have a set of data values for a scalar 3D function, arranged as inputs x,y,z in an array of shape (n,3) and the function values f(x,y,z) in an array of shape (n,).
EDIT: For instance, consider the following simple function
data = np.array([np.arange(n)]*3).T
F = np.linalg.norm(data,axis=1)**2
I would like to convolve this function with a spherical kernel in order to perform a 3D smoothing. The easiest way I found to perform this is to map the function values in a 3D spatial grid and then apply a 3D convolution with the kernel I want.
This works fine, however the part that maps the 3D function to the 3D grid is very slow, as I did not find a way to do it with NumPy only. The code below is my actual implementation, where data is the (n,3) array containing the 3D positions in which the function is evaluated, F is the (n,) array containing the corresponding values of the function and M is the (N,N,N) array that contains the 3D space grid.
step = 0.1
# Create meshgrid
xmin = data[:,0].min()
xmax = data[:,0].max()
ymin = data[:,1].min()
ymax = data[:,1].max()
zmin = data[:,2].min()
zmax = data[:,2].max()
x = np.linspace(xmin,xmax,int((xmax-xmin)/step)+1)
y = np.linspace(ymin,ymax,int((ymax-ymin)/step)+1)
z = np.linspace(zmin,zmax,int((zmax-zmin)/step)+1)
# Build image
M = np.zeros((len(x),len(y),len(z)))
for l in range(len(data)):
for i in range(len(x)-1):
if x[i] < data[l,0] < x[i+1]:
for j in range(len(y)-1):
if y[j] < data[l,1] < y[j+1]:
for k in range(len(z)-1):
if z[k] < data[l,2] < z[k+1]:
M[i,j,k] = F[l]
Is there a more efficient way to fill a 3D spatial grid with the values of a 3D function ?

For each item of data you're scanning pixels of cuboid to check if it's inside. There is an option to skip this scan. You could calculate corresponding indices of these pixels by yourself, for example:
data = np.array([[1, 2, 3], #14 (corner1)
[4, 5, 6], #77 (corner2)
[2.5, 3.5, 4.5], #38.75 (duplicated pixel)
[2.9, 3.9, 4.9], #47.63 (duplicated pixel)
[1.5, 2, 3]]) #15.25 (one step up from [1, 2, 3])
step = 0.5
data_idx = ((data - data.min(axis=0))//step).astype(int)
M = np.zeros(np.max(data_idx, axis=0) + 1)
x, y, z = data_idx.T
M[x, y, z] = F
Note that only one value of duplicated pixels is being mapped to M.

All you need is just reshape F[:, 3] (only f(x, y, z)) into a grid. Hard to be more precise without sample data:
If the data is not sorted, you need to sort it:
F_sorted = F[np.lexsort((F[:,0], F[:,1], F[:,2]))] # sort by x, then y, then z
Choose only f(x, y, z)
F_values = F_sorted[:, 3]
Finally, reshape data into a grid:
M = F_sorted.reshape(N, N, N)

This method is faster than the original (approximatly 20x speed up):
step = 0.1
mins = np.min(data, axis=0)
maxs = np.max(data, axis=0)
ranges = np.floor((maxs - mins) / step + 1).astype(int)
indx = np.zeros(data.shape,dtype=int)
for i in range(3):
x = np.linspace(mins[i], maxs[i], ranges[i])
indx[:,i] = np.argmax(data[:,i,np.newaxis] <= (x[np.newaxis,:]), axis=1) -1
M = np.zeros(ranges)
M[indx[:,0],indx[:,1],indx[:,2]] = F
The first part sets up the required grid variables. The argmax function provides a simple (and fast) way to find the first true value of the broadcasted array. This produces a set of indices for x, y and z directions for each of the function values.
The resulting array M is not the same as that produced by the original code as the original code loses data. The logic of y[j] < data[l,1] < y[j+1] where y is a vector produced using linspace means the minimum and maximum values for each direction will be missed (data[l,1] might be equal to either y[j] or y[j+1]!). Run it with a dataset of two values each with their own coordinates and the M array will be all zeros.

Generating N random unit vectors with their sum equal to 0 (Python)

I'd like to generate N random 3-dimensional vectors (uniformly) on the unit sphere but with the condition, that their sum is equal to 0. My attempt was to generate N/2 random unit vectors, while the other are just the same vectors with a minus sign. The problem is, as I'm trying to achieve as little correlation as possible, and my idea is obviously not ideal, since half of my vectors are perfectly anti-correlated with their corresponding pair.

Your problem does not really have a solution, but you can generate a set of vectors that are going to be slightly less visibly correlated than your original solution of negating them. To be precise, if you generate N / 2 vectors and negate them, then rotate the negated vectors about their sum by any angle, you can guarantee that the sum will be zero and the correlation will be a more complicated rotation than a negative identity matrix.
import numpy as np
from scipy.spatial.transform import Rotation
N = 10
v1 = np.random.normal(size=(N / 2, 3))
v1 /= np.linalg.norm(v1, axis=1, keepdims=True)
axis = v1.sum(0)
rot = Rotation.from_rotvec(np.random.uniform(2.0 * np.pi) * axis / np.linalg.norm(axis))
v2 = rot.apply(-v1)
result = np.concatenate((v1, v2), axis=0)
This assumes that N is even in all cases. The normal distribution is a fairly standard method to generate points uniformly on a sphere: https://mathworld.wolfram.com/SpherePointPicking.html.
If you had some leeway from the sum being exactly zero, you could align two random sets of N / 2 vectors so that their sums point opposite each other.

In this code, I tried to generate vectors selected from a sphere by converting a theta, phi to x, y, z.
import numpy as np
def vectorize(theta, phi):
x = np.cos(phi) * np.cos(theta)
y = np.cos(phi) * np.sin(theta)
z = np.sin(phi)
return np.array([x, y, z])
theta_range = np.arange(0, 2 * np.pi, 0.01)
phi_range = np.arange(-np.pi / 2, np.pi / 2, 0.01)
TH, PI = np.meshgrid(theta_range, phi_range)
whole_map = np.vstack((TH.flatten(), PI.flatten())).T
# Number of vectors:
N = 100
# Selecting N/2 Vectors first at random
v_selected = np.random.choice(range(whole_map.shape[0]), N // 2)
vectors = np.array([vectorize(whole_map[ind][0], whole_map[ind][1]) for ind in v_selected])
# Doubling up the number of vectors by adding the negate of each vector to the vector set
vectors = np.vstack((vectors, -vectors))
print(vectors.sum(axis=0))
# array([1.94289029e-16, 1.17961196e-16, 1.11022302e-16])
# Almost 0, but isn't zero because of floating number precision when converted to binary
Here is the scatter plot of the points generated on the sphere with radius=1:

numpy second derivative of a ndimensional array

I have a set of simulation data where I would like to find the lowest slope in n dimensions. The spacing of the data is constant along each dimension, but not all the same (I could change that for the sake of simplicity).
I can live with some numerical inaccuracy, especially towards the edges. I would heavily prefer not to generate a spline and use that derivative; just on the raw values would be sufficient.
It is possible to calculate the first derivative with numpy using the numpy.gradient() function.
import numpy as np
data = np.random.rand(30,50,40,20)
first_derivative = np.gradient(data)
# second_derivative = ??? <--- there be kudos (:
This is a comment regarding laplace versus the hessian matrix; this is no more a question but is meant to help understanding of future readers.
I use as a testcase a 2D function to determine the 'flattest' area below a threshold. The following pictures show the difference in results between using the minimum of second_derivative_abs = np.abs(laplace(data)) and the minimum of the following:
second_derivative_abs = np.zeros(data.shape)
hess = hessian(data)
# based on the function description; would [-1] be more appropriate?
for i in hess[0]: # calculate a norm
for j in i[0]:
second_derivative_abs += j*j
The color scale depicts the functions values, the arrows depict the first derivative (gradient), the red dot the point closest to zero and the red line the threshold.
The generator function for the data was ( 1-np.exp(-10*xi**2 - yi**2) )/100.0 with xi, yi being generated with np.meshgrid.
Laplace:
Hessian:

The second derivatives are given by the Hessian matrix. Here is a Python implementation for ND arrays, that consists in applying the np.gradient twice and storing the output appropriately,
import numpy as np
def hessian(x):
"""
Calculate the hessian matrix with finite differences
Parameters:
- x : ndarray
Returns:
an array of shape (x.dim, x.ndim) + x.shape
where the array[i, j, ...] corresponds to the second derivative x_ij
"""
x_grad = np.gradient(x)
hessian = np.empty((x.ndim, x.ndim) + x.shape, dtype=x.dtype)
for k, grad_k in enumerate(x_grad):
# iterate over dimensions
# apply gradient again to every component of the first derivative.
tmp_grad = np.gradient(grad_k)
for l, grad_kl in enumerate(tmp_grad):
hessian[k, l, :, :] = grad_kl
return hessian
x = np.random.randn(100, 100, 100)
hessian(x)
Note that if you are only interested in the magnitude of the second derivatives, you could use the Laplace operator implemented by scipy.ndimage.filters.laplace, which is the trace (sum of diagonal elements) of the Hessian matrix.
Taking the smallest element of the the Hessian matrix could be used to estimate the lowest slope in any spatial direction.

Slopes, Hessians and Laplacians are related, but are 3 different things.
Start with 2d: a function( x, y ) of 2 variables, e.g. a height map of a range of hills,
slopes aka gradients are direction vectors, a direction and length at each point x y.
This can be given by 2 numbers dx dy in cartesian coordinates,
or an angle θ and length sqrt( dx^2 + dy^2 ) in polar coordinates.
Over a whole range of hills, we get a
vector field.
Hessians describe curvature near x y, e.g. a paraboloid or a saddle,
with 4 numbers: dxx dxy dyx dyy.
a Laplacian is 1 number, dxx + dyy, at each point x y.
Over a range of hills, we get a
scalar field.
(Functions or hills with Laplacian = 0
are particularly smooth.)
Slopes are linear fits and Hessians quadratic fits, for tiny steps h near a point xy:
f(xy + h) ~ f(xy)
+ slope . h -- dot product, linear in both slope and h
+ h' H h / 2 -- quadratic in h
Here xy, slope and h are vectors of 2 numbers,
and H is a matrix of 4 numbers dxx dxy dyx dyy.
N-d is similar: slopes are direction vectors of N numbers,
Hessians are matrices of N^2 numbers, and Laplacians 1 number, at each point.
(You might find better answers over on
math.stackexchange .)

You can see the Hessian Matrix as a gradient of gradient, where you apply gradient a second time for each component of the first gradient calculated here is a wikipedia link definig Hessian matrix and you can see clearly that is a gradient of gradient, here is a python implementation defining gradient then hessian :
import numpy as np
#Gradient Function
def gradient_f(x, f):
assert (x.shape[0] >= x.shape[1]), "the vector should be a column vector"
x = x.astype(float)
N = x.shape[0]
gradient = []
for i in range(N):
eps = abs(x[i]) * np.finfo(np.float32).eps
xx0 = 1. * x[i]
f0 = f(x)
x[i] = x[i] + eps
f1 = f(x)
gradient.append(np.asscalar(np.array([f1 - f0]))/eps)
x[i] = xx0
return np.array(gradient).reshape(x.shape)
#Hessian Matrix
def hessian (x, the_func):
N = x.shape[0]
hessian = np.zeros((N,N))
gd_0 = gradient_f( x, the_func)
eps = np.linalg.norm(gd_0) * np.finfo(np.float32).eps
for i in range(N):
xx0 = 1.*x[i]
x[i] = xx0 + eps
gd_1 = gradient_f(x, the_func)
hessian[:,i] = ((gd_1 - gd_0)/eps).reshape(x.shape[0])
x[i] =xx0
return hessian
As a test, the Hessian matrix of (x^2 + y^2) is 2 * I_2 where I_2 is the identity matrix of dimension 2

hessians = np.asarray(np.gradient(np.gradient(f(X, Y))))
hessians[1:]
Worked for 3-d function f.

How to do this operation in numPy?

I have an array X of 3D coords of N points (N*3) and want to calculate the eukledian distance between each pair of points.
I can do this by iterating over X and comparing them with the threshold.
coords = array([v.xyz for v in vertices])
for vertice in vertices:
tests = np.sum(array(coords - vertice.xyz) ** 2, 1) < threshold
closest = [v for v, t in zip(vertices, tests) if t]
Is this possible to do in one operation? I recall linear algebra from 10 years ago and can't find a way to do this.
Probably this should be a 3D array (point a, point b, axis) and then summed by axis dimension.
edit: found the solution myself, but it doesn't work on big datasets.
coords = array([v.xyz for v in vertices])
big = np.repeat(array([coords]), len(coords), 0)
big_same = np.swapaxes(big, 0, 1)
tests = np.sum((big - big_same) ** 2, 0) < thr_square
for v, test_vector in zip(vertices, tests):
v.closest = self.filter(vertices, test_vector)

Use scipy.spatial.distance. If X is an n×3 array of points, you can get an n×n distance matrix from
from scipy.spatial import distance
D = distance.squareform(distance.pdist(X))
Then, the closest to point i is the point with index
np.argsort(D[i])[1]
(The [1] skips over the value in the diagonal, which will be returned first.)

I'm not quite sure what you're asking here. If you're computing the Euclidean distance between each pair of points in an N-point space, it would make sense to me to represent the results as a look-up matrix. So for N points, you'd get an NxN symmetric matrix. Element (3, 5) would represent the distance between points 3 and 5, whereas element (2, 2) would be the distance between point 2 and itself (zero). This is how I would do it for random points:
import numpy as np
N = 5
coords = np.array([np.random.rand(3) for _ in range(N)])
dist = np.zeros((N, N))
for i in range(N):
for j in range(i, N):
dist[i, j] = np.linalg.norm(coords[i] - coords[j])
dist[j, i] = dist[i, j]
print dist

If xyz is the array with your coordinates, then the following code will compute the distance matrix (works fast till the moment when you have enough memory to store N^2 distances):
xyz = np.random.uniform(size=(1000,3))
distances = (sum([(xyzs[:,i][:,None]-xyzs[:,i][None,:])**2 for i in range(3)]))**.5

Minimum Euclidean distance between points in two different Numpy arrays, not within

I have two arrays of x-y coordinates, and I would like to find the minimum Euclidean distance between each point in one array with all the points in the other array. The arrays are not necessarily the same size. For example:
xy1=numpy.array(
[[ 243, 3173],
[ 525, 2997]])
xy2=numpy.array(
[[ 682, 2644],
[ 277, 2651],
[ 396, 2640]])
My current method loops through each coordinate xy in xy1 and calculates the distances between that coordinate and the other coordinates.
mindist=numpy.zeros(len(xy1))
minid=numpy.zeros(len(xy1))
for i,xy in enumerate(xy1):
dists=numpy.sqrt(numpy.sum((xy-xy2)**2,axis=1))
mindist[i],minid[i]=dists.min(),dists.argmin()
Is there a way to eliminate the for loop and somehow do element-by-element calculations between the two arrays? I envision generating a distance matrix for which I could find the minimum element in each row or column.
Another way to look at the problem. Say I concatenate xy1 (length m) and xy2 (length p) into xy (length n), and I store the lengths of the original arrays. Theoretically, I should then be able to generate a n x n distance matrix from those coordinates from which I can grab an m x p submatrix. Is there a way to efficiently generate this submatrix?

(Months later)
scipy.spatial.distance.cdist( X, Y )
gives all pairs of distances,
for X and Y 2 dim, 3 dim ...
It also does 22 different norms, detailed
here .
# cdist example: (nx,dim) (ny,dim) -> (nx,ny)
from __future__ import division
import sys
import numpy as np
from scipy.spatial.distance import cdist
#...............................................................................
dim = 10
nx = 1000
ny = 100
metric = "euclidean"
seed = 1
# change these params in sh or ipython: run this.py dim=3 ...
for arg in sys.argv[1:]:
exec( arg )
np.random.seed(seed)
np.set_printoptions( 2, threshold=100, edgeitems=10, suppress=True )
title = "%s dim %d nx %d ny %d metric %s" % (
__file__, dim, nx, ny, metric )
print "\n", title
#...............................................................................
X = np.random.uniform( 0, 1, size=(nx,dim) )
Y = np.random.uniform( 0, 1, size=(ny,dim) )
dist = cdist( X, Y, metric=metric ) # -> (nx, ny) distances
#...............................................................................
print "scipy.spatial.distance.cdist: X %s Y %s -> %s" % (
X.shape, Y.shape, dist.shape )
print "dist average %.3g +- %.2g" % (dist.mean(), dist.std())
print "check: dist[0,3] %.3g == cdist( [X[0]], [Y[3]] ) %.3g" % (
dist[0,3], cdist( [X[0]], [Y[3]] ))
# (trivia: how do pairwise distances between uniform-random points in the unit cube
# depend on the metric ? With the right scaling, not much at all:
# L1 / dim ~ .33 +- .2/sqrt dim
# L2 / sqrt dim ~ .4 +- .2/sqrt dim
# Lmax / 2 ~ .4 +- .2/sqrt dim

To compute the m by p matrix of distances, this should work:
>>> def distances(xy1, xy2):
... d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
... d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
... return numpy.hypot(d0, d1)
the .outer calls make two such matrices (of scalar differences along the two axes), the .hypot calls turns those into a same-shape matrix (of scalar euclidean distances).

The accepted answer does not fully address the question, which requests to find the minimum distance between the two sets of points, not the distance between every point in the two sets.
Although a straightforward solution to the original question indeed consists of computing the distance between every pair and subsequently finding the minimum one, this is not necessary if one is only interested in the minimum distances. A much faster solution exists for the latter problem.
All the proposed solutions have a running time that scales as m*p = len(xy1)*len(xy2). This is OK for small datasets, but an optimal solution can be written that scales as m*log(p), producing huge savings for large xy2 datasets.
This optimal execution time scaling can be achieved using scipy.spatial.KDTree as follows
import numpy as np
from scipy import spatial
xy1 = np.array(
[[243, 3173],
[525, 2997]])
xy2 = np.array(
[[682, 2644],
[277, 2651],
[396, 2640]])
# This solution is optimal when xy2 is very large
tree = spatial.KDTree(xy2)
mindist, minid = tree.query(xy1)
print(mindist)
# This solution by #denis is OK for small xy2
mindist = np.min(spatial.distance.cdist(xy1, xy2), axis=1)
print(mindist)
where mindist is the minimum distance between each point in xy1 and the set of points in xy2

For what you're trying to do:
dists = numpy.sqrt((xy1[:, 0, numpy.newaxis] - xy2[:, 0])**2 + (xy1[:, 1, numpy.newaxis - xy2[:, 1])**2)
mindist = numpy.min(dists, axis=1)
minid = numpy.argmin(dists, axis=1)
Edit: Instead of calling sqrt, doing squares, etc., you can use numpy.hypot:
dists = numpy.hypot(xy1[:, 0, numpy.newaxis]-xy2[:, 0], xy1[:, 1, numpy.newaxis]-xy2[:, 1])

import numpy as np
P = np.add.outer(np.sum(xy1**2, axis=1), np.sum(xy2**2, axis=1))
N = np.dot(xy1, xy2.T)
dists = np.sqrt(P - 2*N)

I think the following function also works.
import numpy as np
from typing import Optional
def pairwise_dist(X: np.ndarray, Y: Optional[np.ndarray] = None) -> np.ndarray:
Y = X if Y is None else Y
xx = (X ** 2).sum(axis = 1)[:, None]
yy = (Y ** 2).sum(axis = 1)[:, None]
return xx + yy.T - 2 * (X # Y.T)
Explanation
Suppose each row of X and Y are coordinates of the two sets of points.
Let their sizes be m X p and p X n respectively.
The result will produce a numpy array of size m X n with the (i, j)-th entry being the distance between the i-th row and the j-th row of X and Y respectively.

I highly recommend using some inbuilt method for calculating squares, and roots for they are customized for optimized way to calculate and very safe against overflows.
#alex answer below is the most safest in terms of overflow and should also be very fast. Also for single points you can use math.hypot which now supports more than 2 dimensions.
>>> def distances(xy1, xy2):
... d0 = numpy.subtract.outer(xy1[:,0], xy2[:,0])
... d1 = numpy.subtract.outer(xy1[:,1], xy2[:,1])
... return numpy.hypot(d0, d1)
Safety concerns
i, j, k = 1e+200, 1e+200, 1e+200
math.hypot(i, j, k)
# np.hypot for 2d points
# 1.7320508075688773e+200
np.sqrt(np.sum((np.array([i, j, k])) ** 2))
# RuntimeWarning: overflow encountered in square
overflow/underflow/speeds

I think that the most straightforward and efficient solution is to do it like this:
distances = np.linalg.norm(xy1, xy2) # calculate the euclidean distances between the test point and the training features.
min_dist = numpy.min(dists, axis=1) # get the minimum distance
min_id = np.argmi(distances) # get the index of the class with the minimum distance, i.e., the minimum difference.

Although many answers here are great, there is another way which has not been mentioned here, using numpy's vectorization / broadcasting properties to compute the distance between each points of two different arrays of different length (and, if wanted, the closest matches). I publish it here because it can be very handy to master broadcasting, and it also solves this problem elengantly while remaining very efficient.
Assuming you have two arrays like so:
# two arrays of different length, but with the same dimension
a = np.random.randn(6,2)
b = np.random.randn(4,2)
You can't do the operation a-b: numpy complains with operands could not be broadcast together with shapes (6,2) (4,2). The trick to allow broadcasting is to manually add a dimension for numpy to broadcast along to. By leaving the dimension 2 in both reshaped arrays, numpy knows that it must perform the operation over this dimension.
deltas = a.reshape(6, 1, 2) - b.reshape(1, 4, 2)
# contains the distance between each points
distance_matrix = (deltas ** 2).sum(axis=2)
The distance_matrix has a shape (6,4): for each point in a, the distances to all points in b are computed. Then, if you want the "minimum Euclidean distance between each point in one array with all the points in the other array", you would do :
distance_matrix.argmin(axis=1)
This returns the index of the point in b that is closest to each point of a.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient Matrix construction for a weighted Euclidean distance - python

I ll try something like this, mostly optimizing your meshv calculation: meshv = (X[:,None]-X).reshape(-1,2) ((meshv # Dinv.T)*meshv).sum(1)

Related

Efficient way to map 3D function to a meshgrid with NumPy

Generating N random unit vectors with their sum equal to 0 (Python)

numpy second derivative of a ndimensional array

How to do this operation in numPy?

Minimum Euclidean distance between points in two different Numpy arrays, not within

Categories

Resources