I have a huge matrix (think 20000 x 1000) called Z that I need to generate the pairwise distance from so I'm currently using sklearn.metrics.pairwise.euclidean_distances(Z,Z) to generate the pairwise distances.
However, now I need to search through the result to find the smallest X distances but I need their indices.
An example would be:
A = 20000 x 1000 numpy.ndarray
B = sklearn.metrics.pairwise.euclidean_distances(A, A)
C = ((2400,100), (800,900), (29,999)) if X = 3
What would be the best way to go about doing this? I saw numpy.unravel_index(a.argmax(), a.shape) but I'm not sure it would work well for this instance.
You can use np.triu_indices to generate the indices that correspond to entries of the compressed distance matrix.
import numpy as np
from scipy.spatial.distance import pdist
# Generate points
Z = np.random.normal(0, 1, (1000, 3))
# Compute euclidean distance
distance = pdist(Z)
# Get the smallest distance
min_distance = np.min(distance)
# Get the indices (k = 1 to omit diagonal entries)
idx = np.asarray(np.triu_indices(len(Z), 1))
# Filter the indices (this is assuming that the minimum distance is not unique)
idx = idx[:, distance == min_distance]
If you know that there is exactly one minimum distance, you could also use
idx = idx[:, np.argmin(distance)]
which is slightly more efficient.
EDIT: To get the sorted indices, try the following
idx = idx[:, np.argsort(distance)]
Related
My goal is to find the Top N vectors in a large 3D dask array(~100k rows per side or more would be nice) that are most cosine similar to a target vector. I can get the Top 1, and only for smaller values of n, n=500 takes over 2 hours. I'm doing something incorrectly with dask, but not sure what. Also, is there a vectorized way to get the cosine similarity instead of the for-loop? In pure numpy I can get to n = ~6000 before I have a MemoryError. dtype of float16 is enough accuracy and an attempt to save space. If dask isn't the right tool, I'd be open to something else too.
import dask.array as da
import numpy as np
from numpy.linalg import norm
# create a 2d matrix of n rows, each of length n, ideally n is quite large, >100,000
start = 1
step = 1
n = 5
vec_len = 10
shape = [n, vec_len]
end = np.prod(shape) * step + start
arr_2D = da.from_array(np.array(np.arange(start, end, step).reshape(shape), dtype=np.float16))
print(arr_2D.compute())
# sum each row with each other row using broadcasting, resulting in a 3D matrix
# each (i,j) location contains a vector that is the sum of the i-th and j-th original vectors
sums_3D = arr_2D[:, None] + arr_2D[None,:]
# make a target vector
target = np.array(range(vec_len,0,-1))
print('target:', target)
# brute force way to get cosine of each vector in #D matrix with target vector
da_cos = da.empty(shape=(n,n), dtype=np.float16)
for i in range(n): # <----- is there a way to vectorize this for loop??
print('row:', i)
for j in range(i+1, n): # i+1: to get only upper triangle
cur = sums_3D[i, j]
cosine = np.dot(target,cur)/(norm(target)*norm(cur))
da_cos[i,j] = cosine
print(da_cos.compute(), da_cos.dtype, da_cos.shape)
# Get top match <------ how would I get the Top N matches??
ar_max = da_cos.argmax().compute()
best_1, best_2 = np.unravel_index(ar_max, (n,n))
print(da_cos.max().compute(), best_1, best_2)
In Python, I have a vector v of 300 elements and an array arr of 20k 300-dimensional vectors. How do I get quickly the indices of the k closest elements to v from the array arr?
You can do this task with numpy
import numpy as np
v = np.array([[1,1,1,1]])
arr = np.array([
[1,1,1,1],
[2,2,2,2],
[3,3,3,3]
])
dist = np.linalg.norm(v - arr, axis=1) # Euclidean distance
min_distance_index = np.argmin(dist) # Find index of minimum distance
closest_vector = arr[min_distance_index] # Get vector having minimum distance
closest_vector
# array([1, 1, 1, 1])
Since 300 is a very small number, sorting all elements and then just using the `k first is not an expensive operation (usually; it depends on how many thousand times per second you need to do this).
so, sorted() is your friend; use the key= keyword argument, sorted_vector = sorted(v ,key=…) to implement sorting by euclidean distance.
Then, use the classic array[:end] syntax to select the first k.
I've an image of about 8000x9000 size as a numpy matrix. I also have a list of indices in a numpy 2xn matrix. These indices are fractional as well as may be out of image size. I need to interpolate the image and find the values for the given indices. If the indices fall outside, I need to return numpy.nan for them. Currently I'm doing it in for loop as below
def interpolate_image(image: numpy.ndarray, indices: numpy.ndarray) -> numpy.ndarray:
"""
:param image:
:param indices: 2xN matrix. 1st row is dim1 (rows) indices, 2nd row is dim2 (cols) indices
:return:
"""
# Todo: Vectorize this
M, N = image.shape
num_indices = indices.shape[1]
interpolated_image = numpy.zeros((1, num_indices))
for i in range(num_indices):
x, y = indices[:, i]
if (x < 0 or x > M - 1) or (y < 0 or y > N - 1):
interpolated_image[0, i] = numpy.nan
else:
# Todo: Do Bilinear Interpolation. For now nearest neighbor is implemented
interpolated_image[0, i] = image[int(round(x)), int(round(y))]
return interpolated_image
But the for loop is taking huge amount of time (as expected). How can I vectorize this? I found scipy.interpolate.interp2d, but I'm not able to use it. Can someone explain how to use this or any other method is also fine. I also found this, but again it is not according to my requirements. Given x and y indices, these generated interpolated matrices. I don't want that. For the given indices, I just want the interpolated values i.e. I need a vector output. Not a matrix.
I tried like this, but as said above, it gives a matrix output
f = interpolate.interp2d(numpy.arange(image.shape[0]), numpy.arange(image.shape[1]), image, kind='linear')
interp_image_vect = f(indices[:,0], indices[:,1])
RuntimeError: Cannot produce output of size 73156608x73156608 (size too large)
For now, I've implemented nearest-neighbor interpolation. scipy interp2d doesn't have nearest neighbor. It would be good if the library function as nearest neighbor (so I can compare). If not, then also fine.
It looks like scipy.interpolate.RectBivariateSpline will do the trick:
from scipy.interpolate import RectBivariateSpline
image = # as given
indices = # as given
spline = RectBivariateSpline(numpy.arange(M), numpy.arange(N), image)
interpolated = spline(indices[0], indices[1], grid=False)
This gets you the interpolated values, but it doesn't give you nan where you need it. You can get that with where:
nans = numpy.zeros(interpolated.shape) + numpy.nan
x_in_bounds = (0 <= indices[0]) & (indices[0] < M)
y_in_bounds = (0 <= indices[1]) & (indices[1] < N)
bounded = numpy.where(x_in_bounds & y_in_bounds, interpolated, nans)
I tested this with a 2624x2624 image and 100,000 points in indices and all told it took under a second.
I have a numpy array that has 10,000 vectors with 3,000 elements in each. I want to return the top 10 indices of the closest pairs with the distance between them. So if row 5 and row 7 have the closest euclidean distance of 0.005, and row 8 and row 10 have the second closest euclidean distance of 0.0052 then I want to return [(8,10,.0052),(5,7,.005)]. The traditional for loop method is very slow. Is there an alternative quicker approach for a way to get euclidean neighbors of large features vectors (stored as np array)?
I'm doing the following:
l = []
for i in range(0,M.shape[0]):
for j in range(0,M.shape[0]):
if i != j and i > j:
l.append( (i,j,euc(M[i],M[j]))
return l
Here euc is a function to calculate euclidean distances between two vectors of a matrix using scipy.
Then I sort l and pull out the top 10 closest distances
def topTen(M):
i,j = np.triu_indices(M.shape[0], 1)
dist_sq = np.einsum('ij,ij->i', M[i]-M[j], M[i]-M[j])
max_i=np.argpartition(dist_sq, 10)[:10]
max_o=np.argsort(dist_sq[max_i])
return np.vstack((i[max_i][max_o], j[max_i][max_o], dist_sq[max_i][max_o]**.5)).T
This should be pretty fast as it only does sorting and the square root on the top 10, which are the long steps (outside of the looping).
I'll post this as an answer, but I admit is not a real solution to the question, because it will only work for smaller arrays. The problem is that if you want to be really fast and avoid loops you would need to compute all the pairwise distances at once, and that implies a memory complexity in the order of the square of the input... Let's say 10,000 rows * 10,000 rows * 3,000 elems/row * 4 bytes/row (say we're using float32) ≈ 1TB (!) of memory required (actually maybe twice because you probably need a couple of arrays that size). So while it is possible, it is not practical with these kind of sizes. The following code shows how you could implement that (with sizes divided by 100).
import numpy as np
# Row length
n = 30
# Number of rows
m = 100
# Number of top elements
k = 10
# Input data
data = np.random.random((m, n))
# Tile the data in two different dimensions
data1 = np.tile(data[:, :, np.newaxis], (1, 1, m))
data2 = np.tile(data.T[np.newaxis, :, :], (m, 1, 1))
# Compute pairwise squared distances
dist = np.sum(np.square(data1 - data2), axis=1)
# Fill lower half with inf to avoid repeat and self-matching
dist[np.tril_indices(m)] = np.inf
# Find smallest distance for each row
i = np.arange(m)
j = np.argmin(dist, axis=1)
dmin = dist[i, j]
# Pick the top K smallest distances
idx = np.stack((i, j), axis=1)
isort = dmin.argsort()
# Top K indices pairs (K x 2 matrix)
top_idx = idx[isort[:k], :]
# Top K smallest distances
top_dist = np.sqrt(dmin[isort[:k]])
Given a list of points, how can I get their indices in a KDTree?
from scipy import spatial
import numpy as np
#some data
x, y = np.mgrid[0:3, 0:3]
data = zip(x.ravel(), y.ravel())
points = [[0,1], [2,2]]
#KDTree
tree = spatial.cKDTree(data)
# incices of points in tree should be [1,8]
I could do something like:
[tree.query_ball_point(i,r=0) for i in points]
>>> [[1], [8]]
Does it make sense to do it that way?
Use cKDTree.query(x, k, ...) to find the k nearest neighbours to a given set of points x:
distances, indices = tree.query(points, k=1)
print(repr(indices))
# array([1, 8])
In a trivial case such as this, where your dataset and your set of query points are both small, and where each query point is identical to a single row within the dataset, it would be faster to use simple boolean operations with broadcasting rather than building and querying a k-D tree:
data, points = np.array(data), np.array(points)
indices = (data[..., None] == points.T).all(1).argmax(0)
data[..., None] == points.T broadcasts out to an (nrows, ndims, npoints) array, which could quickly become expensive in terms of memory for larger datasets. In such cases you might get better performance out of a normal for loop or list comprehension:
indices = [(data == p).all(1).argmax() for p in points]