Given a list of points, how can I get their indices in a KDTree?
from scipy import spatial
import numpy as np
#some data
x, y = np.mgrid[0:3, 0:3]
data = zip(x.ravel(), y.ravel())
points = [[0,1], [2,2]]
#KDTree
tree = spatial.cKDTree(data)
# incices of points in tree should be [1,8]
I could do something like:
[tree.query_ball_point(i,r=0) for i in points]
>>> [[1], [8]]
Does it make sense to do it that way?
Use cKDTree.query(x, k, ...) to find the k nearest neighbours to a given set of points x:
distances, indices = tree.query(points, k=1)
print(repr(indices))
# array([1, 8])
In a trivial case such as this, where your dataset and your set of query points are both small, and where each query point is identical to a single row within the dataset, it would be faster to use simple boolean operations with broadcasting rather than building and querying a k-D tree:
data, points = np.array(data), np.array(points)
indices = (data[..., None] == points.T).all(1).argmax(0)
data[..., None] == points.T broadcasts out to an (nrows, ndims, npoints) array, which could quickly become expensive in terms of memory for larger datasets. In such cases you might get better performance out of a normal for loop or list comprehension:
indices = [(data == p).all(1).argmax() for p in points]
Related
In Python, I have a vector v of 300 elements and an array arr of 20k 300-dimensional vectors. How do I get quickly the indices of the k closest elements to v from the array arr?
You can do this task with numpy
import numpy as np
v = np.array([[1,1,1,1]])
arr = np.array([
[1,1,1,1],
[2,2,2,2],
[3,3,3,3]
])
dist = np.linalg.norm(v - arr, axis=1) # Euclidean distance
min_distance_index = np.argmin(dist) # Find index of minimum distance
closest_vector = arr[min_distance_index] # Get vector having minimum distance
closest_vector
# array([1, 1, 1, 1])
Since 300 is a very small number, sorting all elements and then just using the `k first is not an expensive operation (usually; it depends on how many thousand times per second you need to do this).
so, sorted() is your friend; use the key= keyword argument, sorted_vector = sorted(v ,key=…) to implement sorting by euclidean distance.
Then, use the classic array[:end] syntax to select the first k.
I'm trying to calculate the summation of each pair of rows in a matrix. Suppose I have an m x n matrix, say one like
[[1,2,3],
[4,5,6],
[7,8,9]]
and I want to create a matrix of the summations of all pairs of rows. So, for the above matrix, we would want
[[5,7,9],
[8,10,12],
[11,13,15]]
In general, I think the new matrix will be (m choose 2) x n. For the above example in pytorch, I ran
import torch
x = torch.tensor([[1,2,3], [4,5,6], [7,8,9]])
y = x[None] + x[:, None]
torch.cat((y[0, 1:3, :], y[1, 2:3, :]))
which manually creates the matrix I am looking for. However, I am struggling to think of a way to create the output without manually specifying indices and without using a for-loop. Is there even a way to create such a matrix for an arbitrary matrix without the use of a for-loop?
You can try using this function:
def sum_rows(x):
y = x[None] + x[:, None]
ind = torch.tril_indices(x.shape[0], x.shape[0], offset=-1)
return y[ind[0], ind[1]]
Because you know you want pairs with the constraints of sum_matrix[i,j], where i<j (but i>j would also work), you can just specify that you want the lower/upper triangle indices of your 3D matrix. This still uses a for loop, AFAIK, but should do the job for variable-sized inputs.
What is the most efficient way to compute a sparse boolean matrix I from one or two arrays a,b, with I[i,j]==True where a[i]==b[j]? The following is fast but memory-inefficient:
I = a[:,None]==b
The following is slow and still memory-inefficient during creation:
I = csr((a[:,None]==b),shape=(len(a),len(b)))
The following gives at least the rows,cols for better csr_matrix initialization, but it still creates the full dense matrix and is equally slow:
z = np.argwhere((a[:,None]==b))
Any ideas?
One way to do it would be to first identify all different elements that a and b have in common using sets. This should work well if there are not very many different possibilities for the values in a and b. One then would only have to loop over the different values (below in variable values) and use np.argwhere to identify the indices in a and b where these values occur. The 2D indices of the sparse matrix can then be constructed using np.repeat and np.tile:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
##identifying all values that occur both in a and b:
values = set(np.unique(a)) & set(np.unique(b))
##here we collect the indices in a and b where the respective values are the same:
rows, cols = [], []
##looping over the common values, finding their indices in a and b, and
##generating the 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
x = np.argwhere(a==value).ravel()
y = np.argwhere(b==value).ravel()
rows.append(np.repeat(x, len(x)))
cols.append(np.tile(y, len(y)))
##concatenating the indices for different values and generating a 1D vector
##of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows),dtype=bool)
##generating sparse matrix
I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )
##checking that the matrix was generated correctly:
print((I1 != I3).nnz==0)
The syntax for generating the csr matrix is taken from the documentation. The test for sparse matrix equality is taken from this post.
Old Answer:
I don't know about performance, but at least you can avoid constructing the full dense matrix by using a simple generator expression. Here some code that uses two 1d arras of random integers to first generate the sparse matrix the way that the OP posted and then uses a generator expression to test all elements for equality:
import numpy as np
from scipy import sparse
a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))
## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))
## matrix generation using generator
data, rows, cols = zip(
*((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))
##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0) ## --> True
I think there is no way around the double loop and ideally this would be pushed into numpy, but at least with the generator the loops are somewhat optimised ...
You could use numpy.isclose with small tolerance:
np.isclose(a,b)
Or pandas.DataFrame.eq:
a.eq(b)
Note this returns an array of True False.
I have a huge matrix (think 20000 x 1000) called Z that I need to generate the pairwise distance from so I'm currently using sklearn.metrics.pairwise.euclidean_distances(Z,Z) to generate the pairwise distances.
However, now I need to search through the result to find the smallest X distances but I need their indices.
An example would be:
A = 20000 x 1000 numpy.ndarray
B = sklearn.metrics.pairwise.euclidean_distances(A, A)
C = ((2400,100), (800,900), (29,999)) if X = 3
What would be the best way to go about doing this? I saw numpy.unravel_index(a.argmax(), a.shape) but I'm not sure it would work well for this instance.
You can use np.triu_indices to generate the indices that correspond to entries of the compressed distance matrix.
import numpy as np
from scipy.spatial.distance import pdist
# Generate points
Z = np.random.normal(0, 1, (1000, 3))
# Compute euclidean distance
distance = pdist(Z)
# Get the smallest distance
min_distance = np.min(distance)
# Get the indices (k = 1 to omit diagonal entries)
idx = np.asarray(np.triu_indices(len(Z), 1))
# Filter the indices (this is assuming that the minimum distance is not unique)
idx = idx[:, distance == min_distance]
If you know that there is exactly one minimum distance, you could also use
idx = idx[:, np.argmin(distance)]
which is slightly more efficient.
EDIT: To get the sorted indices, try the following
idx = idx[:, np.argsort(distance)]
I would like to use Delaunay Triangulation in Python to interpolate the points in 3D.
What I have is
# my array of points
points = [[1,2,3], [2,3,4], ...]
# my array of values
values = [7, 8, ...]
# an object with triangulation
tri = Delaunay(points)
# a set of points at which I want to interpolate
p = [[1.5, 2.5, 3.5], ...]
# this gets simplexes that contain given points
s = tri.find_simplex(p)
# this gets vertices for the simplexes
v = tri.vertices[s]
I was only able to find one answer here that suggest to use transform method for the interpolation, but without being any more specific.
What I need to know is how to use the vertices of the containing simplex to get the weights for the linear interpolation. Let's assume a general n-dim case so that the answer does not depend on the dimension.
EDIT: I do not want to use LinearNDInterpolator or similar approach because I do not have a number at each point as a value but something more complex (array/function).
After some experimenting, the solution looks simple (this post was quite helpful):
# dimension of the problem (in this example I use 3D grid,
# but the method works for any dimension n>=2)
n = 3
# my array of grid points (array of n-dimensional coordinates)
points = [[1,2,3], [2,3,4], ...]
# each point has some assigned value that will be interpolated
# (e.g. a float, but it can be a function or anything else)
values = [7, 8, ...]
# a set of points at which I want to interpolate (it must be a NumPy array)
p = np.array([[1.5, 2.5, 3.5], [1.1, 2.2, 3.3], ...])
# create an object with triangulation
tri = Delaunay(points)
# find simplexes that contain interpolated points
s = tri.find_simplex(p)
# get the vertices for each simplex
v = tri.vertices[s]
# get transform matrices for each simplex (see explanation bellow)
m = tri.transform[s]
# for each interpolated point p, mutliply the transform matrix by
# vector p-r, where r=m[:,n,:] is one of the simplex vertices to which
# the matrix m is related to (again, see bellow)
b = np.einsum('ijk,ik->ij', m[:,:n,:n], p-m[:,n,:])
# get the weights for the vertices; `b` contains an n-dimensional vector
# with weights for all but the last vertices of the simplex
# (note that for n-D grid, each simplex consists of n+1 vertices);
# the remaining weight for the last vertex can be copmuted from
# the condition that sum of weights must be equal to 1
w = np.c_[b, 1-b.sum(axis=1)]
The key method to understand is transform, which is briefly documented, however the documentation says all it needs to be said. For each simplex, transform[:,:n,:n] contains the transformation matrix, and transform[:,n,:] contains the vector r to which the matrix is related to. It seems that r vector is chosen as the last vertex of the simplex.
Another tricky point is how to get b, because what I want to do is something like
for i in range(len(p)): b[i] = m[i,:n,:n].dot(p[i]-m[i,n,:])
Essentially, I need an array of dot products, while dot gives a product of two arrays. The loop over the individual simplexes like above would work, but a it can be done faster in one step, for which there is numpy.einsum:
b = np.einsum('ijk,ik->ij', m[:,:n,:n], p-m[:,n,:])
Now, v contains indices of vertex points for each simplex and w holds corresponding weights. To get the interpolated values p_values at set of points p, we do (note: values must be NumPy array for this):
values = np.array(values)
for i in range(len(p)): p_values[i] = np.inner(values[v[i]], w[i])
Or we may do this in a single step using `np.einsum' again:
p_values = np.einsum('ij,ij->i', values[v], w)
Some care must be taken in situations, when some of the interpolated points lie outside the grid. In such case, find_simplex(p) returns -1 for those points and then you will have to mask out them (using masked arrays perhaps).
You don't need to implement this from scratch, there is already built-in support in scipy for this feature:
scipy.interpolate.LinearNDInterpolator
You need an interval and a linear interpolation, i.e. the lenght of the edge and the distance of the interpolated points from the start vertex.