I have a list of n polar coordinates, and a distance function which takes in two coordinates.
I want to create an n x n matrix which contains the pairwise distances under my function. I realize I probably need to use some form of vectorization with numpy but am not sure exactly how to do so.
A simple code segment is below for your reference
import numpy as np
length = 10
coord_r = np.random.rand(length)*10
coord_alpha = np.random.rand(length)*np.pi
# Repeat vector to matrix form
coord_r_X = np.tile(coord_r, [length,1])
coord_r_Y = coord_r_X.T
coord_alpha_X = np.tile(coord_alpha, [length,1])
coord_alpha_Y = coord_alpha_X.T
matDistance = np.sqrt(coord_r_X**2 + coord_r_Y**2 - 2*coord_r_X*coord_r_Y*np.cos(coord_alpha_X - coord_alpha_Y))
print matDistance
You can use scipy.spatial.distance.pdist. However, if the distance you want to calculate is the Euclidean distance, you may be better off just converting your points to rectangular coordinates, since then pdist will do the calculations quite quickly using its builtin Euclidean distance.
Related
I have carried out some clustering analysis on some data X and have arrived at both the labels y and the centroids c. Now, I'm trying to calculate the distance between X and their assigned cluster's centroid c. This is easy when we have a small number of points:
import numpy as np
# 10 random points in 3D space
X = np.random.rand(10,3)
# define the number of clusters, say 3
clusters = 3
# give each point a random label
# (in the real code this is found using KMeans, for example)
y = np.asarray([np.random.randint(0,clusters) for i in range(10)]).reshape(-1,1)
# randomly assign location of centroids
# (in the real code this is found using KMeans, for example)
c = np.random.rand(clusters,3)
# calculate distances
distances = []
for i in range(len(X)):
distances.append(np.linalg.norm(X[i]-c[y[i][0]]))
Unfortunately, the actual data has many more rows. Is there a way to vectorise this somehow (instead of using a for loop)? I can't seem to get my head around the mapping.
Thanks to numpy's array indexing, you can actually turn your for loop into a one-liner and avoid explicit looping altogether:
distances = np.linalg.norm(X- np.einsum('ijk->ik', c[y]), axis=1)
will do the same thing as your original for loop.
EDIT: Thanks #Kris, I forgot the axis keyword, and since I didn't specify it, numpy automatically computed the norm of the entire flattened matrix, not just along the rows (axis 1). I've updated it now, and it should return an array of distances for each point. Also, einsum was suggested by #Kris for their specific application.
I have been trying for a while now to calculate the Euclidean and Minkowski distance between all the vectors in a list of lists. I don't have much advanced mathematical knowledge.
I am usually working with 4 or 5 dimension vectors
The vector list can range in size from 0 to around 200,000
When calculating the distance all the vectors will have the same amount of dimensions
I have relied on these two questions during the process:
python numpy euclidean distance calculation between matrices of row vectors
Calculate Euclidean Distance between all the elements in a list of lists python
At first my code looked like this:
import numpy as np
def euclidean_distance_np(vec_list, single_vec):
dist = (np.array(vec_list) - single_vec) ** 2
dist = np.sum(dist, axis=1)
dist = np.sqrt(dist)
return dist
def minkowski_distance_np(vec_list, single_vec, p_val):
dist = (np.abs(np.array(vec_list, dtype=np.int64) - single_vec) ** p_val).sum(axis=1) ** (1 / p_val)
return dist
This worked well when I had a small amount of vectors. I would calculate the distance of a single vector to all the vectors in the list and repeat the process for every vector in the list one by one,
but once the list became 5 or 6 digits in length, these functions became extremely slow.
I managed to improve the Euclidean distance calculation like so:
x = np.array([v[0] for v in vec_list])
y = np.array([v[1] for v in vec_list])
z = np.array([v[2] for v in vec_list])
w = np.array([v[3] for v in vec_list])
t = np.array([v[4] for v in vec_list])
res = np.sqrt(np.square(x - x.reshape(-1,1)) + np.square(y - y.reshape(-1,1)) + np.square(z - z.reshape(-1,1)) + np.square(w - w.reshape(-1,1)) + np.square(t - t.reshape(-1,1)))
But cannot figure out how to implement the calculation method above to correctly calculate Minkowski distance.
So, to be precise, my question is how can I calculate Minkowski distance in a similar way to the code I mentioned above.
I would also appreciate any ideas for improvement or better ways to preform the calculations
Scipy has already implemented distance functions: minkowski, euclidean. But probably what you need is cdist.
Numpy is great tool for matrices manipulation, but it doesn't contain all possible functions. You can find most of additional features and operations in SciPy which is more related to mathematics, science, and engineering.
I have a large set of data points in a pandas dataframe, with columns containing x/y coordinates for these points. I would like to identify all points that are within a certain distance "d" of any other point in the dataframe.
I first tried to do this using 'for' loops, checking the distance between the first point and all other points, then the distance between the second point and all others, etc. Clearly this is not very efficient for a large data set.
Recent searching online suggests that the best way might be to use scipy.spatial.ckdtree, but I can't figure out how to implement this. Most examples I see check against a single x/y location, whereas I want to check all vs all. Is anyone able to provide suggestions or examples, starting from an array of x/y coordinates taken from my dataframe as follows:
points = df_sub.loc[:,['FRONT_X','FRONT_Y']].values
That looks something like this:
[[19091199.587 -544406.722]
[19091161.475 -544452.426]
[19091163.893 -544464.899]
...
[19089150.04 -544747.196]
[19089774.213 -544729.005]
[19089690.516 -545165.489]]
The ideal output would be the ID's of all pairs of points that are within a cutoff distance "d" of each other.
scipy.spatial has many good functions for handling distance computations.
Let's create an array pos of 1000 (x, y) points, similar to what you have in your dataframe.
import numpy as np
from scipy.spatial import distance_matrix
num = 1000
pos = np.random.uniform(size=(num, 2))
# Distance threshold
d = 0.25
From here we shall use the distance_matrix function to calculate pairwise distances. Then we use np.argwhere to find the indices of all the pairwise distances less than some threshold d.
pair_dist = distance_matrix(pos, pos)
ids = np.argwhere(pair_dist < d)
ids now contains the "ID's of all pairs of points that are within a cutoff distance "d" of each other", as you desired.
Shortcomings
Of course, this method has the shortcoming that we always compute the distance between each point and itself (returning a distance of 0), which will always be less than our threshold d. However, we can exclude self-comparisons from our ids with the following fudge:
pair_dist[np.r_[:num], np.r_[:num]] = np.inf
ids = np.argwhere(pair_dist < d)
Another shortcoming is that we compute the full symmetric pairwise distance matrix when we only really need the upper or lower triangular pairwise distance matrix. However, unless this computation really is a bottleneck in your code, I wouldn't worry too much about this.
I want to compute the distance from a set of N 3D-points to a set of M 3D-centers and store the results in a NxM matrix (where column i is the distance from all points to center i)
Example:
data = np.random.rand(100,3) # 100 toy 3D points
centers = np.random.rand(20,3) # 20 toy 3D points
For computing the distance between all points and a single center we can use "broadcasting" so we avoid looping though all points:
i = 0 # first center
np.sqrt(np.sum(np.power(data - centers[i,:], 2),1)) # Euclidean distance
Now we can put this code in a loop that iterates over all centers:
distances = np.zeros(data.shape[0], centers.shape[0])
for i in range(centers.shape[0]):
distances[:,i] = np.sqrt(np.sum(np.power(data - centers[i,:], 2),1))
However this is clearly an operation that could be parallelized and improved.
I'm wondering if there is a better way of doing this (maybe some multi-dimensional broadcasting or some library).
This is a very common problem for clustering and classification, where you want to get distances from your data to a set of classes, so I think it should be some efficient implementation to to this.
What's the best way of doing this?
Broadcast all the way:
import numpy as np
data = np.random.rand(100,3)
centers = np.random.rand(20,3)
distances = np.sqrt(np.sum(np.power(data[:,None,:] - centers[None,:,:], 2), axis=-1))
print distances.shape
# 100, 20
If you just want the nearest center, and you have a lot of data points (a lot being more than a several 100 000 samples), you probably should store your data in a KD tree and query that with the centers (scipy.spatial.KDTree).
What is the fastes way of determening which point q out of n points in 2D space is the closest (smallest euclidian distance) to point p, see attached imgage.
My current method of doing this in Python is storing all the distances in a list and then running
numpy.argmin(list_of_distances)
This is however a bit slow when calculating this for m number of points p. Or is it?
Instead of calculating the distances, you could calculate the squared distances. That way you don't need to perform n * m square roots.
This falls under closest point query -problems.
How many points are expected? Are your points static or do they change? One naive but powerful approach for static points would be to pre-compute every known distance, which would result in O(1) lookup.
Put everything as soon as possible into numpy and do calculations there. If you have many points, it is much faster than calculating distances in lists:
import numpy as np
px, py
x = np.fromiter(point.x for point in points, dtype = np.float)
y = np.fromiter(point.y for point in points, dtype = np.float)
i_closest = np.argmin((x - px) ** 2 + (y - py) ** 2)