Euclidian distance between two python matrixes without double for-loop? - python

I am working with two numpy matrixes, U (dimensions Nu x 3) and M (dimensions 3 x Nm)
A contains Nu users and 3 features
M contains Nm movies (and the same 3 features)
For each user of U, I would like to calculate its euclidian distance to every movie in M (so I need to compute Nu*Nm euclidian distances).
Is this possible without an explicit double for-loop? I am working with large dimensions matrixes and the double for-loop will probably take too much time.
Thanks in advance.

Check out scipy.spatial.distance.cdist. Something like this will do:
from scipy.spatial.distance import cdist
dist = cdist(U, M.T)

I'm afraid not. You need to compute the euclidian distance for every pair of (user, movie), so you'll have a time complexity of numOfUsers * numOfMovies, which would be a double for loop. You can't do less operations than that, unless you're willing to skip some pairs. The best you can do is optimize the euclidian distance calculation, but the number of operations you're going to do will be quadratic one way or the other.

Related

Calculate distances among a set of coordinates

Is there a more efficient way to calculate the Euclidean distance among a given set of points?
This is the code I use:
def all_distances(position):
distances = np.zeros((N_circles, N_circles))
for i in range(N_circles):
for j in range(i, N_circles):
distances[i][j]=calculate_distance(position[i], position[j])
return distances
def calculate_distance(p1, p2):
return math.sqrt((p1[0]-p2[0])**2+(p1[1]-p2[1])**2)
position is an array containing the coordinates of N_circles points.
You could use pdist and squareform from scipy
from scipy.spatial.distance import pdist, squareform
distances = pdist(position, metric="euclidean")
distance_matrix = squareform(distances)
You can use linalg to calculate the norm. Also you can define a function that calculate a hypersphere equation that include circle
import numpy as np
def distance(w, x, b=0):
w_norm = np.linalg.norm(w,2)
return abs(np.dot(w,x) + b) / w_norm
**2 may use some "power" subroutine. It may be faster to use a multiply.
If there is a hypot() in the library, use it.
You are keeping the distance from i to j (where i <= j). Maybe you want to store [j][i] also?
Alternatively, when looking up the distance you can use min(i,j) to 'max(i,j)`. (I can't tell whether this is less overhead.)
The code seems to compute [i][i]. Won't that always be zero? That is, perhaps you need range(i+1, N_circles). And you may or may not need to store 0.
Do all the distances change every time? If not, is there some way to recompute only the ones that changed? (This is a sample of "out of the box" thinking. There may be other tricks that can be used.)
Here's another...
Don't use SQRT at all. Instead, keep the squared distances. It is sufficient for deciding which is "closer" -- if that is all you need it for. (I used this out-of-the-box trick successfully in one project.)
How many times do you look up a 'distance' before recomputing it? If <= 'once', then don't bother pre-calculating. Simply calculate on the fly. (Actually the cutoff is a little more than 1.0, because of the overhead of creating and maintaining distance[])

Fastest way to calculate Euclidean and Minkowski distance between all the vectors in a list of lists python

I have been trying for a while now to calculate the Euclidean and Minkowski distance between all the vectors in a list of lists. I don't have much advanced mathematical knowledge.
I am usually working with 4 or 5 dimension vectors
The vector list can range in size from 0 to around 200,000
When calculating the distance all the vectors will have the same amount of dimensions
I have relied on these two questions during the process:
python numpy euclidean distance calculation between matrices of row vectors
Calculate Euclidean Distance between all the elements in a list of lists python
At first my code looked like this:
import numpy as np
def euclidean_distance_np(vec_list, single_vec):
dist = (np.array(vec_list) - single_vec) ** 2
dist = np.sum(dist, axis=1)
dist = np.sqrt(dist)
return dist
def minkowski_distance_np(vec_list, single_vec, p_val):
dist = (np.abs(np.array(vec_list, dtype=np.int64) - single_vec) ** p_val).sum(axis=1) ** (1 / p_val)
return dist
This worked well when I had a small amount of vectors. I would calculate the distance of a single vector to all the vectors in the list and repeat the process for every vector in the list one by one,
but once the list became 5 or 6 digits in length, these functions became extremely slow.
I managed to improve the Euclidean distance calculation like so:
x = np.array([v[0] for v in vec_list])
y = np.array([v[1] for v in vec_list])
z = np.array([v[2] for v in vec_list])
w = np.array([v[3] for v in vec_list])
t = np.array([v[4] for v in vec_list])
res = np.sqrt(np.square(x - x.reshape(-1,1)) + np.square(y - y.reshape(-1,1)) + np.square(z - z.reshape(-1,1)) + np.square(w - w.reshape(-1,1)) + np.square(t - t.reshape(-1,1)))
But cannot figure out how to implement the calculation method above to correctly calculate Minkowski distance.
So, to be precise, my question is how can I calculate Minkowski distance in a similar way to the code I mentioned above.
I would also appreciate any ideas for improvement or better ways to preform the calculations
Scipy has already implemented distance functions: minkowski, euclidean. But probably what you need is cdist.
Numpy is great tool for matrices manipulation, but it doesn't contain all possible functions. You can find most of additional features and operations in SciPy which is more related to mathematics, science, and engineering.

memory efficient euclidean distance measurement

I have 40,000 points and I need to find out the euclidean distance between each of the pairs. After going through the net, I found that the efficient way of calculating euclidean distance between pairs of points is by using scipy.spatial distance.cdist. But, since the no. of points is 40,000, the distance matirx will take around 12 GB of memory.
Is there a way of reducing the memory required to store the distance matrix without compromising the speed of calculating the same? Can the data type be change to float 32 instead of float 64 in the calculation of the distance matrix?
cdist like approach
The output datatype is the same as given as input.
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def calc_distance(vec_1,vec_2):
res=np.empty((vec_1.shape[0],vec_2.shape[0]),dtype=vec_1.dtype)
for i in nb.prange(vec_1.shape[0]):
for j in range(vec_2.shape[0]):
res[i,j]=np.sqrt((vec_1[i,0]-vec_2[j,0])**2+(vec_1[i,1]-vec_2[j,1])**2+(vec_1[i,2]-vec_2[j,2])**2)
return res
Aproach without repetitions
#nb.njit(fastmath=True)
def calc_distance_pairs(vec):
res=np.empty(((vec.shape[0]**2)//2-vec.shape[0]//2),dtype=vec.dtype)
ii=0
for i in range(vec.shape[0]):
for j in range(i+1,vec.shape[0]):
res[ii]=np.sqrt((vec[i,0]-vec[j,0])**2+(vec[i,1]-vec[j,1])**2+(vec[i,2]-vec[j,2])**2)
ii+=1
return res
This cuts the amount of memory to less than 1/4 of the scipy cdist approach.
Timings
calc_distance: ~2s
calc_distance_pairs: ~3s
cdist: ~11s

Fastest way to calculate euclidian distance in 2D space

What is the fastes way of determening which point q out of n points in 2D space is the closest (smallest euclidian distance) to point p, see attached imgage.
My current method of doing this in Python is storing all the distances in a list and then running
numpy.argmin(list_of_distances)
This is however a bit slow when calculating this for m number of points p. Or is it?
Instead of calculating the distances, you could calculate the squared distances. That way you don't need to perform n * m square roots.
This falls under closest point query -problems.
How many points are expected? Are your points static or do they change? One naive but powerful approach for static points would be to pre-compute every known distance, which would result in O(1) lookup.
Put everything as soon as possible into numpy and do calculations there. If you have many points, it is much faster than calculating distances in lists:
import numpy as np
px, py
x = np.fromiter(point.x for point in points, dtype = np.float)
y = np.fromiter(point.y for point in points, dtype = np.float)
i_closest = np.argmin((x - px) ** 2 + (y - py) ** 2)

Using Numpy to find the average distance in a set of points

I have an array of points in unknown dimensional space, such as:
data=numpy.array(
[[ 115, 241, 314],
[ 153, 413, 144],
[ 535, 2986, 41445]])
and I would like to find the average euclidean distance between all points.
Please note that I have over 20,000 points, so I would like to do this as efficiently as possible.
Thanks.
If you have access to scipy, you could try the following:
scipy.spatial.distance.cdist(data,data)
Well, I don't think that there is a super fast way to do this, but this should do it:
tot = 0.
for i in xrange(data.shape[0]-1):
tot += ((((data[i+1:]-data[i])**2).sum(1))**.5).sum()
avg = tot/((data.shape[0]-1)*(data.shape[0])/2.)
Now that you've stated your goal of finding the outliers, you are probably better off computing the sample mean and, with that, the sample variance, since both those operations will give you an O(nd) operation. With that, you should be able to find outliers (e.g. excluding points further from the mean than some fraction of the std. dev.), and that filtering process should be possible to perform in O(nd) time for a total of O(nd).
You might be interested in a refresher on Chebyshev's inequality.
Is it ever worthwhile to optimize without a working solution? Also, computation of a distance matrix over the entire data set rarely needs to be fast because you only do it once--when you need to know a distance between two points, you just look it up, it's already calculated.
So if you don't have a place to start, here's one. If you want to do this in Numpy without the need to write any inline fortran or C, that should be no problem, though perhaps you want to include this small vector-based virtual machine called "numexpr" (available on PyPI, trivial to intall) which in this case gave a 5x performance boost versus Numpy alone.
Below i've calculated a distance matrix for 10,000 points in 2D space (a 10K x 10k matrix giving the distance between all 10k points). This took 59 seconds on my MBP.
import numpy as NP
import numexpr as NE
# data are points in 2D space (x, y)--obviously, this code can accept data of any dimension
x = NP.random.randint(0, 10, 10000)
y = NP.random.randint(0, 10, 10000)
fnx = lambda q : q - NP.reshape(q, (len(q), 1))
delX = fnx(x)
delY = fnx(y)
dist_mat = NE.evaluate("(delX**2 + delY**2)**0.5")
There's no getting around the number of evaluations:
Sum[n-i, {i, 0, n}] = http://www.equationsheet.com/latexrender/pictures/27744c0bd81116aa31c138ab38a2aa87.gif
But you can save yourself the expense of all those square roots if you can get by with an approximate result. It depends on your needs.
If you're going to calculate an average, I would advise you to not try putting all the values into an array before calculating. Just calculate the sum (and sum of squares if you need standard deviation as well) and throw away each value as you calculate it.
Since
and
, I don't know if this means you have to multiply by two somewhere.
If you want a fast and inexact solution, you could probably adapt the Fast Multipole Method algorithm.
Points that are separated by a small distance have a smaller contribution to the final average distance, so it would make sense to group points into clusters and compare the clusters distances.

Categories