memory efficient euclidean distance measurement - python

I have 40,000 points and I need to find out the euclidean distance between each of the pairs. After going through the net, I found that the efficient way of calculating euclidean distance between pairs of points is by using scipy.spatial distance.cdist. But, since the no. of points is 40,000, the distance matirx will take around 12 GB of memory.
Is there a way of reducing the memory required to store the distance matrix without compromising the speed of calculating the same? Can the data type be change to float 32 instead of float 64 in the calculation of the distance matrix?

cdist like approach
The output datatype is the same as given as input.
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def calc_distance(vec_1,vec_2):
res=np.empty((vec_1.shape[0],vec_2.shape[0]),dtype=vec_1.dtype)
for i in nb.prange(vec_1.shape[0]):
for j in range(vec_2.shape[0]):
res[i,j]=np.sqrt((vec_1[i,0]-vec_2[j,0])**2+(vec_1[i,1]-vec_2[j,1])**2+(vec_1[i,2]-vec_2[j,2])**2)
return res
Aproach without repetitions
#nb.njit(fastmath=True)
def calc_distance_pairs(vec):
res=np.empty(((vec.shape[0]**2)//2-vec.shape[0]//2),dtype=vec.dtype)
ii=0
for i in range(vec.shape[0]):
for j in range(i+1,vec.shape[0]):
res[ii]=np.sqrt((vec[i,0]-vec[j,0])**2+(vec[i,1]-vec[j,1])**2+(vec[i,2]-vec[j,2])**2)
ii+=1
return res
This cuts the amount of memory to less than 1/4 of the scipy cdist approach.
Timings
calc_distance: ~2s
calc_distance_pairs: ~3s
cdist: ~11s

Related

Vectorizing calculating the hydrodynamic radius in numpy

I have a polymer with coordinates stored in Nx3 numpy array, where N is the number of particles in the polymer (degree of polymerization).
I am trying to calculate the hydrodynamic radius of this polymer. The hydrodynamic radius is given by the first expression found in this link. The hydrodynamic radius, Rh is essentially a harmonic average over pair-wise distance.
Given that P is the Nx3 array, this is my current numpy-pythonic implementation:
inv_dist = 0
for i in range(N-1):
for j in range(i+1, N):
inv_dist += 1/(np.linalg.norm (P[i,:]-P[j,:], 2))
Rh = 1/(inv_dist/(N**2) )
np is numpy in this case. I am aware that the wikipedia formula asks for an ensemble average. This would mean that I would loop over EVERY possible configuration of the polymer in my simulation. In any event, the two loops mentioned above will still be computed.
This is a nested for loop with Nx(N-1)/2 iterations. As N gets large, this computation becomes increasingly taxing. How can I vectorize this code to bypass the for loops to an extent?
I would appreciate any advice you have for me.
You can use scipy.spatial.distance.pdist:
from scipy.spatial.distance import pdist
inv_dist = (1/pdist(P)).sum()

Calculate distances among a set of coordinates

Is there a more efficient way to calculate the Euclidean distance among a given set of points?
This is the code I use:
def all_distances(position):
distances = np.zeros((N_circles, N_circles))
for i in range(N_circles):
for j in range(i, N_circles):
distances[i][j]=calculate_distance(position[i], position[j])
return distances
def calculate_distance(p1, p2):
return math.sqrt((p1[0]-p2[0])**2+(p1[1]-p2[1])**2)
position is an array containing the coordinates of N_circles points.
You could use pdist and squareform from scipy
from scipy.spatial.distance import pdist, squareform
distances = pdist(position, metric="euclidean")
distance_matrix = squareform(distances)
You can use linalg to calculate the norm. Also you can define a function that calculate a hypersphere equation that include circle
import numpy as np
def distance(w, x, b=0):
w_norm = np.linalg.norm(w,2)
return abs(np.dot(w,x) + b) / w_norm
**2 may use some "power" subroutine. It may be faster to use a multiply.
If there is a hypot() in the library, use it.
You are keeping the distance from i to j (where i <= j). Maybe you want to store [j][i] also?
Alternatively, when looking up the distance you can use min(i,j) to 'max(i,j)`. (I can't tell whether this is less overhead.)
The code seems to compute [i][i]. Won't that always be zero? That is, perhaps you need range(i+1, N_circles). And you may or may not need to store 0.
Do all the distances change every time? If not, is there some way to recompute only the ones that changed? (This is a sample of "out of the box" thinking. There may be other tricks that can be used.)
Here's another...
Don't use SQRT at all. Instead, keep the squared distances. It is sufficient for deciding which is "closer" -- if that is all you need it for. (I used this out-of-the-box trick successfully in one project.)
How many times do you look up a 'distance' before recomputing it? If <= 'once', then don't bother pre-calculating. Simply calculate on the fly. (Actually the cutoff is a little more than 1.0, because of the overhead of creating and maintaining distance[])

Euclidian distance between two python matrixes without double for-loop?

I am working with two numpy matrixes, U (dimensions Nu x 3) and M (dimensions 3 x Nm)
A contains Nu users and 3 features
M contains Nm movies (and the same 3 features)
For each user of U, I would like to calculate its euclidian distance to every movie in M (so I need to compute Nu*Nm euclidian distances).
Is this possible without an explicit double for-loop? I am working with large dimensions matrixes and the double for-loop will probably take too much time.
Thanks in advance.
Check out scipy.spatial.distance.cdist. Something like this will do:
from scipy.spatial.distance import cdist
dist = cdist(U, M.T)
I'm afraid not. You need to compute the euclidian distance for every pair of (user, movie), so you'll have a time complexity of numOfUsers * numOfMovies, which would be a double for loop. You can't do less operations than that, unless you're willing to skip some pairs. The best you can do is optimize the euclidian distance calculation, but the number of operations you're going to do will be quadratic one way or the other.

Pandas Matrix to Distance Matrix as fast as possible

I want to calculate a NxN similarity Matrix using the cosine distance formula of sklearn. My problem is that my Matrix is very very large. It has about 1000 entries. My current approach is very very slow and I need a real speed-up. Can anybody help me speeding the code up?
for i in similarity_matrix.columns:
for j in similarity_matrix.columns:
if i == j:
similarity_matrix.ix[i,j] = 0
else:
similarity_matrix.ix[i,j] = cosine(documents[int(i)], documents[int(j)])
Bonus task: In addition I would like to use the weighted cosine formula. But it seems not to be implemented in sklearn? Is that true?
Using for-loops is not the ideal solution. I would recommend to fall back to the pdist functions of scipy. My read is that you don't mean your matrix has 1000 entries but 1000x1000? However Scipy can handle this easily.
import numpy as np
from scipy.spatial.distance import pdist
res = pdist(documents.T, 'cosine')
distances = 1-pd.DataFrame(squareform(res), index=documents.columns, columns=documents.columns)
I have problems understanding how your weight vector looks like? Is is a constant value? Pdist allows for adding custom functions. For example you can calculate your cosine distance using numpy (which is also really fast)
pdist(X, lambda u, v: np.dot(np.dot(u, v), weightvec) / (norm(np.multiply(u, weightvec)) * norm(np.multiply(v, weightvec))))

Optimize Hamming Distance Python

I have around 1M of binary numpy array which I need to get Hamming Distance between them to found de k-nearest-neighbours, the fastest method that I get is using cdist, returning a float matrix with distance.
Since I don't have memory enough to get a 1Mx1M float matrix so I'm doing it one element at the time like this:
from scipy.spatial Import distance
Hamming_Distance = distance.cdist(array1,all_array,'hamming')
The probles is that it's taken like 2-3s for each Hamming_Distance, to 1m document it took an eternity (And I need to use it to different k).
Is there any fastest way to do it?
I'm thinking on multiprocessing or make it on C but I have some troubles understanding how it works multiprocessing on python and I don't know how to mix C code with Python code.
If you want to compute the k-nearest neighbors, it may not be necessary to compute all n^2 pairs of distances. Instead, you can use a Kd tree or a ball tree (both are data structures for efficiently querying relations between a set of points).
Scipy has a package called scipy.spatial.kdtree. It however does not currently support hamming distance as a metric between points. However, the wonderful folks at scikit-learn (aka sklearn) do have an implementation of ball tree with hamming distance supported. Here's a small example using sklearn's ball tree.
from sklearn.neighbors import BallTree
import numpy as np
# Generate random binary data.
data = np.random.random_integers(0, 1, size=(10,10))
# Implement BallTree.
ballt = BallTree(data, leaf_size = 30, metric = 'hamming')
distances, neighbors = ballt.query(data, k=3)
print neighbors # Row n has the nth vector's k closest neighbors.
print distances # Same idea but the hamming distance to neighbors.
Now for the big caveat. For high dimensional vectors, KDTree and BallTree become comparable to the brute force algorithm. I'm a bit unclear on the nature of your vectors, but hopefully the above snippet gives you some ideas/direction.

Categories