I have a polymer with coordinates stored in Nx3 numpy array, where N is the number of particles in the polymer (degree of polymerization).
I am trying to calculate the hydrodynamic radius of this polymer. The hydrodynamic radius is given by the first expression found in this link. The hydrodynamic radius, Rh is essentially a harmonic average over pair-wise distance.
Given that P is the Nx3 array, this is my current numpy-pythonic implementation:
inv_dist = 0
for i in range(N-1):
for j in range(i+1, N):
inv_dist += 1/(np.linalg.norm (P[i,:]-P[j,:], 2))
Rh = 1/(inv_dist/(N**2) )
np is numpy in this case. I am aware that the wikipedia formula asks for an ensemble average. This would mean that I would loop over EVERY possible configuration of the polymer in my simulation. In any event, the two loops mentioned above will still be computed.
This is a nested for loop with Nx(N-1)/2 iterations. As N gets large, this computation becomes increasingly taxing. How can I vectorize this code to bypass the for loops to an extent?
I would appreciate any advice you have for me.
You can use scipy.spatial.distance.pdist:
from scipy.spatial.distance import pdist
inv_dist = (1/pdist(P)).sum()
Related
I have 200 data points, each point is a list of 3 numbers that represents the position. I want to sample N=100 points from this 3D space, but with the constraint that the minimum distance between every two points must be larger than 0.15. The script below is the way I sample the points, but it just keeps running and never stops. Also, if I set a N larger than a value, the code cannot find all N points because I sample each point randomly and it gets to a point where no points can be sampled that isn't too close to the current points, but in reality the N can be much larger than this value if the point distribution is very "dense" (but still satisfies minimum distance larger than 0.15). Is there a more efficient way to do this?
import numpy as np
import random
import time
def get_random_points_not_too_close(points, npoints, min_distance):
random.shuffle(points)
final_points = [points[0]]
while len(final_points) < npoints:
for point in points:
if point in final_points:
continue
elif min([np.linalg.norm(np.array(p) - np.array(point)) for p in final_points]) > min_distance:
final_points.append(point)
return final_points
data = [[random.random() for i in range(3)] for j in range(200)]
t1 = time.time()
sample_points = get_random_points_not_too_close(points=data, npoints=100, min_distance=0.15)
t2 = time.time()
print(t2-t1)
Your algorithm could work for small sets of points, but it does not run in a deterministic time.
I did the following to create a random forest (simulated trees): first generate a square grid of points where the grid point distance is 3x the minimal distance. Now you take each point in the regular grid and translate it in a random direction, at a random distance which is max your distance. The result points will never be closer than max your distance.
I have 40,000 points and I need to find out the euclidean distance between each of the pairs. After going through the net, I found that the efficient way of calculating euclidean distance between pairs of points is by using scipy.spatial distance.cdist. But, since the no. of points is 40,000, the distance matirx will take around 12 GB of memory.
Is there a way of reducing the memory required to store the distance matrix without compromising the speed of calculating the same? Can the data type be change to float 32 instead of float 64 in the calculation of the distance matrix?
cdist like approach
The output datatype is the same as given as input.
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def calc_distance(vec_1,vec_2):
res=np.empty((vec_1.shape[0],vec_2.shape[0]),dtype=vec_1.dtype)
for i in nb.prange(vec_1.shape[0]):
for j in range(vec_2.shape[0]):
res[i,j]=np.sqrt((vec_1[i,0]-vec_2[j,0])**2+(vec_1[i,1]-vec_2[j,1])**2+(vec_1[i,2]-vec_2[j,2])**2)
return res
Aproach without repetitions
#nb.njit(fastmath=True)
def calc_distance_pairs(vec):
res=np.empty(((vec.shape[0]**2)//2-vec.shape[0]//2),dtype=vec.dtype)
ii=0
for i in range(vec.shape[0]):
for j in range(i+1,vec.shape[0]):
res[ii]=np.sqrt((vec[i,0]-vec[j,0])**2+(vec[i,1]-vec[j,1])**2+(vec[i,2]-vec[j,2])**2)
ii+=1
return res
This cuts the amount of memory to less than 1/4 of the scipy cdist approach.
Timings
calc_distance: ~2s
calc_distance_pairs: ~3s
cdist: ~11s
I want to calculate a NxN similarity Matrix using the cosine distance formula of sklearn. My problem is that my Matrix is very very large. It has about 1000 entries. My current approach is very very slow and I need a real speed-up. Can anybody help me speeding the code up?
for i in similarity_matrix.columns:
for j in similarity_matrix.columns:
if i == j:
similarity_matrix.ix[i,j] = 0
else:
similarity_matrix.ix[i,j] = cosine(documents[int(i)], documents[int(j)])
Bonus task: In addition I would like to use the weighted cosine formula. But it seems not to be implemented in sklearn? Is that true?
Using for-loops is not the ideal solution. I would recommend to fall back to the pdist functions of scipy. My read is that you don't mean your matrix has 1000 entries but 1000x1000? However Scipy can handle this easily.
import numpy as np
from scipy.spatial.distance import pdist
res = pdist(documents.T, 'cosine')
distances = 1-pd.DataFrame(squareform(res), index=documents.columns, columns=documents.columns)
I have problems understanding how your weight vector looks like? Is is a constant value? Pdist allows for adding custom functions. For example you can calculate your cosine distance using numpy (which is also really fast)
pdist(X, lambda u, v: np.dot(np.dot(u, v), weightvec) / (norm(np.multiply(u, weightvec)) * norm(np.multiply(v, weightvec))))
I have a list of n polar coordinates, and a distance function which takes in two coordinates.
I want to create an n x n matrix which contains the pairwise distances under my function. I realize I probably need to use some form of vectorization with numpy but am not sure exactly how to do so.
A simple code segment is below for your reference
import numpy as np
length = 10
coord_r = np.random.rand(length)*10
coord_alpha = np.random.rand(length)*np.pi
# Repeat vector to matrix form
coord_r_X = np.tile(coord_r, [length,1])
coord_r_Y = coord_r_X.T
coord_alpha_X = np.tile(coord_alpha, [length,1])
coord_alpha_Y = coord_alpha_X.T
matDistance = np.sqrt(coord_r_X**2 + coord_r_Y**2 - 2*coord_r_X*coord_r_Y*np.cos(coord_alpha_X - coord_alpha_Y))
print matDistance
You can use scipy.spatial.distance.pdist. However, if the distance you want to calculate is the Euclidean distance, you may be better off just converting your points to rectangular coordinates, since then pdist will do the calculations quite quickly using its builtin Euclidean distance.
What is the fastes way of determening which point q out of n points in 2D space is the closest (smallest euclidian distance) to point p, see attached imgage.
My current method of doing this in Python is storing all the distances in a list and then running
numpy.argmin(list_of_distances)
This is however a bit slow when calculating this for m number of points p. Or is it?
Instead of calculating the distances, you could calculate the squared distances. That way you don't need to perform n * m square roots.
This falls under closest point query -problems.
How many points are expected? Are your points static or do they change? One naive but powerful approach for static points would be to pre-compute every known distance, which would result in O(1) lookup.
Put everything as soon as possible into numpy and do calculations there. If you have many points, it is much faster than calculating distances in lists:
import numpy as np
px, py
x = np.fromiter(point.x for point in points, dtype = np.float)
y = np.fromiter(point.y for point in points, dtype = np.float)
i_closest = np.argmin((x - px) ** 2 + (y - py) ** 2)