Is there a faster way to perform this neighbour finding operation - python

I'm trying to calculate Moran's I in Python (This is the underlying equation). My inputs are a coords Nx3 array containing the coordinates of each point and a Nx3 array z which contains the values minus the overall mean. The operation requires each value of z to be multiplied with every point within a set distance (here set to 1.99). My problem is that in my case N=~2 Million and so the find_neighbours operation is very slow. Is there a way I could speed this up?
def find_neighbours(coords,idx,k):
distances = np.sqrt(np.power(coords - coords[idx], 2).sum(axis=1))
distances[idx] = np.inf
return np.argwhere(distances<=k)
z = x - np.mean(x)
n = len(coords)
A = 0
B = np.sum([z[idx]**2 for idx,coord in enumerate(coords)])
S_0 = 0
for idx in range(len(coords)):
neighbours = find_neighbours(coords,idx,1.99)
S_0 += len(neighbours)
A += np.sum([(z[neighbour]*z[idx]) for neighbour in neighbours])
I = (n/S_0)*(A/B)

This is a classical problem with plenty of literature about. It's called Radius Neighbor Search in Three-dimensional Point Clouds . You need to store your points in a better data structure to do the search faster. I would suggest an octree.
Check python code here and adapt to your case.
For explanations, check this paper.

Related

How to compute the distance between 3d points in a fast way

I have a 3d point cloud (x,y,z) in a txt file. I want to calculate the 3d distance between each point and all the other points in the point cloud, and save the number of points having distance less than a threshold. I have done it in python in the shown code but it takes too much time. I was asking for a faster one than the one I got.
from math import sqrt
import numpy as np
points_list = []
with open("D:/Point cloud data/projection_test_data3.txt") as chp:
for line in chp:
x, y, z = line.split()
points_list.append((float(x), float(y), float(z)))
j = 0
Final_density = 0
while j < len(points_list)-1:
i = 0
Density = 0
while i < len (points_list) - 1 :
if sqrt((points_list[i][0] - points_list[j][0])**2 + (points_list[i][1] - points_list[j][1])**2 + (points_list[i][2] - points_list[j][2])**2) < 0.15:
Density += 1
i += 1
Final_density = Density
with open("D:/Point cloud data/my_density.txt", 'a') as g:
g.write("{}\n".format(str(Final_density)))
j += 1
One (quick) option that might speed this up is to change the position of the file writing/opening so that you're not opening/closing the file as many times.
from math import sqrt
import numpy as np
points_list = []
with open("D:/Point cloud data/projection_test_data3.txt") as chp:
for line in chp:
x, y, z = line.split()
points_list.append((float(x), float(y), float(z)))
j = 0
Final_density = 0
with open("D:/Point cloud data/my_density.txt", 'a') as g:
while j < len(points_list)-1:
i = 0
Density = 0
while i < len (points_list) - 1 :
if sqrt((points_list[i][0] - points_list[j][0])**2 + (points_list[i][1] - points_list[j][1])**2 + (points_list[i][2] - points_list[j][2])**2) < 0.15:
Density += 1
i += 1
Final_density = Density
g.write("{}\n".format(str(Final_density)))
j += 1
Since it looks like you can use numpy, why not use it? You'll have to make sure the arrays are numpy arrays, but that should be simple.
if sqrt(...) < 0.15: to if np.linalg.norm(points_list[j] - points_list[i]) < 0.15:
This post (Finding 3d distances using an inbuilt function in python) has other ways to use a prebuilt function to get the 3d distance in python.
Edit thanks to #KellyBundy's comment:
You can also use np.linalg.norm(points_list - points_list[:, None], axis=1) to generate a matrix representing the distance between all points in the array. The diagonal will be 0 (corresponding to the distance between the given point and itself) and the matrix will be symmetric about the diagonal. You can use just the upper triangle to determine the distance between any given pair of points. Again, you'll have to modify your data structure to make all the points into a numpy array in the proper format (np.array([[point1x, point1y, point1z], [point2x, point2y, point2z]]), etc. (https://stackoverflow.com/a/46700369/2391458)
resulting matrix of the form:
[ [0 distance between points 0 and 1 distance between points 0 and 2......]
[distance between points 1 and 0 0 distance between points 2 and 0....]
.....
Quick x2 speed up – replace i = 0 with i = j + 1. That way you would check each pair only once, not twice.
More fundamental change – you can sort points by coordinates, and use sliding window algorithm. The idea is that if points are sorted by x coordinate, j-th point has x=1, and i-th point has x=1.01, then it might be near each other and you should check it. But if i-th point has x=2, then it cannot be near to j-th point, and since points are sorted, all points after i-th can be skipped (i.e. not checked in pair with j-th point).
If points are sparse, then it should significantly speed up function, and complexity would be O(n*log(n)) because of sorting.
In the if, instead of taking the sqrt and comparing it with 0.15, compare it with square of 0.15 which is 0.0225 directly. The result will be the same. sqrt is an expensive operation, it will save you time to not use it.
if (points_list[i][0] - points_list[j][0])**2 + (points_list[i][1] - points_list[j][1])**2 + (points_list[i][2] - points_list[j][2])**2 < 0.0225:

How do I find the closest point for each point in a data set while keeping precision?

This is my data set:
https://pastebin.com/SsuKP2eH
I'm trying to find the nearest point for all points in the data set. These points are latitude and longitude on the Earth's surface. Of course, the nearest point cannot be the same point.
I tried the KDTree solutions listed in this post: https://stackoverflow.com/a/45128643 and changed the poster's random points (generated by np.random.uniform) to my own data set.
I expected to get an array full of distances, but instead, I got an array full of zeroes with some numbers like 2.87722e-06 and 0.616582 sprinkled in. This wasn't what I wanted. I tried the other solution, NearestNeighbours, on my data set and got the same result. So, I did some debugging and reduced the range of random numbers he used, making it closer to my own data set.
import numpy as np
import scipy.spatial as spatial
import pandas as pd
R = 6367
def using_kdtree(data):
"Based on https://stackoverflow.com/q/43020919/190597"
def dist_to_arclength(chord_length):
"""
https://en.wikipedia.org/wiki/Great-circle_distance
Convert Euclidean chord length to great circle arc length
"""
central_angle = 2*np.arcsin(chord_length/(2.0*R))
arclength = R*central_angle
return arclength
phi = np.deg2rad(data['Latitude'])
theta = np.deg2rad(data['Longitude'])
data['x'] = R * np.cos(phi) * np.cos(theta)
data['y'] = R * np.cos(phi) * np.sin(theta)
data['z'] = R * np.sin(phi)
tree = spatial.KDTree(data[['x', 'y','z']])
distance, index = tree.query(data[['x', 'y','z']], k=2)
return dist_to_arclength(distance[:, 1])
#return distance, index
np.random.seed(2017)
N = 1000
#data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 'Longitude':np.random.uniform(0, 360,size=N)})
data = pd.DataFrame({'Latitude':np.random.uniform(-49.19,49.32,size=N), 'Longitude':np.random.uniform(-123.02, -123.23,size=N)})
result = using_kdtree(data)
I found that the resulting distances array had small values, close to 0. This makes me believe that the reason why the result array for my data set is full of zeroes is because the differences between points are very small. Somewhere, the KD Tree/nearest neighbours loses precision and outputs garbage. Is there a way to make them keep the precision of my floats? The brute-force method can keep precision but it is far too slow with 7200 points to iterate through.
I think what's happening is that k=2 in
distance, index = tree.query(data[['x', 'y','z']], k=2)
tells KDTree you want the closest two points to a point. So the closest is obviously the point itself with distance from itself being zero. Also if you print index you see a Nx2 array and each row starts with the number of the row. This is KDTree's way of saying well the closest point to the i-th point is the i-th point itself.
Obviously that is not useful and you probably want only the 2nd closest point. Fortunately I found this in the documentation of the k parameter of query
Either the number of nearest neighbors to return, or a list of the
k-th nearest neighbors to return, starting from 1.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query
So
distance, index = tree.query(data[['x', 'y','z']], k=[2])
gives only the distance and index of the 2nd to closest point.

Speed up computation for Distance Transform on Image in Python

I would like to find the find the distance transform of a binary image in the fastest way possible without using the scipy package distance_trnsform_edt(). The image is 256 by 256. The reason I don't want to use scipy is because using it is difficult in tensorflow. Evry time I want to use this package I need to start a new session and this takes a lot of time. So I would like to make a custom function that only utilizes numpy.
My approach is as follows: Find the coordinated for all the ones and all the zeros in the image. Find the euclidian distance between each of the zero pixels (a) and the one pixels (b) and then the value at each (a) position is the minimum distance to a (b) pixel. I do this for each 0 pixel. The resultant image has the same dimensions as the original binary map. My attempt at doing this is shown below.
I tried to do this as fast as possible using no loops and only vectorization. But my function still can't work as fast as the scipy package can. When I timed the code it looks like the assignment to the variable "a" is taking the longest time. But I do not know if there is a way to speed this up.
If anyone has any other suggestions for different algorithms to solve this problem of distance transforms or can direct me to other implementations in python, it would be very appreciated.
def get_dst_transform_img(og): #og is a numpy array of original image
ones_loc = np.where(og == 1)
ones = np.asarray(ones_loc).T # coords of all ones in og
zeros_loc = np.where(og == 0)
zeros = np.asarray(zeros_loc).T # coords of all zeros in og
a = -2 * np.dot(zeros, ones.T)
b = np.sum(np.square(ones), axis=1)
c = np.sum(np.square(zeros), axis=1)[:,np.newaxis]
dists = a + b + c
dists = np.sqrt(dists.min(axis=1)) # min dist of each zero pixel to one pixel
x = og.shape[0]
y = og.shape[1]
dist_transform = np.zeros((x,y))
dist_transform[zeros[:,0], zeros[:,1]] = dists
plt.figure()
plt.imshow(dist_transform)
The implementation in the OP is a brute-force approach to the distance transform. This algorithm is O(n2), as it computes the distance from each background pixel to each foreground pixel. Furthermore, because of the way it is vectorized, it requires a lot of memory. On my computer it couldn't compute the distance transform of a 256x256 image without thrashing. Many other algorithms are described in the literature, below I'll discuss two O(n) algorithms.
Note: Typically, the distance transform is computed for object pixels (value 1) to the nearest background pixel (value 0). The code in the OP does the reverse, and so the code I've pasted below follows OP's convention, not the more common convention.
The easiest to implement, IMO, is the chamfer distance algorithm. This is a recursive algorithm that does two passes over the image: one left to right and top to bottom, and one right to left and bottom to top. In each pass, the distance computed for previous pixels is propagated. This algorithm can be implemented using integer distances or floating-point distances between neighbors. The latter yields smaller errors, of course. But in both cases the errors can be reduced significantly by increasing the number of neighbors queried in this propagation. The algorithm is older, but G. Borgefors analyzed it and proposed suitable neighbor distances (G. Borgefors, Distance Transformations in Digital Images, Computer Vision, Graphics, and Image Processing 34:344-371, 1986).
Here is an implementation using 3-4 distance (distance to edge-connected neighbors is 3, distance to vertex-connected neighbors is 4):
def chamfer_distance(img):
w, h = img.shape
dt = np.zeros((w,h), np.uint32)
# Forward pass
x = 0
y = 0
if img[x,y] == 0:
dt[x,y] = 65535 # some large value
for x in range(1, w):
if img[x,y] == 0:
dt[x,y] = 3 + dt[x-1,y]
for y in range(1, h):
x = 0
if img[x,y] == 0:
dt[x,y] = min(3 + dt[x,y-1], 4 + dt[x+1,y-1])
for x in range(1, w-1):
if img[x,y] == 0:
dt[x,y] = min(4 + dt[x-1,y-1], 3 + dt[x,y-1], 4 + dt[x+1,y-1], 3 + dt[x-1,y])
x = w-1
if img[x,y] == 0:
dt[x,y] = min(4 + dt[x-1,y-1], 3 + dt[x,y-1], 3 + dt[x-1,y])
# Backward pass
for x in range(w-2, -1, -1):
y = h-1
if img[x,y] == 0:
dt[x,y] = min(dt[x,y], 3 + dt[x+1,y])
for y in range(h-2, -1, -1):
x = w-1
if img[x,y] == 0:
dt[x,y] = min(dt[x,y], 3 + dt[x,y+1], 4 + dt[x-1,y+1])
for x in range(1, w-1):
if img[x,y] == 0:
dt[x,y] = min(dt[x,y], 4 + dt[x+1,y+1], 3 + dt[x,y+1], 4 + dt[x-1,y+1], 3 + dt[x+1,y])
x = 0
if img[x,y] == 0:
dt[x,y] = min(dt[x,y], 4 + dt[x+1,y+1], 3 + dt[x,y+1], 3 + dt[x+1,y])
return dt
Note that a lot of the complication here is to avoid indexing out of bounds, but still computing distances all the way to the edges of the image. If we simply skip the pixels around the border of the image, the code becomes much simpler.
Because it is a recursive algorithm, it is not possible to vectorize its implementation. The Python code will not be very efficient. But programmed in C or the like will yield a very fast algorithm that yields a fairly good approximation to the Euclidean distance.
OpenCV's cv.distanceTransform implements this algorithm.
Another very efficient algorithm computes the square of the distance transform. The square distance is separable (i.e. can be computed independently for each axis and added). This leads to an algorithm that is easy to parallelize. For each image row, the algorithm does a forward and a backward pass. For each column in the result, the algorithm then does another forward and backward pass. This process leads to an exact Euclidean distance transform.
This algorithm was first proposed by R. van den Boomgaard in his Ph.D. thesis in 1992. Unfortunately this went unnoticed. The algorithm was then again proposed by A. Meijster, J.B.T.M. Roerdink and W.H. Hesselink (A General Algorithm for Computing Distance Transforms in Linear Time, Mathematical Morphology and its Applications to Image and Signal Processing, pp 331-340, 2002), and again by P. Felzenszwalb and D. Huttenlocher (Distance transforms of sampled functions, Technical report, Cornell University, 2004).
This is the most efficient algorithm known, in part because it is the only one that can be easily and efficiently parallelized (computation on each image row, and later on each image column, is independent of other rows/columns).
Unfortunately I don't have any Python code for this one to share, but you can find implementations online. For example OpenCV's cv.distanceTransform implements this algorithm, and DIPlib's dip.EuclideanDistanceTransform does too.

k-Nearest Neighbors rundown

I'm trying to follow an example on k-Nearest Neighbors and I'm not sure about the numpy command syntax. I'm supposed to be doing a matrix-wise distance calculation and the code given is
def classify(inputVector, trainingData,labels,k):
dataSetSize=trainingData.shape[0]
diffMat=tile(inputVector,(dataSetSize,1))-trainingData
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
def createDataSet():
group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
labels = ['A','A','B','B']
return group, labels
my question is how does sqDistances**0.5 amount to the distance equation ((A[0]-B[0])+(A[1]-B[1]))^1/2? I don't follow how tile influences it specifically how the matrix is made from (datasetsize,1)-training data.
I hope the following will explain the working.
Numpy tile : https://docs.scipy.org/doc/numpy/reference/generated/numpy.tile.html
Using this function, you are creating matrix from input vector same to the shape of training data. From this matrix you are subtracting training data which will give you some part from what you mentioned say test[0]-train[0] i.e. element wise difference.
Then you squared each obtained element by using diffMat**2 and then taken sum along axis = 1 (https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html). This resulted in equations like (test[0] - train[0])^2 + (test[1] - train[1])^2.
Next by taking sqDistances**0.5 , it will give Euclidean distance.
To calculate Euclidean distance, this might be helpful
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html#scipy.spatial.distance.euclidean

fastest way to find euclidean distance in python

I have 2 sets of 2D points (A and B), each set have about 540 points. I need to find the points in set B that are farther than a defined distance alpha from all the points in A.
I have a solution, but is not fast enough
# find the closest point of each of the new point to the target set
def find_closest_point( self, A, B):
outliers = []
for i in range(len(B)):
# find all the euclidean distances
temp = distance.cdist([B[i]],A)
minimum = numpy.min(temp)
# if point is too far away from the rest is consider outlier
if minimum > self.alpha :
outliers.append([i, B[i]])
else:
continue
return outliers
I am using python 2.7 with numpy and scipy. Is there another way to do this that I may gain a considerable increase in speed?
Thanks in advance for the answers
>>> from scipy.spatial.distance import cdist
>>> A = np.random.randn(540, 2)
>>> B = np.random.randn(540, 2)
>>> alpha = 1.
>>> ind = np.all(cdist(A, B) > alpha, axis=0)
>>> outliers = B[ind]
gives you the points you want.
If you have a very large set of points you could calculate x & y bounds of a add & subtract aplha then eliminate all the points in b from specific consideration that lay outside of that boundary.

Categories