I need a good algorithm for calculating the point that is closest to a collection of lines in python, preferably by using least squares. I found this post on a python implementation that doesn't work:
Finding the centre of multiple lines using least squares approach in Python
And I found this resource in Matlab that everyone seems to like... but I'm not sure how to convert it to python:
https://www.mathworks.com/matlabcentral/fileexchange/37192-intersection-point-of-lines-in-3d-space
I find it hard to believe that someone hasn't already done this... surely this is part of numpy or a standard package, right? I'm probably just not searching for the right terms - but I haven't been able to find it yet. I'd be fine with defining lines by two points each or by a point and a direction. Any help would be greatly appreciated!
Here's an example set of points that I'm working with:
initial XYZ points for the first set of lines
array([[-7.07107037, 7.07106748, 1. ],
[-7.34818339, 6.78264559, 1. ],
[-7.61352972, 6.48335745, 1. ],
[-7.8667115 , 6.17372055, 1. ],
[-8.1072994 , 5.85420065, 1. ]])
the angles that belong to the first set of lines
[-44.504854, -42.029223, -41.278573, -37.145774, -34.097022]
initial XYZ points for the second set of lines
array([[ 0., -20. , 1. ],
[ 7.99789129e-01, -19.9839984, 1. ],
[ 1.59830153e+00, -19.9360366, 1. ],
[ 2.39423914e+00, -19.8561769, 1. ],
[ 3.18637019e+00, -19.7445510, 1. ]])
the angles that belong to the second set of lines
[89.13244, 92.39087, 94.86425, 98.91849, 99.83488]
The solution should be the origin or very near it (the data is just a little noisy, which is why the lines don't perfectly intersect at a single point).
Here's a numpy solution using the method described in this link
def intersect(P0,P1):
"""P0 and P1 are NxD arrays defining N lines.
D is the dimension of the space. This function
returns the least squares intersection of the N
lines from the system given by eq. 13 in
http://cal.cs.illinois.edu/~johannes/research/LS_line_intersect.pdf.
"""
# generate all line direction vectors
n = (P1-P0)/np.linalg.norm(P1-P0,axis=1)[:,np.newaxis] # normalized
# generate the array of all projectors
projs = np.eye(n.shape[1]) - n[:,:,np.newaxis]*n[:,np.newaxis] # I - n*n.T
# see fig. 1
# generate R matrix and q vector
R = projs.sum(axis=0)
q = (projs # P0[:,:,np.newaxis]).sum(axis=0)
# solve the least squares problem for the
# intersection point p: Rp = q
p = np.linalg.lstsq(R,q,rcond=None)[0]
return p
Works
Edit: here is a generator for noisy test data
n = 6
P0 = np.stack((np.array([5,5])+3*np.random.random(size=2) for i in range(n)))
a = np.linspace(0,2*np.pi,n)+np.random.random(size=n)*np.pi/5.0
P1 = np.array([5+5*np.sin(a),5+5*np.cos(a)]).T
If this wikipedia equation carries any weight:
then you can use:
def nearest_intersection(points, dirs):
"""
:param points: (N, 3) array of points on the lines
:param dirs: (N, 3) array of unit direction vectors
:returns: (3,) array of intersection point
"""
dirs_mat = dirs[:, :, np.newaxis] # dirs[:, np.newaxis, :]
points_mat = points[:, :, np.newaxis]
I = np.eye(3)
return np.linalg.lstsq(
(I - dirs_mat).sum(axis=0),
((I - dirs_mat) # points_mat).sum(axis=0),
rcond=None
)[0]
If you want help deriving / checking that equation from first principles, then math.stackexchange.com would be a better place to ask.
surely this is part of numpy
Note that numpy gives you enough tools to express this very concisely already
Here's the final code that I ended up using. Thanks to kevinkayaks and everyone else who responded! Your help is very much appreciated!!!
The first half of this function simply converts the two collections of points and angles to direction vectors. I believe the rest of it is basically the same as what Eric and Eugene proposed. I just happened to have success first with Kevin's and ran with it until it was an end-to-end solution for me.
import numpy as np
def LS_intersect(p0,a0,p1,a1):
"""
:param p0 : Nx2 (x,y) position coordinates
:param p1 : Nx2 (x,y) position coordinates
:param a0 : angles in degrees for each point in p0
:param a1 : angles in degrees for each point in p1
:return: least squares intersection point of N lines from eq. 13 in
http://cal.cs.illinois.edu/~johannes/research/LS_line_intersect.pdf
"""
ang = np.concatenate( (a0,a1) ) # create list of angles
# create direction vectors with magnitude = 1
n = []
for a in ang:
n.append([np.cos(np.radians(a)), np.sin(np.radians(a))])
pos = np.concatenate((p0[:,0:2],p1[:,0:2])) # create list of points
n = np.array(n)
# generate the array of all projectors
nnT = np.array([np.outer(nn,nn) for nn in n ])
ImnnT = np.eye(len(pos[0]))-nnT # orthocomplement projectors to n
# now generate R matrix and q vector
R = np.sum(ImnnT,axis=0)
q = np.sum(np.array([np.dot(m,x) for m,x in zip(ImnnT,pos)]),axis=0)
# and solve the least squares problem for the intersection point p
return np.linalg.lstsq(R,q,rcond=None)[0]
#sample data
pa = np.array([[-7.07106638, 7.07106145, 1. ],
[-7.34817263, 6.78264524, 1. ],
[-7.61354115, 6.48336347, 1. ],
[-7.86671133, 6.17371816, 1. ],
[-8.10730426, 5.85419995, 1. ]])
paa = [-44.504854321138524, -42.02922380123842, -41.27857390748773, -37.145774853341386, -34.097022454778674]
pb = np.array([[-8.98220431e-07, -1.99999962e+01, 1.00000000e+00],
[ 7.99789129e-01, -1.99839984e+01, 1.00000000e+00],
[ 1.59830153e+00, -1.99360366e+01, 1.00000000e+00],
[ 2.39423914e+00, -1.98561769e+01, 1.00000000e+00],
[ 3.18637019e+00, -1.97445510e+01, 1.00000000e+00]])
pba = [88.71923357743934, 92.55801427272372, 95.3038321024299, 96.50212060095349, 100.24177145619092]
print("Should return (-0.03211692, 0.14173216)")
solution = LS_intersect(pa,paa,pb,pba)
print(solution)
Related
I have some code where I record a list of vectors over a trajectory. For each of these I have the N dimensional point from which it originates, and the vector from that point.
I have all of these ordered in an array from start to finish, and I would like to have these N dimensional vectors placed at points in the space relating to their origin so that I can calculate the divergence.
Example origin points:
[[-0.03194652 -0.02481244 0.02337171 -0.04088087]
[-0.03244277 0.16996671 0.02255409 -0.32609914]
[-0.02904343 0.3647604 0.01603211 -0.61158502]
[-0.02174823 0.16941809 0.00380041 -0.31389597]
[-0.01835986 -0.02575779 -0.00247751 -0.02001695]
[-0.01887502 0.1693996 -0.00287785 -0.31348053]
[-0.01548703 -0.02568124 -0.00914746 -0.02170657]
[-0.01600065 -0.22067082 -0.00958159 0.26807625]
[-0.02041407 -0.02541345 -0.00422007 -0.02761331]
[-0.02092234 0.16976877 -0.00477233 -0.32162472]]
Example vectors at each point
[[-0.00049625 0.19477914 -0.00081762 -0.28521826]
[ 0.00339933 0.19479369 -0.00652198 -0.28548588]
[ 0.00729521 -0.19534231 -0.0122317 0.29768905]
[ 0.00338836 -0.19517588 -0.00627792 0.29387903]
[-0.00051516 0.19515739 -0.00040034 -0.29346358]
[ 0.00338799 -0.19508084 -0.00626961 0.29177396]
[-0.00051362 -0.19498958 -0.00043413 0.28978282]
[-0.00441342 0.19525737 0.00536152 -0.29568956]
[-0.00050827 0.19518221 -0.00055227 -0.29401141]
[ 0.00339538 -0.19505367 -0.00643249 0.29117411]]
I attempted to use np.meshgrid over the initial points for each dimension, but I had to use sparse=True to save memory.
I am a bit stuck here, can anyone help?
Edit:
For my specific example I have a particle x in 4 dimensional space.
I have a collection of trajectories in a list for it moving through this N dimensional space of a set length of 10 time steps. Each entry in the trajectory is an entry [(t_start, t_end)]
initial_array = []
vector_array = []
for trajectory in range(trajectories):
for t in range(timesteps):
initial_state = trajectory[t][0]
vector_at_state = trajectory[t][0] - initial_state
initial_array.append(initial_state)
vector_array.append(vector_at_state)
return np.array(initial_array), np.array(vector_array)
I then want to plot these vectors at the points they began from to generate a "sample" vector field from what I can observe, and then use this function:
def divergence(f, h):
num_dims = len(f)
return np.ufunc.reduce(np.add, [np.gradient(f[i], h[i],axis=i) for i in range(num_dims)])
To calculate the divergence over the sampled trajectories.
Note: I asked this question before but it was closed as a duplicate, however, I, along with several others believe it was unduely closed, I explain why in an edit in my original post. So I would like to re-ask this question here again.
Does anyone know of a python library that can interpolate between two lines. For example, given the two solid lines below, I would like to produce the dashed line in the middle. In other words, I'd like to get the centreline. The input is a just two numpy arrays of coordinates with size N x 2 and M x 2 respectively.
Furthermore, I'd like to know if someone has written a function for this in some optimized python library. Although optimization isn't exactly a necessary.
Here is an example of two lines that I might have, you can assume they do not overlap with each other and an x/y can have multiple y/x coordinates.
array([[ 1233.87375018, 1230.07095987],
[ 1237.63559365, 1253.90749041],
[ 1240.87500801, 1264.43925132],
[ 1245.30875975, 1274.63795396],
[ 1256.1449357 , 1294.48254424],
[ 1264.33600095, 1304.47893299],
[ 1273.38192911, 1313.71468591],
[ 1283.12411536, 1322.35942538],
[ 1293.2559388 , 1330.55873344],
[ 1309.4817002 , 1342.53074698],
[ 1325.7074616 , 1354.50276051],
[ 1341.93322301, 1366.47477405],
[ 1358.15898441, 1378.44678759],
[ 1394.38474581, 1390.41880113]])
array([[ 1152.27115094, 1281.52899302],
[ 1155.53345506, 1295.30515742],
[ 1163.56506781, 1318.41642169],
[ 1168.03497425, 1330.03181319],
[ 1173.26135672, 1341.30559949],
[ 1184.07110925, 1356.54121651],
[ 1194.88086178, 1371.77683353],
[ 1202.58908737, 1381.41765447],
[ 1210.72465255, 1390.65097106],
[ 1227.81309742, 1403.2904646 ],
[ 1244.90154229, 1415.92995815],
[ 1261.98998716, 1428.56945169],
[ 1275.89219696, 1438.21626352],
[ 1289.79440676, 1447.86307535],
[ 1303.69661656, 1457.50988719],
[ 1323.80994319, 1470.41028655],
[ 1343.92326983, 1488.31068591],
[ 1354.31738934, 1499.33260989],
[ 1374.48879779, 1516.93734053],
[ 1394.66020624, 1534.54207116]])
Visualizing this we have:
So my attempt at this has been using the skeletonize function in the skimage.morphology library by first rasterizing the coordinates into a filled in polygon. However, I get branching at the ends like this:
First of all, pardon the overkill; I had fun with your question. If the description is too long, feel free to skip to the bottom, I defined a function that does everything I describe.
Your problem would be relatively straightforward if your arrays were the same length. In that case, all you would have to do is find the average between the corresponding x values in each array, and the corresponding y values in each array.
So what we can do is create arrays of the same length, that are more or less good estimates of your original arrays. We can do this by fitting a polynomial to the arrays you have. As noted in comments and other answers, the midline of your original arrays is not specifically defined, so a good estimate should fulfill your needs.
Note: In all of these examples, I've gone ahead and named the two arrays that you posted a1 and a2.
Step one: Create new arrays that estimate your old lines
Looking at the data you posted:
These aren't particularly complicated functions, it looks like a 3rd degree polynomial would fit them pretty well. We can create those using numpy:
import numpy as np
# Find the range of x values in a1
min_a1_x, max_a1_x = min(a1[:,0]), max(a1[:,0])
# Create an evenly spaced array that ranges from the minimum to the maximum
# I used 100 elements, but you can use more or fewer.
# This will be used as your new x coordinates
new_a1_x = np.linspace(min_a1_x, max_a1_x, 100)
# Fit a 3rd degree polynomial to your data
a1_coefs = np.polyfit(a1[:,0],a1[:,1], 3)
# Get your new y coordinates from the coefficients of the above polynomial
new_a1_y = np.polyval(a1_coefs, new_a1_x)
# Repeat for array 2:
min_a2_x, max_a2_x = min(a2[:,0]), max(a2[:,0])
new_a2_x = np.linspace(min_a2_x, max_a2_x, 100)
a2_coefs = np.polyfit(a2[:,0],a2[:,1], 3)
new_a2_y = np.polyval(a2_coefs, new_a2_x)
The result:
That's not bad so bad! If you have more complicated functions, you'll have to fit a higher degree polynomial, or find some other adequate function to fit to your data.
Now, you've got two sets of arrays of the same length (I chose a length of 100, you can do more or less depending on how smooth you want your midpoint line to be). These sets represent the x and y coordinates of the estimates of your original arrays. In the example above, I named these new_a1_x, new_a1_y, new_a2_x and new_a2_y.
Step two: calculate the average between each x and each y in your new arrays
Then, we want to find the average x and average y value for each of our estimate arrays. Just use np.mean:
midx = [np.mean([new_a1_x[i], new_a2_x[i]]) for i in range(100)]
midy = [np.mean([new_a1_y[i], new_a2_y[i]]) for i in range(100)]
midx and midy now represent the midpoint between our 2 estimate arrays. Now, just plot your original (not estimate) arrays, alongside your midpoint array:
plt.plot(a1[:,0], a1[:,1],c='black')
plt.plot(a2[:,0], a2[:,1],c='black')
plt.plot(midx, midy, '--', c='black')
plt.show()
And voilà:
This method still works with more complex, noisy data (but you have to fit the function thoughtfully):
As a function:
I've put the above code in a function, so you can use it easily. It returns an array of your estimated midpoints, in the format you had your original arrays in.
The arguments: a1 and a2 are your 2 input arrays, poly_deg is the degree polynomial you want to fit, n_points is the number of points you want in your midpoint array, and plot is a boolean, whether you want to plot it or not.
import matplotlib.pyplot as plt
import numpy as np
def interpolate(a1, a2, poly_deg=3, n_points=100, plot=True):
min_a1_x, max_a1_x = min(a1[:,0]), max(a1[:,0])
new_a1_x = np.linspace(min_a1_x, max_a1_x, n_points)
a1_coefs = np.polyfit(a1[:,0],a1[:,1], poly_deg)
new_a1_y = np.polyval(a1_coefs, new_a1_x)
min_a2_x, max_a2_x = min(a2[:,0]), max(a2[:,0])
new_a2_x = np.linspace(min_a2_x, max_a2_x, n_points)
a2_coefs = np.polyfit(a2[:,0],a2[:,1], poly_deg)
new_a2_y = np.polyval(a2_coefs, new_a2_x)
midx = [np.mean([new_a1_x[i], new_a2_x[i]]) for i in range(n_points)]
midy = [np.mean([new_a1_y[i], new_a2_y[i]]) for i in range(n_points)]
if plot:
plt.plot(a1[:,0], a1[:,1],c='black')
plt.plot(a2[:,0], a2[:,1],c='black')
plt.plot(midx, midy, '--', c='black')
plt.show()
return np.array([[x, y] for x, y in zip(midx, midy)])
[EDIT]:
I was thinking back on this question, and I overlooked a simpler way to do this, by "densifying" both arrays to the same number of points using np.interp. This method follows the same basic idea as the line-fitting method above, but instead of approximating lines using polyfit / polyval, it just densifies:
min_a1_x, max_a1_x = min(a1[:,0]), max(a1[:,0])
min_a2_x, max_a2_x = min(a2[:,0]), max(a2[:,0])
new_a1_x = np.linspace(min_a1_x, max_a1_x, 100)
new_a2_x = np.linspace(min_a2_x, max_a2_x, 100)
new_a1_y = np.interp(new_a1_x, a1[:,0], a1[:,1])
new_a2_y = np.interp(new_a2_x, a2[:,0], a2[:,1])
midx = [np.mean([new_a1_x[i], new_a2_x[i]]) for i in range(100)]
midy = [np.mean([new_a1_y[i], new_a2_y[i]]) for i in range(100)]
plt.plot(a1[:,0], a1[:,1],c='black')
plt.plot(a2[:,0], a2[:,1],c='black')
plt.plot(midx, midy, '--', c='black')
plt.show()
The "line between two lines" is not so well defined. You can obtain a decent though simple solution by triangulating between the two curves (you can triangulate by progressing from vertex to vertex, choosing the diagonals that produce the less skewed triangle).
Then the interpolated curve joins the middles of the sides.
I work with rivers, so this is a common problem. One of my solutions is exactly like the one you showed in your question--i.e. skeletonize the blob. You see that the boundaries have problems, so what I've done that seems to work well is to simply mirror the boundaries. For this approach to work, the blob must not intersect the corners of the image.
You can find my implementation in RivGraph; this particular algorithm is in rivers/river_utils.py called "mask_to_centerline".
Here's an example output showing how the ends of the centerline extend to the desired edge of the object:
sacuL's solution almost worked for me, but I needed to aggregate more than just two curves.
Here is my generalization for sacuL's solution:
def interp(*axis_list):
min_max_xs = [(min(axis[:,0]), max(axis[:,0])) for axis in axis_list]
new_axis_xs = [np.linspace(min_x, max_x, 100) for min_x, max_x in min_max_xs]
new_axis_ys = [np.interp(new_x_axis, axis[:,0], axis[:,1]) for axis, new_x_axis in zip(axis_list, new_axis_xs)]
midx = [np.mean([new_axis_xs[axis_idx][i] for axis_idx in range(len(axis_list))]) for i in range(100)]
midy = [np.mean([new_axis_ys[axis_idx][i] for axis_idx in range(len(axis_list))]) for i in range(100)]
for axis in axis_list:
plt.plot(axis[:,0], axis[:,1],c='black')
plt.plot(midx, midy, '--', c='black')
plt.show()
If we now run an example:
a1 = np.array([[x, x**2+5*(x%4)] for x in range(10)])
a2 = np.array([[x-0.5, x**2+6*(x%3)] for x in range(10)])
a3 = np.array([[x+0.2, x**2+7*(x%2)] for x in range(10)])
interp(a1, a2, a3)
we get the plot:
I did find a way to calculate the center coordinate of a cluster of points. However, my method is quite slow when the number of initial coordinates is increased (I have about 100 000 coordinates).
The bottleneck is the for-loop in the code. I tried to remove it by using np.apply_along_axis, but discovered that this is nothing more than a hidden python-loop.
Is it possible to detect and average out various sized clusters of too close points in a vectorized way?
import numpy as np
from scipy.spatial import cKDTree
np.random.seed(7)
max_distance=1
#Create random points
points = np.array([[1,1],[1,2],[2,1],[3,3],[3,4],[5,5],[8,8],[10,10],[8,6],[6,5]])
#Create trees and detect the points and neighbours which needs to be fused
tree = cKDTree(points)
rows_to_fuse = np.array(list(tree.query_pairs(r=max_distance))).astype('uint64')
#Split the points and neighbours into two groups
points_to_fuse = points[rows_to_fuse[:,0], :2]
neighbours = points[rows_to_fuse[:,1], :2]
#get unique points_to_fuse
nonduplicate_points = np.ascontiguousarray(points_to_fuse)
unique_points = np.unique(nonduplicate_points.view([('', nonduplicate_points.dtype)]\
*nonduplicate_points.shape[1]))
unique_points = unique_points.view(nonduplicate_points.dtype).reshape(\
(unique_points.shape[0],\
nonduplicate_points.shape[1]))
#Empty array to store fused points
fused_points = np.empty((len(unique_points), 2))
####BOTTLENECK LOOP####
for i, point in enumerate(unique_points):
#Detect all locations where a unique point occurs
locs=np.where(np.logical_and((points_to_fuse[:,0] == point[0]), (points_to_fuse[:,1]==point[1])))
#Select all neighbours on these locations take the average
fused_points[i,:] = (np.average(np.hstack((point[0],neighbours[locs,0][0]))),np.average(np.hstack((point[1],neighbours[locs,1][0]))))
#Get original points that didn't need to be fused
points_without_fuse = np.delete(points, np.unique(rows_to_fuse.reshape((1, -1))), axis=0)
#Stack result
points = np.row_stack((points_without_fuse, fused_points))
Expected output
>>> points
array([[ 8. , 8. ],
[ 10. , 10. ],
[ 8. , 6. ],
[ 1.33333333, 1.33333333],
[ 3. , 3.5 ],
[ 5.5 , 5. ]])
EDIT 1: Example of 1 loop with desired result
Step 1: Create variables for the loop
#outside loop
points_to_fuse = np.array([[100,100],[101,101],[100,100]])
neighbours = np.array([[103,105],[109,701],[99,100]])
unique_points = np.array([[100,100],[101,101]])
#inside loop
point = np.array([100,100])
i = 0
Step 2: Detect all locations where a unique point occurs in the points_to_fuse array
locs=np.where(np.logical_and((points_to_fuse[:,0] == point[0]), (points_to_fuse[:,1]==point[1])))
>>> (array([0, 2], dtype=int64),)
Step 3: Create an array of the point and the neighbouring points at these locations and calculate the average
array_of_points = np.column_stack((np.hstack((point[0],neighbours[locs,0][0])),np.hstack((point[1],neighbours[locs,1][0]))))
>>> array([[100, 100],
[103, 105],
[ 99, 100]])
fused_points[i, :] = np.average(array_of_points, 0)
>>> array([ 100.66666667, 101.66666667])
Loop output after a complete run:
>>> print(fused_points)
>>> array([[ 100.66666667, 101.66666667],
[ 105. , 401. ]])
The bottleneck is not the loop which is necessary since all the neighborhoods have not the same size.
The pitfall is the points_to_fuse[:,0] == point[0] in the loop which trig a quadratic complexity. you can avoid that by sorting the points, by index.
An example to do that, even it doesn't solve the whole problem (after the generation of rows_to_fuse):
sorter=np.lexsort(rows_to_fuse.T)
sorted_points=rows_to_fuse[sorter]
uniques,counts=np.unique(sorted_points[:,1],return_counts=True)
indices=counts.cumsum()
neighbourhood=np.split(sorted_points,indices)[:-1]
means=[(points[ne[:,0]].sum(axis=0)+points[ne[0,1]])/(len(ne)+1) \
for ne in neighbourhood] # a simple python loop.
# + manage unfused points.
An other improvement is to compute means with numba if you want to speed the code, but the complexity is now ~ optimal I think.
I have been playing around for months on how to best write a program that will analyze multiple tables for similarities in geographical coordinates. I have tried everything now from nested for-loops to currently using a KD-Tree which seems to be working great. However I am not sure it is functioning properly when reading in my 3rd dimension, in this case is defined as Z.
import numpy
from scipy import spatial
import math as ma
def d(a,b):
d = ma.acos(ma.sin(ma.radians(a[1]))*ma.sin(ma.radians(b[1]))
+ma.cos(ma.radians(a[1]))*ma.cos(ma.radians(b[1]))*(ma.cos(ma.radians((a[0]-b[0])))))
return d
filename1 = "A"
pos1 = numpy.genfromtxt(filename1,
skip_header=1,
usecols=(1, 2))
z1 = numpy.genfromtxt(filename1,
skip_header=1,
usecols=(3))
filename2 = "B"
pos2 = numpy.genfromtxt(filename2,
#skip_header=1,
usecols=(0, 1))
z2 = numpy.genfromtxt(filename2,
#skip_header=1,
usecols=(2))
filename1 = "A"
data1 = numpy.genfromtxt(filename1,
skip_header=1)
#usecols=(0, 1))
filename2 = "B"
data2 = numpy.genfromtxt(filename2,
skip_header=1)
#usecols=(0, 1)
tree1 = spatial.KDTree(pos1)
match = tree1.query(pos2)
#print match
indices_pos1, indices_pos2 = [], []
for idx_pos1 in range(len(pos1)):
# find indices in pos2 that match this position (idx_pos1)
matching_indices_pos2 = numpy.where(match[1]==idx_pos1)[0]
for idx_pos2 in matching_indices_pos2:
# distance in sph coo
distance = d(pos1[idx_pos1], pos2[idx_pos2])
if distance < 0.01 and z1[idx_pos1]-z2[idx_pos2] > 0.001:
print pos1[idx_pos1], pos2[idx_pos2], z1[idx_pos1], z2[idx_pos2], distance
As you can see I am first calculating the (x,y) position as a single unit measured in spherical coordinates. Each element in file1 is compared to each element in file2. The problem lies somewhere in the Z dimension but I cant seem to crack this issue. When the results are printed out, the Z coordinates are often nowhere near each other. It seems as if my program is entirely ignoring the and statement. Below I have posted a string of results from my data which shows the issue that the z-values are in fact very far apart.
[ 358.98787832 -3.87297365] [ 358.98667162 -3.82408566] 0.694282 0.5310796 0.000853515096105
[ 358.98787832 -3.87297365] [ 359.00303872 -3.8962745 ] 0.694282 0.5132215 0.000484847441066
[ 358.98787832 -3.87297365] [ 358.99624509 -3.84617685] 0.694282 0.5128636 0.000489860962243
[ 359.0065807 -8.81507801] [ 358.99226267 -8.8451829 ] 0.6865379 0.6675241 0.000580562641945
[ 359.0292886 9.31398903] [ 358.99296163 9.28436493] 0.68445694 0.45485374 0.000811677349685
How the out put is structured: [ position1 (x,y)] [position2 (x,y)] [Z1] [Z2] distance
As you can see, specifically in the last example the Z-coordinates are sperated by about .23, which is way over the .001 restriction I typed for it above.
Any insights you could share would be really wonderful!
As for your original problem, you have a simple problem with the sign. You test if z1-z2 > 0.001, but you probably wanted abs(z1-z2) < 0.001 (notice the < instead of a >).
You could have the tree to also take the z coordinate into account, then you need to give it data as (x,y,z) and not only (x,y).
If it doesn't know the z value, it cannot use it.
It should be possible (although the sklearn API might not allow this) to query the tree directly for a window, where you bound the coordinate range and the z range independently. Think of a box that has different extensions in x,y,z. But because z will have a different value range, combining these different scales is difficult.
Beware that the k-d-tree does not know about spherical coordinates. A point at +180 degree and one at -180 degree - or one at 0 and one at 360 - are very far for the k-d-tree, but very close by spherical distance. So it will miss some points!
I have two 2d numpy arrays: x_array contains positional information in the x-direction, y_array contains positions in the y-direction.
I then have a long list of x,y points.
For each point in the list, I need to find the array index of the location (specified in the arrays) which is closest to that point.
I have naively produced some code which works, based on this question:
Find nearest value in numpy array
i.e.
import time
import numpy
def find_index_of_nearest_xy(y_array, x_array, y_point, x_point):
distance = (y_array-y_point)**2 + (x_array-x_point)**2
idy,idx = numpy.where(distance==distance.min())
return idy[0],idx[0]
def do_all(y_array, x_array, points):
store = []
for i in xrange(points.shape[1]):
store.append(find_index_of_nearest_xy(y_array,x_array,points[0,i],points[1,i]))
return store
# Create some dummy data
y_array = numpy.random.random(10000).reshape(100,100)
x_array = numpy.random.random(10000).reshape(100,100)
points = numpy.random.random(10000).reshape(2,5000)
# Time how long it takes to run
start = time.time()
results = do_all(y_array, x_array, points)
end = time.time()
print 'Completed in: ',end-start
I'm doing this over a large dataset and would really like to speed it up a bit.
Can anyone optimize this?
Thanks.
UPDATE: SOLUTION following suggestions by #silvado and #justin (below)
# Shoe-horn existing data for entry into KDTree routines
combined_x_y_arrays = numpy.dstack([y_array.ravel(),x_array.ravel()])[0]
points_list = list(points.transpose())
def do_kdtree(combined_x_y_arrays,points):
mytree = scipy.spatial.cKDTree(combined_x_y_arrays)
dist, indexes = mytree.query(points)
return indexes
start = time.time()
results2 = do_kdtree(combined_x_y_arrays,points_list)
end = time.time()
print 'Completed in: ',end-start
This code above sped up my code (searching for 5000 points in 100x100 matrices) by 100 times. Interestingly, using scipy.spatial.KDTree (instead of scipy.spatial.cKDTree) gave comparable timings to my naive solution, so it is definitely worth using the cKDTree version...
Here is a scipy.spatial.KDTree example
In [1]: from scipy import spatial
In [2]: import numpy as np
In [3]: A = np.random.random((10,2))*100
In [4]: A
Out[4]:
array([[ 68.83402637, 38.07632221],
[ 76.84704074, 24.9395109 ],
[ 16.26715795, 98.52763827],
[ 70.99411985, 67.31740151],
[ 71.72452181, 24.13516764],
[ 17.22707611, 20.65425362],
[ 43.85122458, 21.50624882],
[ 76.71987125, 44.95031274],
[ 63.77341073, 78.87417774],
[ 8.45828909, 30.18426696]])
In [5]: pt = [6, 30] # <-- the point to find
In [6]: A[spatial.KDTree(A).query(pt)[1]] # <-- the nearest point
Out[6]: array([ 8.45828909, 30.18426696])
#how it works!
In [7]: distance,index = spatial.KDTree(A).query(pt)
In [8]: distance # <-- The distances to the nearest neighbors
Out[8]: 2.4651855048258393
In [9]: index # <-- The locations of the neighbors
Out[9]: 9
#then
In [10]: A[index]
Out[10]: array([ 8.45828909, 30.18426696])
scipy.spatial also has a k-d tree implementation: scipy.spatial.KDTree.
The approach is generally to first use the point data to build up a k-d tree. The computational complexity of that is on the order of N log N, where N is the number of data points. Range queries and nearest neighbour searches can then be done with log N complexity. This is much more efficient than simply cycling through all points (complexity N).
Thus, if you have repeated range or nearest neighbor queries, a k-d tree is highly recommended.
If you can massage your data into the right format, a fast way to go is to use the methods in scipy.spatial.distance:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
In particular pdist and cdist provide fast ways to calculate pairwise distances.
Search methods have two phases:
build a search structure, e.g. a KDTree, from the npt data points (your x y)
lookup nq query points.
Different methods have different build times, and different query times.
Your choice will depend a lot on npt and nq:
scipy cdist
has build time 0, but query time ~ npt * nq.
KDTree build times are complicated,
lookups are very fast, ~ ln npt * nq.
On a regular (Manhatten) grid, you can do much better: see (ahem)
find-nearest-value-in-numpy-array.
A little
testbench:
: building a KDTree of 5000 × 5000 2d points takes about 30 seconds,
then queries take microseconds;
scipy cdist
25 million × 20 points (all pairs, 4G) takes about 5 seconds, on my old iMac.
I have been trying to follow along with this, but new to Jupyter Notebooks, Python and the various tools being discussed here, but I have managed to get some way down the road I'm travelling.
BURoute = pd.read_csv('C:/Users/andre/BUKP_1m.csv', header=None)
NGEPRoute = pd.read_csv('c:/Users/andre/N1-06.csv', header=None)
I create a combined XY array from my BURoute dataframe
combined_x_y_arrays = BURoute.iloc[:,[0,1]]
And I create the points with the following command
points = NGEPRoute.iloc[:,[0,1]]
I then do the KDTree magic
def do_kdtree(combined_x_y_arrays, points):
mytree = scipy.spatial.cKDTree(combined_x_y_arrays)
dist, indexes = mytree.query(points)
return indexes
results2 = do_kdtree(combined_x_y_arrays, points)
This gives me an array of the indexes. I'm now trying to figure out how to calculate the distance between the points and the indexed points in the results array.
def find_nearest_vector(self,arrList, value):
y,x = value
offset =10
x_Array=[]
y_Array=[]
for p in arrList:
x_Array.append(p[1])
y_Array.append(p[0])
x_Array=np.array(x_Array)
y_Array=np.array(y_Array)
difference_array_x = np.absolute(x_Array-x)
difference_array_y = np.absolute(y_Array-y)
index_x = np.where(difference_array_x<offset)[0]
index_y = np.where(difference_array_y<offset)[0]
index = np.intersect1d(index_x, index_y, assume_unique=True)
nearestCootdinate = (arrList[index][0][0],arrList[index][0][1])
return nearestCootdinate