Fast fuse of close points in a numpy-2d (vectorized)

Fast fuse of close points in a numpy-2d (vectorized) - python

I have a question similar to the question asked here:
simple way of fusing a few close points. I want to replace points that are located close to each other with the average of their coordinates. The closeness in cells is specified by the user (I am talking about euclidean distance).
In my case I have a lot of points (about 1-million). This method is working, but is time consuming as it uses a double for loop.
Is there a faster way to detect and fuse close points in a numpy 2d array?
To be complete I added an example:
points=array([[ 382.49056159, 640.1731949 ],
[ 496.44669161, 655.8583119 ],
[ 1255.64762859, 672.99699399],
[ 1070.16520917, 688.33538171],
[ 318.89390168, 718.05989421],
[ 259.7106383 , 822.2 ],
[ 141.52574427, 28.68594436],
[ 1061.13573287, 28.7094536 ],
[ 820.57417943, 84.27702407],
[ 806.71416007, 108.50307828]])
A scatterplot of the points is visible below. The red circle indicates the points located close to each other (in this case a distance of 27.91 between the last two points in the array). So if the user would specify a minimum distance of 30 these points should be fused.
In the output of the fuse function the last to points are fused. This will look like:
#output
array([[ 382.49056159, 640.1731949 ],
[ 496.44669161, 655.8583119 ],
[ 1255.64762859, 672.99699399],
[ 1070.16520917, 688.33538171],
[ 318.89390168, 718.05989421],
[ 259.7106383 , 822.2 ],
[ 141.52574427, 28.68594436],
[ 1061.13573287, 28.7094536 ],
[ 813.64416975, 96.390051175]])

If you have a large number of points then it may be faster to build a k-D tree using scipy.spatial.KDTree, then query it for pairs of points that are closer than some threshold:
import numpy as np
from scipy.spatial import KDTree
tree = KDTree(points)
rows_to_fuse = tree.query_pairs(r=30)
print(repr(rows_to_fuse))
# {(8, 9)}
print(repr(points[list(rows_to_fuse)]))
# array([[ 820.57417943, 84.27702407],
# [ 806.71416007, 108.50307828]])
The major advantage of this approach is that you don't need to compute the distance between every pair of points in your dataset.

You can use scipy's distance functions such as pdist in order to quickly find which points should be merged:
import numpy as np
from scipy.spatial.distance import pdist, squareform
d = squareform(pdist(a))
d = np.ma.array(d, mask=np.isclose(d, 0))
a[d.min(axis=1) < 30]
#array([[ 820.57417943, 84.27702407],
# [ 806.71416007, 108.50307828]])
NOTE
For large samples this method can cause memory errors since it is storing a full matrix containing the relative distances.

Related

nearest intersection point to many lines in python

I need a good algorithm for calculating the point that is closest to a collection of lines in python, preferably by using least squares. I found this post on a python implementation that doesn't work:
Finding the centre of multiple lines using least squares approach in Python
And I found this resource in Matlab that everyone seems to like... but I'm not sure how to convert it to python:
https://www.mathworks.com/matlabcentral/fileexchange/37192-intersection-point-of-lines-in-3d-space
I find it hard to believe that someone hasn't already done this... surely this is part of numpy or a standard package, right? I'm probably just not searching for the right terms - but I haven't been able to find it yet. I'd be fine with defining lines by two points each or by a point and a direction. Any help would be greatly appreciated!
Here's an example set of points that I'm working with:
initial XYZ points for the first set of lines
array([[-7.07107037, 7.07106748, 1. ],
[-7.34818339, 6.78264559, 1. ],
[-7.61352972, 6.48335745, 1. ],
[-7.8667115 , 6.17372055, 1. ],
[-8.1072994 , 5.85420065, 1. ]])
the angles that belong to the first set of lines
[-44.504854, -42.029223, -41.278573, -37.145774, -34.097022]
initial XYZ points for the second set of lines
array([[ 0., -20. , 1. ],
[ 7.99789129e-01, -19.9839984, 1. ],
[ 1.59830153e+00, -19.9360366, 1. ],
[ 2.39423914e+00, -19.8561769, 1. ],
[ 3.18637019e+00, -19.7445510, 1. ]])
the angles that belong to the second set of lines
[89.13244, 92.39087, 94.86425, 98.91849, 99.83488]
The solution should be the origin or very near it (the data is just a little noisy, which is why the lines don't perfectly intersect at a single point).

Here's a numpy solution using the method described in this link
def intersect(P0,P1):
"""P0 and P1 are NxD arrays defining N lines.
D is the dimension of the space. This function
returns the least squares intersection of the N
lines from the system given by eq. 13 in
http://cal.cs.illinois.edu/~johannes/research/LS_line_intersect.pdf.
"""
# generate all line direction vectors
n = (P1-P0)/np.linalg.norm(P1-P0,axis=1)[:,np.newaxis] # normalized
# generate the array of all projectors
projs = np.eye(n.shape[1]) - n[:,:,np.newaxis]*n[:,np.newaxis] # I - n*n.T
# see fig. 1
# generate R matrix and q vector
R = projs.sum(axis=0)
q = (projs # P0[:,:,np.newaxis]).sum(axis=0)
# solve the least squares problem for the
# intersection point p: Rp = q
p = np.linalg.lstsq(R,q,rcond=None)[0]
return p
Works
Edit: here is a generator for noisy test data
n = 6
P0 = np.stack((np.array([5,5])+3*np.random.random(size=2) for i in range(n)))
a = np.linspace(0,2*np.pi,n)+np.random.random(size=n)*np.pi/5.0
P1 = np.array([5+5*np.sin(a),5+5*np.cos(a)]).T

If this wikipedia equation carries any weight:
then you can use:
def nearest_intersection(points, dirs):
"""
:param points: (N, 3) array of points on the lines
:param dirs: (N, 3) array of unit direction vectors
:returns: (3,) array of intersection point
"""
dirs_mat = dirs[:, :, np.newaxis] # dirs[:, np.newaxis, :]
points_mat = points[:, :, np.newaxis]
I = np.eye(3)
return np.linalg.lstsq(
(I - dirs_mat).sum(axis=0),
((I - dirs_mat) # points_mat).sum(axis=0),
rcond=None
)[0]
If you want help deriving / checking that equation from first principles, then math.stackexchange.com would be a better place to ask.
surely this is part of numpy
Note that numpy gives you enough tools to express this very concisely already

Here's the final code that I ended up using. Thanks to kevinkayaks and everyone else who responded! Your help is very much appreciated!!!
The first half of this function simply converts the two collections of points and angles to direction vectors. I believe the rest of it is basically the same as what Eric and Eugene proposed. I just happened to have success first with Kevin's and ran with it until it was an end-to-end solution for me.
import numpy as np
def LS_intersect(p0,a0,p1,a1):
"""
:param p0 : Nx2 (x,y) position coordinates
:param p1 : Nx2 (x,y) position coordinates
:param a0 : angles in degrees for each point in p0
:param a1 : angles in degrees for each point in p1
:return: least squares intersection point of N lines from eq. 13 in
http://cal.cs.illinois.edu/~johannes/research/LS_line_intersect.pdf
"""
ang = np.concatenate( (a0,a1) ) # create list of angles
# create direction vectors with magnitude = 1
n = []
for a in ang:
n.append([np.cos(np.radians(a)), np.sin(np.radians(a))])
pos = np.concatenate((p0[:,0:2],p1[:,0:2])) # create list of points
n = np.array(n)
# generate the array of all projectors
nnT = np.array([np.outer(nn,nn) for nn in n ])
ImnnT = np.eye(len(pos[0]))-nnT # orthocomplement projectors to n
# now generate R matrix and q vector
R = np.sum(ImnnT,axis=0)
q = np.sum(np.array([np.dot(m,x) for m,x in zip(ImnnT,pos)]),axis=0)
# and solve the least squares problem for the intersection point p
return np.linalg.lstsq(R,q,rcond=None)[0]
#sample data
pa = np.array([[-7.07106638, 7.07106145, 1. ],
[-7.34817263, 6.78264524, 1. ],
[-7.61354115, 6.48336347, 1. ],
[-7.86671133, 6.17371816, 1. ],
[-8.10730426, 5.85419995, 1. ]])
paa = [-44.504854321138524, -42.02922380123842, -41.27857390748773, -37.145774853341386, -34.097022454778674]
pb = np.array([[-8.98220431e-07, -1.99999962e+01, 1.00000000e+00],
[ 7.99789129e-01, -1.99839984e+01, 1.00000000e+00],
[ 1.59830153e+00, -1.99360366e+01, 1.00000000e+00],
[ 2.39423914e+00, -1.98561769e+01, 1.00000000e+00],
[ 3.18637019e+00, -1.97445510e+01, 1.00000000e+00]])
pba = [88.71923357743934, 92.55801427272372, 95.3038321024299, 96.50212060095349, 100.24177145619092]
print("Should return (-0.03211692, 0.14173216)")
solution = LS_intersect(pa,paa,pb,pba)
print(solution)

4D interpolation for irregular (x,y,z) grids by python

I have some data that comes in the form (x, y, z, V) where x,y,z are distances, and V is the moisture. I read a lot on StackOverflow about interpolation by python like this and this valuable posts, but all of them were about regular grids of x, y, z. i.e. every value of x contributes equally with every point of y, and every point of z. On the other hand, my points came from 3D finite element grid (as below), where the grid is not regular.
The two mentioned posts 1 and 2, defined each of x, y, z as a separate numpy array then they used something like cartcoord = zip(x, y) then scipy.interpolate.LinearNDInterpolator(cartcoord, z) (in a 3D example). I can not do the same as my 3D grid is not regular, thus not each point has a contribution to other points, so if when I repeated these approaches I found many null values, and I got many errors.
Here are 10 sample points in the form of [x, y, z, V]
data = [[27.827, 18.530, -30.417, 0.205] , [24.002, 17.759, -24.782, 0.197] ,
[22.145, 13.687, -33.282, 0.204] , [17.627, 18.224, -25.197, 0.197] ,
[29.018, 18.841, -38.761, 0.212] , [24.834, 20.538, -33.012, 0.208] ,
[26.232, 22.327, -27.735, 0.204] , [23.017, 23.037, -29.230, 0.205] ,
[28.761, 21.565, -31.586, 0.211] , [26.263, 23.686, -32.766, 0.215]]
I want to get the interpolated value V of the point (25, 20, -30)
How can I get it?

I found the answer, and posting it for the benefit of StackOverflow readers.
The method is as follows:
1- Imports:
import numpy as np
from scipy.interpolate import griddata
from scipy.interpolate import LinearNDInterpolator
2- prepare the data as follows:
# put the available x,y,z data as a numpy array
points = np.array([
[ 27.827, 18.53 , -30.417], [ 24.002, 17.759, -24.782],
[ 22.145, 13.687, -33.282], [ 17.627, 18.224, -25.197],
[ 29.018, 18.841, -38.761], [ 24.834, 20.538, -33.012],
[ 26.232, 22.327, -27.735], [ 23.017, 23.037, -29.23 ],
[ 28.761, 21.565, -31.586], [ 26.263, 23.686, -32.766]])
# and put the moisture corresponding data values in a separate array:
values = np.array([0.205, 0.197, 0.204, 0.197, 0.212,
0.208, 0.204, 0.205, 0.211, 0.215])
# Finally, put the desired point/points you want to interpolate over
request = np.array([[25, 20, -30], [27, 20, -32]])
3- Write the final line of code to get the interpolated values
Method 1, using griddata
print griddata(points, values, request)
# OUTPUT: array([ 0.20448536, 0.20782028])
Method 2, using LinearNDInterpolator
# First, define an interpolator function
linInter= LinearNDInterpolator(points, values)
# Then, apply the function to one or more points
print linInter(np.array([[25, 20, -30]]))
print linInter(request)
# OUTPUT: [0.20448536 0.20782028]
# I think you may use it with python map or pandas.apply as well
Hope this benefit every one.
Bet regards

How to average all coordinates within a given distance in a vectorized way

I did find a way to calculate the center coordinate of a cluster of points. However, my method is quite slow when the number of initial coordinates is increased (I have about 100 000 coordinates).
The bottleneck is the for-loop in the code. I tried to remove it by using np.apply_along_axis, but discovered that this is nothing more than a hidden python-loop.
Is it possible to detect and average out various sized clusters of too close points in a vectorized way?
import numpy as np
from scipy.spatial import cKDTree
np.random.seed(7)
max_distance=1
#Create random points
points = np.array([[1,1],[1,2],[2,1],[3,3],[3,4],[5,5],[8,8],[10,10],[8,6],[6,5]])
#Create trees and detect the points and neighbours which needs to be fused
tree = cKDTree(points)
rows_to_fuse = np.array(list(tree.query_pairs(r=max_distance))).astype('uint64')
#Split the points and neighbours into two groups
points_to_fuse = points[rows_to_fuse[:,0], :2]
neighbours = points[rows_to_fuse[:,1], :2]
#get unique points_to_fuse
nonduplicate_points = np.ascontiguousarray(points_to_fuse)
unique_points = np.unique(nonduplicate_points.view([('', nonduplicate_points.dtype)]\
*nonduplicate_points.shape[1]))
unique_points = unique_points.view(nonduplicate_points.dtype).reshape(\
(unique_points.shape[0],\
nonduplicate_points.shape[1]))
#Empty array to store fused points
fused_points = np.empty((len(unique_points), 2))
####BOTTLENECK LOOP####
for i, point in enumerate(unique_points):
#Detect all locations where a unique point occurs
locs=np.where(np.logical_and((points_to_fuse[:,0] == point[0]), (points_to_fuse[:,1]==point[1])))
#Select all neighbours on these locations take the average
fused_points[i,:] = (np.average(np.hstack((point[0],neighbours[locs,0][0]))),np.average(np.hstack((point[1],neighbours[locs,1][0]))))
#Get original points that didn't need to be fused
points_without_fuse = np.delete(points, np.unique(rows_to_fuse.reshape((1, -1))), axis=0)
#Stack result
points = np.row_stack((points_without_fuse, fused_points))
Expected output
>>> points
array([[ 8. , 8. ],
[ 10. , 10. ],
[ 8. , 6. ],
[ 1.33333333, 1.33333333],
[ 3. , 3.5 ],
[ 5.5 , 5. ]])
EDIT 1: Example of 1 loop with desired result
Step 1: Create variables for the loop
#outside loop
points_to_fuse = np.array([[100,100],[101,101],[100,100]])
neighbours = np.array([[103,105],[109,701],[99,100]])
unique_points = np.array([[100,100],[101,101]])
#inside loop
point = np.array([100,100])
i = 0
Step 2: Detect all locations where a unique point occurs in the points_to_fuse array
locs=np.where(np.logical_and((points_to_fuse[:,0] == point[0]), (points_to_fuse[:,1]==point[1])))
>>> (array([0, 2], dtype=int64),)
Step 3: Create an array of the point and the neighbouring points at these locations and calculate the average
array_of_points = np.column_stack((np.hstack((point[0],neighbours[locs,0][0])),np.hstack((point[1],neighbours[locs,1][0]))))
>>> array([[100, 100],
[103, 105],
[ 99, 100]])
fused_points[i, :] = np.average(array_of_points, 0)
>>> array([ 100.66666667, 101.66666667])
Loop output after a complete run:
>>> print(fused_points)
>>> array([[ 100.66666667, 101.66666667],
[ 105. , 401. ]])

The bottleneck is not the loop which is necessary since all the neighborhoods have not the same size.
The pitfall is the points_to_fuse[:,0] == point[0] in the loop which trig a quadratic complexity. you can avoid that by sorting the points, by index.
An example to do that, even it doesn't solve the whole problem (after the generation of rows_to_fuse):
sorter=np.lexsort(rows_to_fuse.T)
sorted_points=rows_to_fuse[sorter]
uniques,counts=np.unique(sorted_points[:,1],return_counts=True)
indices=counts.cumsum()
neighbourhood=np.split(sorted_points,indices)[:-1]
means=[(points[ne[:,0]].sum(axis=0)+points[ne[0,1]])/(len(ne)+1) \
for ne in neighbourhood] # a simple python loop.
# + manage unfused points.
An other improvement is to compute means with numba if you want to speed the code, but the complexity is now ~ optimal I think.

numpy's fast Fourier transform yields unexpected results

I am struggling with numpy's implementation of the fast Fourier transform. My signal is not of periodic nature and therefore certainly not an ideal candidate, the result of the FFT however is far from what I was expecting. It is the same signal, simply stretched by some factor. I plotted a sinus curve, approximating my signal next to it which should illustrate, that I use the FFT function correctly:
import numpy as np
from matplotlib import pyplot as plt
signal = array([[ 0.], [ 0.1667557 ], [ 0.31103874], [ 0.44339886], [ 0.50747922],
[ 0.47848347], [ 0.64544846], [ 0.67861755], [ 0.69268326], [ 0.71581176],
[ 0.726552 ], [ 0.75032795], [ 0.77133769], [ 0.77379966], [ 0.80519187],
[ 0.78756476], [ 0.84179849], [ 0.85406538], [ 0.82852684], [ 0.87172407],
[ 0.9055542 ], [ 0.90563205], [ 0.92073452], [ 0.91178145], [ 0.8795554 ],
[ 0.89155587], [ 0.87965686], [ 0.91819571], [ 0.95774404], [ 0.95432073],
[ 0.96326252], [ 0.99480947], [ 0.94754962], [ 0.9818627 ], [ 0.9804966 ],
[ 1.], [ 0.99919711], [ 0.97202208], [ 0.99065786], [ 0.90567128],
[ 0.94300558], [ 0.89839004], [ 0.87312245], [ 0.86288378], [ 0.87301008],
[ 0.78184963], [ 0.73774451], [ 0.7450479 ], [ 0.67291666], [ 0.63518575],
[ 0.57036157], [ 0.5709147 ], [ 0.63079811], [ 0.61821523], [ 0.49526048],
[ 0.4434457 ], [ 0.29746173], [ 0.13024641], [ 0.17631683], [ 0.08590552]])
sinus = np.sin(np.linspace(0, np.pi, 60))
plt.plot(signal)
plt.plot(sinus)
The blue line is my signal, the green line is the sinus.
transformed_signal = abs(np.fft.fft(signal)[:30] / len(signal))
transformed_sinus = abs(np.fft.fft(sinus)[:30] / len(sinus))
plt.plot(transformed_signal)
plt.plot(transformed_sinus)
The blue line is transformed_signal, the green line is the transformed_sinus.
Plotting only transformed_signal illustrates the behavior described above:
Can someone explain to me what's going on here?
UPDATE
I was indeed a problem of calling the FFT. This is the correct call and the correct result:
transformed_signal = abs(np.fft.fft(signal,axis=0)[:30] / len(signal))

Numpy's fft is by default applied over rows. Since your signal variable is a column vector, fft is applied over the rows consisting of one element and returns the one-point FFT of each element.
Use the axis option of fft to specify that you want FFT applied over the columns of signal, i.e.,
transformed_signal = abs(np.fft.fft(signal,axis=0)[:30] / len(signal))

[EDIT] I overlooked the crucial thing stated by Stelios! Nevertheless I leave my answer here, since, while not spotting the root cause of your trouble, it is still true and contains things you have to reckon with for a useable FFT.
As you say you're tranforming a non-periodical signal.
Your signal has some ripples (higher harmonics) which nicely show up in the FFT.
The sine does have far less higher freq's and consists largely of a DC component.
So far so good. What I don't understand is that your signal also has a DC component, which doesn't show up at all. Could be that this is a matter of scale.
Core of the matter is that while the sinus and your signal look quite the same, they have a totally different harmonic content.
Most notable none of both hold a frequency that corresponds to the half sinus. This is because a 'half sinus' isn't built by summing whole sinusses. In other words: the underlying full sinus wave isn't in the spectral content of the sinus over half the period.
BTW having only 60 samples is a bit meager, Shannon states that your sample frequency should be at least twice the highest signal frequency, otherwise aliasing will happen (mapping freqs to the wrong place). In other words: your signal should visually appear smooth after sampling (unless of course it is discontinuous or has a discontinuous derivative, like a block or triangle wave). But in your case it looks like the sharp peaks are an artifact of undersampling.

Expanding "pixels" on matplotlib + numpy array

I have created a random data source that looks like this:
This is the code I use to gennerate and plot the first image.
import pandas as pd
import numpy as np
import numpy.ma as ma
import matplotlib.pyplot as plt
msize=25
rrange=5
jump=3
start=1
dpi=96
h=500
w=500
X,Y=np.meshgrid(range(0,msize),range(0,msize))
dat=np.random.rand(msize,msize)*rrange
msk=np.zeros_like(dat)
msk[start::jump,start::jump].fill(1)
mdat=msk*dat
mdat[mdat==0]=np.nan
mmdat = ma.masked_where(np.isnan(mdat),mdat)
fig = plt.figure(figsize=(w/dpi,h/dpi),dpi=dpi)
cmap = plt.get_cmap('RdYlBu')
cmap.set_bad(color='#cccccc', alpha=1.)
plot = plt.pcolormesh(X,Y,mmdat,cmap=cmap)
plot.axes.set_ylim(0,msize-1)
plot.axes.set_xlim(0,msize-1)
fig.savefig("masked.png",dpi=dpi)
Often this data source isn't so evenly distributed (but this is another subject).
Is there any kind of interpolation that makes the points "spill out" from its position?
Something like we take that light yellow point #(1,1) and turn all region around it (1 radius in taxi driver metric + diagonals) with the same color/value (for every valid point on image, nans will not be expanded)?
As I "gimped" on this image, on the three most lower/left values, the idea is find a way to do the same in all valid points, and not use gimp for that ;-):
After some thinking I arrived on this solution
import numpy as np
import matplotlib.pyplot as plt
t=np.array([
[ 0,0,0,0,0,0,0,0 ],
[ 0,0,0,0,0,0,0,0 ],
[ 0,0,2,0,0,4,0,0 ],
[ 0,0,0,0,0,0,0,0 ],
[ 0,0,0,0,0,0,0,0 ],
[ 0,0,3,0,0,1,0,0 ],
[ 0,0,0,0,0,0,0,0 ],
[ 0,0,0,0,0,0,0,0 ]])
def spill(arr, nval=0, m=1):
narr=np.copy(arr)
for i in range(arr.shape[0]):
for j in range(arr.shape[1]):
if arr[i][j] != nval:
narr[i-m:i+m+1:1,j-m:j+m+1:1]=arr[i][j]
return narr
l=spill(t)
plt.figure()
plt.pcolormesh(t)
plt.savefig("notspilled.png")
plt.figure()
plt.pcolormesh(l)
plt.savefig("spilled.png")
plt.show()
This solution didn't make me very happy because the double for loop inside the spill() function :-/
Here are the output from the last code
This one isn't spilled
This one was sppilled:
How can I enhance the code above to eliminate the double loop.

You could do this with a 2D convolution. For example:
from scipy.signal import convolve2d
def spill2(arr, nval=0, m=1):
return convolve2d(arr, np.ones((2*m+1, 2*m+1)), mode='same')
np.allclose(spill(t), spill2(t))
# True
Be aware that as written, the results will not match if nval != 0 or if the spilled pixels overlap, but you can probably modify this to suit your needs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast fuse of close points in a numpy-2d (vectorized) - python

Related

nearest intersection point to many lines in python

4D interpolation for irregular (x,y,z) grids by python

How to average all coordinates within a given distance in a vectorized way

numpy's fast Fourier transform yields unexpected results

Expanding "pixels" on matplotlib + numpy array

Categories

Resources