Python - Find closest indices from 2 sets - python

I have 2 sets of indices (i,j).
What I need to get is the 2 indices that are closest from the 2 sets.
It is easier to explain graphically:
Assuming I have all the indices that make the first black shape, and all the indices that make the second black shape, how do I find the closest indices (the red points in the figure) between those 2 shapes, in an efficient way (built in function in Python, not by iterating through all the possibilities)?
Any help will be appreciated!

As you asked about a built in function rather than looping through all combinations, there's a method in scipy.spacial.distance that does just that - it outputs a matrix of distances between all pairs of 2 inputs. If A and B are collections of 2D points, then:
from scipy.spatial import distance
dists = distance.cdist(A,B)
Then you can get the index of the minimal value in the matrix.

Related

fastest way to find min euclidean distance between two arrays

I have two arrays of x,y,z coordinates, (e.g. a=[(x1,y1,z1)...(xN,yN,zN)], b = [(X1,Y1,Z1)...(XN,YN,ZN)] ). I need the fastest way to iterate through them and find the indices of b with the minimum euclidean distance to each point in a. here's the catch. I'm using a modified/weighted euclidean equation. Currently I'm doing two for loops which admittedly is the slowest way to do it.
b typically has around 500 coordinate sets to choose from, but a can have tens-to-hundreds of thousands
as an example:
a = (1,1,1), b = [(87,87,87),(2,2,2),(50,50,50)]
would return index 1.
You could create a k-d tree of array b and find the nearest distance of a coordinate in array a by traversing down the tree.
For array a of size n and array b of size m, the complexity would be O(mlog(m)) for building the tree and O(nlog(m)) for finding all the nearest distances.

Flatten only part of a dataframe shape for Euclidean calculation?

I have a data frame with shape:
(20,30,1024)
I want to find the Euclidean distance between every entry and every other entry in the dataframe (ideally non-redundantly, i.e. don't find the distance of row 1 and 5....and then row 5 and 1 but not there yet). I have this code:
from scipy.spatial.distance import pdist,squareform
distances = pdist(df_test,metric='euclidean')
dist_matrix = squareform(distances)
print(dist_matrix)
The error says:
A 2-dimensional array must be passed.
So I guess I want to convert my matrix from shape (20,30,1024) to (20,30720), and then calculate the pdist/squareform between the rows (i.e. 20 rows of vectors that are 30720 in length).
I know that I can use test_df[0:20].flatten().tolist()
But that completely flattened my matrix, the output shape was (1,614400).
Can someone show me how to convert a shape from (20,30,1024) to (20,3072), or if i'm not going about this the right way?
The ultimate end goal is to calculate Euclidean distance between all non-redundant pairs in a data set, but the data set is big, so I need to do it as efficiently as possible/not duplicating calculations.
The most straightforward way to reshape that I can think of, according to how you described the problem, is:
df_test.values.reshape(20, -1)
By calling .values, you are retrieving your dataframe data as a numpy array. From there, .reshape finishes your job. Since you need a 2D-array, you provide the size of the first dimension (in your case, 20), and by passing -1 Numpy will calculate the size of the second dimension for you (in this case it will multiply the remaining dimension sizes in the original 3D-array)

Find indices of each integer group in a labelled array

I have a labelled array obtained by using scipy measure.label on a binary 2 dimensional array. For argument sake it might look like this:
[
[1,1,0,0,2],
[1,1,1,0,2],
[1,0,0,0,0],
[0,0,0,3,3]
]
I want to get the indices of each group of labels. So in this case:
[
[(0,0),(0,1),(1,0),(1,1),(1,2),(2,0)],
[(0,4),(1,4)],
[(3,3),(3,4)]
]
I can do this using builtin Python like so (n and m are the dimensions of the array):
_dict = {}
for coords in itertools.product(range(n), range(m)):
_dict.setdefault(labelled_array[coords], []).append(coords)
blobs = [np.array(item) for item in _dict.values()]
This is very slow (about 10 times slower than the initial labelling of the binary array using measure.label!)
Scipy also has a function find_objects:
from scipy import ndimage
objs = ndimage.find_objects(labelled_array)
From what I can gather though this is returning the bounding box for each group (object). I don't want the bounding box I want the exact coordinates of each value in the group.
I have also tried using np.where for each integer in the number of labels. This is very slow.
it also seems to me that what I'm tring to do here is something like the minesweeper algorithm. I suspect there must be an efficient solution using numpy or scipy.
Is there an efficient way to obtain these coordinates?

Minimize sum of distances between mutually disjoint bipartite pair of points

I have a multidimensionnal array which represent distances between two group of points (colored by blue and red respectively).
import numpy as np
distance=np.array([[30,18,51,55],
[35,15,50,49],
[36,17,40,32],
[40,29,29,17]])
Each column represent the red dot and rows are for blue dots. Values in this matrix represent the distance between red and blue dots. Here is a sketch to understand what it looks like:
Question: How to find the minimum of the sum of distances between mutually disjoint (blue, red) pairs?
Attempt
I am expecting to find 1=1, 2=2, 3=3 and 4=4 in the above image. However, if i use a simple argmin numpy function like:
for liste in distance:
np.argmin(liste)
the result is
1
1
1
3
because the 2 red point is the nearest of 1,2 and 3 blue point.
Is there a way to do something generic in that case to make things better? I mean without using a lot of if statements and a while function.
The problem is known as the assignment problem in operations management and can be solved efficiently by Hungarian Algorithm. In your case, the distance can be viewed as a kind of "cost" function which is going to be minimized in its total.
Luckily, scipy has a nice linear_sum_assignment() (see official docs and example) implemented, so you don't have to reinvent the wheel. The function returns the matched indices.
from scipy.optimize import linear_sum_assignment
distance=np.array([[30,18,51,55],
[35,15,50,49],
[36,17,40,32],
[40,29,29,17]])
row_ind, col_ind = linear_sum_assignment(distance)
# result
col_ind
Out[79]: array([0, 1, 2, 3])
row_ind
Out[80]: array([0, 1, 2, 3])
You can use itertools.permutations to find all possible solutions. Then, you calculate which solution minimize the total pair-wise distance.
import itertools
import numpy as np
distance=np.array([[30,18,51,55],[35,15,50,49],[36,17,40,32],[40,29,29,17]])
permutation=[x for x in itertools.permutations([0,1,2,3],4)]
x_opt=permutation[0]
d_opt=sum([distance[i,x_opt[i]] for i in range(len(distance[0]))])
for x in permutation:
d=sum([distance[i,x[i]] for i in range(len(distance[0]))])
if d<d_opt:
(d_opt,x_opt)=(d,x)
print(x_opt)
The result will be in this case:
(0,1,2,3)

Python: Get median in 3-dimensional numpy array

I have a 3-dimensional numpy array, where the first two dimensions form a grid, and the third dimension (let's call it cell) is a vector of attributes. Here is an example for array x (a 2x3 grid with 4 attributes in each cell):
[[[1 2 3 4][5 6 7 8][9 8 7 6]]
[[9 8 7 6][5 4 3 2][1 2 3 4]]]
for which I want to get the median of the 8 neighbors of each cell in array x, e.g. for x[i,j,:] it would be the median of all cells with an index combined of i-1, i+1, j-1, j+1. It is clear how to do that, but for the borders the index would get out of range (e.g. if i=0, a general solution where I take x[i-1,j,:] into the calculation wouldn't work).
Now the simple solution would be (simple in the sense of not thought through) to separately treat the 4 corners (e.g. where i=j=0), borders (e.g. where i=0 and j!=0) and the default case for cells in the middle with if statements, but I would hope that there is a more elegant solution for this problem. I thought to extend the n*m grid to a (n+2)*(m+2) grid and fill the border cells on all sides with 0 values, but that would distort the median computation.
I hope I was able to kind of clarify the problem. Thanks in advance for any suggestions for a more elegant way to solve this.

Categories