Minimize sum of distances between mutually disjoint bipartite pair of points - python

I have a multidimensionnal array which represent distances between two group of points (colored by blue and red respectively).
import numpy as np
distance=np.array([[30,18,51,55],
[35,15,50,49],
[36,17,40,32],
[40,29,29,17]])
Each column represent the red dot and rows are for blue dots. Values in this matrix represent the distance between red and blue dots. Here is a sketch to understand what it looks like:
Question: How to find the minimum of the sum of distances between mutually disjoint (blue, red) pairs?
Attempt
I am expecting to find 1=1, 2=2, 3=3 and 4=4 in the above image. However, if i use a simple argmin numpy function like:
for liste in distance:
np.argmin(liste)
the result is
1
1
1
3
because the 2 red point is the nearest of 1,2 and 3 blue point.
Is there a way to do something generic in that case to make things better? I mean without using a lot of if statements and a while function.

The problem is known as the assignment problem in operations management and can be solved efficiently by Hungarian Algorithm. In your case, the distance can be viewed as a kind of "cost" function which is going to be minimized in its total.
Luckily, scipy has a nice linear_sum_assignment() (see official docs and example) implemented, so you don't have to reinvent the wheel. The function returns the matched indices.
from scipy.optimize import linear_sum_assignment
distance=np.array([[30,18,51,55],
[35,15,50,49],
[36,17,40,32],
[40,29,29,17]])
row_ind, col_ind = linear_sum_assignment(distance)
# result
col_ind
Out[79]: array([0, 1, 2, 3])
row_ind
Out[80]: array([0, 1, 2, 3])

You can use itertools.permutations to find all possible solutions. Then, you calculate which solution minimize the total pair-wise distance.
import itertools
import numpy as np
distance=np.array([[30,18,51,55],[35,15,50,49],[36,17,40,32],[40,29,29,17]])
permutation=[x for x in itertools.permutations([0,1,2,3],4)]
x_opt=permutation[0]
d_opt=sum([distance[i,x_opt[i]] for i in range(len(distance[0]))])
for x in permutation:
d=sum([distance[i,x[i]] for i in range(len(distance[0]))])
if d<d_opt:
(d_opt,x_opt)=(d,x)
print(x_opt)
The result will be in this case:
(0,1,2,3)

Related

Python - Find closest indices from 2 sets

I have 2 sets of indices (i,j).
What I need to get is the 2 indices that are closest from the 2 sets.
It is easier to explain graphically:
Assuming I have all the indices that make the first black shape, and all the indices that make the second black shape, how do I find the closest indices (the red points in the figure) between those 2 shapes, in an efficient way (built in function in Python, not by iterating through all the possibilities)?
Any help will be appreciated!
As you asked about a built in function rather than looping through all combinations, there's a method in scipy.spacial.distance that does just that - it outputs a matrix of distances between all pairs of 2 inputs. If A and B are collections of 2D points, then:
from scipy.spatial import distance
dists = distance.cdist(A,B)
Then you can get the index of the minimal value in the matrix.

Fast euclidean distances between two sets of points with missing values in Python

I have two numpy matrices X and Y representing each a set of points in some d-dimensional space. I would like to compute all the euclidean distances from each point in X to each point in Y. scipy provides the function cdist to do exactly this, but there is a catch: some points include missing values in the form of NaN. I would like the distance operation to ignore NaN entries, for example if I'm computing the distance between the following two points
a = [1, 3, nan]
b = [2, nan, 4]
then I would ignore the second and third dimensions, thus getting a distance of sqrt((1-2)**2) = 1.
Unfortunately in this setting cdist just returns a NaN distance whenever a single NaN is found in a pair of points. The same goes for the euclidean_distances function in scikit-learn
Of course one could write a double loop to perform all the required operations, but since X and Y are large matrices this turns out to be too slow. Therefore, a solution based on numpy/scipy would be ideal.
numpy does include some mechanisms such as masked arrays that allow performing operations ignoring NaN values, but scipy seems to ignore those masks.
What would be an efficient way to perform this operation?
Using the suggestion from #Daniel F, you can use cdist like this:
cdist(XA, XB, lambda u, v: np.sqrt(np.nansum((u-v)**2)))
For instance:
import numpy as np
from scipy.spatial.distance import cdist, squareform
a = np.array([1, 3, np.nan])
b = np.array([2, np.nan, 4])
print(np.sqrt(np.nansum((a-b)**2)))
Output:
1.0
The example above is just to demonstrate the effect of the lambda function.
Simplest way is to use the standard euclidean distance formula, but replace the sum with nansum
np.sqrt(np.nansum((X - Y)**2))
I doubt you're going to get anything easier than that (you'll have to work out the broadcasting yourself as you only gave 1d inputs). Standard practice is that nan is always carried through calculations.

Modify kmeans alghoritm for 1d array where order matters

I want to find groups in one dimensional array where order/position matters. I tried to use numpys kmeans2 but it works only when i have numbers in increasing order.
I have to maximize average difference between neigbour sub-arrays
For example: if I have array [1,2,2,8,9,0,0,0,1,1,1] and i want to get 4 groups the result should be something like [1,2,2], [8,9], [0,0,0], [1,1,1]
Is there a way to do it in better then O(n^k)
answer: I ended up with modiied dendrogram, where I merge neigbors only.
K-means is about minimizing the least squares. Among it's largest drawbacks (there are many) is that you need to know k. Why do you want to inherit this drawback?
Instead of hacking k-means into not ignoring the order, why don't you instead look at time series segmentation and change detection approaches that are much more appropriate for this problem?
E.g. split your time series if abs(x[i] - x[-1]) > stddev where stddev is the standard deviation of your data set. Or the standard deviation of the last 10 samples (in above series, the standard deviation is about 3, so it would split as [1,2,2], [8,9], [0,0,0,1,1,1] because the change 0 to 1 is not significant.

Fastest way to Iterate a Matrix with vectors as entries in numpy

I'm using a function in python's opencv library to get the light flow movement of my hand as I move it around. Specifically http://docs.opencv.org/modules/video/doc/motion_analysis_and_object_tracking.html#calcopticalflowfarneback
This function outputs a numpy array
flow = cv2.calcOpticalFlowFarneback(prevgray, gray, 0.5, 3, 15, 3, 5, 1.2, 0)
print flow.shape # prints (480,320,2)
So flow is a matrix with each entry a vector. I want a way to quantify this matrix so I though of using the L1 Matrix norm (numpy.linalg.norm(flow, 1)) Which throws a improper dimensions to norm error.
I'm thinking about getting around this by calculating the euclidean norm of every vector and then finding the L1 norm of a matrix with the distances of the vectors.
I'm having trouble iterating through the flow matrix efficiently. I have done it using two for loops by going first through columns and then rows, but it's way too slow.
r,c,d = flow.shape
flowprime = numpy.zeros((r,c),flow.dtype)
for i in range(0,r):
for j in range (0,c):
flowprime[i,j] = numpy.linalg.norm(flow[i,j], 2)
print(numpy.linalg.norm(flowprime, 1))
I had also tried using numpy.nditer but
for x in numpy.nditer(flow, op_flags=['readwrite']):
print x
just prints a single value rather than a vector.
What would be the fastest way to iterate through a numpy matrix with vectors as entries, norm them and then take the L1 norm?
As of numpy version 1.9, norm takes an axis argument.
Aside from that, say what you want ideally, and almost surely you can ask numpy to do it. E.g., assuming no complex entries or missing values, the simplest case np.sqrt((flow**2).sum()) or the case I think you describe np.linalg.norm(np.sqrt((flow**2).sum(axis=-1)),1).

Uniform Random Numbers

I am trying to understand what this code does. I am going through some examples about numpy and plotting and I can't figure out what u and v are. I know u is an array of two arrays each with size 10000. What does v=u.max(axis=0) do? Is the max function being invoked part of the standard python library? When I plot the histogram I get a pdf defined by 2x as opposed to a normal uniform distribution.
import numpy as np
import numpy.random as rand
import matplotlib.pyplot as plt
np.random.seed(123)
u=rand.uniform(0,1,[2,10000])
v=u.max(axis=0)
plt.figure()
plt.hist(v,100,normed=1,color='blue')
plt.ylim([0,2])
plt.show()
u.max(), or equivalently np.max(u), will give you the maximum value in the array - i.e. a single value. It's the Numpy function here, not part of the standard library. You often want to find the maximum value along a particular axis/dimension and that's what is happening here.
U has shape (2,10000), and u.max(axis=0) gives you the max along the 0 axis, returning an array with shape (10000,). If you did u.max(axis=1) you would get an array with shape (2,).
Simple illustration/example:
>>> a = np.array([[1,2],[3,4]])
>>> a
array([[1, 2],
[3, 4]])
>>> a.max(axis=0)
array([3, 4])
>>> a.max(axis=1)
array([2, 4])
>>> a.max()
4
first three lines you load in different modules (libraries that are relied apon in the rest of the code). you load numpy which is a numerical library, numpy.random which is a library that does a lot of great work to create random numbers and matplotlib allows for plotting functions.
the rest is described here:
np.random.seed(123)
A computer does not really generate a random number rather picks a number from a long list of numbers (for a more correct explanation of how this is done http://en.wikipedia.org/wiki/Random_number_generation). In essence if you want to reproduce the work with the same random numbers the computer needs to know where in this list of numbers to start picking numbers. This is what this line of code does. If anybody else runs the same piece of code now you end up with the same 'random' numbers.
u=rand.uniform(0,1,[2,10000])
This generates 10000 random numbers twice that are distributed between 0 and 1. This is uniform distribution so it is equally likely to get any point between 0 and 1. (Again more information can be found here: http://en.wikipedia.org/wiki/Uniform_distribution_(continuous) ). You are creating two arrays within an array. This can be checked by doing: len(u) and len(u[0]).
v=u.max(axis=0)
The u.max? command in iPython refers you to the docs. It is basically select a max and the axis determines how the max is chosen. Try the following:
a = np.arange(4).reshape((2,2))
np.amax(a, axis=0) # gives array([2, 3])
np.amax(a, axis=1) # gives array([1, 3])
The rest of the code is meant to set the histogram plot. There are 100 bins in total in the histogram and the bars will be colored blue. The maximum height on the histogram y-axis is 2 and normed will guarantee that at least one sample will be in every bin.
I can't clearly make up what the true purpose or application of the code was. But this is en essence what it is doing.

Categories