Minimize total distance between two sets of points in Python - python

Given two sets of points in n-dimensional space, how can one map points from one set to the other, such that each point is only used once and the total euclidean distance between the pairs of points is minimized?
For example,
import matplotlib.pyplot as plt
import numpy as np
# create six points in 2d space; the first three belong to set "A" and the
# second three belong to set "B"
x = [1, 2, 3, 1.8, 1.9, 3.4]
y = [2, 3, 1, 2.6, 3.4, 0.4]
colors = ['red'] * 3 + ['blue'] * 3
plt.scatter(x, y, c=colors)
plt.show()
So in the example above, the goal would be to map each red point to a blue point such that each blue point is only used once and the sum of the distances between points is minimized.
I came across this question which helps to solve the first part of the problem -- computing the distances between all pairs of points across sets using the scipy.spatial.distance.cdist() function.
From there, I could probably test every permutation of single elements from each row, and find the minimum.
The application I have in mind involves a fairly small number of datapoints in 3-dimensional space, so the brute force approach might be fine, but I thought I would check to see if anyone knows of a more efficient or elegant solution first.

An example of assigning (mapping) elements of one set to points to the elements of another set of points, such that the sum Euclidean distance is minimized.
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment
np.random.seed(100)
points1 = np.array([(x, y) for x in np.linspace(-1,1,7) for y in np.linspace(-1,1,7)])
N = points1.shape[0]
points2 = 2*np.random.rand(N,2)-1
C = cdist(points1, points2)
_, assigment = linear_sum_assignment(C)
plt.plot(points1[:,0], points1[:,1],'bo', markersize = 10)
plt.plot(points2[:,0], points2[:,1],'rs', markersize = 7)
for p in range(N):
plt.plot([points1[p,0], points2[assigment[p],0]], [points1[p,1], points2[assigment[p],1]], 'k')
plt.xlim(-1.1,1.1)
plt.ylim(-1.1,1.1)
plt.axes().set_aspect('equal')

There's a known algorithm for this, The Hungarian Method For Assignment, which works in time O(n3).
In SciPy, you can find an implementation in scipy.optimize.linear_sum_assignment

Related

Creating Integer Range from multiple string math expressions in python

I have a problem in a project and I've searched the internet high and low with no clear answer.
How can i convert math expressions
As 3x + 5y >= 100
And x,y < 500
Into a range of x and range of y to be used as ristriction in a math problem,
Ex: f = x^2+4y
The end result is to find the largest answer using genetic algorithms where x and y are ristricted in value.
Tried sympy and eval with no luck
Searched every and found only a few helpful but not enough resources
I need just to translate the user input to the code for the genetic algorithm to use.
Your set of linear inequations define a polygon in the plane.
The edges of this polygon are the lines defined by each equality that you get by replacing the inequal sign by an equal sign in an inequality.
The vertices of this polygon are the intersections of two adjacent edges; equivalently, they are the intersections of two edges that satisfy the system of (large) inequalities.
So, one way to find all the vertices of the polygon is to find every intersection point by solving every subsystem of two equalities, then filtering out the points that are outside of the polygon.
import numpy as np
from numpy.linalg import solve, LinAlgError
from itertools import combinations
import matplotlib.pyplot as plt
A = np.array([[-3, -5],[1,0],[0,1]])
b = np.array([-100,500,500])
# find polygon for system of linear inequations
# expects input in "less than" form:
# A X <= b
def get_polygon(A, b, tol = 1e-5):
polygon = []
for subsystem in map(list, combinations(range(len(A)), 2)):
try:
polygon.append(solve(A[subsystem], b[subsystem])) # solve subsystem of 2 equalities
except LinAlgError:
pass
polygon = np.array(polygon)
polygon = polygon[(polygon # A.T <= b + tol).all(axis=1)] # filter out points outside of polygon
return polygon
polygon = get_polygon(A, b)
polygon = np.vstack((polygon, polygon[0])) # duplicate first point of polygon to "close the loop" before plotting
plt.plot(polygon[:,0], polygon[:,1])
plt.show()
Note that get_polygon will find all vertices of the polygon, but if there are more than 3, they might not be ordered in clockwise order.
If you want to sort the vertices in clockwise order before plotting the polygon, I refer you to this question:
How to sort a list of points in clockwise/anticlockwise in python?
Using #Stef's approach in SymPy would give the triangular region of interest like this:
>>> from sympy.abc import x, y
>>> from sympy import intersection, Triangle, Line
>>> eqs = Eq(3*x+5*y,100), Eq(x,500), Eq(y,500)
>>> Triangle(*intersection(*[Line(eq) for eq in eqs], pairwise=True))
Triangle(Point2D(-800, 500), Point2D(500, -280), Point2D(500, 500))
So x is in range [-800, 500] and y is in range [m, 500] where m is the y value calculated from the equation of the diagonal:
m = solve(eqs[0], y)[0] # m(x)
def yval(xi):
if xi <-800 or xi > 500:
return
return m.subs(x,xi)
yval(300) # -> -160

Integration of a curve generated using matplotlib

I have generated a graph using basic function -
plt.plot(tm, o1)
tm is list of all x coordinates and o1 is a list of all y coordinates
NOTE
there is no specific function such as y=f(x), rather a certain y value remains constant for a given range of x.. see figure for clarity
My question is how to integrate this function, either using the matplotlib figure or using the lists (tm and o1)
The integral corresponds to computing the area under the curve.
The most easy way to compute (or approximate) the integral "numerically" is using the rectangle rule which is basically approximating the area under the curve by summing area of rectangles (see https://en.wikipedia.org/wiki/Numerical_integration#Quadrature_rules_based_on_interpolating_functions).
Practically in your case, it quite straightforward since it is a step function.
First, I recomment to use numpy arrays instead of list (more handy for numerical computing):
import matplotlib.pyplot as plt
import numpy as np
x = np.array([0,1,3,4,6,7,8,11,13,15])
y = np.array([8,5,2,2,2,5,6,5,9,9])
plt.plot(x,y)
Then, we compute the width of rectangles using np.diff():
w = np.diff(x)
Then, the height of the same rectangles (multiple possibilities exist):
h = y[:-1]
Here I chose the 2nd value of each two successive y values. the top right angle of rectangle is on the curve. You can choose the mean value of each two successive y values h = (y[1:]+y[:-1])/2 in which the middle of the top of the rectangle coincide with the curve.
Then , you will need to multiply and sum:
area = (w*h).sum()

Resample points in a Geodataframe so they stay equally spaced but inside area

I have a geopandas dataframe that contains a complex area and some Point objects representing latitude and longitue inside that same area, both reading from .kml and .xlsx files, not defined by me. Thing is that those points are really close to each other, and when ploting the whole map they overlap, making it really difficult to spot each one individually, specially when using geoplot.kdeplot()
So what I would like to do is to find some way to equally space them, respecting the boundaries of my area. For sake of simplicity, consder the Polygon object as the area and the Point objects as the ones to resample:
import random
import matplotlib.pyplot as plt
y = np.array([[100,100], [200,100], [200,200], [100,200], [50.0,150]])
p = Polygon(y)
points = []
for i in range(100):
pt = Point(random.randint(50,199),random.randint(100,199))
if p.contains(pt):
points.append(pt)
xs = [point.x for point in points]
ys = [point.y for point in points]
fig = plt.figure()
ax1=fig.add_subplot(111)
x,y = p.exterior.xy
plt.plot(x,y)
ax1.scatter(xs, ys)
plt.show()
That gives something like this:
Any ideas on how to do it witout crossing the area? Thank you in advance!
EDIT:
Important to mention that the resample should not be arbitrary, meaning that a point in coordinates (100,100), for example, should be somewhere near its original location.
This kind of problems can be handled in various ways such as force based method or geometrical method; Here, geometrical one is used.
I have considered the points to be as circles, so an arbitrary diameter can be specified for them, and also the area size for plotting with matplotlib. So we have the following codes for the beginning:
import numpy as np
from scipy.spatial import cKDTree
from shapely.geometry.polygon import Polygon
import matplotlib.pyplot as plt
np.random.seed(85)
# Points' coordinates and the specified diameter
coords = np.array([[3, 4], [7, 8], [3, 3], [1, 8], [5, 4], [3, 5], [7, 7]], dtype=np.float64)
points_dia = 1.1 # can be chosen arbitrary based on the problem
# Polygon creation
polygon_verts = np.array([[0, 0], [8.1, 0], [8.1, 8.1], [0, 8.1]])
p = Polygon(polygon_verts)
# Plotting
x, y = coords.T
colors = np.random.rand(coords.shape[0])
area = 700 # can be chosen arbitrary based on the problem
min_ = coords.min() - 2*points_dia
max_ = coords.max() + 2*points_dia
plt.xlim(min_, max_)
plt.ylim(min_, max_)
xc, yc = p.exterior.xy
plt.plot(xc, yc)
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.show()
I separate the total answer to two steps:
dispersing (moving away) the suspected points from average position coordinates
curing the point-polygon overlaps
First step
For the first part, based on the number of points and neighbors, we can choose between Scipy and other libraries e.g. Scikit learn (my last explanations on this topic 1 (4th-5th paragraphs) and 2 can be helpful for such selections). Based on the (not large) sizes that the OP is mentioned in the comment, I will suggest using scipy.spatial.cKDTree, which has the best performance in this regard based on my experiences, to query the points. Using cKDTree.query_pairs we specify groups of points that are in the distance range of the diameter value from each other. Then by looping on the groups and averaging each group's points' coordinates, we can again use cKDTree.query to find the nearest point (k=1) of that group to the specified averaged coordinate. Now, it is easy to calculate distance of other points of that group to the nearest one and move them outward by a specified distance (here just as much as the overlaps):
nears = cKDTree(coords).query_pairs(points_dia, output_type='ndarray')
near_ids = np.unique(nears)
col1_ids = np.unique(nears[:, 0])
for i in range(col1_ids.size):
pts_ids = np.unique(nears[nears[:, 0] == i].ravel())
center_coord = np.average(coords[pts_ids], axis=0)
nearest_center_id = pts_ids[cKDTree(coords[pts_ids]).query(center_coord, k=1)[1]]
furthest_center_ids = pts_ids[pts_ids != nearest_center_id]
vecs = coords[furthest_center_ids] - coords[nearest_center_id]
dists = np.linalg.norm(vecs, axis=1) - points_dia
coords[furthest_center_ids] = coords[furthest_center_ids] + vecs / np.linalg.norm(vecs, axis=1) * np.abs(dists)
Second step
For this part, we can loop on the modified coordinates (in the previous step) and find the two nearest coordinates (k=2) of the polygon's edges and project the point to that edge to find the closest coordinate on that line. So, again as step 1, we calculate overlaps and move the points to placing them inside the polygon. The points that where single (not in groups) will be moved shorter distances. This distances can be specified as desired and I have set them as a default value:
for i, j in enumerate(coords):
for k in range(polygon_verts.shape[0]-1):
nearest_poly_ids = cKDTree(polygon_verts).query(j, k=2)[1]
vec_line = polygon_verts[nearest_poly_ids][1] - polygon_verts[nearest_poly_ids][0]
vec_proj = j - polygon_verts[nearest_poly_ids][0]
line_pnt = polygon_verts[nearest_poly_ids][0] + np.dot(vec_line, vec_proj) / np.dot(vec_line, vec_line) * vec_line
vec = j - line_pnt
dist = np.linalg.norm(vec)
if dist < points_dia / 2:
if i in near_ids:
coords[i] = coords[i] + vec / dist * 2 * points_dia
else:
coords[i] = coords[i] + vec / dist
These codes, perhaps, could be optimized to get better performances, but it is not of importance for this question. I have checked it on my small example and need to be checked by larger cases. In all conditions, I think this will be one of the best ways to achieve the goal (may some changes based on the need) and can be a template which can be modified to satisfy any other needs or any shortcomings.
so this is not particularly fast due to the loop checking for "point in polygon", but as long as you don't intend to ave a very fine grid-spacing something like this would do the job:
from scipy.interpolate import NearestNDInterpolator
# define a regular grid
gx, gy = np.meshgrid(np.linspace(50, 200, 100),
np.linspace(100, 200, 100))
# get a interpolation function
# (just used "xs" as values in here...)
i = NearestNDInterpolator(np.column_stack((xs, ys)), xs)
# crop your grid to the shape
newp = []
for x, y in zip(gx.flat, gy.flat):
pt = Point(x, y)
if p.contains(pt):
newp.append(pt.xy)
newp = np.array(newp).squeeze()
# evaluate values on the grid
vals = i(*newp.T)
ax1.scatter(*newp.T, c=vals, zorder=0)

Find the distance traveled from (x,y) coordinates

I currently have a python script that reads in a 3 column text file containing x and y coordinates for a walker and the time they have been walking.
I have read in this data and allocated it in numpy arrays as shown in the code below:
import numpy as np
import matplotlib.pyplot as plt
data = np.loadtxt("info.txt", delimiter = ',')
x = data[:,0]
y = data[:,1]
t = data[:,2]
File is following format (x,y,t):
5907364.2371 -447070.881709 2193094
5907338.306978 -447058.019176 2193116
5907317.260891 -447042.192668 2193130
I now want to find the distance traveled as a function of time by the walker. One way I can think of doing that is by summing the differences in x coordinates and all the differences in y coordinates in a loop. This seems a very long winded method however and I think it could be solved with a type of numerical integration. Does anyone have any ideas of what I could do?
To compute the distance "along the way", you must first obtain the distance of each step.
This can be obtained, component-wise, by the indexing dx = x[1:]-x[:-1]. The distance per step is then "square root of dx**2+dy**2" Note that the length of this array is less by one as there is one less interval with respect to the number of steps. This can be completed by assigning the distance "0" to the the first time data. This is the role of the "concatenate" line below.
There is no numerical integration here, but a cumulative sum. To perform numerical integration, you would need equations of motion (for instance).
Extra change: I use np.loadtxt with the unpack=True argument to save a few lines.
import numpy as np
import matplotlib.pyplot as plt
x, y, t = np.loadtxt("info.txt", unpack=True)
dx = x[1:]-x[:-1]
dy = y[1:]-y[:-1]
step_size = np.sqrt(dx**2+dy**2)
cumulative_distance = np.concatenate(([0], np.cumsum(step_size)))
plt.plot(t, cumulative_distance)
plt.show()
There are several ways to obtain the Euclidean Distance between the points:
Numpy:
import numpy as np
dist = np.linalg.norm(x-y)
dist1= np.sqrt(np.sum((x-y)**2)))
Scipy:
from scipy.spatial import distance
dist = distance.euclidean(x,y)
typically, to get distance walked, you sum up smaller distances. Your walker probably isn't walking on a grid (that is, a step in x and a step in y) but rather in diagonals (think Pythagorean theorem)
so, in python it might look like this...
distanceWalked = 0
for x_y_point in listOfPoints:
distanceWalked = distanceWalked + (x_y_point[0] **2 + x_y_point[1] **2)**.5
Where listOfPoints is something like [[0,0],[0,1],[0,2],[1,2],[2,2]]
Alternatively, you can use pandas.
import pandas as pd
df = pd.read_csv('info.txt',sep = '\t')
df['helpercol'] = (df['x']**2 +df['y']**2 )**.5
df['cumDist'] = df['helpercol'].cumsum()
now, you'll have cumulative distance per time in your dataframe

Identifying points with the smallest Euclidean distance

I have a collection of n dimensional points and I want to find which 2 are the closest. The best I could come up for 2 dimensions is:
from numpy import *
myArr = array( [[1, 2],
[3, 4],
[5, 6],
[7, 8]] )
n = myArr.shape[0]
cross = [[sum( ( myArr[i] - myArr[j] ) ** 2 ), i, j]
for i in xrange( n )
for j in xrange( n )
if i != j
]
print min( cross )
which gives
[8, 0, 1]
But this is too slow for large arrays. What kind of optimisation can I apply to it?
RELATED:
Euclidean distance between points in two different Numpy arrays, not within
Try scipy.spatial.distance.pdist(myArr). This will give you a condensed distance matrix. You can use argmin on it and find the index of the smallest value. This can be converted into the pair information.
There's a whole Wikipedia page on just this problem, see: http://en.wikipedia.org/wiki/Closest_pair_of_points
Executive summary: you can achieve O(n log n) with a recursive divide and conquer algorithm (outlined on the Wiki page, above).
You could take advantage of the latest version of SciPy's (v0.9) Delaunay triangulation tools. You can be sure that the closest two points will be an edge of a simplex in the triangulation, which is a much smaller subset of pairs than doing every combination.
Here's the code (updated for general N-D):
import numpy
from scipy import spatial
def closest_pts(pts):
# set up the triangluataion
# let Delaunay do the heavy lifting
mesh = spatial.Delaunay(pts)
# TODO: eliminate reduncant edges (numpy.unique?)
edges = numpy.vstack((mesh.vertices[:,:dim], mesh.vertices[:,-dim:]))
# the rest is easy
x = mesh.points[edges[:,0]]
y = mesh.points[edges[:,1]]
dists = numpy.sum((x-y)**2, 1)
idx = numpy.argmin(dists)
return edges[idx]
#print 'distance: ', dists[idx]
#print 'coords:\n', pts[closest_verts]
dim = 3
N = 1000*dim
pts = numpy.random.random(N).reshape(N/dim, dim)
Seems closely O(n):
There is a scipy function pdist that will get you the pairwise distances between points in an array in a fairly efficient manner:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
that outputs the N*(N-1)/2 unique pairs (since r_ij == r_ji). You can then search on the minimum value and avoid the whole loop mess in your code.
Perhaps you could proceed along these lines:
In []: from scipy.spatial.distance import pdist as pd, squareform as sf
In []: m= 1234
In []: n= 123
In []: p= randn(m, n)
In []: d= sf(pd(p))
In []: a= arange(m)
In []: d[a, a]= d.max()
In []: where(d< d.min()+ 1e-9)
Out[]: (array([701, 730]), array([730, 701]))
With substantially more points you need to be able to somehow utilize the hierarchical structure of your clustering.
How fast is it compared to just doing a nested loop and keeping track of the shortest pair? I think creating a huge cross array is what might be hurting you. Even O(n^2) is still pretty quick if you're only doing 2 dimensional points.
The accepted answer is OK for small datasets, but its execution time scales as n**2. However, as pointed out by #payne, an optimal solution can achieve n*log(n) computation time scaling.
This optial solution can be obtained using sklearn.neighbors.BallTree as follows.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import BallTree as tree
n = 10
dim = 2
xy = np.random.uniform(size=[n, dim])
# This solution is optimal when xy is very large
res = tree(xy)
dist, ids = res.query(xy, 2)
mindist = dist[:, 1] # second nearest neighbour
minid = np.argmin(mindist)
plt.plot(*xy.T, 'o')
plt.plot(*xy[ids[minid]].T, '-o')
This procedure scales well for very large sets of xy values and even for large dimensions dim (altough the example illustrates the case dim=2). The resulting output looks like this
An identical solution can be obtained using scipy.spatial.cKDTree, by replacing the sklearn import with the following Scipy one. Note however that cKDTree, unlike BallTree, does not scale well for high dimensions
from scipy.spatial import cKDTree as tree

Categories