k-means with a centroid constraint

k-means with a centroid constraint - python

I'm working on a data science project for my intro to Data Science class, and we've decided to tackle a problem relating to desalination plants in california: "Where should we place k plants to minimize the distance to zip codes?"
The data that we have so far is, zip, city, county, pop, lat, long, amount of water.
The issue is, I can't find any resources on how to force the centroid to be constrained to staying on the coast. What we've thought of so far is:
Use a normal kmeans algorithm, but move the centroid to the coast once clusters have settled (bad)
Use a normal kmeans algorithm with weights, making the coastal zips have infinite weight (I've been told this isn't a great solution)
What do you guys think?

K-means does not minimize distances.
It minimizes squared errors, which is quite different.
The difference is roughly that of the median, and the mean in 1 dimensional data. The error can be massive.
Here is a counter example, assuming we have the coordinates:
-1 0
+1 0
0 -1
0 101
The center chosen by k-means would be 0,25. The optimal location is 0,0.
The sum of distances by k-means is > 152, the optimum location has distance 104. So here, the centroid is almost 50% worse than the optimum! But the centroid (= multivariate mean) is what k-means uses!
k-means does not minimize the Euclidean distance!
This is one variant how "k-means is sensitive to outliers".
It does not get better if you try to constrain it to place "centers" on the coast only...
Also, you may want to at least use Haversine distance, because in California, 1 degree north != 1 degree east, because it's not at the Equator.
Furthermore, you likely should not make the assumption that every location requires its own pipe, but rather they will be connected like a tree. This greatly reduces the cost.
I strongly suggest to treat this as a generic optimization problem, rather than k-means. K-means is an optimization too, but it may optimize the wrong function for your problem...

I would approach this by setting possible points that could be centers, i.e. your coastline.
I think this is close to Nathaniel Saul's first comment.
This way, for each iteration, instead of choosing a mean, a point out of the possible set would be chosen by proximity to the cluster.
I’ve simplified the conditions to only 2 data columns (lon. and lat.) but you should be able to extrapolate the concept. For simplicity, to demonstrate, I based this on code from here.
In this example, the purple dots are places on the coastline. If I understood correctly, the optimal Coastline locations should look something like this:
See code below:
#! /usr/bin/python3.6
# Code based on:
# https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
import matplotlib.pyplot as plt
import numpy as np
import random
##### Simulation START #####
# Generate possible points.
def possible_points(n=20):
y=list(np.linspace( -1, 1, n ))
x=[-1.2]
X=[]
for i in list(range(1,n)):
x.append(x[i-1]+random.uniform(-2/n,2/n) )
for a,b in zip(x,y):
X.append(np.array([a,b]))
X = np.array(X)
return X
# Generate sample
def init_board_gauss(N, k):
n = float(N)/k
X = []
for i in range(k):
c = (random.uniform(-1, 1), random.uniform(-1, 1))
s = random.uniform(0.05,0.5)
x = []
while len(x) < n:
a, b = np.array([np.random.normal(c[0], s), np.random.normal(c[1], s)])
# Continue drawing points from the distribution in the range [-1,1]
if abs(a) < 1 and abs(b) < 1:
x.append([a,b])
X.extend(x)
X = np.array(X)[:N]
return X
##### Simulation END #####
# Identify points for each center.
def cluster_points(X, mu):
clusters = {}
for x in X:
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
for i in enumerate(mu)], key=lambda t:t[1])[0]
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters
# Get closest possible point for each cluster.
def closest_point(cluster,possiblePoints):
closestPoints=[]
# Check average distance for each point.
for possible in possiblePoints:
distances=[]
for point in cluster:
distances.append(np.linalg.norm(possible-point))
closestPoints.append(np.sum(distances)) # minimize total distance
# closestPoints.append(np.mean(distances))
return possiblePoints[closestPoints.index(min(closestPoints))]
# Calculate new centers.
# Here the 'coast constraint' goes.
def reevaluate_centers(clusters,possiblePoints):
newmu = []
keys = sorted(clusters.keys())
for k in keys:
newmu.append(closest_point(clusters[k],possiblePoints))
return newmu
# Check whether centers converged.
def has_converged(mu, oldmu):
return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))
# Meta function that runs the steps of the process in sequence.
def find_centers(X, K, possiblePoints):
# Initialize to K random centers
oldmu = random.sample(list(possiblePoints), K)
mu = random.sample(list(possiblePoints), K)
while not has_converged(mu, oldmu):
oldmu = mu
# Assign all points in X to clusters
clusters = cluster_points(X, mu)
# Re-evaluate centers
mu = reevaluate_centers(clusters,possiblePoints)
return(mu, clusters)
K=3
X = init_board_gauss(30,K)
possiblePoints=possible_points()
results=find_centers(X,K,possiblePoints)
# Show results
# Show constraints and clusters
# List point types
pointtypes1=["gx","gD","g*"]
plt.plot(
np.matrix(possiblePoints).transpose()[0],np.matrix(possiblePoints).transpose()[1],'m.'
)
for i in list(range(0,len(results[0]))) :
plt.plot(
np.matrix(results[0][i]).transpose()[0], np.matrix(results[0][i]).transpose()[1],pointtypes1[i]
)
pointtypes=["bx","yD","c*"]
# Show all cluster points
for i in list(range(0,len(results[1]))) :
plt.plot(
np.matrix(results[1][i]).transpose()[0],np.matrix(results[1][i]).transpose()[1],pointtypes[i]
)
plt.show()
Edited to minimize total distance.

Related

How can I make the following code faster?

I am trying to optimize code for a challenge in codewars but I am having some trouble making it fast enough. The kata in question is this. The description is explained in a detailed way in the link I have given and I do not reproduce it here not to bloat the question. My attempted code is the following (in python)
# first I define the manhattan distance between any two given points
def manhattan(p1,p2):
return abs(p1[0]-p2[0])+abs(p1[1]-p2[1])
# I write a function that takes the minimum of the list of the distances from a given point to all agents
def distance(p,agents):
return min(manhattan(p,i) for i in agents)
# and the main function
def advice(agents,n):
# there could be agents outside of the range of the grid so I redefine agents cropping the agents that may be outside
agents=[i for i in agents if i[0]<n and i[1]<n]
# in case there is no grid or there are agents everywhere
if n==0 or len(agents)==n*n:
return []
# if there is no agent I output the whole matrix
if agents==[]:
return [(i,j) for i in range(n) for j in range(n)]
# THIS FOLLOWING PART IS THE PART THAT I THINK I NEED TO OPTIMIZE
# I define a dictionary with key point coordinates and value the output of the function distance then make a list with those entries with lowest values. This ends the script
dct=dict( ( (i,j),distance([i,j],agents) ) for i in range(n) for j in range(n))
# highest distance to an agent
mxm=max(dct.values())
return [ nn for nn,mm in dct.items() if mm==mxm]
The part I think I need to improve is marked in the code above. Any ideas of how to make this faster?

Your algorithm is doing a brute force check of all positions against all agents. If you want to accelerate the process, you have to use a strategy that allows you to skip large parts of these n x n x a combinations.
For example, you could group the points by distance in a dictionary starting with all the points at the largest possible distance. Then, go through the agents and redistribute only the farthest points to closer distances. Repeat this process until the farthest distance retains at least one point after going through all agents.
This would not eliminate all the distance checks but it skips a lot of distance computations for points that are already known to be closer to an agent than the farthest ones.
Here's an example:
def myAdvice2(agents,n):
maxDist = 2*n
distPoints = { maxDist:[ (x,y) for x in range(n) for y in range(n)] }
index = 0
retained = 0
lastMax = None
# keep refining until farthest distance is kept for all agents
while retained < len(agents):
ax,ay = agents[index]
index = (index + 1) % len(agents)
# redistribute points at farthest distance for this agent
points = distPoints.pop(maxDist)
for x,y in points:
distance = min(maxDist,abs(x-ax)+abs(y-ay))
if distance not in distPoints:
distPoints[distance] = []
distPoints[distance].append((x,y))
maxDist = max(distPoints)
# count retained agents at MaxDist
retained += 1
# reset count when maxDist is reduced
if lastMax and maxDist < lastMax : retained = 0
lastMax = maxDist
# once all agents have been applied to largest distance, we have a solution
return distPoints[maxDist]
In my performance tests this is about 3x faster and performs better on larger grids.
[EDIT] This algorithm can be further accelerated by sorting the agents based on their distance to the center. This will ensure that points at the farthest distances are eliminated faster because agents at central positions cover more ground.
You can add this sort at the beginning of the function:
agents = sorted(agents, key=lambda p: abs(p[0]-n//2)+abs(p[1]-n//2))
This will yield a 10% to 15% improvement on speed depending on the number of agent, their positions and the size of the area. Further optimisations could determine when this sort is worthwhile based on data.
[EDIT2] if you're going to use a brute force approach, leveraging parallel processing (using the GPU) could give suprizingly fast results.
This example, using numpy, is 3 to 10 times faster than my original approach. I'm guessing that this will only be the case up to a point (having more agents slows it down considerably) but it was much faster in all small scale tests I did.
import numpy as np
def myAdvice4(agents,n):
ax,ay = [np.array(c)[:,None,None] for c in zip(*agents)]
px = np.repeat(np.arange(n)[None,:],n,axis=0)
py = np.repeat(np.arange(n)[:,None],n,axis=1)
agentDist = abs(ax - px) + abs(ay - py)
minDist = np.min(agentDist,axis=0)
maxDist = np.max(minDist)
maxCoord = zip(*np.where(minDist == maxDist))
return list(maxCoord)

You can use sklearn.metrics.pairwise_distances with manhattan metric.

Optimization function for the sum Google Maps distance

I am trying to find a point (latitude/longitude) that minimizes the sum of Google maps distance to all other N points.
I was able to extract the Google Maps distances between my latitude and longitude arrays but I wasn't able to minimize my function.
Code
def minimize_g(input_g):
gmaps1 = googlemaps.Client(key="xxx")
def distance_f(x):
dist = gmaps1.distance_matrix([x], np.array(input_g)[:,1:3])
sum_ = 0
for obs in range(len(np.array(df[:3]))):
sum_+= dist['rows'][0]['elements'][obs]['distance']['value']
return sum_
#initial guess: centroid
centroid = input_g.mean(axis=0)
optimization = minimize(distance_f, centroid, method='COBYLA')
return optimization.x
Thanks!

If you are looking for any point on the map that results in shortest distance to all coordinates in your list, you can try writing a function that calculates the distance from one coordinate to another coordinate. If you have that function ready to go, its a matter of calculating the total distance to all your points from a test point.
Then, from some artificially created coordinates, you would minimize the distances to all your points with something along the lines of
import numpy as np
lats = [12.3, 12.4, 12.5]
lons = [16.1, 15.1, 14.1]
def total_distance_to_lats_and_lons(lat, lon):
# some summation over distances from lat, lon to lats, lons
# create two lists with 0.01 degree precision as an artificial grid of possibilities
test_lats = np.arange(min(lats), max(lats), 0.01)
test_lons = np.arange(min(lons), max(lons), 0.01)
test_distances = [] # empty list to fill with the total_distance to each combination of test_lat, test_lon
coordinate_index_combinations = [] # corresponding coordinates
for test_lat in test_lats:
for test_lon in test_lons:
coordinate_combinations.append([test_lat, test_lon]) # add a combination of indices
test_distances.append(total_distance_to_lats_and_lons(test_lat, test_lon)) # add a distance
index_of_best_test_coordinate = np.argmin(test_distances) # find index of the minimum value
print('Best match is index {}'.format(index_of_best_test_coordinate))
print('Coordinates: {}'.format(coordinate_combinations[index_of_best_test_coordinate]))
print('Total distance: {}'.format(test_distances[index_of_best_test_coordinate]))
This brute force approach has some precision limitations and becomes an expensive loop quite quickly, so you can also apply this method iteratively with the minimum found after each round, so iteratively increasing precision and decreasing start and end points in the test coordinate lists. After a few iterations, you should have a pretty precise estimate. On the other hand, it is possible such an iterative method converges to one of multiple local minima, yielding only one of multiple solutions.

Identifying weighted clusters with max distance diameter and sum(weight) > 50

Problem
Need to identify a way to find 2 mile clusters of points where each point has a value. Identify 2 mile areas which have a sum(value) > 50.
Data
I have data that looks like the following
ID COUNT LATITUDE LONGITUDE
187601546 20 025.56394 -080.03206
187601547 25 025.56394 -080.03206
187601548 4 025.56394 -080.03206
187601550 0 025.56298 -080.03285
Roughly 200K records. What I need to determine is if there are any areas where more than sum of the count exceeds 65 in a one mile radius (2 mile diameter) area.
Using each point as a center for an area
Now, I have python code from another project that will draw a shapefile around a point of x diameter as follows:
def poly_based_on_distance(center_lat,center_long, distance, bearing):
# bearing is in degrees
# distance in miles
# print ('center', center_lat, center_long)
destination = (vincenty(miles=distance).destination(Point(center_lat,
center_long), bearing).format_decimal())
And a routine to return destination and then see which points are inside the radius.
## This is the evaluation for overlap between points and
## area polyshapes
area_list = []
store_geo_dict = {}
for stores in locationdict:
location = Polygon(locationdict[stores])
for areas in AREAdictionary:
area = Polygon(AREAdictionary[areass])
if store.intersects(area):
area_list.append(areas)
store_geo_dict[stores] = area_list
area_list = []
At this point, I am simply drawing a circular shapefile around each of the 200K points, see which others were inside and doing the count.
Need Clustering Algorithm?
However, there might be an area with the required count density where one of the points is not in the center.
I'm familiar with clustering algos such as DBSCAN that use attributes for classification but this is a matter of finding a density clusters using a value for each point. Is there any clustering algorithm to find any cluster of a 2 mile diameter circle where the inside count is >= 50?
Any suggestions, python or R are preferred tools but this is wide-open and probably a one-off so computation efficiency is not a priority.

Not a complete solution, but maybe it will help simplify the problem depending on the distribution of your data. I will use planar coordinates and cKDTree in my example, this might work with geographic data if you can ignore curvature in a projection.
The main observation is the following: a point (x,y) does not contribute to a dense cluster if a ball of radius 2*r (e.g. 2 miles) around (x,y) contributes less than the cutoff value (e.g. 50 in your title). In fact, any point within r of (x,y) does not contribute to ant dense cluster.
This allows you to repeatedly discard points from consideration. If you are left with no points, there are no dense clusters; if you are left with some points, clusters may exist.
import numpy as np
from scipy.spatial import cKDTree
# test data
N = 1000
data = np.random.rand(N, 2)
x, y = data.T
# test weights of each point
weights = np.random.rand(N)
def filter_noncontrib(pts, weights, radius=0.1, cutoff=60):
tree = cKDTree(pts)
contribs = np.array(
[weights[tree.query_ball_point(pt, 2 * radius)].sum() for pt in pts]
)
return contribs >= cutoff
def possible_contributors(pts, weights, radius=0.1, cutoff=60):
n_pts = len(pts)
while len(pts):
mask = filter_noncontrib(pts, weights, radius, cutoff)
pts = pts[mask]
weights = weights[mask]
if len(pts) == n_pts:
break
n_pts = len(pts)
return pts
Example with dummy data:

DBSCAN can be adapted (see Generalized DBSCAN; define core points as weight sum >= 50), but it will not ensure the maximum cluster size (it computes transitive closures).
You could also try complete linkage. Use it to find clusters with the desired maximum diameter, then check if these satisfy the desired density. But that does not guarantee to find all.
It's probably faster to (a) build an index for fast radius search. (b) for every point, find neighbors in radius r; keep if they have the desired minimum sum. But that does not guarantee to find everything because the center is not necessarily a data point. Consider a max radius of 1, minimum weight 100. Two points with weight 50 each, at (0,0) and (1,1). Neither a query at (0,0) nor one at (1,1) will discover the solution, but a cluster at (.5,.5) satisfies the conditions.
Unfortunately, I believe your problem is at least NP-hard, so you won't be able to afford the ultimate solution.

Choosing subset of farthest points in given set of points

Imagine you are given set S of n points in 3 dimensions. Distance between any 2 points is simple Euclidean distance. You want to chose subset Q of k points from this set such that they are farthest from each other. In other words there is no other subset Q’ of k points exists such that min of all pair wise distances in Q is less than that in Q’.
If n is approximately 16 million and k is about 300, how do we efficiently do this?
My guess is that this NP-hard so may be we just want to focus on approximation. One idea I can think of is using Multidimensional scaling to sort these points in a line and then use version of binary search to get points that are furthest apart on this line.

This is known as the discrete p-dispersion (maxmin) problem.
The optimality bound is proved in White (1991) and Ravi et al. (1994) give a factor-2 approximation for the problem with the latter proving this heuristic is the best possible (unless P=NP).
Factor-2 Approximation
The factor-2 approximation is as follows:
Let V be the set of nodes/objects.
Let i and j be two nodes at maximum distance.
Let p be the number of objects to choose.
P = set([i,j])
while size(P)<p:
Find a node v in V-P such that min_{v' in P} dist(v,v') is maximum.
\That is: find the node with the greatest minimum distance to the set P.
P = P.union(v)
Output P
You could implement this in Python like so:
#!/usr/bin/env python3
import numpy as np
p = 50
N = 400
print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2 #Make the matrix symmetric
print("Finding initial edge...")
maxdist = 0
bestpair = ()
for i in range(N):
for j in range(i+1,N):
if d[i,j]>maxdist:
maxdist = d[i,j]
bestpair = (i,j)
P = set()
P.add(bestpair[0])
P.add(bestpair[1])
print("Finding optimal set...")
while len(P)<p:
print("P size = {0}".format(len(P)))
maxdist = 0
vbest = None
for v in range(N):
if v in P:
continue
for vprime in P:
if d[v,vprime]>maxdist:
maxdist = d[v,vprime]
vbest = v
P.add(vbest)
print(P)
Exact Solution
You could also model this as an MIP. For p=50, n=400 after 6000s, the optimality gap was still 568%. The approximation algorithm took 0.47s to obtain an optimality gap of 100% (or less). A naive Gurobi Python representation might look like this:
#!/usr/bin/env python
import numpy as np
import gurobipy as grb
p = 50
N = 400
print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2 #Make the matrix symmetric
m = grb.Model(name="MIP Model")
used = [m.addVar(vtype=grb.GRB.BINARY) for i in range(N)]
objective = grb.quicksum( d[i,j]*used[i]*used[j] for i in range(0,N) for j in range(i+1,N) )
m.addConstr(
lhs=grb.quicksum(used),
sense=grb.GRB.EQUAL,
rhs=p
)
# for maximization
m.ModelSense = grb.GRB.MAXIMIZE
m.setObjective(objective)
# m.Params.TimeLimit = 3*60
# solving with Glpk
ret = m.optimize()
Scaling
Obviously, the O(N^2) scaling for the initial points is bad. We can find them more efficiently by recognizing that the pair must lie on the convex hull of the dataset. This gives us an O(N log N) way to find the pair. Once we've found it we proceed as before (using SciPy for acceleration).
The best way of scaling would be to use an R*-tree to efficiently find the minimum distance between a candidate point p and the set P. But this cannot be done efficiently in Python since a for loop is still involved.
import numpy as np
from scipy.spatial import ConvexHull
from scipy.spatial.distance import cdist
p = 300
N = 16000000
# Find a convex hull in O(N log N)
points = np.random.rand(N, 3) # N random points in 3-D
# Returned 420 points in testing
hull = ConvexHull(points)
# Extract the points forming the hull
hullpoints = points[hull.vertices,:]
# Naive way of finding the best pair in O(H^2) time if H is number of points on
# hull
hdist = cdist(hullpoints, hullpoints, metric='euclidean')
# Get the farthest apart points
bestpair = np.unravel_index(hdist.argmax(), hdist.shape)
P = np.array([hullpoints[bestpair[0]],hullpoints[bestpair[1]]])
# Now we have a problem
print("Finding optimal set...")
while len(P)<p:
print("P size = {0}".format(len(P)))
distance_to_P = cdist(points, P)
minimum_to_each_of_P = np.min(distance_to_P, axis=1)
best_new_point_idx = np.argmax(minimum_to_each_of_P)
best_new_point = np.expand_dims(points[best_new_point_idx,:],0)
P = np.append(P,best_new_point,axis=0)
print(P)

I am also pretty sure that the problem is NP-Hard, the most similar problem I found is the k-Center Problem. If runtime is more important than correctness a greedy algorithm is probably your best choice:
Q ={}
while |Q| < k
Q += p from S where mindist(p, Q) is maximal
Side note: In similar problems e.g., the set-cover problem it can be shown that the solution from the greedy algorithm is at least 63% as good as the optimal solution.
In order to speed things up I see 3 possibilities:
Index your dataset in an R-Tree first, then perform a greedy search. Construction of the R-Tree is O(n log n), but though being developed for nearest neighbor search, it can also help you finding the furthest point to a set of points in O(log n). This might be faster than the naive O(k*n) algorithm.
Sample a subset from your 16 million points and perform the greedy algorithm on the subset. You are approximate anyway so you might be able to spare a little more accuracy. You can also combine this with the 1. algorithm.
Use an iterative approach and stop when you are out of time. The idea here is to randomly select k points from S (lets call this set Q'). Then in each step you switch the point p_ from Q' that has the minimum distance to another one in Q' with a random point from S. If the resulting set Q'' is better proceed with Q'', otherwise repeat with Q'. In order not to get stuck you might want to choose another point from Q' than p_ if you could not find an adequate replacement for a couple of iterations.

If you can afford to do ~ k*n distance calculations then you could
Find the center of the distribution of points.
Select the point furthest from the center. (and remove it from the set of un-selected points).
Find the point furthest from all the currently selected points and select it.
Repeat 3. until you end with k points.

Find the maximum extent of all points. Split into 7x7x7 voxels. For all points in a voxel find the point closest to its centre. Return these 7x7x7 points. Some voxels may contain no points, hopefully not too many.

calculate turning points / pivot points in trajectory (path)

I'm trying to come up with an algorithm that will determine turning points in a trajectory of x/y coordinates. The following figures illustrates what I mean: green indicates the starting point and red the final point of the trajectory (the entire trajectory consists of ~ 1500 points):
In the following figure, I added by hand the possible (global) turning points that an algorithm could return:
Obviously, the true turning point is always debatable and will depend on the angle that one specifies that has to lie between points. Furthermore a turning point can be defined on a global scale (what I tried to do with the black circles), but could also be defined on a high-resolution local scale. I'm interested in the global (overall) direction changes, but I'd love to see a discussion on the different approaches that one would use to tease apart global vs local solutions.
What I've tried so far:
calculate distance between subsequent points
calculate angle between subsequent points
look how distance / angle changes between subsequent points
Unfortunately this doesn't give me any robust results. I probably have too calculate the curvature along multiple points, but that's just an idea.
I'd really appreciate any algorithms / ideas that might help me here. The code can be in any programming language, matlab or python are preferred.
EDIT here's the raw data (in case somebody want's to play with it):
mat file
text file (x coordinate first, y coordinate in second line)

You could use the Ramer-Douglas-Peucker (RDP) algorithm to simplify the path. Then you could compute the change in directions along each segment of the simplified path. The points corresponding to the greatest change in direction could be called the turning points:
A Python implementation of the RDP algorithm can be found on github.
import matplotlib.pyplot as plt
import numpy as np
import os
import rdp
def angle(dir):
"""
Returns the angles between vectors.
Parameters:
dir is a 2D-array of shape (N,M) representing N vectors in M-dimensional space.
The return value is a 1D-array of values of shape (N-1,), with each value
between 0 and pi.
0 implies the vectors point in the same direction
pi/2 implies the vectors are orthogonal
pi implies the vectors point in opposite directions
"""
dir2 = dir[1:]
dir1 = dir[:-1]
return np.arccos((dir1*dir2).sum(axis=1)/(
np.sqrt((dir1**2).sum(axis=1)*(dir2**2).sum(axis=1))))
tolerance = 70
min_angle = np.pi*0.22
filename = os.path.expanduser('~/tmp/bla.data')
points = np.genfromtxt(filename).T
print(len(points))
x, y = points.T
# Use the Ramer-Douglas-Peucker algorithm to simplify the path
# http://en.wikipedia.org/wiki/Ramer-Douglas-Peucker_algorithm
# Python implementation: https://github.com/sebleier/RDP/
simplified = np.array(rdp.rdp(points.tolist(), tolerance))
print(len(simplified))
sx, sy = simplified.T
# compute the direction vectors on the simplified curve
directions = np.diff(simplified, axis=0)
theta = angle(directions)
# Select the index of the points with the greatest theta
# Large theta is associated with greatest change in direction.
idx = np.where(theta>min_angle)[0]+1
fig = plt.figure()
ax =fig.add_subplot(111)
ax.plot(x, y, 'b-', label='original path')
ax.plot(sx, sy, 'g--', label='simplified path')
ax.plot(sx[idx], sy[idx], 'ro', markersize = 10, label='turning points')
ax.invert_yaxis()
plt.legend(loc='best')
plt.show()
Two parameters were used above:
The RDP algorithm takes one parameter, the tolerance, which
represents the maximum distance the simplified path
can stray from the original path. The larger the tolerance, the cruder the simplified path.
The other parameter is the min_angle which defines what is considered a turning point. (I'm taking a turning point to be any point on the original path, whose angle between the entering and exiting vectors on the simplified path is greater than min_angle).

I will be giving numpy/scipy code below, as I have almost no Matlab experience.
If your curve is smooth enough, you could identify your turning points as those of highest curvature. Taking the point index number as the curve parameter, and a central differences scheme, you can compute the curvature with the following code
import numpy as np
import matplotlib.pyplot as plt
import scipy.ndimage
def first_derivative(x) :
return x[2:] - x[0:-2]
def second_derivative(x) :
return x[2:] - 2 * x[1:-1] + x[:-2]
def curvature(x, y) :
x_1 = first_derivative(x)
x_2 = second_derivative(x)
y_1 = first_derivative(y)
y_2 = second_derivative(y)
return np.abs(x_1 * y_2 - y_1 * x_2) / np.sqrt((x_1**2 + y_1**2)**3)
You will probably want to smooth your curve out first, then calculate the curvature, then identify the highest curvature points. The following function does just that:
def plot_turning_points(x, y, turning_points=10, smoothing_radius=3,
cluster_radius=10) :
if smoothing_radius :
weights = np.ones(2 * smoothing_radius + 1)
new_x = scipy.ndimage.convolve1d(x, weights, mode='constant', cval=0.0)
new_x = new_x[smoothing_radius:-smoothing_radius] / np.sum(weights)
new_y = scipy.ndimage.convolve1d(y, weights, mode='constant', cval=0.0)
new_y = new_y[smoothing_radius:-smoothing_radius] / np.sum(weights)
else :
new_x, new_y = x, y
k = curvature(new_x, new_y)
turn_point_idx = np.argsort(k)[::-1]
t_points = []
while len(t_points) < turning_points and len(turn_point_idx) > 0:
t_points += [turn_point_idx[0]]
idx = np.abs(turn_point_idx - turn_point_idx[0]) > cluster_radius
turn_point_idx = turn_point_idx[idx]
t_points = np.array(t_points)
t_points += smoothing_radius + 1
plt.plot(x,y, 'k-')
plt.plot(new_x, new_y, 'r-')
plt.plot(x[t_points], y[t_points], 'o')
plt.show()
Some explaining is in order:
turning_points is the number of points you want to identify
smoothing_radius is the radius of a smoothing convolution to be applied to your data before computing the curvature
cluster_radius is the distance from a point of high curvature selected as a turning point where no other point should be considered as a candidate.
You may have to play around with the parameters a little, but I got something like this:
>>> x, y = np.genfromtxt('bla.data')
>>> plot_turning_points(x, y, turning_points=20, smoothing_radius=15,
... cluster_radius=75)
Probably not good enough for a fully automated detection, but it's pretty close to what you wanted.

A very interesting question. Here is my solution, that allows for variable resolution. Although, fine-tuning it may not be simple, as it's mostly intended to narrow down
Every k points, calculate the convex hull and store it as a set. Go through the at most k points and remove any points that are not in the convex hull, in such a way that the points don't lose their original order.
The purpose here is that the convex hull will act as a filter, removing all of "unimportant points" leaving only the extreme points. Of course, if the k-value is too high, you'll end up with something too close to the actual convex hull, instead of what you actually want.
This should start with a small k, at least 4, then increase it until you get what you seek. You should also probably only include the middle point for every 3 points where the angle is below a certain amount, d. This would ensure that all of the turns are at least d degrees (not implemented in code below). However, this should probably be done incrementally to avoid loss of information, same as increasing the k-value. Another possible improvement would be to actually re-run with points that were removed, and and only remove points that were not in both convex hulls, though this requires a higher minimum k-value of at least 8.
The following code seems to work fairly well, but could still use improvements for efficiency and noise removal. It's also rather inelegant in determining when it should stop, thus the code really only works (as it stands) from around k=4 to k=14.
def convex_filter(points,k):
new_points = []
for pts in (points[i:i + k] for i in xrange(0, len(points), k)):
hull = set(convex_hull(pts))
for point in pts:
if point in hull:
new_points.append(point)
return new_points
# How the points are obtained is a minor point, but they need to be in the right order.
x_coords = [float(x) for x in x.split()]
y_coords = [float(y) for y in y.split()]
points = zip(x_coords,y_coords)
k = 10
prev_length = 0
new_points = points
# Filter using the convex hull until no more points are removed
while len(new_points) != prev_length:
prev_length = len(new_points)
new_points = convex_filter(new_points,k)
Here is a screen shot of the above code with k=14. The 61 red dots are the ones that remain after the filter.

The approach you took sounds promising but your data is heavily oversampled. You could filter the x and y coordinates first, for example with a wide Gaussian and then downsample.
In MATLAB, you could use x = conv(x, normpdf(-10 : 10, 0, 5)) and then x = x(1 : 5 : end). You will have to tweak those numbers depending on the intrinsic persistence of the objects you are tracking and the average distance between points.
Then, you will be able to detect changes in direction very reliably, using the same approach you tried before, based on the scalar product, I imagine.

Another idea is to examine the left and the right surroundings at every point. This may be done by creating a linear regression of N points before and after each point. If the intersecting angle between the points is below some threshold, then you have an corner.
This may be done efficiently by keeping a queue of the points currently in the linear regression and replacing old points with new points, similar to a running average.
You finally have to merge adjacent corners to a single corner. E.g. choosing the point with the strongest corner property.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.