speedup geolocation algorithm in python - python

I have a set 100k of of geo locations (lat/lon) and a hexogonal grid (4k polygons). My goal is to calculate the total number of points which are located within each polygon.
My current algorithm uses 2 for loops to loop over all geo points and all polygons, which is really slow if I increase the number of polygons... How would you speedup the algorithm? I have uploaded a minimal example which creates 100k random geo points and uses 561 cells in the grid...
I also saw that reading the geo json file (with 4k polygons) takes some time, maybe i should export the polygons into a csv?
hexagon_grid.geojson file:
https://gist.github.com/Arnold1/9e41454e6eea910a4f6cd68ff1901db1
minimal python example:
https://gist.github.com/Arnold1/ee37a2e4b2dfbfdca9bfae7c7c3a3755

You don't need to explicitly test each hexagon to see whether a given point is located inside it.
Let's assume, for the moment, that all of your points fall somewhere within the bounds of your hexagonal grid. Because your hexagons form a regular lattice, you only really need to know which of the hexagon centers is closest to each point.
This can be computed very efficiently using a scipy.spatial.cKDTree:
import numpy as np
from scipy.spatial import cKDTree
import json
with open('/tmp/grid.geojson', 'r') as f:
data = json.load(f)
verts = []
centroids = []
for hexagon in data['features']:
# a (7, 2) array of xy coordinates specifying the vertices of the hexagon.
# we ignore the last vertex since it's equal to the first
xy = np.array(hexagon['geometry']['coordinates'][0][:6])
verts.append(xy)
# compute the centroid by taking the average of the vertex coordinates
centroids.append(xy.mean(0))
verts = np.array(verts)
centroids = np.array(centroids)
# construct a k-D tree from the centroid coordinates of the hexagons
tree = cKDTree(centroids)
# generate 10000 normally distributed xy coordinates
sigma = 0.5 * centroids.std(0, keepdims=True)
mu = centroids.mean(0, keepdims=True)
gen = np.random.RandomState(0)
xy = (gen.randn(10000, 2) * sigma) + mu
# query the k-D tree to find which hexagon centroid is nearest to each point
distance, idx = tree.query(xy, 1)
# count the number of points that are closest to each hexagon centroid
counts = np.bincount(idx, minlength=centroids.shape[0])
Plotting the output:
from matplotlib import pyplot as plt
fig, ax = plt.subplots(1, 1, subplot_kw={'aspect': 'equal'})
ax.hold(True)
ax.scatter(xy[:, 0], xy[:, 1], 10, c='b', alpha=0.25, edgecolors='none')
ax.scatter(centroids[:, 0], centroids[:, 1], marker='h', s=(counts + 5),
c=counts, cmap='Reds')
ax.margins(0.01)
I can think of several different ways you could handle points that fall outside your grid depending on how much accuracy you need:
You could exclude points that fall outside the outer bounding rectangle of your hexagon vertices (i.e. x < xmin, x > xmax etc.). However, this will fail to exclude points that fall within the 'gaps' along the edges of your grid.
Another straightforward option would be to set a cut-off on distance according to the spacing of your hexagon centers, which is equivalent to using a circular approximation for your outer hexagons.
If accuracy is crucial then you could define a matplotlib.path.Path corresponding to the outer vertices of your hexagonal grid, then use its .contains_points() method to test whether your points are contained within it. Compared to the other two methods, this would probably be slower and more fiddly to code.

Related

Calculate the area enclosed by a 2D array of unordered points in python

I am trying to calculate the area of a shape enclosed by a large set of unordered points in python. I have a 2D array of points which I can plot as a scatterplot like this.
There are several ways to calculate the area enclosed by points, but these all assume ordered points, such as here and here. This method calculates the area unordered points, but it doesn't appear to work for complex shapes, as seen here. How would I calculate this area from unordered points in python?
Sample data looks like this:
[[225.93459 -27.25677 ]
[226.98128 -32.001945]
[223.3623 -34.119724]
[225.84741 -34.416553]]
From pen and paper one can see that this shape contains an area of ~12 (unitless) but putting these coordinates into one of the algorithms linked to previously returns an area of ~0.78.
Let's first mention that in the question How would I calculate this area from unordered points in python? used phrase 'unordered points' in the context of calculation of an area usually means that given are points of a contour enclosing an area which area is to calculate.
But in the question provided data sample are not points of a contour but just a cloud of points, which if visualized using a scatterplot results in a visually perceivable area.
The above is the reason why in the question provided links to algorithms calculating areas from 'unordered points' don't apply at all to what the question is about.
In other words, the actual title of the question I will answer below will be:
Calculate the visually perceivable area a cloud of (x,y) points is forming when visualized as a scatterplot
One of the possible options is mentioned in a comment to the question:
Honestly, you might consider taking THAT graph as a bitmap, and counting the number of non-white pixels in it. That is probably as close as you can get. – Tim Roberts
Given the image perfectly covering (without any margin) all the non-white pixels you can calculate the area the image rectangle is covering in units used in the underlying (x,y) data by calculating the area TA of the rectangle visible in the image from the underlying list of points P with (x,y) point coordinates ( P = [(x1,y1), (x2,y2), ...] ) as follows:
X = [x for x,y in P]
Y = [y for x,y in P]
TA = (max(X)-min(X))*(max(Y)-min(Y))
Assuming N_white is the number of all white pixels in the image with N pixels the actual area A covered by non-white pixels expressed in units used in the list of points P will be:
A = TA*(N-N_white)/N
Another approach using a list of points P with (x,y) point coordinates only ( without creation of an image ) consists of following steps:
decide which area Ap a point is covering and calculate half of the size h2 of a rectangle with this area around that point ( h2 = 0.5*sqrt(Ap) )
create a list R with rectangles around all points in the list P: R = [(x-h2, y+h2, x+h2, y-h2) for x,y in P]
use the code provided through a link listed in the stackoverflow question
Area of Union Of Rectangles using Segment Trees to calculate the total area covered by the rectangles in the list R.
The above approach has the advantage over the graphical one obtained from the scatterplot that with the choice of the area covered by a point you directly influence the used precision/resolution/granularity for the area calculation.
Given a 2D array of points the area covered by the points can be calculated with help of the return value of the same hist2d() function provided in the matplotlib module (as matplotlib.pyplot.hist2d()) which is used to show the scatterplot.
The 'trick' is to set the cmin parameter value of the function to 1 ( cmin=1 ) and then calculate the number of numpy.nan values in the by the function returned array setting them in relation to entire amount of array values.
In other words all what is necessary to calculate the area when creating the scatterplot is already there for easy use in a simple area calculation formulas if you know that the histogram creating function provide as return value all what is therefore necessary.
Below code of a ready to use function for the area calculation along with demonstration of function usage:
def area_of_points(points, grid_size = [1000, 1000]):
"""
Returns the area covered by N 2D-points provided in a 'points' array
points = [ (x1,y1), (x2,y2), ... , (xN, yN) ]
'grid_size' gives the number of grid cells in x and y direction the
'points' bounding box is divided into for calculation of the area.
Larger 'grid_size' values mean smaller grid cells, higher precision
of the area calculation and longer runtime.
area_of_points() requires installed matplotlib module. """
import matplotlib.pyplot as plt
import numpy as np
pts_x = [x for x,y in points]
pts_y = [y for x,y in points]
pts_bb_area = (max(pts_x)-min(pts_x))*(max(pts_y)-min(pts_y))
h2D,_,_,_ = plt.hist2d( pts_x, pts_y, bins = grid_size, cmin=1)
numberOfWhiteBins = np.count_nonzero(np.isnan(h2D))
numberOfAll2Dbins = h2D.shape[0]*h2D.shape[1]
areaFactor = 1.0 - numberOfWhiteBins/numberOfAll2Dbins
pts_pts_area = areaFactor * pts_bb_area
print(f'Areas: b-box = {pts_bb_area:8.4f}, points = {pts_pts_area:8.4f}')
plt.show()
return pts_pts_area
#:def area_of_points(points, grid_size = [1000, 1000])
import numpy as np
np.random.seed(12345)
x = np.random.normal(size=100000)
y = x + np.random.normal(size=100000)
pts = [[xi,yi] for xi,yi in zip(x,y)]
print(area_of_points(pts))
# ^-- prints: Areas: b-box = 114.5797, points = 7.8001
# ^-- prints: 7.800126875291629
The above code creates following scatterplot:
Notice that the printed output Areas: b-box = 114.5797, points = 7.8001 and the by the function returned area value 7.800126875291629 give the area in units in which the x,y coordinates in the array of points are specified.
Instead of usage of a function when utilizing the know how you can play around with the parameter of the scatterplot calculating the area of what can be seen in the scatterplot.
Below code which changes the displayed scatterplot using the same underlying point data:
import numpy as np
np.random.seed(12345)
x = np.random.normal(size=100000)
y = x + np.random.normal(size=100000)
pts = [[xi,yi] for xi,yi in zip(x,y)]
pts_values_example = \
[[0.53005, 2.79209],
[0.73751, 0.18978],
... ,
[-0.6633, -2.0404],
[1.51470, 0.86644]]
# ---
pts_x = [x for x,y in pts]
pts_y = [y for x,y in pts]
pts_bb_area = (max(pts_x)-min(pts_x))*(max(pts_y)-min(pts_y))
# ---
import matplotlib.pyplot as plt
bins = [320, 300] # resolution of the grid (for the scatter plot)
# ^-- resolution of precision for the calculation of area
pltRetVal = plt.hist2d( pts_x, pts_y, bins = bins, cmin=1, cmax=15 )
plt.colorbar() # display the colorbar (for a 2d density histogram)
plt.show()
# ---
h2D, xedges1D, yedges1D, h2DhistogramObject = pltRetVal
numberOfWhiteBins = np.count_nonzero(np.isnan(h2D))
numberOfAll2Dbins = (len(xedges1D)-1)*(len(yedges1D)-1)
areaFactor = 1.0 - numberOfWhiteBins/numberOfAll2Dbins
area = areaFactor * pts_bb_area
print(f'Areas: b-box = {pts_bb_area:8.4f}, points = {area:8.4f}')
# prints "Areas: b-box = 114.5797, points = 20.7174"
creating following scatterplot:
Notice that the calculated area is now larger due to smaller values used for grid resolution resulting in more of the area colored.

Resample points in a Geodataframe so they stay equally spaced but inside area

I have a geopandas dataframe that contains a complex area and some Point objects representing latitude and longitue inside that same area, both reading from .kml and .xlsx files, not defined by me. Thing is that those points are really close to each other, and when ploting the whole map they overlap, making it really difficult to spot each one individually, specially when using geoplot.kdeplot()
So what I would like to do is to find some way to equally space them, respecting the boundaries of my area. For sake of simplicity, consder the Polygon object as the area and the Point objects as the ones to resample:
import random
import matplotlib.pyplot as plt
y = np.array([[100,100], [200,100], [200,200], [100,200], [50.0,150]])
p = Polygon(y)
points = []
for i in range(100):
pt = Point(random.randint(50,199),random.randint(100,199))
if p.contains(pt):
points.append(pt)
xs = [point.x for point in points]
ys = [point.y for point in points]
fig = plt.figure()
ax1=fig.add_subplot(111)
x,y = p.exterior.xy
plt.plot(x,y)
ax1.scatter(xs, ys)
plt.show()
That gives something like this:
Any ideas on how to do it witout crossing the area? Thank you in advance!
EDIT:
Important to mention that the resample should not be arbitrary, meaning that a point in coordinates (100,100), for example, should be somewhere near its original location.
This kind of problems can be handled in various ways such as force based method or geometrical method; Here, geometrical one is used.
I have considered the points to be as circles, so an arbitrary diameter can be specified for them, and also the area size for plotting with matplotlib. So we have the following codes for the beginning:
import numpy as np
from scipy.spatial import cKDTree
from shapely.geometry.polygon import Polygon
import matplotlib.pyplot as plt
np.random.seed(85)
# Points' coordinates and the specified diameter
coords = np.array([[3, 4], [7, 8], [3, 3], [1, 8], [5, 4], [3, 5], [7, 7]], dtype=np.float64)
points_dia = 1.1 # can be chosen arbitrary based on the problem
# Polygon creation
polygon_verts = np.array([[0, 0], [8.1, 0], [8.1, 8.1], [0, 8.1]])
p = Polygon(polygon_verts)
# Plotting
x, y = coords.T
colors = np.random.rand(coords.shape[0])
area = 700 # can be chosen arbitrary based on the problem
min_ = coords.min() - 2*points_dia
max_ = coords.max() + 2*points_dia
plt.xlim(min_, max_)
plt.ylim(min_, max_)
xc, yc = p.exterior.xy
plt.plot(xc, yc)
plt.scatter(x, y, s=area, c=colors, alpha=0.5)
plt.show()
I separate the total answer to two steps:
dispersing (moving away) the suspected points from average position coordinates
curing the point-polygon overlaps
First step
For the first part, based on the number of points and neighbors, we can choose between Scipy and other libraries e.g. Scikit learn (my last explanations on this topic 1 (4th-5th paragraphs) and 2 can be helpful for such selections). Based on the (not large) sizes that the OP is mentioned in the comment, I will suggest using scipy.spatial.cKDTree, which has the best performance in this regard based on my experiences, to query the points. Using cKDTree.query_pairs we specify groups of points that are in the distance range of the diameter value from each other. Then by looping on the groups and averaging each group's points' coordinates, we can again use cKDTree.query to find the nearest point (k=1) of that group to the specified averaged coordinate. Now, it is easy to calculate distance of other points of that group to the nearest one and move them outward by a specified distance (here just as much as the overlaps):
nears = cKDTree(coords).query_pairs(points_dia, output_type='ndarray')
near_ids = np.unique(nears)
col1_ids = np.unique(nears[:, 0])
for i in range(col1_ids.size):
pts_ids = np.unique(nears[nears[:, 0] == i].ravel())
center_coord = np.average(coords[pts_ids], axis=0)
nearest_center_id = pts_ids[cKDTree(coords[pts_ids]).query(center_coord, k=1)[1]]
furthest_center_ids = pts_ids[pts_ids != nearest_center_id]
vecs = coords[furthest_center_ids] - coords[nearest_center_id]
dists = np.linalg.norm(vecs, axis=1) - points_dia
coords[furthest_center_ids] = coords[furthest_center_ids] + vecs / np.linalg.norm(vecs, axis=1) * np.abs(dists)
Second step
For this part, we can loop on the modified coordinates (in the previous step) and find the two nearest coordinates (k=2) of the polygon's edges and project the point to that edge to find the closest coordinate on that line. So, again as step 1, we calculate overlaps and move the points to placing them inside the polygon. The points that where single (not in groups) will be moved shorter distances. This distances can be specified as desired and I have set them as a default value:
for i, j in enumerate(coords):
for k in range(polygon_verts.shape[0]-1):
nearest_poly_ids = cKDTree(polygon_verts).query(j, k=2)[1]
vec_line = polygon_verts[nearest_poly_ids][1] - polygon_verts[nearest_poly_ids][0]
vec_proj = j - polygon_verts[nearest_poly_ids][0]
line_pnt = polygon_verts[nearest_poly_ids][0] + np.dot(vec_line, vec_proj) / np.dot(vec_line, vec_line) * vec_line
vec = j - line_pnt
dist = np.linalg.norm(vec)
if dist < points_dia / 2:
if i in near_ids:
coords[i] = coords[i] + vec / dist * 2 * points_dia
else:
coords[i] = coords[i] + vec / dist
These codes, perhaps, could be optimized to get better performances, but it is not of importance for this question. I have checked it on my small example and need to be checked by larger cases. In all conditions, I think this will be one of the best ways to achieve the goal (may some changes based on the need) and can be a template which can be modified to satisfy any other needs or any shortcomings.
so this is not particularly fast due to the loop checking for "point in polygon", but as long as you don't intend to ave a very fine grid-spacing something like this would do the job:
from scipy.interpolate import NearestNDInterpolator
# define a regular grid
gx, gy = np.meshgrid(np.linspace(50, 200, 100),
np.linspace(100, 200, 100))
# get a interpolation function
# (just used "xs" as values in here...)
i = NearestNDInterpolator(np.column_stack((xs, ys)), xs)
# crop your grid to the shape
newp = []
for x, y in zip(gx.flat, gy.flat):
pt = Point(x, y)
if p.contains(pt):
newp.append(pt.xy)
newp = np.array(newp).squeeze()
# evaluate values on the grid
vals = i(*newp.T)
ax1.scatter(*newp.T, c=vals, zorder=0)

Finding edge from 2d points Python

I have several 2d sets of scattered data that I would like to find the edges of. Some edges may be open lines, others may be polygons.
For example, here is one plot that has an open edge that I would like to be able to keep. I would actually like to create a polygon from the open edges so I can use point_in_poly to check if another point lies inside. The points that would close the polygon are the boundaries of my plot area, btw.
Any ideas on where to get started?
EDIT:
Here is what I have already tried:
KernelDensity from sklearn. The edges point density varies significantly enough to not be entirely distinguishable from the bulk of the points.
kde = KernelDensity()
kde.fit(my_data)
dens = np.exp(kde.score_samples(ds))
dmax = dens.max()
dens_mask = (0.4 * dmax < dens) & (dens < 0.8 * dmax)
ax.scatter(ds[dens_mask, 0], ds[dens_mask, 1], ds[dens_mask, 2],
c=dens[dens_mask], depthshade=False, marker='o', edgecolors='none')
Incidentally, the 'gap' in the left side of the color plot is the same one that is in the black and white plot above. I also am pretty sure that I could be using KDE better. For example, I would like to get the density for a much smaller volume, more like using radius_neighbors from sklearn's NearestNeighbors()
ConvexHull from scipy. I tried removing points from semi-random data (for practice) while still keeping a point of interest (here, 0,0) inside the convex set. This wasn't terribly effective. I had no sophisticated way of exlcuding points from an iteration and only removed the ones that were used in the last convex hull. This code and accompanying image shows the first and last hull made while keeping the point of interest in the set.
hull = ConvexHull(pts)
contains = True
while contains:
temp_pts = np.delete(pts, hull.vertices, 0)
temp_hull = ConvexHull(temp_pts)
tp = path.Path(np.hstack((temp_pts[temp_hull.vertices, 0][np.newaxis].T,
temp_pts[temp_hull.vertices, 1][np.newaxis].T)))
if not tp.contains_point([0, 0]):
contains = False
hull = ConvexHull(pts)
plt.plot(pts[hull.vertices, 0], pts[hull.vertices, 1])
else:
pts = temp_pts
plt.plot(pts[hull.vertices, 0], pts[hull.vertices, 1], 'r-')
plt.show()
Ideally the goal for convex hull would be to maximize the area inside the hull while keeping only the point of interest inside the set but I haven't been able to code this.
KMeans() from sklearn.cluster. Using n=3 clusters I tried just run the class with default settings and got three horizontal groups of points. I haven't learned how to train the data to recognize points that form edges.
Here is a piece of the model where the data points are coming from. The solid areas contain points while the voids do not.
Here, and here are some other questions I have asked that show some more of what I have been looking at.
So I was able to do this in a roundabout way.
I used images of slices of the model in the xy plane generated from SolidWorks to distinguish the areas of interest.
If you see them, there are points in the corners of the picture that I placed in the model for reference at known distances. These points allowed me to determine the number of pixels per millimeter. From there, I mapped the points in my analysis set to pixels and checked the color of the pixel. If the pixel is white it is masked.
def mask_z_level(xi, yi, msk_img, x0=-14.3887, y0=5.564):
im = plt.imread(msk_img)
msk = np.zeros(xi.shape, dtype='bool')
pxmm = np.zeros((3, 2))
p = 0
for row in range(im.shape[0]):
for col in range(im.shape[1]):
if tuple(im[row, col]) == (1., 0., 0.):
pxmm[p] = (row, col)
p += 1
pxx = pxmm[1, 1] / 5.5
pxy = pxmm[2, 0] / 6.5
print(pxx, pxy)
for j in range(xi.shape[1]):
for i in range(xi.shape[0]):
x, y = xi[i, j], yi[i, j]
dx, dy = x - x0, y - y0
dpx = np.round(dx * pxx).astype('int')
dpy = -np.round(dy * pxy).astype('int')
if tuple(im[dpy, dpx]) == (1., 1., 1.):
msk[i, j] = True
return msk
Here is a plot showing the effects of the masking:
I am still fine tuning the borders but I have a very manageable task now that the mask is in largely complete. The reason being is that some mask points are incorrect resulting in banding.

Python determine the mean and extract maximum value inside a polygon over a grid in an Array

I have 3 NumPy arrays which consist of UTM-X(256) and UTM-Y(256) coordinates, and the accumulated Rainfall(65536) for a Weather Radar 256x256 (km) in UTM.
I also have a Polygon inside the Grid bounds that is a Catchment Boundary in UTM.
I need to determine the Average Rainfall over just the catchment polygon (a clipped sub set of the RADAR data), and the maximum, and the location of the maximum. I have already determined the average over the entire RADAR grid.
So the question is: How do I perform analysis on a subset of a NumPy array that is determined by the Polygon? I would have thought that this would be a very common operation, but have not found any Python scripts to perform this operation.
Here is an illustration of the data set:
Here is an outline of a possible approach.
First find the polygon that bounds the catchment boundary. Presuming you know which of the UTM coordinates of your full set of points form that catchment boundary, say it's like this,
catchment = an np.array of (UTM_X, UTM_Y) point tuples
you could find the boundary of that point set using scipy.spatial.ConvexHull
boundary= scipy.spatial.ConvexHull(catchment)
Next, for your array of rainfall data, you would have to test whether the coordinates fall inside or outside of the boundary of the convex hull.
This previous SO question has some good answers explaining ways to do that coordinate test.
Finally you would gather those rainfall data points that passed the test of being inside the boundary and perform whatever statistical calculations you want to do with appropriate NumPy/SciPy statistical functions.
Assuming the boundary is given as a list of the polygon vertices, you could have matplotlib generate a mask for you over the data coordinates and then use that mask to sum up only the values within the contour.
In other words, when you have a series of coordinates that define the boundary of the polygon that marks the region of interest, then have matplotlib generate a boolean mask indicating all the coordinates that are within this polygon. This mask can then be used to extract only the limited dataset of rainfall within the contour.
The following simple example shows you how this is done:
import numpy as np
from matplotlib.patches import PathPatch
from matplotlib.path import Path
import matplotlib.pyplot as plt
# generate some fake data
xmin, xmax, ymin, ymax = -10, 30, -4, 20
y,x = np.mgrid[ymin:ymax+1,xmin:xmax+1]
z = (x-(xmin+xmax)/2)**2 + (y-(ymin + ymax)/2)**2
extent = [xmin-.5, xmax+.5, ymin-.5, ymax+.5]
xr, yr = [np.random.random_integers(lo, hi, 3) for lo, hi
in ((xmin, xmax), (ymin, ymax))] # create a contour
coordlist = np.vstack((xr, yr)).T # create an Nx2 array of coordinates
coord_map = np.vstack((x.flatten(), y.flatten())).T # create an Mx2 array listing all the coordinates in field
polypath = Path(coordlist)
mask = polypath.contains_points(coord_map).reshape(x.shape) # have mpl figure out which coords are within the contour
f, ax = plt.subplots(1,1)
ax.imshow(z, extent=extent, interpolation='none', origin='lower', cmap='hot')
ax.imshow(mask, interpolation='none', extent=extent, origin='lower', alpha=.5, cmap='gray')
patch = PathPatch(polypath, facecolor='g', alpha=.5)
ax.add_patch(patch)
plt.show(block=False)
print(z[mask].sum()) # prints out the total accumulated
In this example, x and y represent your UTM-X and UTM-Y dataranges. z represents the weather rainfall data, but is in this case a matrix, unlike your single-column view of average rainfall (which is easily remapped onto a grid).
In the last line, I've summed up all the values of z that are within the contour. If you want the mean, just replace sum by mean.

calculate turning points / pivot points in trajectory (path)

I'm trying to come up with an algorithm that will determine turning points in a trajectory of x/y coordinates. The following figures illustrates what I mean: green indicates the starting point and red the final point of the trajectory (the entire trajectory consists of ~ 1500 points):
In the following figure, I added by hand the possible (global) turning points that an algorithm could return:
Obviously, the true turning point is always debatable and will depend on the angle that one specifies that has to lie between points. Furthermore a turning point can be defined on a global scale (what I tried to do with the black circles), but could also be defined on a high-resolution local scale. I'm interested in the global (overall) direction changes, but I'd love to see a discussion on the different approaches that one would use to tease apart global vs local solutions.
What I've tried so far:
calculate distance between subsequent points
calculate angle between subsequent points
look how distance / angle changes between subsequent points
Unfortunately this doesn't give me any robust results. I probably have too calculate the curvature along multiple points, but that's just an idea.
I'd really appreciate any algorithms / ideas that might help me here. The code can be in any programming language, matlab or python are preferred.
EDIT here's the raw data (in case somebody want's to play with it):
mat file
text file (x coordinate first, y coordinate in second line)
You could use the Ramer-Douglas-Peucker (RDP) algorithm to simplify the path. Then you could compute the change in directions along each segment of the simplified path. The points corresponding to the greatest change in direction could be called the turning points:
A Python implementation of the RDP algorithm can be found on github.
import matplotlib.pyplot as plt
import numpy as np
import os
import rdp
def angle(dir):
"""
Returns the angles between vectors.
Parameters:
dir is a 2D-array of shape (N,M) representing N vectors in M-dimensional space.
The return value is a 1D-array of values of shape (N-1,), with each value
between 0 and pi.
0 implies the vectors point in the same direction
pi/2 implies the vectors are orthogonal
pi implies the vectors point in opposite directions
"""
dir2 = dir[1:]
dir1 = dir[:-1]
return np.arccos((dir1*dir2).sum(axis=1)/(
np.sqrt((dir1**2).sum(axis=1)*(dir2**2).sum(axis=1))))
tolerance = 70
min_angle = np.pi*0.22
filename = os.path.expanduser('~/tmp/bla.data')
points = np.genfromtxt(filename).T
print(len(points))
x, y = points.T
# Use the Ramer-Douglas-Peucker algorithm to simplify the path
# http://en.wikipedia.org/wiki/Ramer-Douglas-Peucker_algorithm
# Python implementation: https://github.com/sebleier/RDP/
simplified = np.array(rdp.rdp(points.tolist(), tolerance))
print(len(simplified))
sx, sy = simplified.T
# compute the direction vectors on the simplified curve
directions = np.diff(simplified, axis=0)
theta = angle(directions)
# Select the index of the points with the greatest theta
# Large theta is associated with greatest change in direction.
idx = np.where(theta>min_angle)[0]+1
fig = plt.figure()
ax =fig.add_subplot(111)
ax.plot(x, y, 'b-', label='original path')
ax.plot(sx, sy, 'g--', label='simplified path')
ax.plot(sx[idx], sy[idx], 'ro', markersize = 10, label='turning points')
ax.invert_yaxis()
plt.legend(loc='best')
plt.show()
Two parameters were used above:
The RDP algorithm takes one parameter, the tolerance, which
represents the maximum distance the simplified path
can stray from the original path. The larger the tolerance, the cruder the simplified path.
The other parameter is the min_angle which defines what is considered a turning point. (I'm taking a turning point to be any point on the original path, whose angle between the entering and exiting vectors on the simplified path is greater than min_angle).
I will be giving numpy/scipy code below, as I have almost no Matlab experience.
If your curve is smooth enough, you could identify your turning points as those of highest curvature. Taking the point index number as the curve parameter, and a central differences scheme, you can compute the curvature with the following code
import numpy as np
import matplotlib.pyplot as plt
import scipy.ndimage
def first_derivative(x) :
return x[2:] - x[0:-2]
def second_derivative(x) :
return x[2:] - 2 * x[1:-1] + x[:-2]
def curvature(x, y) :
x_1 = first_derivative(x)
x_2 = second_derivative(x)
y_1 = first_derivative(y)
y_2 = second_derivative(y)
return np.abs(x_1 * y_2 - y_1 * x_2) / np.sqrt((x_1**2 + y_1**2)**3)
You will probably want to smooth your curve out first, then calculate the curvature, then identify the highest curvature points. The following function does just that:
def plot_turning_points(x, y, turning_points=10, smoothing_radius=3,
cluster_radius=10) :
if smoothing_radius :
weights = np.ones(2 * smoothing_radius + 1)
new_x = scipy.ndimage.convolve1d(x, weights, mode='constant', cval=0.0)
new_x = new_x[smoothing_radius:-smoothing_radius] / np.sum(weights)
new_y = scipy.ndimage.convolve1d(y, weights, mode='constant', cval=0.0)
new_y = new_y[smoothing_radius:-smoothing_radius] / np.sum(weights)
else :
new_x, new_y = x, y
k = curvature(new_x, new_y)
turn_point_idx = np.argsort(k)[::-1]
t_points = []
while len(t_points) < turning_points and len(turn_point_idx) > 0:
t_points += [turn_point_idx[0]]
idx = np.abs(turn_point_idx - turn_point_idx[0]) > cluster_radius
turn_point_idx = turn_point_idx[idx]
t_points = np.array(t_points)
t_points += smoothing_radius + 1
plt.plot(x,y, 'k-')
plt.plot(new_x, new_y, 'r-')
plt.plot(x[t_points], y[t_points], 'o')
plt.show()
Some explaining is in order:
turning_points is the number of points you want to identify
smoothing_radius is the radius of a smoothing convolution to be applied to your data before computing the curvature
cluster_radius is the distance from a point of high curvature selected as a turning point where no other point should be considered as a candidate.
You may have to play around with the parameters a little, but I got something like this:
>>> x, y = np.genfromtxt('bla.data')
>>> plot_turning_points(x, y, turning_points=20, smoothing_radius=15,
... cluster_radius=75)
Probably not good enough for a fully automated detection, but it's pretty close to what you wanted.
A very interesting question. Here is my solution, that allows for variable resolution. Although, fine-tuning it may not be simple, as it's mostly intended to narrow down
Every k points, calculate the convex hull and store it as a set. Go through the at most k points and remove any points that are not in the convex hull, in such a way that the points don't lose their original order.
The purpose here is that the convex hull will act as a filter, removing all of "unimportant points" leaving only the extreme points. Of course, if the k-value is too high, you'll end up with something too close to the actual convex hull, instead of what you actually want.
This should start with a small k, at least 4, then increase it until you get what you seek. You should also probably only include the middle point for every 3 points where the angle is below a certain amount, d. This would ensure that all of the turns are at least d degrees (not implemented in code below). However, this should probably be done incrementally to avoid loss of information, same as increasing the k-value. Another possible improvement would be to actually re-run with points that were removed, and and only remove points that were not in both convex hulls, though this requires a higher minimum k-value of at least 8.
The following code seems to work fairly well, but could still use improvements for efficiency and noise removal. It's also rather inelegant in determining when it should stop, thus the code really only works (as it stands) from around k=4 to k=14.
def convex_filter(points,k):
new_points = []
for pts in (points[i:i + k] for i in xrange(0, len(points), k)):
hull = set(convex_hull(pts))
for point in pts:
if point in hull:
new_points.append(point)
return new_points
# How the points are obtained is a minor point, but they need to be in the right order.
x_coords = [float(x) for x in x.split()]
y_coords = [float(y) for y in y.split()]
points = zip(x_coords,y_coords)
k = 10
prev_length = 0
new_points = points
# Filter using the convex hull until no more points are removed
while len(new_points) != prev_length:
prev_length = len(new_points)
new_points = convex_filter(new_points,k)
Here is a screen shot of the above code with k=14. The 61 red dots are the ones that remain after the filter.
The approach you took sounds promising but your data is heavily oversampled. You could filter the x and y coordinates first, for example with a wide Gaussian and then downsample.
In MATLAB, you could use x = conv(x, normpdf(-10 : 10, 0, 5)) and then x = x(1 : 5 : end). You will have to tweak those numbers depending on the intrinsic persistence of the objects you are tracking and the average distance between points.
Then, you will be able to detect changes in direction very reliably, using the same approach you tried before, based on the scalar product, I imagine.
Another idea is to examine the left and the right surroundings at every point. This may be done by creating a linear regression of N points before and after each point. If the intersecting angle between the points is below some threshold, then you have an corner.
This may be done efficiently by keeping a queue of the points currently in the linear regression and replacing old points with new points, similar to a running average.
You finally have to merge adjacent corners to a single corner. E.g. choosing the point with the strongest corner property.

Categories