How to get the K most distant points, given their coordinates? - python
We have boring CSV with 10000 rows of ages (float), titles (enum/int), scores (float), ....
We have N columns each with int/float values in a table.
You can imagine this as points in ND space
We want to pick K points that would have maximised distance between each other.
So if we have 100 points in a tightly packed cluster and one point in the distance we would get something like this for three points:
or this
For 4 points it will become more interesting and pick some point in the middle.
So how to select K most distant rows (points) from N (with any complexity)? It looks like an ND point cloud "triangulation" with a given resolution yet not for 3d points.
I search for a reasonably fast approach (approximate - no precise solution needed) for K=200 and N=100000 and ND=6 (probably multigrid or ANN on KDTree based, SOM or triangulation based..).. Does anyone know one?
From past experience with a pretty similar problem, a simple solution of computing the mean Euclidean distance of all pairs within each group of K points and then taking the largest mean, works very well. As someone noted above, it's probably hard to avoid a loop on all combinations (not on all pairs). So a possible implementation of all this can be as follows:
import itertools
import numpy as np
from scipy.spatial.distance import pdist
Npoints = 3 # or 4 or 5...
# making up some data:
data = np.matrix([[3,2,4,3,4],[23,25,30,21,27],[6,7,8,7,9],[5,5,6,6,7],[0,1,2,0,2],[3,9,1,6,5],[0,0,12,2,7]])
# finding row indices of all combinations:
c = [list(x) for x in itertools.combinations(range(len(data)), Npoints )]
distances = []
for i in c:
distances.append(np.mean(pdist(data[i,:]))) # pdist: a method of computing all pairwise Euclidean distances in a condensed way.
ind = distances.index(max(distances)) # finding the index of the max mean distance
rows = c[ind] # these are the points in question
I propose an approximate solution. The idea is to start from a set of K points chosen in a way I'll explain below, and repeatedly loop through these points replacing the current one with the point, among the N-K+1 points not belonging to the set but including the current one, that maximizes the sum of the distances from the points of the set. This procedure leads to a set of K points where the replacement of any single point would cause the sum of the distances among the points of the set to decrease.
To start the process we take the K points that are closest to the mean of all points. This way we have good chances that on the first loop the set of K points will be spread out close to its optimum. Subsequent iterations will make adjustments to the set of K points towards a maximum of the sum of distances, which for the current values of N, K and ND appears to be reachable in just a few seconds. In order to prevent excessive looping in edge cases, we limit the number of loops nonetheless.
We stop iterating when an iteration does not improve the total distance among the K points. Of course, this is a local maximum. Other local maxima will be reached for different initial conditions, or by allowing more than one replacement at a time, but I don't think it would be worthwhile.
The data must be adjusted in order for unit displacements in each dimension to have the same significance, i.e., in order for Euclidean distances to be meaningful. E.g., if your dimensions are salary and number of children, unadjusted, the algorithm will probably yield results concentrated in the extreme salary regions, ignoring that person with 10 kids. To get a more realistic output you could divide salary and number of children by their standard deviation, or by some other estimate that makes differences in salary comparable to differences in number of children.
To be able to plot the output for a random Gaussian distribution, I have set ND = 2 in the code, but setting ND = 6, as per your request, is no problem (except you cannot plot it).
import matplotlib.pyplot as plt
import numpy as np
import scipy.spatial as spatial
N, K, ND = 100000, 200, 2
MAX_LOOPS = 20
SIGMA, SEED = 40, 1234
rng = np.random.default_rng(seed=SEED)
means, variances = [0] * ND, [SIGMA**2] * ND
data = rng.multivariate_normal(means, np.diag(variances), N)
def distances(ndarray_0, ndarray_1):
if (ndarray_0.ndim, ndarray_1.ndim) not in ((1, 2), (2, 1)):
raise ValueError("bad ndarray dimensions combination")
return np.linalg.norm(ndarray_0 - ndarray_1, axis=1)
# start with the K points closest to the mean
# (the copy() is only to avoid a view into an otherwise unused array)
indices = np.argsort(distances(data, data.mean(0)))[:K].copy()
# distsums is, for all N points, the sum of the distances from the K points
distsums = spatial.distance.cdist(data, data[indices]).sum(1)
# but the K points themselves should not be considered
# (the trick is that -np.inf ± a finite quantity always yields -np.inf)
distsums[indices] = -np.inf
prev_sum = 0.0
for loop in range(MAX_LOOPS):
for i in range(K):
# remove this point from the K points
old_index = indices[i]
# calculate its sum of distances from the K points
distsums[old_index] = distances(data[indices], data[old_index]).sum()
# update the sums of distances of all points from the K-1 points
distsums -= distances(data, data[old_index])
# choose the point with the greatest sum of distances from the K-1 points
new_index = np.argmax(distsums)
# add it to the K points replacing the old_index
indices[i] = new_index
# don't consider it any more in distsums
distsums[new_index] = -np.inf
# update the sums of distances of all points from the K points
distsums += distances(data, data[new_index])
# sum all mutual distances of the K points
curr_sum = spatial.distance.pdist(data[indices]).sum()
# break if the sum hasn't changed
if curr_sum == prev_sum:
break
prev_sum = curr_sum
if ND == 2:
X, Y = data.T
marker_size = 4
plt.scatter(X, Y, s=marker_size)
plt.scatter(X[indices], Y[indices], s=marker_size)
plt.grid(True)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
Output:
Splitting the data into 3 equidistant Gaussian distributions the output is this:
Assuming that if you read your csv file with N (10000) rows and D dimension (or features) into a N*D martix X. You can calculate the distance between each point and store it in a distance matrix as follows:
import numpy as np
X = np.asarray(X) ### convert to numpy array
distance_matrix = np.zeros((X.shape[0],X.shape[0]))
for i in range(X.shape[0]):
for j in range(i+1,X.shape[0]):
## We compute triangle matrix and copy the rest. Distance from point A to point B and distance from point B to point A are the same.
distance_matrix[i][j]= np.linalg.norm(X[i]-X[j]) ## Here I am calculating Eucledian distance. Other distance measures can also be used.
#distance_matrix = distance_matrix + distance_matrix.T - np.diag(np.diag(distance_matrix)) ## This syntax can be used to get the lower triangle of distance matrix, which is not really required in your case.
K = 5 ## Number of points that you want to pick
indexes = np.unravel_index(np.argsort(distance_matrix.ravel())[-1*K:], distance_matrix.shape)
print(indexes)
Bottom Line Up Front: Dealing with multiple equally distant points and the Curse of Dimensionality are going to be larger problems than just finding the points. Spoiler alert: There's a surprise ending.
I think this an interesting question but I'm bewildered by some of the answers. I think this is, in part, due to the sketches provided. You've no doubt noticed the answers look similar -- 2d, with clusters -- even though you indicated a wider scope was needed. Because others will eventually see this, I'm going to step through my thinking a bit slowly so bear with me for the early part.
It makes sense to start with a simplified example to see if we can generalize a solution with data that's easy to grasp and a linear 2D model is easiest of the easy.
We don't need to calculate all the distances though. We just need the ones at the extremes. So we can then take the top and bottom few values:
right = lin_2_D.nlargest(8, ['x'])
left = lin_2_D.nsmallest(8, ['x'])
graph = sns.scatterplot(x="x", y="y", data=lin_2_D, color = 'gray', marker = '+', alpha = .4)
sns.scatterplot(x = right['x'], y = right['y'], color = 'red')
sns.scatterplot(x = left['x'], y = left['y'], color = 'green')
fig = graph.figure
fig.set_size_inches(8,3)
What we have so far: Of 100 points, we've eliminated the need to calculate the distance between 84 of them. Of what's left we can further drop this by ordering the results on one side and checking the distance against the others.
You can imagine a case where you have a couple of data points way off the trend line that could be captured by taking the greatest or least y values, and all that starts to look like Walter Tross's top diagram. Add in a couple of extra clusters and you get what looks his bottom diagram and it appears that we're sort of making the same point.
The problem with stopping here is the requirement you mentioned is that you need a solution that works for any number of dimensions.
The unfortunate part is that we run into four challenges:
Challenge 1: As you increase the dimensions you can run into a large number of cases where you have multiple solutions when seeking midpoints. So you're looking for k furthest points but have a large number of equally valid possible solutions and no way prioritizing them. Here are two super easy examples illustrate this:
A) Here we have just four points and in only two dimensions. You really can't get any easier than this, right? The distance from red to green is trivial. But try to find the next furthest point and you'll see both of the black points are equidistant from both the red and green points. Imagine you wanted the furthest six points using the first graphs, you might have 20 or more points that are all equidistant.
edit: I just noticed the red and green dots are at the edges of their circles rather than at the center, I'll update later but the point is the same.
B) This is super easy to imagine: Think of a D&D 4 sided die. Four points of data in a three-dimensional space, all equidistant so it's known as a triangle-based pyramid. If you're looking for the closest two points, which two? You have 4 choose 2 (aka, 6) combinations possible. Getting rid of valid solutions can be a bit of a problem because invariably you face questions such as "why did we get rid of these and not this one?"
Challenge 2: The Curse of Dimensionality. Nuff Said.
Challenge 3 Revenge of The Curse of Dimensionality Because you're looking for the most distant points, you have to x,y,z ... n coordinates for each point or you have to impute them. Now, your data set is much larger and slower.
Challenge 4 Because you're looking for the most distant points, dimension reduction techniques such as ridge and lasso are not going to be useful.
So, what to do about this?
Nothing.
Wait. What?!?
Not truly, exactly, and literally nothing. But nothing crazy. Instead, rely on a simple heuristic that is understandable and computationally easy. Paul C. Kainen puts it well:
Intuitively, when a situation is sufficiently complex or uncertain,
only the simplest methods are valid. Surprisingly, however,
common-sense heuristics based on these robustly applicable techniques
can yield results which are almost surely optimal.
In this case, you have not the Curse of Dimensionality but rather the Blessing of Dimensionality. It's true you have a lot of points and they'll scale linearly as you seek other equidistant points (k) but the total dimensional volume of space will increase to power of the dimensions. The k number of furthest points you're is insignificant to the total number of points. Hell, even k^2 becomes insignificant as the number of dimensions increase.
Now, if you had a low dimensionality, I would go with them as a solution (except the ones that are use nested for loops ... in NumPy or Pandas).
If I was in your position, I'd be thinking how I've got code in these other answers that I could use as a basis and maybe wonder why should I should trust this other than it lays out a framework on how to think through the topic. Certainly, there should be some math and maybe somebody important saying the same thing.
Let me reference to chapter 18 of Computer Intensive Methods in Control and Signal Processing and an expanded argument by analogy with some heavy(-ish) math. You can see from the above (the graph with the colored dots at the edges) that the center is removed, particularly if you followed the idea of removing the extreme y values. It's a though you put a balloon in a box. You could do this a sphere in a cube too. Raise that into multiple dimensions and you have a hypersphere in a hypercube. You can read more about that relationship here.
Finally, let's get to a heuristic:
Select the points that have the most max or min values per dimension. When/if you run out of them pick ones that are close to those values if there isn't one at the min/max. Essentially, you're choosing the corners of a box For a 2D graph you have four points, for a 3D you have the 8 corners of the box (2^3).
More accurately this would be a 4d or 5d (depending on how you might assign the marker shape and color) projected down to 3d. But you can easily see how this data cloud gives you the full range of dimensions.
Here is a quick check on learning; for purposes of ease, ignore the color/shape aspect: It's easy to graphically intuit that you have no problem with up to k points short of deciding what might be slightly closer. And you can see how you might need to randomize your selection if you have a k < 2D. And if you added another point you can see it (k +1) would be in a centroid. So here is the check: If you had more points, where would they be? I guess I have to put this at the bottom -- limitation of markdown.
So for a 6D data cloud, the values of k less than 64 (really 65 as we'll see in just a moment) points are pretty easy. But...
If you don't have a data cloud but instead have data that has a linear relationship, you'll 2^(D-1) points. So, for that linear 2D space, you have a line, for linear 3D space, you'd have a plane. Then a rhomboid, etc. This is true even if your shape is curved. Rather than do this graph myself, I'm using the one from an excellent post on by Inversion Labs on Best-fit Surfaces for 3D Data
If the number of points, k, is less than 2^D you need a process to decide what you don't use. Linear discriminant analysis should be on your shortlist. That said, you can probably satisfice the solution by randomly picking one.
For a single additional point (k = 1 + 2^D), you're looking for one that is as close to the center of the bounding space.
When k > 2^D, the possible solutions will scale not geometrically but factorially. That may not seem intuitive so let's go back to the two circles. For 2D you have just two points that could be a candidate for being equidistant. But if that were 3D space and rotate the points about the line, any point in what is now a ring would suffice as a solution for k. For a 3D example, they would be a sphere. Hyperspheres (n-spheres) from thereon. Again, 2^D scaling.
One last thing: You should seriously look at xarray if you're not already familiar with it.
Hope all this helps and I also hope you'll read through the links. It'll be worth the time.
*It would be the same shape, centrally located, with the vertices at the 1/3 mark. So like having 27 six-sided dice shaped like a giant cube. Each vertice (or point nearest it) would fix the solution. Your original k+1 would have to be relocated too. So you would select 2 of the 8 vertices. Final question: would it be worth calculating the distances of those points against each other (remember the diagonal is slightly longer than the edge) and then comparing them to the original 2^D points? Bluntly, no. Satifice the solution.
If you're interested in getting the most distant points you can take advantage of all of the methods that were developed for nearest neighbors, you just have to give a different "metric".
For example, using scikit-learn's nearest neighbors and distance metrics tools you can do something like this
import numpy as np
from sklearn.neighbors import BallTree
from sklearn.neighbors.dist_metrics import PyFuncDistance
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
def inverted_euclidean(x1, x2):
# You can speed this up using cython like scikit-learn does or numba
dist = np.sum((x1 - x2) ** 2)
# We invert the euclidean distance and set nearby points to the biggest possible
# positive float that isn't inf
inverted_dist = np.where(dist == 0, np.nextafter(np.inf, 0), 1 / dist)
return inverted_dist
# Make up some fake data
n_samples = 100000
n_features = 200
X, _ = make_blobs(n_samples=n_samples, centers=3, n_features=n_features, random_state=0)
# We exploit the BallTree algorithm to get the most distant points
ball_tree = BallTree(X, leaf_size=50, metric=PyFuncDistance(inverted_euclidean))
# Some made up query, you can also provide a stack of points to query against
test_point = np.zeros((1, n_features))
distance, distant_points_inds = ball_tree.query(X=test_point, k=10, return_distance=True)
distant_points = X[distant_points_inds[0]]
# We can try to visualize the query results
plt.plot(X[:, 0], X[:, 1], ".b", alpha=0.1)
plt.plot(test_point[:, 0], test_point[:, 1], "*r", markersize=9)
plt.plot(distant_points[:, 0], distant_points[:, 1], "sg", markersize=5, alpha=0.8)
plt.show()
Which will plot something like:
There are many points that you can improve on:
I implemented the inverted_euclidean distance function with numpy, but you can try to do what the folks of scikit-learn do with their distance functions and implement them in cython. You could also try to jit compile them with numba.
Maybe the euclidean distance isn't the metric you would like to use to find the furthest points, so you're free to implement your own or simply roll with what scikit-learn provides.
The nice thing about using the Ball Tree algorithm (or the KdTree algorithm) is that for each queried point you have to do log(N) comparisons to find the furthest point in the training set. Building the Ball Tree itself, I think also requires log(N) comparison, so in the end if you want to find the k furthest points for every point in the ball tree training set (X), it will have almost O(D N log(N)) complexity (where D is the number of features), which will increase up to O(D N^2) with the increasing k.
Related
Manually find the distance between centroid and labelled data points
I have carried out some clustering analysis on some data X and have arrived at both the labels y and the centroids c. Now, I'm trying to calculate the distance between X and their assigned cluster's centroid c. This is easy when we have a small number of points: import numpy as np # 10 random points in 3D space X = np.random.rand(10,3) # define the number of clusters, say 3 clusters = 3 # give each point a random label # (in the real code this is found using KMeans, for example) y = np.asarray([np.random.randint(0,clusters) for i in range(10)]).reshape(-1,1) # randomly assign location of centroids # (in the real code this is found using KMeans, for example) c = np.random.rand(clusters,3) # calculate distances distances = [] for i in range(len(X)): distances.append(np.linalg.norm(X[i]-c[y[i][0]])) Unfortunately, the actual data has many more rows. Is there a way to vectorise this somehow (instead of using a for loop)? I can't seem to get my head around the mapping.
Thanks to numpy's array indexing, you can actually turn your for loop into a one-liner and avoid explicit looping altogether: distances = np.linalg.norm(X- np.einsum('ijk->ik', c[y]), axis=1) will do the same thing as your original for loop. EDIT: Thanks #Kris, I forgot the axis keyword, and since I didn't specify it, numpy automatically computed the norm of the entire flattened matrix, not just along the rows (axis 1). I've updated it now, and it should return an array of distances for each point. Also, einsum was suggested by #Kris for their specific application.
Finding two most far away points in plot with many points in Python
I need to find the two points which are most far away from each other. I have, as the screenshots say, an array containing two other arrays. one for the X and one for the Y coordinates. What's the best way to determine the longest line through the data? by saying this, i need to select the two most far away points in the plot. Hope you guys can help. Below are some screenshots to help explain the problem.
You can avoid computing all pairwise distances by observing that the two points which are furthest apart will occur as vertices in the convex hull. You can then compute pairwise distances between fewer points. For example, with 100,000 points distributed uniformly in a unit square, there are only 22 points in the convex hull in my instance. import numpy as np from scipy import spatial # test points pts = np.random.rand(100_000, 2) # two points which are fruthest apart will occur as vertices of the convex hull candidates = pts[spatial.ConvexHull(pts).vertices] # get distances between each pair of candidate points dist_mat = spatial.distance_matrix(candidates, candidates) # get indices of candidates that are furthest apart i, j = np.unravel_index(dist_mat.argmax(), dist_mat.shape) print(candidates[i], candidates[j]) # e.g. [ 1.11251218e-03 5.49583204e-05] [ 0.99989971 0.99924638] If your data is 2-dimensional, you can compute the convex hull in O(N*log(N)) time where N is the number of points. By concentration of measure, this method deteriorates in performance for many common distributions as the number of dimensions grows.
Finding n nearest data points to grid locations
I'm working on a problem where I have a large set (>4 million) of data points located in a three-dimensional space, each with a scalar function value. This is represented by four arrays: XD, YD, ZD, and FD. The tuple (XD[i], YD[i], ZD[i]) refers to the location of data point i, which has a value of FD[i]. I'd like to superimpose a rectilinear grid of, say, 100x100x100 points in the same space as my data. This grid is set up as follows. [XGrid, YGrid, ZGrid] = np.mgrid[Xmin:Xmax:Xstep, Ymin:Ymax:Ystep, Zmin:Zmax:Zstep] XG = XGrid[:,0,0] YG = YGrid[0,:,0] ZG = ZGrid[0,0,:] XGrid is a 3D array of the x-value at each point in the grid. XG is a 1D array of the x-values going from Xmin to Xmax, separated by a distance of XStep. I'd like to use an interpolation algorithm I have to find the value of the function at each grid point based on the data surrounding it. In this algorithm I require 20 data points closest (or at least close) to my grid point of interest. That is, for grid point (XG[i], YG[j], ZG[k]) I want to find the 20 closest data points. The only way I can think of is to have one for loop that goes through each data point and a subsequent embedded for loop going through all (so many!) data points, calculating the Euclidean distance, and picking out the 20 closest ones. for i in range(0,XG.shape): for j in range(0,YG.shape): for k in range(0,ZG.shape): Distance = np.zeros([XD.shape]) for a in range(0,XD.shape): Distance[a] = (XD[a] - XG[i])**2 + (YD[a] - YG[j])**2 + (ZD[a] - ZG[k])**2 B = np.zeros([20], int) for a in range(0,20): indx = np.argmin(Distance) B[a] = indx Distance[indx] = float(inf) This would give me an array, B, of the indices of the data points closest to the grid point. I feel like this would take too long to go through each data point at each grid point. I'm looking for any suggestions, such as how I might be able to organize the data points before calculating distances, which could cut down on computation time.
Have a look at a seemingly simmilar but 2D problem and see if you cannot improve with ideas from there. From the top of my head, I'm thinking that you can sort the points according to their coordinates (three separate arrays). When you need the closest points to the [X, Y, Z] grid point you'll quickly locate points in those three arrays and start from there.
Also, you don't really need the euclidian distance, since you are only interested in relative distance, which can also be described as: abs(deltaX) + abs(deltaY) + abs(deltaZ) And save on the expensive power and square roots...
No need to iterate over your data points for each grid location: Your grid locations are inherently ordered, so just iterate over your data points once, and assign each data point to the eight grid locations that surround it. When you're done, some grid locations may have too few data points. Check the data points of adjacent grid locations. If you have plenty of data points to go around (it depends on how your data is distributed), you can already select the 20 closest neighbors during the initial pass. Addendum: You may want to reconsider other parts of your algorithm as well. Your algorithm is a kind of piecewise-linear interpolation, and there are plenty of relatively simple improvements. Instead of dividing your space into evenly spaced cubes, consider allocating a number of center points and dynamically repositioning them until the average distance of data points from the nearest center point is minimized, like this: Allocate each data point to its closest center point. Reposition each center point to the coordinates that would minimize the average distance from "its" points (to the "centroid" of the data subset). Some data points now have a different closest center point. Repeat steps 1. and 2. until you converge (or near enough).
These spectrum bands used to be judged by eye, how to do it programmatically?
Operators used to examine the spectrum, knowing the location and width of each peak and judge the piece the spectrum belongs to. In the new way, the image is captured by a camera to a screen. And the width of each band must be computed programatically. Old system: spectroscope -> human eye New system: spectroscope -> camera -> program What is a good method to compute the width of each band, given their approximate X-axis positions; given that this task used to be performed perfectly by eye, and must now be performed by program? Sorry if I am short of details, but they are scarce. Program listing that generated the previous graph; I hope it is relevant: import Image from scipy import * from scipy.optimize import leastsq # Load the picture with PIL, process if needed pic = asarray(Image.open("spectrum.jpg")) # Average the pixel values along vertical axis pic_avg = pic.mean(axis=2) projection = pic_avg.sum(axis=0) # Set the min value to zero for a nice fit projection /= projection.mean() projection -= projection.min() #print projection # Fit function, two gaussians, adjust as needed def fitfunc(p,x): return p[0]*exp(-(x-p[1])**2/(2.0*p[2]**2)) + \ p[3]*exp(-(x-p[4])**2/(2.0*p[5]**2)) errfunc = lambda p, x, y: fitfunc(p,x)-y # Use scipy to fit, p0 is inital guess p0 = array([0,20,1,0,75,10]) X = xrange(len(projection)) p1, success = leastsq(errfunc, p0, args=(X,projection)) Y = fitfunc(p1,X) # Output the result print "Mean values at: ", p1[1], p1[4] # Plot the result from pylab import * #subplot(211) #imshow(pic) #subplot(223) #plot(projection) #subplot(224) #plot(X,Y,'r',lw=5) #show() subplot(311) imshow(pic) subplot(312) plot(projection) subplot(313) plot(X,Y,'r',lw=5) show()
Given an approximate starting point, you could use a simple algorithm that finds a local maxima closest to this point. Your fitting code may be doing that already (I wasn't sure whether you were using it successfully or not). Here's some code that demonstrates simple peak finding from a user-given starting point: #!/usr/bin/env python from __future__ import division import numpy as np from matplotlib import pyplot as plt # Sample data with two peaks: small one at t=0.4, large one at t=0.8 ts = np.arange(0, 1, 0.01) xs = np.exp(-((ts-0.4)/0.1)**2) + 2*np.exp(-((ts-0.8)/0.1)**2) # Say we have an approximate starting point of 0.35 start_point = 0.35 # Nearest index in "ts" to this starting point is... start_index = np.argmin(np.abs(ts - start_point)) # Find the local maxima in our data by looking for a sign change in # the first difference # From http://stackoverflow.com/a/9667121/188535 maxes = (np.diff(np.sign(np.diff(xs))) < 0).nonzero()[0] + 1 # Find which of these peaks is closest to our starting point index_of_peak = maxes[np.argmin(np.abs(maxes - start_index))] print "Peak centre at: %.3f" % ts[index_of_peak] # Quick plot showing the results: blue line is data, green dot is # starting point, red dot is peak location plt.plot(ts, xs, '-b') plt.plot(ts[start_index], xs[start_index], 'og') plt.plot(ts[index_of_peak], xs[index_of_peak], 'or') plt.show() This method will only work if the ascent up the peak is perfectly smooth from your starting point. If this needs to be more resilient to noise, I have not used it, but PyDSTool seems like it might help. This SciPy post details how to use it for detecting 1D peaks in a noisy data set. So assume at this point you've found the centre of the peak. Now for the width: there are several methods you could use, but the easiest is probably the "full width at half maximum" (FWHM). Again, this is simple and therefore fragile. It will break for close double-peaks, or for noisy data. The FWHM is exactly what its name suggests: you find the width of the peak were it's halfway to the maximum. Here's some code that does that (it just continues on from above): # FWHM... half_max = xs[index_of_peak]/2 # This finds where in the data we cross over the halfway point to our peak. Note # that this is global, so we need an extra step to refine these results to find # the closest crossovers to our peak. # Same sign-change-in-first-diff technique as above hm_left_indices = (np.diff(np.sign(np.diff(np.abs(xs[:index_of_peak] - half_max)))) > 0).nonzero()[0] + 1 # Add "index_of_peak" to result because we cut off the left side of the data! hm_right_indices = (np.diff(np.sign(np.diff(np.abs(xs[index_of_peak:] - half_max)))) > 0).nonzero()[0] + 1 + index_of_peak # Find closest half-max index to peak hm_left_index = hm_left_indices[np.argmin(np.abs(hm_left_indices - index_of_peak))] hm_right_index = hm_right_indices[np.argmin(np.abs(hm_right_indices - index_of_peak))] # And the width is... fwhm = ts[hm_right_index] - ts[hm_left_index] print "Width: %.3f" % fwhm # Plot to illustrate FWHM: blue line is data, red circle is peak, red line # shows FWHM plt.plot(ts, xs, '-b') plt.plot(ts[index_of_peak], xs[index_of_peak], 'or') plt.plot( [ts[hm_left_index], ts[hm_right_index]], [xs[hm_left_index], xs[hm_right_index]], '-r') plt.show() It doesn't have to be the full width at half maximum — as one commenter points out, you can try to figure out where your operators' normal threshold for peak detection is, and turn that into an algorithm for this step of the process. A more robust way might be to fit a Gaussian curve (or your own model) to a subset of the data centred around the peak — say, from a local minima on one side to a local minima on the other — and use one of the parameters of that curve (eg. sigma) to calculate the width. I realise this is a lot of code, but I've deliberately avoided factoring out the index-finding functions to "show my working" a bit more, and of course the plotting functions are there just to demonstrate. Hopefully this gives you at least a good starting point to come up with something more suitable to your particular set.
Late to the party, but for anyone coming across this question in the future... Eye movement data looks very similar to this; I'd base an approach off that used by Nystrom + Holmqvist, 2010. Smooth the data using a Savitsky-Golay filter (scipy.signal.savgol_filter in scipy v0.14+) to get rid of some of the low-level noise while keeping the large peaks intact - the authors recommend using an order of 2 and a window size of about twice the width of the smallest peak you want to be able to detect. You can find where the bands are by arbitrarily removing all values above a certain y value (set them to numpy.nan). Then take the (nan)mean and (nan)standard deviation of the remainder, and remove all values greater than the mean + [parameter]*std (I think they use 6 in the paper). Iterate until you're not removing any data points - but depending on your data, certain values of [parameter] may not stabilise. Then use numpy.isnan() to find events vs non-events, and numpy.diff() to find the start and end of each event (values of -1 and 1 respectively). To get even more accurate start and end points, you can scan along the data backward from each start and forward from each end to find the nearest local minimum which has value smaller than mean + [another parameter]*std (I think they use 3 in the paper). Then you just need to count the data points between each start and end. This won't work for that double peak; you'd have to do some extrapolation for that.
The best method might be to statistically compare a bunch of methods with human results. You would take a large variety data and a large variety of measurement estimates (widths at various thresholds, area above various thresholds, different threshold selection methods, 2nd moments, polynomial curve fits of various degrees, pattern matching, and etc.) and compare these estimates to human measurements of the same data set. Pick the estimate method that correlates best with expert human results. Or maybe pick several methods, the best one for each of various heights, for various separations from other peaks, and etc.
Peak detection in a 2D array
I'm helping a veterinary clinic measuring pressure under a dogs paw. I use Python for my data analysis and now I'm stuck trying to divide the paws into (anatomical) subregions. I made a 2D array of each paw, that consists of the maximal values for each sensor that has been loaded by the paw over time. Here's an example of one paw, where I used Excel to draw the areas I want to 'detect'. These are 2 by 2 boxes around the sensor with local maxima's, that together have the largest sum. So I tried some experimenting and decide to simply look for the maximums of each column and row (can't look in one direction due to the shape of the paw). This seems to 'detect' the location of the separate toes fairly well, but it also marks neighboring sensors. So what would be the best way to tell Python which of these maximums are the ones I want? Note: The 2x2 squares can't overlap, since they have to be separate toes! Also I took 2x2 as a convenience, any more advanced solution is welcome, but I'm simply a human movement scientist, so I'm neither a real programmer or a mathematician, so please keep it 'simple'. Here's a version that can be loaded with np.loadtxt Results So I tried #jextee's solution (see the results below). As you can see, it works very on the front paws, but it works less well for the hind legs. More specifically, it can't recognize the small peak that's the fourth toe. This is obviously inherent to the fact that the loop looks top down towards the lowest value, without taking into account where this is. Would anyone know how to tweak #jextee's algorithm, so that it might be able to find the 4th toe too? Since I haven't processed any other trials yet, I can't supply any other samples. But the data I gave before were the averages of each paw. This file is an array with the maximal data of 9 paws in the order they made contact with the plate. This image shows how they were spatially spread out over the plate. Update: I have set up a blog for anyone interested and I have setup a OneDrive with all the raw measurements. So to anyone requesting more data: more power to you! New update: So after the help I got with my questions regarding paw detection and paw sorting, I was finally able to check the toe detection for every paw! Turns out, it doesn't work so well in anything but paws sized like the one in my own example. Off course in hindsight, it's my own fault for choosing the 2x2 so arbitrarily. Here's a nice example of where it goes wrong: a nail is being recognized as a toe and the 'heel' is so wide, it gets recognized twice! The paw is too large, so taking a 2x2 size with no overlap, causes some toes to be detected twice. The other way around, in small dogs it often fails to find a 5th toe, which I suspect is being caused by the 2x2 area being too large. After trying the current solution on all my measurements I came to the staggering conclusion that for nearly all my small dogs it didn't find a 5th toe and that in over 50% of the impacts for the large dogs it would find more! So clearly I need to change it. My own guess was changing the size of the neighborhood to something smaller for small dogs and larger for large dogs. But generate_binary_structure wouldn't let me change the size of the array. Therefore, I'm hoping that anyone else has a better suggestion for locating the toes, perhaps having the toe area scale with the paw size?
I detected the peaks using a local maximum filter. Here is the result on your first dataset of 4 paws: I also ran it on the second dataset of 9 paws and it worked as well. Here is how you do it: import numpy as np from scipy.ndimage.filters import maximum_filter from scipy.ndimage.morphology import generate_binary_structure, binary_erosion import matplotlib.pyplot as pp #for some reason I had to reshape. Numpy ignored the shape header. paws_data = np.loadtxt("paws.txt").reshape(4,11,14) #getting a list of images paws = [p.squeeze() for p in np.vsplit(paws_data,4)] def detect_peaks(image): """ Takes an image and detect the peaks usingthe local maximum filter. Returns a boolean mask of the peaks (i.e. 1 when the pixel's value is the neighborhood maximum, 0 otherwise) """ # define an 8-connected neighborhood neighborhood = generate_binary_structure(2,2) #apply the local maximum filter; all pixel of maximal value #in their neighborhood are set to 1 local_max = maximum_filter(image, footprint=neighborhood)==image #local_max is a mask that contains the peaks we are #looking for, but also the background. #In order to isolate the peaks we must remove the background from the mask. #we create the mask of the background background = (image==0) #a little technicality: we must erode the background in order to #successfully subtract it form local_max, otherwise a line will #appear along the background border (artifact of the local maximum filter) eroded_background = binary_erosion(background, structure=neighborhood, border_value=1) #we obtain the final mask, containing only peaks, #by removing the background from the local_max mask (xor operation) detected_peaks = local_max ^ eroded_background return detected_peaks #applying the detection and plotting results for i, paw in enumerate(paws): detected_peaks = detect_peaks(paw) pp.subplot(4,2,(2*i+1)) pp.imshow(paw) pp.subplot(4,2,(2*i+2) ) pp.imshow(detected_peaks) pp.show() All you need to do after is use scipy.ndimage.measurements.label on the mask to label all distinct objects. Then you'll be able to play with them individually. Note that the method works well because the background is not noisy. If it were, you would detect a bunch of other unwanted peaks in the background. Another important factor is the size of the neighborhood. You will need to adjust it if the peak size changes (the should remain roughly proportional).
Solution Data file: paw.txt. Source code: from scipy import * from operator import itemgetter n = 5 # how many fingers are we looking for d = loadtxt("paw.txt") width, height = d.shape # Create an array where every element is a sum of 2x2 squares. fourSums = d[:-1,:-1] + d[1:,:-1] + d[1:,1:] + d[:-1,1:] # Find positions of the fingers. # Pair each sum with its position number (from 0 to width*height-1), pairs = zip(arange(width*height), fourSums.flatten()) # Sort by descending sum value, filter overlapping squares def drop_overlapping(pairs): no_overlaps = [] def does_not_overlap(p1, p2): i1, i2 = p1[0], p2[0] r1, col1 = i1 / (width-1), i1 % (width-1) r2, col2 = i2 / (width-1), i2 % (width-1) return (max(abs(r1-r2),abs(col1-col2)) >= 2) for p in pairs: if all(map(lambda prev: does_not_overlap(p,prev), no_overlaps)): no_overlaps.append(p) return no_overlaps pairs2 = drop_overlapping(sorted(pairs, key=itemgetter(1), reverse=True)) # Take the first n with the heighest values positions = pairs2[:n] # Print results print d, "\n" for i, val in positions: row = i / (width-1) column = i % (width-1) print "sum = %f # %d,%d (%d)" % (val, row, column, i) print d[row:row+2,column:column+2], "\n" Output without overlapping squares. It seems that the same areas are selected as in your example. Some comments The tricky part is to calculate sums of all 2x2 squares. I assumed you need all of them, so there might be some overlapping. I used slices to cut the first/last columns and rows from the original 2D array, and then overlapping them all together and calculating sums. To understand it better, imaging a 3x3 array: >>> a = arange(9).reshape(3,3) ; a array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) Then you can take its slices: >>> a[:-1,:-1] array([[0, 1], [3, 4]]) >>> a[1:,:-1] array([[3, 4], [6, 7]]) >>> a[:-1,1:] array([[1, 2], [4, 5]]) >>> a[1:,1:] array([[4, 5], [7, 8]]) Now imagine you stack them one above the other and sum elements at the same positions. These sums will be exactly the same sums over the 2x2 squares with the top-left corner in the same position: >>> sums = a[:-1,:-1] + a[1:,:-1] + a[:-1,1:] + a[1:,1:]; sums array([[ 8, 12], [20, 24]]) When you have the sums over 2x2 squares, you can use max to find the maximum, or sort, or sorted to find the peaks. To remember positions of the peaks I couple every value (the sum) with its ordinal position in a flattened array (see zip). Then I calculate row/column position again when I print the results. Notes I allowed for the 2x2 squares to overlap. Edited version filters out some of them such that only non-overlapping squares appear in the results. Choosing fingers (an idea) Another problem is how to choose what is likely to be fingers out of all the peaks. I have an idea which may or may not work. I don't have time to implement it right now, so just pseudo-code. I noticed that if the front fingers stay on almost a perfect circle, the rear finger should be inside of that circle. Also, the front fingers are more or less equally spaced. We may try to use these heuristic properties to detect the fingers. Pseudo code: select the top N finger candidates (not too many, 10 or 12) consider all possible combinations of 5 out of N (use itertools.combinations) for each combination of 5 fingers: for each finger out of 5: fit the best circle to the remaining 4 => position of the center, radius check if the selected finger is inside of the circle check if the remaining four are evenly spread (for example, consider angles from the center of the circle) assign some cost (penalty) to this selection of 4 peaks + a rear finger (consider, probably weighted: circle fitting error, if the rear finger is inside, variance in the spreading of the front fingers, total intensity of 5 peaks) choose a combination of 4 peaks + a rear peak with the lowest penalty This is a brute-force approach. If N is relatively small, then I think it is doable. For N=12, there are C_12^5 = 792 combinations, times 5 ways to select a rear finger, so 3960 cases to evaluate for every paw.
This is an image registration problem. The general strategy is: Have a known example, or some kind of prior on the data. Fit your data to the example, or fit the example to your data. It helps if your data is roughly aligned in the first place. Here's a rough and ready approach, "the dumbest thing that could possibly work": Start with five toe coordinates in roughly the place you expect. With each one, iteratively climb to the top of the hill. i.e. given current position, move to maximum neighbouring pixel, if its value is greater than current pixel. Stop when your toe coordinates have stopped moving. To counteract the orientation problem, you could have 8 or so initial settings for the basic directions (North, North East, etc). Run each one individually and throw away any results where two or more toes end up at the same pixel. I'll think about this some more, but this kind of thing is still being researched in image processing - there are no right answers! Slightly more complex idea: (weighted) K-means clustering. It's not that bad. Start with five toe coordinates, but now these are "cluster centres". Then iterate until convergence: Assign each pixel to the closest cluster (just make a list for each cluster). Calculate the center of mass of each cluster. For each cluster, this is: Sum(coordinate * intensity value)/Sum(coordinate) Move each cluster to the new centre of mass. This method will almost certainly give much better results, and you get the mass of each cluster which may help in identifying the toes. (Again, you've specified the number of clusters up front. With clustering you have to specify the density one way or another: Either choose the number of clusters, appropriate in this case, or choose a cluster radius and see how many you end up with. An example of the latter is mean-shift.) Sorry about the lack of implementation details or other specifics. I would code this up but I've got a deadline. If nothing else has worked by next week let me know and I'll give it a shot.
Using persistent homology to analyze your data set I get the following result (click to enlarge): This is the 2D-version of the peak detection method described in this SO answer. The above figure simply shows 0-dimensional persistent homology classes sorted by persistence. I did upscale the original dataset by a factor of 2 using scipy.misc.imresize(). However, note that I did consider the four paws as one dataset; splitting it into four would make the problem easier. Methodology. The idea behind this quite simple: Consider the function graph of the function that assigns each pixel its level. It looks like this: Now consider a water level at height 255 that continuously descents to lower levels. At local maxima islands pop up (birth). At saddle points two islands merge; we consider the lower island to be merged to the higher island (death). The so-called persistence diagram (of the 0-th dimensional homology classes, our islands) depicts death- over birth-values of all islands: The persistence of an island is then the difference between the birth- and death-level; the vertical distance of a dot to the grey main diagonal. The figure labels the islands by decreasing persistence. The very first picture shows the locations of births of the islands. This method not only gives the local maxima but also quantifies their "significance" by the above mentioned persistence. One would then filter out all islands with a too low persistence. However, in your example every island (i.e., every local maximum) is a peak you look for. Python code can be found here.
This problem has been studied in some depth by physicists. There is a good implementation in ROOT. Look at the TSpectrum classes (especially TSpectrum2 for your case) and the documentation for them. References: M.Morhac et al.: Background elimination methods for multidimensional coincidence gamma-ray spectra. Nuclear Instruments and Methods in Physics Research A 401 (1997) 113-132. M.Morhac et al.: Efficient one- and two-dimensional Gold deconvolution and its application to gamma-ray spectra decomposition. Nuclear Instruments and Methods in Physics Research A 401 (1997) 385-408. M.Morhac et al.: Identification of peaks in multidimensional coincidence gamma-ray spectra. Nuclear Instruments and Methods in Research Physics A 443(2000), 108-125. ...and for those who don't have access to a subscription to NIM: Spectrum.doc SpectrumDec.ps.gz SpectrumSrc.ps.gz SpectrumBck.ps.gz
I'm sure you have enough to go on by now, but I can't help but suggest using the k-means clustering method. k-means is an unsupervised clustering algorithm which will take you data (in any number of dimensions - I happen to do this in 3D) and arrange it into k clusters with distinct boundaries. It's nice here because you know exactly how many toes these canines (should) have. Additionally, it's implemented in Scipy which is really nice (http://docs.scipy.org/doc/scipy/reference/cluster.vq.html). Here's an example of what it can do to spatially resolve 3D clusters: What you want to do is a bit different (2D and includes pressure values), but I still think you could give it a shot.
Here is an idea: you calculate the (discrete) Laplacian of the image. I would expect it to be (negative and) large at maxima, in a way that is more dramatic than in the original images. Thus, maxima could be easier to find. Here is another idea: if you know the typical size of the high-pressure spots, you can first smooth your image by convoluting it with a Gaussian of the same size. This may give you simpler images to process.
Just a couple of ideas off the top of my head: take the gradient (derivative) of the scan, see if that eliminates the false calls take the maximum of the local maxima You might also want to take a look at OpenCV, it's got a fairly decent Python API and might have some functions you'd find useful.
thanks for the raw data. I'm on the train and this is as far as I've gotten (my stop is coming up). I massaged your txt file with regexps and have plopped it into a html page with some javascript for visualization. I'm sharing it here because some, like myself, might find it more readily hackable than python. I think a good approach will be scale and rotation invariant, and my next step will be to investigate mixtures of gaussians. (each paw pad being the center of a gaussian). <html> <head> <script type="text/javascript" src="http://vis.stanford.edu/protovis/protovis-r3.2.js"></script> <script type="text/javascript"> var heatmap = [[[0,0,0,0,0,0,0,4,4,0,0,0,0], [0,0,0,0,0,7,14,22,18,7,0,0,0], [0,0,0,0,11,40,65,43,18,7,0,0,0], [0,0,0,0,14,61,72,32,7,4,11,14,4], [0,7,14,11,7,22,25,11,4,14,65,72,14], [4,29,79,54,14,7,4,11,18,29,79,83,18], [0,18,54,32,18,43,36,29,61,76,25,18,4], [0,4,7,7,25,90,79,36,79,90,22,0,0], [0,0,0,0,11,47,40,14,29,36,7,0,0], [0,0,0,0,4,7,7,4,4,4,0,0,0] ],[ [0,0,0,4,4,0,0,0,0,0,0,0,0], [0,0,11,18,18,7,0,0,0,0,0,0,0], [0,4,29,47,29,7,0,4,4,0,0,0,0], [0,0,11,29,29,7,7,22,25,7,0,0,0], [0,0,0,4,4,4,14,61,83,22,0,0,0], [4,7,4,4,4,4,14,32,25,7,0,0,0], [4,11,7,14,25,25,47,79,32,4,0,0,0], [0,4,4,22,58,40,29,86,36,4,0,0,0], [0,0,0,7,18,14,7,18,7,0,0,0,0], [0,0,0,0,4,4,0,0,0,0,0,0,0], ],[ [0,0,0,4,11,11,7,4,0,0,0,0,0], [0,0,0,4,22,36,32,22,11,4,0,0,0], [4,11,7,4,11,29,54,50,22,4,0,0,0], [11,58,43,11,4,11,25,22,11,11,18,7,0], [11,50,43,18,11,4,4,7,18,61,86,29,4], [0,11,18,54,58,25,32,50,32,47,54,14,0], [0,0,14,72,76,40,86,101,32,11,7,4,0], [0,0,4,22,22,18,47,65,18,0,0,0,0], [0,0,0,0,4,4,7,11,4,0,0,0,0], ],[ [0,0,0,0,4,4,4,0,0,0,0,0,0], [0,0,0,4,14,14,18,7,0,0,0,0,0], [0,0,0,4,14,40,54,22,4,0,0,0,0], [0,7,11,4,11,32,36,11,0,0,0,0,0], [4,29,36,11,4,7,7,4,4,0,0,0,0], [4,25,32,18,7,4,4,4,14,7,0,0,0], [0,7,36,58,29,14,22,14,18,11,0,0,0], [0,11,50,68,32,40,61,18,4,4,0,0,0], [0,4,11,18,18,43,32,7,0,0,0,0,0], [0,0,0,0,4,7,4,0,0,0,0,0,0], ],[ [0,0,0,0,0,0,4,7,4,0,0,0,0], [0,0,0,0,4,18,25,32,25,7,0,0,0], [0,0,0,4,18,65,68,29,11,0,0,0,0], [0,4,4,4,18,65,54,18,4,7,14,11,0], [4,22,36,14,4,14,11,7,7,29,79,47,7], [7,54,76,36,18,14,11,36,40,32,72,36,4], [4,11,18,18,61,79,36,54,97,40,14,7,0], [0,0,0,11,58,101,40,47,108,50,7,0,0], [0,0,0,4,11,25,7,11,22,11,0,0,0], [0,0,0,0,0,4,0,0,0,0,0,0,0], ],[ [0,0,4,7,4,0,0,0,0,0,0,0,0], [0,0,11,22,14,4,0,4,0,0,0,0,0], [0,0,7,18,14,4,4,14,18,4,0,0,0], [0,4,0,4,4,0,4,32,54,18,0,0,0], [4,11,7,4,7,7,18,29,22,4,0,0,0], [7,18,7,22,40,25,50,76,25,4,0,0,0], [0,4,4,22,61,32,25,54,18,0,0,0,0], [0,0,0,4,11,7,4,11,4,0,0,0,0], ],[ [0,0,0,0,7,14,11,4,0,0,0,0,0], [0,0,0,4,18,43,50,32,14,4,0,0,0], [0,4,11,4,7,29,61,65,43,11,0,0,0], [4,18,54,25,7,11,32,40,25,7,11,4,0], [4,36,86,40,11,7,7,7,7,25,58,25,4], [0,7,18,25,65,40,18,25,22,22,47,18,0], [0,0,4,32,79,47,43,86,54,11,7,4,0], [0,0,0,14,32,14,25,61,40,7,0,0,0], [0,0,0,0,4,4,4,11,7,0,0,0,0], ],[ [0,0,0,0,4,7,11,4,0,0,0,0,0], [0,4,4,0,4,11,18,11,0,0,0,0,0], [4,11,11,4,0,4,4,4,0,0,0,0,0], [4,18,14,7,4,0,0,4,7,7,0,0,0], [0,7,18,29,14,11,11,7,18,18,4,0,0], [0,11,43,50,29,43,40,11,4,4,0,0,0], [0,4,18,25,22,54,40,7,0,0,0,0,0], [0,0,4,4,4,11,7,0,0,0,0,0,0], ],[ [0,0,0,0,0,7,7,7,7,0,0,0,0], [0,0,0,0,7,32,32,18,4,0,0,0,0], [0,0,0,0,11,54,40,14,4,4,22,11,0], [0,7,14,11,4,14,11,4,4,25,94,50,7], [4,25,65,43,11,7,4,7,22,25,54,36,7], [0,7,25,22,29,58,32,25,72,61,14,7,0], [0,0,4,4,40,115,68,29,83,72,11,0,0], [0,0,0,0,11,29,18,7,18,14,4,0,0], [0,0,0,0,0,4,0,0,0,0,0,0,0], ] ]; </script> </head> <body> <script type="text/javascript+protovis"> for (var a=0; a < heatmap.length; a++) { var w = heatmap[a][0].length, h = heatmap[a].length; var vis = new pv.Panel() .width(w * 6) .height(h * 6) .strokeStyle("#aaa") .lineWidth(4) .antialias(true); vis.add(pv.Image) .imageWidth(w) .imageHeight(h) .image(pv.Scale.linear() .domain(0, 99, 100) .range("#000", "#fff", '#ff0a0a') .by(function(i, j) heatmap[a][j][i])); vis.render(); } </script> </body> </html>
Physicist's solution: Define 5 paw-markers identified by their positions X_i and init them with random positions. Define some energy function combining some award for location of markers in paws' positions with some punishment for overlap of markers; let's say: E(X_i;S)=-Sum_i(S(X_i))+alfa*Sum_ij (|X_i-Xj|<=2*sqrt(2)?1:0) (S(X_i) is the mean force in 2x2 square around X_i, alfa is a parameter to be peaked experimentally) Now time to do some Metropolis-Hastings magic: 1. Select random marker and move it by one pixel in random direction. 2. Calculate dE, the difference of energy this move caused. 3. Get an uniform random number from 0-1 and call it r. 4. If dE<0 or exp(-beta*dE)>r, accept the move and go to 1; if not, undo the move and go to 1. This should be repeated until the markers will converge to paws. Beta controls the scanning to optimizing tradeoff, so it should be also optimized experimentally; it can be also constantly increased with the time of simulation (simulated annealing).
Just wanna tell you guys there is a nice option to find local maxima in images with python: from skimage.feature import peak_local_max or for skimage 0.8.0: from skimage.feature.peak import peak_local_max http://scikit-image.org/docs/0.8.0/api/skimage.feature.peak.html
It's probably worth to try with neural networks if you are able to create some training data... but this needs many samples annotated by hand.
Heres another approach that I used when doing something similar for a large telescope: 1) Search for the highest pixel. Once you have that, search around that for the best fit for 2x2 (maybe maximizing the 2x2 sum), or do a 2d gaussian fit inside the sub region of say 4x4 centered on the highest pixel. Then set those 2x2 pixels you have found to zero (or maybe 3x3) around the peak center go back to 1) and repeat till the highest peak falls below a noise threshold, or you have all the toes you need
a rough outline... you'd probably want to use a connected components algorithm to isolate each paw region. wiki has a decent description of this (with some code) here: http://en.wikipedia.org/wiki/Connected_Component_Labeling you'll have to make a decision about whether to use 4 or 8 connectedness. personally, for most problems i prefer 6-connectedness. anyway, once you've separated out each "paw print" as a connected region, it should be easy enough to iterate through the region and find the maxima. once you've found the maxima, you could iteratively enlarge the region until you reach a predetermined threshold in order to identify it as a given "toe". one subtle problem here is that as soon as you start using computer vision techniques to identify something as a right/left/front/rear paw and you start looking at individual toes, you have to start taking rotations, skews, and translations into account. this is accomplished through the analysis of so-called "moments". there are a few different moments to consider in vision applications: central moments: translation invariant normalized moments: scaling and translation invariant hu moments: translation, scale, and rotation invariant more information about moments can be found by searching "image moments" on wiki.
Perhaps you can use something like Gaussian Mixture Models. Here's a Python package for doing GMMs (just did a Google search) http://www.ar.media.kyoto-u.ac.jp/members/david/softwares/em/
Interesting problem. The solution I would try is the following. Apply a low pass filter, such as convolution with a 2D gaussian mask. This will give you a bunch of (probably, but not necessarily floating point) values. Perform a 2D non-maximal suppression using the known approximate radius of each paw pad (or toe). This should give you the maximal positions without having multiple candidates which are close together. Just to clarify, the radius of the mask in step 1 should also be similar to the radius used in step 2. This radius could be selectable, or the vet could explicitly measure it beforehand (it will vary with age/breed/etc). Some of the solutions suggested (mean shift, neural nets, and so on) probably will work to some degree, but are overly complicated and probably not ideal.
It seems you can cheat a bit using jetxee's algorithm. He is finding the first three toes fine, and you should be able to guess where the fourth is based off that.
Well, here's some simple and not terribly efficient code, but for this size of a data set it is fine. import numpy as np grid = np.array([[0,0,0,0,0,0,0,0,0,0,0,0,0,0], [0,0,0,0,0,0,0,0,0.4,0.4,0.4,0,0,0], [0,0,0,0,0.4,1.4,1.4,1.8,0.7,0,0,0,0,0], [0,0,0,0,0.4,1.4,4,5.4,2.2,0.4,0,0,0,0], [0,0,0.7,1.1,0.4,1.1,3.2,3.6,1.1,0,0,0,0,0], [0,0.4,2.9,3.6,1.1,0.4,0.7,0.7,0.4,0.4,0,0,0,0], [0,0.4,2.5,3.2,1.8,0.7,0.4,0.4,0.4,1.4,0.7,0,0,0], [0,0,0.7,3.6,5.8,2.9,1.4,2.2,1.4,1.8,1.1,0,0,0], [0,0,1.1,5,6.8,3.2,4,6.1,1.8,0.4,0.4,0,0,0], [0,0,0.4,1.1,1.8,1.8,4.3,3.2,0.7,0,0,0,0,0], [0,0,0,0,0,0.4,0.7,0.4,0,0,0,0,0,0]]) arr = [] for i in xrange(grid.shape[0] - 1): for j in xrange(grid.shape[1] - 1): tot = grid[i][j] + grid[i+1][j] + grid[i][j+1] + grid[i+1][j+1] arr.append([(i,j),tot]) best = [] arr.sort(key = lambda x: x[1]) for i in xrange(5): best.append(arr.pop()) badpos = set([(best[-1][0][0]+x,best[-1][0][1]+y) for x in [-1,0,1] for y in [-1,0,1] if x != 0 or y != 0]) for j in xrange(len(arr)-1,-1,-1): if arr[j][0] in badpos: arr.pop(j) for item in best: print grid[item[0][0]:item[0][0]+2,item[0][1]:item[0][1]+2] I basically just make an array with the position of the upper-left and the sum of each 2x2 square and sort it by the sum. I then take the 2x2 square with the highest sum out of contention, put it in the best array, and remove all other 2x2 squares that used any part of this just removed 2x2 square. It seems to work fine except with the last paw (the one with the smallest sum on the far right in your first picture), it turns out that there are two other eligible 2x2 squares with a larger sum (and they have an equal sum to each other). One of them is still selects one square from your 2x2 square, but the other is off to the left. Fortunately, by luck we see to be choosing more of the one that you would want, but this may require some other ideas to be used to get what you actually want all of the time.
I am not sure this answers the question, but it seems like you can just look for the n highest peaks that don't have neighbors. Here is the gist. Note that it's in Ruby, but the idea should be clear. require 'pp' NUM_PEAKS = 5 NEIGHBOR_DISTANCE = 1 data = [[1,2,3,4,5], [2,6,4,4,6], [3,6,7,4,3], ] def tuples(matrix) tuples = [] matrix.each_with_index { |row, ri| row.each_with_index { |value, ci| tuples << [value, ri, ci] } } tuples end def neighbor?(t1, t2, distance = 1) [1,2].each { |axis| return false if (t1[axis] - t2[axis]).abs > distance } true end # convert the matrix into a sorted list of tuples (value, row, col), highest peaks first sorted = tuples(data).sort_by { |tuple| tuple.first }.reverse # the list of peaks that don't have neighbors non_neighboring_peaks = [] sorted.each { |candidate| # always take the highest peak if non_neighboring_peaks.empty? non_neighboring_peaks << candidate puts "took the first peak: #{candidate}" else # check that this candidate doesn't have any accepted neighbors is_ok = true non_neighboring_peaks.each { |accepted| if neighbor?(candidate, accepted, NEIGHBOR_DISTANCE) is_ok = false break end } if is_ok non_neighboring_peaks << candidate puts "took #{candidate}" else puts "denied #{candidate}" end end } pp non_neighboring_peaks
Maybe a naive approach is sufficient here: Build a list of all 2x2 squares on your plane, order them by their sum (in descending order). First, select the highest-valued square into your "paw list". Then, iteratively pick 4 of the next-best squares that don't intersect with any of the previously found squares.
What if you proceed step by step: you first locate the global maximum, process if needed the surrounding points given their value, then set the found region to zero, and repeat for the next one.