How to compare two 3D curves in Python? - python

I have a huge array with coordinates describing a 3D curves, ~20000 points. I am trying to use less points, by ignoring some, say take 1 every 2 points. When I do this and I plot the reduced number of points the shape looks the same. However I would like to compare the two curves properly, similar to the chi squared test to see how much the reduced plot differs from the original.
Is there an easy, built-in way of doing this or does anyone have any ideas on how to approach the problem.

The general question of "line simplification" seems to be an entire field of research. I recommend you to have a look, for instance, at the Ramer–Douglas–Peucker algorithm. There are several python modules, I could find: rdp and simplification (which also implement the Py-Visvalingam-Whyatt algorithm).
Anyway, I am trying something for evaluating the difference between two polylines, using interpolation. Any curves can be compared, even without common points.
The first idea is to compute the distance along the path for both polylines. They are used as landmarks to go from one given point on the first curve to a relatively close point on the other curve.
Then, the points of the first curve can be interpolated on the other curve. These two datasets can now be compared, point by point.
On the graph, the black curve is the interpolation of xy2 on the curve xy1. So the distances between the black squares and the orange circles can be computed, and averaged.
This gives an average distance measure, but nothing to compare against and decide if the applied reduction is good enough...
def normed_distance_along_path( polyline ):
polyline = np.asarray(polyline)
distance = np.cumsum( np.sqrt(np.sum( np.diff(polyline, axis=1)**2, axis=0 )) )
return np.insert(distance, 0, 0)/distance[-1]
def average_distance_between_polylines(xy1, xy2):
s1 = normed_distance_along_path(xy1)
s2 = normed_distance_along_path(xy2)
interpol_xy1 = interp1d( s1, xy1 )
xy1_on_2 = interpol_xy1(s2)
node_to_node_distance = np.sqrt(np.sum( (xy1_on_2 - xy2)**2, axis=0 ))
return node_to_node_distance.mean() # or use the max
# Two example polyline:
xy1 = [0, 1, 8, 2, 1.7], [1, 0, 6, 7, 1.9] # it should work in 3D too
xy2 = [.1, .6, 4, 8.3, 2.1, 2.2, 2], [.8, .1, 2, 6.4, 6.7, 4.4, 2.3]
average_distance_between_polylines(xy1, xy2) # 0.45004578069119189

If you subsample the original curve, a simple way to assess the approximation error is by computing the maximum distance between the original curve and the line segments between the resampled vertices. The maximum distance occurs at the original vertices and it suffices to evaluate at these points only.
By the way, this provides a simple way to perform the subsampling by setting a maximum tolerance and decimating until the tolerance is exceeded.
You can also think of computing the average distance, but this probably involves nasty integrals and might give less visually pleasing results.

Related

How to determine if an object is flat or not from depth image?

I have a 2x2 matrix of distances from a depth sensor.
The matrix is cropped so only the points we are interested in is in the frame(All the points in the cropped image contains the object).
My question is how can we determine if this object is flat or not?
The depth image is acquired from Realsense d435. I read the depth image and then multiply it by depth_scale.
The object is recognized using AI for the rgb image that is aligned with the depth image.
And I have 4 points on the object. So, all the distances in that rectangle contains the distance of the object from the sensor.
My first idea was standard deviation of all the points. But then this falls apart if the image is taken from an angle. (since the standard deviation won't be 0)
From an angle the distance of a flat object is changing uniformly on the y axis. Maybe somehow, we can use this information?
The 2x2 matrix is a numpy array in python. Maybe there are some libraries which do this already.
After reprojecting your four depth measurements to the 3D space, it becomes a problem of deciding if your set of points is coplanar. There are several ways you can go about it.
One way to do it is to reproject the points to 3D and fit a plane to all four of them there. Since you're fitting a plane to four points in three dimensions, you get an over-determined system, and it's very unlikely that all points would lie exactly on the estimated plane. At this stage, you could prescribe some tolerance to determine "goodness of fit". For instance, you could look at the R^2 coefficient.
To fit the plane you can use scipy.linalg.lstsq. Here's a good description of how it can be done: Fit plane to a set of points in 3D.
Another way to approach the problem is by calculating the volume of a tetrahedron spanned by the four points in 3D. If they are coplanar (or close to coplanar), the volume of such a tatrahedron should be equal to (or close to) 0. Assuming your pointa reprojected to 3D can be described by (x_0, y_0, z_0), ..., (x_3, y_3, z_3), the volume of the tetrahedron is equal to:
volume = abs(numpy.linalg.det(tetrahedron)) / 6, where
tetrahedron = np.array([[x_0, y_0, z_0, 1], [x_1, y_1, z_1, 1], [x_2, y_2, z_2, 1], [x_3, y_3, z_3, 1]])
To check if your points are on the same plane, (equivalently - if the tetrahedron has a small enough volume), it is now sufficient to check if
volume < TOL
for some defined small tolerance value, which must be determined experimentally.
You can define a surface by choosing three of the four 3D points.
Evaluate the distance from the remaining point to the surface.
How to choose the three points is... it may be good to choose the pattern that maximizes the area of the triangle.

Shifting a curve to another curve horizontally

Does anyone have an idea how to fit a curve to another curve, simply by shifting it to the right. For example, in this plot, I want to shift the orange curve to the right (no vertical shift!) in order that the curves overlap eachother. Can anyone help me to do this?
Data of the curves:
y1 = [1.2324, 1.4397, 1.5141, 1.7329, 1.9082, 2.2884, 2.166, 2.8175, 3.1014, 2.8893, 3.673, 4.3875, 4.9817, 5.6906, 6.3667, 7.2854, 8.2703, 9.3432, 10.591, 11.963, 13.579, 15.36, 17.306, 19.508, 21.976, 24.666, 27.692, 31.026, 34.724, 38.702]
y2 = [1.6231, 1.6974, 1.8145, 2.4805, 2.5643, 2.6176, 2.9332, 3.4379, 4.0154, 4.2258, 4.6837, 5.9837, 6.4408, 7.2903, 8.2283, 9.4134, 10.537, 11.947, 13.344, 15.202, 17.073, 19.211, 21.598, 24.216, 27.06, 30.31, 33.933, 37.882, 42.201, 46.978]
x = [0.1, 0.127, 0.161, 0.204, 0.259, 0.329, 0.418, 0.53, 0.672, 0.853, 1.08, 1.37, 1.74, 2.21, 2.81, 3.56, 4.52, 5.74, 7.28, 9.24, 11.7, 14.9, 18.9, 24.0, 30.4, 38.6, 48.9, 62.1, 78.8, 100.0]
This question has a few problems that needed some fiddling to resolve. This is not the ideal solution I'm sure, but it provided a close enough value to what was expected by the manual method (around 1.5 and 1.6).
The first roadblock is that when you shift the X values, you don't get matching y values, so calculating the residual can be tricky. I brute-forced my way through this problem by creating a huge new x array with 1000 points, then interpolated the original 2 y values on this new x array (this will come later). Therefore, when calculating the residual between the two curves, the x values will be off, but not by much.
reference_y = y1
to_shift_y = y2
expanded_x = np.logspace(np.log10(x[0]), np.log10(x[-1]), num=1000)
expanded_y_reference = np.interp(expanded_x, x, reference_y)
expanded_y_to_shift = np.interp(expanded_x, x, to_shift_y)
Then, when you shift x by some constant, you'll get two regions where there won't be equivalent x values.
original x: -------------------------------xxxx
shifted x: xxxx-------------------------------
I created a new x array with the shift parameter, hor_shift set so some value greater than 1. Then, I found the indices where the original and shifted stop matching.
start = np.argmax(expanded_x >= expanded_x_shifted[0])
end = np.argmin(expanded_x_shifted <= expanded_x[-1])
Since these arrays are [False, False, True, True ...] and [True, True, ..., True, False, False], argmax and argmin will return the first instance where you have a different value.
Now, we have to slice our original and shifted x arrays so they have the same size, and values in common, and the same with the expanded y arrays. Pardon the long names, it's just so I don't get confused.
expanded_x_original_in_common_with_shifted = expanded_x[start:]
expanded_x_shifted_in_common_with_original = expanded_x_shifted[:end]
sliced_expanded_y_reference = expanded_y_reference[start:]
sliced_expanded_y_to_shift = expanded_y_to_shift[:end]
And last, and most importantly, we can calculate a distance between the two curves, assuming the x values are aligned.
residual = ((sliced_expanded_y_reference - sliced_expanded_y_to_shift) ** 2).sum()
By minimizing this, we can get the ideal shift.
We can compare our curves. Here, I used two values for the shift, 1.3 and 1.56, to illustrate good and bad shift values (these were found by testing different values). The vertical lines show the region in common.
Now, we can transform this process into a function and use some minimization method to find the ideal shift value. Here's what I got.
from lmfit import Parameters, minimize
par = Parameters()
# If the shift parameter is 1, you get an error
par.add('shift', value=1.1, min=1)
def min_function(par, x, reference_y, to_shift_y):
hor_shift = par['shift'].value
# print(hor_shift) # <- in case you want to follow the process
expanded_x = np.logspace(np.log10(x[0]), np.log10(x[-1]), num=1000)
expanded_x_shifted = expanded_x * hor_shift
start = np.argmax(expanded_x >= expanded_x_shifted[0])
end = np.argmin(expanded_x_shifted <= expanded_x[-1])
expanded_x_original_in_common_with_shifted = expanded_x[start:]
expanded_x_shifted_in_common_with_original = expanded_x_shifted[:end]
expanded_y_reference = np.interp(expanded_x, x, reference_y)
expanded_y_to_shift = np.interp(expanded_x, x, to_shift_y)
sliced_expanded_y_reference = expanded_y_reference[start:]
sliced_expanded_y_to_shift = expanded_y_to_shift[:end]
residual = ((sliced_expanded_y_reference - sliced_expanded_y_to_shift) ** 2).sum()
return residual
minimize(min_function, par, method='nelder', args=(x, reference_y, to_shift_y))
This results in an ideal shift parameter of 1.555, confirming the initial guess. Note that you have to change the residual expression to (sliced_expanded_y_reference - sliced_expanded_y_to_shift) if you want your chisquared to match the one in the graphs.
The notations are changed in order to make more clear the matrix equations below :
y(x)=y1(x)
z(x)=y2(x)
A translation of value=c on the x logarithmic scale is equivalent to an expansion of value=b on the x linear scale because log(x)+c=log(b x) with c=log(b).
The inverse function x=f(y) have to approximately coincide with bx=f(z). So we consider the sum of residuals [f(y)-x]^2+[f(z)-b x]^2. This leads to the regression calculus below. The function f(y) is approximated with a polynomial of degree m.
With the given data the shape of the curve of f(y) is rather smooth. This suggest that a low degree m might be sufficient.
For example with m=2 the result is :
The black curve is the horizontaly translated blue curve from c=0.180 on the logarithmic scale.
Of course one can use a polynomial of higher degree. For example with m=3 we get b=1.536 and c=0.186
This numerical example is a favourable case because the curve x=f(y) has a simple shape. In case of more complicated shapes probably a bigger value of m should be necessary, with the risk of unreliable regression calculus.

How to get the K most distant points, given their coordinates?

We have boring CSV with 10000 rows of ages (float), titles (enum/int), scores (float), ....
We have N columns each with int/float values in a table.
You can imagine this as points in ND space
We want to pick K points that would have maximised distance between each other.
So if we have 100 points in a tightly packed cluster and one point in the distance we would get something like this for three points:
or this
For 4 points it will become more interesting and pick some point in the middle.
So how to select K most distant rows (points) from N (with any complexity)? It looks like an ND point cloud "triangulation" with a given resolution yet not for 3d points.
I search for a reasonably fast approach (approximate - no precise solution needed) for K=200 and N=100000 and ND=6 (probably multigrid or ANN on KDTree based, SOM or triangulation based..).. Does anyone know one?
From past experience with a pretty similar problem, a simple solution of computing the mean Euclidean distance of all pairs within each group of K points and then taking the largest mean, works very well. As someone noted above, it's probably hard to avoid a loop on all combinations (not on all pairs). So a possible implementation of all this can be as follows:
import itertools
import numpy as np
from scipy.spatial.distance import pdist
Npoints = 3 # or 4 or 5...
# making up some data:
data = np.matrix([[3,2,4,3,4],[23,25,30,21,27],[6,7,8,7,9],[5,5,6,6,7],[0,1,2,0,2],[3,9,1,6,5],[0,0,12,2,7]])
# finding row indices of all combinations:
c = [list(x) for x in itertools.combinations(range(len(data)), Npoints )]
distances = []
for i in c:
distances.append(np.mean(pdist(data[i,:]))) # pdist: a method of computing all pairwise Euclidean distances in a condensed way.
ind = distances.index(max(distances)) # finding the index of the max mean distance
rows = c[ind] # these are the points in question
I propose an approximate solution. The idea is to start from a set of K points chosen in a way I'll explain below, and repeatedly loop through these points replacing the current one with the point, among the N-K+1 points not belonging to the set but including the current one, that maximizes the sum of the distances from the points of the set. This procedure leads to a set of K points where the replacement of any single point would cause the sum of the distances among the points of the set to decrease.
To start the process we take the K points that are closest to the mean of all points. This way we have good chances that on the first loop the set of K points will be spread out close to its optimum. Subsequent iterations will make adjustments to the set of K points towards a maximum of the sum of distances, which for the current values of N, K and ND appears to be reachable in just a few seconds. In order to prevent excessive looping in edge cases, we limit the number of loops nonetheless.
We stop iterating when an iteration does not improve the total distance among the K points. Of course, this is a local maximum. Other local maxima will be reached for different initial conditions, or by allowing more than one replacement at a time, but I don't think it would be worthwhile.
The data must be adjusted in order for unit displacements in each dimension to have the same significance, i.e., in order for Euclidean distances to be meaningful. E.g., if your dimensions are salary and number of children, unadjusted, the algorithm will probably yield results concentrated in the extreme salary regions, ignoring that person with 10 kids. To get a more realistic output you could divide salary and number of children by their standard deviation, or by some other estimate that makes differences in salary comparable to differences in number of children.
To be able to plot the output for a random Gaussian distribution, I have set ND = 2 in the code, but setting ND = 6, as per your request, is no problem (except you cannot plot it).
import matplotlib.pyplot as plt
import numpy as np
import scipy.spatial as spatial
N, K, ND = 100000, 200, 2
MAX_LOOPS = 20
SIGMA, SEED = 40, 1234
rng = np.random.default_rng(seed=SEED)
means, variances = [0] * ND, [SIGMA**2] * ND
data = rng.multivariate_normal(means, np.diag(variances), N)
def distances(ndarray_0, ndarray_1):
if (ndarray_0.ndim, ndarray_1.ndim) not in ((1, 2), (2, 1)):
raise ValueError("bad ndarray dimensions combination")
return np.linalg.norm(ndarray_0 - ndarray_1, axis=1)
# start with the K points closest to the mean
# (the copy() is only to avoid a view into an otherwise unused array)
indices = np.argsort(distances(data, data.mean(0)))[:K].copy()
# distsums is, for all N points, the sum of the distances from the K points
distsums = spatial.distance.cdist(data, data[indices]).sum(1)
# but the K points themselves should not be considered
# (the trick is that -np.inf ± a finite quantity always yields -np.inf)
distsums[indices] = -np.inf
prev_sum = 0.0
for loop in range(MAX_LOOPS):
for i in range(K):
# remove this point from the K points
old_index = indices[i]
# calculate its sum of distances from the K points
distsums[old_index] = distances(data[indices], data[old_index]).sum()
# update the sums of distances of all points from the K-1 points
distsums -= distances(data, data[old_index])
# choose the point with the greatest sum of distances from the K-1 points
new_index = np.argmax(distsums)
# add it to the K points replacing the old_index
indices[i] = new_index
# don't consider it any more in distsums
distsums[new_index] = -np.inf
# update the sums of distances of all points from the K points
distsums += distances(data, data[new_index])
# sum all mutual distances of the K points
curr_sum = spatial.distance.pdist(data[indices]).sum()
# break if the sum hasn't changed
if curr_sum == prev_sum:
break
prev_sum = curr_sum
if ND == 2:
X, Y = data.T
marker_size = 4
plt.scatter(X, Y, s=marker_size)
plt.scatter(X[indices], Y[indices], s=marker_size)
plt.grid(True)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
Output:
Splitting the data into 3 equidistant Gaussian distributions the output is this:
Assuming that if you read your csv file with N (10000) rows and D dimension (or features) into a N*D martix X. You can calculate the distance between each point and store it in a distance matrix as follows:
import numpy as np
X = np.asarray(X) ### convert to numpy array
distance_matrix = np.zeros((X.shape[0],X.shape[0]))
for i in range(X.shape[0]):
for j in range(i+1,X.shape[0]):
## We compute triangle matrix and copy the rest. Distance from point A to point B and distance from point B to point A are the same.
distance_matrix[i][j]= np.linalg.norm(X[i]-X[j]) ## Here I am calculating Eucledian distance. Other distance measures can also be used.
#distance_matrix = distance_matrix + distance_matrix.T - np.diag(np.diag(distance_matrix)) ## This syntax can be used to get the lower triangle of distance matrix, which is not really required in your case.
K = 5 ## Number of points that you want to pick
indexes = np.unravel_index(np.argsort(distance_matrix.ravel())[-1*K:], distance_matrix.shape)
print(indexes)
Bottom Line Up Front: Dealing with multiple equally distant points and the Curse of Dimensionality are going to be larger problems than just finding the points. Spoiler alert: There's a surprise ending.
I think this an interesting question but I'm bewildered by some of the answers. I think this is, in part, due to the sketches provided. You've no doubt noticed the answers look similar -- 2d, with clusters -- even though you indicated a wider scope was needed. Because others will eventually see this, I'm going to step through my thinking a bit slowly so bear with me for the early part.
It makes sense to start with a simplified example to see if we can generalize a solution with data that's easy to grasp and a linear 2D model is easiest of the easy.
We don't need to calculate all the distances though. We just need the ones at the extremes. So we can then take the top and bottom few values:
right = lin_2_D.nlargest(8, ['x'])
left = lin_2_D.nsmallest(8, ['x'])
graph = sns.scatterplot(x="x", y="y", data=lin_2_D, color = 'gray', marker = '+', alpha = .4)
sns.scatterplot(x = right['x'], y = right['y'], color = 'red')
sns.scatterplot(x = left['x'], y = left['y'], color = 'green')
fig = graph.figure
fig.set_size_inches(8,3)
What we have so far: Of 100 points, we've eliminated the need to calculate the distance between 84 of them. Of what's left we can further drop this by ordering the results on one side and checking the distance against the others.
You can imagine a case where you have a couple of data points way off the trend line that could be captured by taking the greatest or least y values, and all that starts to look like Walter Tross's top diagram. Add in a couple of extra clusters and you get what looks his bottom diagram and it appears that we're sort of making the same point.
The problem with stopping here is the requirement you mentioned is that you need a solution that works for any number of dimensions.
The unfortunate part is that we run into four challenges:
Challenge 1: As you increase the dimensions you can run into a large number of cases where you have multiple solutions when seeking midpoints. So you're looking for k furthest points but have a large number of equally valid possible solutions and no way prioritizing them. Here are two super easy examples illustrate this:
A) Here we have just four points and in only two dimensions. You really can't get any easier than this, right? The distance from red to green is trivial. But try to find the next furthest point and you'll see both of the black points are equidistant from both the red and green points. Imagine you wanted the furthest six points using the first graphs, you might have 20 or more points that are all equidistant.
edit: I just noticed the red and green dots are at the edges of their circles rather than at the center, I'll update later but the point is the same.
B) This is super easy to imagine: Think of a D&D 4 sided die. Four points of data in a three-dimensional space, all equidistant so it's known as a triangle-based pyramid. If you're looking for the closest two points, which two? You have 4 choose 2 (aka, 6) combinations possible. Getting rid of valid solutions can be a bit of a problem because invariably you face questions such as "why did we get rid of these and not this one?"
Challenge 2: The Curse of Dimensionality. Nuff Said.
Challenge 3 Revenge of The Curse of Dimensionality Because you're looking for the most distant points, you have to x,y,z ... n coordinates for each point or you have to impute them. Now, your data set is much larger and slower.
Challenge 4 Because you're looking for the most distant points, dimension reduction techniques such as ridge and lasso are not going to be useful.
So, what to do about this?
Nothing.
Wait. What?!?
Not truly, exactly, and literally nothing. But nothing crazy. Instead, rely on a simple heuristic that is understandable and computationally easy. Paul C. Kainen puts it well:
Intuitively, when a situation is sufficiently complex or uncertain,
only the simplest methods are valid. Surprisingly, however,
common-sense heuristics based on these robustly applicable techniques
can yield results which are almost surely optimal.
In this case, you have not the Curse of Dimensionality but rather the Blessing of Dimensionality. It's true you have a lot of points and they'll scale linearly as you seek other equidistant points (k) but the total dimensional volume of space will increase to power of the dimensions. The k number of furthest points you're is insignificant to the total number of points. Hell, even k^2 becomes insignificant as the number of dimensions increase.
Now, if you had a low dimensionality, I would go with them as a solution (except the ones that are use nested for loops ... in NumPy or Pandas).
If I was in your position, I'd be thinking how I've got code in these other answers that I could use as a basis and maybe wonder why should I should trust this other than it lays out a framework on how to think through the topic. Certainly, there should be some math and maybe somebody important saying the same thing.
Let me reference to chapter 18 of Computer Intensive Methods in Control and Signal Processing and an expanded argument by analogy with some heavy(-ish) math. You can see from the above (the graph with the colored dots at the edges) that the center is removed, particularly if you followed the idea of removing the extreme y values. It's a though you put a balloon in a box. You could do this a sphere in a cube too. Raise that into multiple dimensions and you have a hypersphere in a hypercube. You can read more about that relationship here.
Finally, let's get to a heuristic:
Select the points that have the most max or min values per dimension. When/if you run out of them pick ones that are close to those values if there isn't one at the min/max. Essentially, you're choosing the corners of a box For a 2D graph you have four points, for a 3D you have the 8 corners of the box (2^3).
More accurately this would be a 4d or 5d (depending on how you might assign the marker shape and color) projected down to 3d. But you can easily see how this data cloud gives you the full range of dimensions.
Here is a quick check on learning; for purposes of ease, ignore the color/shape aspect: It's easy to graphically intuit that you have no problem with up to k points short of deciding what might be slightly closer. And you can see how you might need to randomize your selection if you have a k < 2D. And if you added another point you can see it (k +1) would be in a centroid. So here is the check: If you had more points, where would they be? I guess I have to put this at the bottom -- limitation of markdown.
So for a 6D data cloud, the values of k less than 64 (really 65 as we'll see in just a moment) points are pretty easy. But...
If you don't have a data cloud but instead have data that has a linear relationship, you'll 2^(D-1) points. So, for that linear 2D space, you have a line, for linear 3D space, you'd have a plane. Then a rhomboid, etc. This is true even if your shape is curved. Rather than do this graph myself, I'm using the one from an excellent post on by Inversion Labs on Best-fit Surfaces for 3D Data
If the number of points, k, is less than 2^D you need a process to decide what you don't use. Linear discriminant analysis should be on your shortlist. That said, you can probably satisfice the solution by randomly picking one.
For a single additional point (k = 1 + 2^D), you're looking for one that is as close to the center of the bounding space.
When k > 2^D, the possible solutions will scale not geometrically but factorially. That may not seem intuitive so let's go back to the two circles. For 2D you have just two points that could be a candidate for being equidistant. But if that were 3D space and rotate the points about the line, any point in what is now a ring would suffice as a solution for k. For a 3D example, they would be a sphere. Hyperspheres (n-spheres) from thereon. Again, 2^D scaling.
One last thing: You should seriously look at xarray if you're not already familiar with it.
Hope all this helps and I also hope you'll read through the links. It'll be worth the time.
*It would be the same shape, centrally located, with the vertices at the 1/3 mark. So like having 27 six-sided dice shaped like a giant cube. Each vertice (or point nearest it) would fix the solution. Your original k+1 would have to be relocated too. So you would select 2 of the 8 vertices. Final question: would it be worth calculating the distances of those points against each other (remember the diagonal is slightly longer than the edge) and then comparing them to the original 2^D points? Bluntly, no. Satifice the solution.
If you're interested in getting the most distant points you can take advantage of all of the methods that were developed for nearest neighbors, you just have to give a different "metric".
For example, using scikit-learn's nearest neighbors and distance metrics tools you can do something like this
import numpy as np
from sklearn.neighbors import BallTree
from sklearn.neighbors.dist_metrics import PyFuncDistance
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
def inverted_euclidean(x1, x2):
# You can speed this up using cython like scikit-learn does or numba
dist = np.sum((x1 - x2) ** 2)
# We invert the euclidean distance and set nearby points to the biggest possible
# positive float that isn't inf
inverted_dist = np.where(dist == 0, np.nextafter(np.inf, 0), 1 / dist)
return inverted_dist
# Make up some fake data
n_samples = 100000
n_features = 200
X, _ = make_blobs(n_samples=n_samples, centers=3, n_features=n_features, random_state=0)
# We exploit the BallTree algorithm to get the most distant points
ball_tree = BallTree(X, leaf_size=50, metric=PyFuncDistance(inverted_euclidean))
# Some made up query, you can also provide a stack of points to query against
test_point = np.zeros((1, n_features))
distance, distant_points_inds = ball_tree.query(X=test_point, k=10, return_distance=True)
distant_points = X[distant_points_inds[0]]
# We can try to visualize the query results
plt.plot(X[:, 0], X[:, 1], ".b", alpha=0.1)
plt.plot(test_point[:, 0], test_point[:, 1], "*r", markersize=9)
plt.plot(distant_points[:, 0], distant_points[:, 1], "sg", markersize=5, alpha=0.8)
plt.show()
Which will plot something like:
There are many points that you can improve on:
I implemented the inverted_euclidean distance function with numpy, but you can try to do what the folks of scikit-learn do with their distance functions and implement them in cython. You could also try to jit compile them with numba.
Maybe the euclidean distance isn't the metric you would like to use to find the furthest points, so you're free to implement your own or simply roll with what scikit-learn provides.
The nice thing about using the Ball Tree algorithm (or the KdTree algorithm) is that for each queried point you have to do log(N) comparisons to find the furthest point in the training set. Building the Ball Tree itself, I think also requires log(N) comparison, so in the end if you want to find the k furthest points for every point in the ball tree training set (X), it will have almost O(D N log(N)) complexity (where D is the number of features), which will increase up to O(D N^2) with the increasing k.

Calculating the density of known groups of points

I'm currently working on a project where the interplay between AD algorithms and visualizations is analysed. I've read a great amount of literature, and concluded that to fit my needs I would like to combine several metrics. I was able to realize most of them, yet, this one is missing:
Say I have a 2D space occupied by points belonging to one of 2 classes. I would like to calculate the density of each group. The labels of the groups are known. When looking around stackoverflow etc. I read about DBSCAN alot, but to my understanding DBSCAN is used to perform clustering itself. I already have the clusters, and would like to know their density.
I would appreciate any suggestions or guidance toward a known metric.
If existant, please also share the needed python labraries with me.
Thank you very much.
This is what my data can look like:
array([[-3.90611544e+00, -5.47953465e-01],
[-5.22999684e+00, 5.56145331e-01],
[-4.84611012e+00, 5.54304197e-02],
[-4.85019718e+00, -3.19791419e-01],
[-4.59453620e+00, 5.70821744e-01],
[-6.65068624e+00, -9.97229190e-01],
[-6.57787930e+00, -5.03538827e-01],
[-4.80275333e+00, -8.42197968e-02],
[-4.55720113e+00, 8.23122108e-01],
[-4.47469205e+00, -6.77669238e-01],
[-5.84095559e+00, -8.19564981e-01],
[-4.93963103e+00, -8.66167854e-01],
[-4.98336307e+00, -4.45923700e-02],
[-4.56953722e+00, -4.27976712e-01],
[-6.25553298e+00, 1.32863878e-01],
[-6.11860914e+00, -1.09009817e+00],
[-5.60347264e+00, 1.34600670e+00],
[-4.85974421e+00, -2.03600566e-01],
[-4.38049846e+00, 1.27302889e+00],
.......
which plots like this:
cannot include pictures yet, see link
I would now like to get a density value for the red and green clusters each.
Thank you very much in advance!
UPDATE: Corrected my code. Also please note that this algorithm is of complexity O(n^2): For 10240 points it takes almost 1 minute to run on a fast machine.
UPDATE 2: Return the inverse: count/total_distance
UPDATE 3:
If, as you mentioned above, density visualization is a goal, I think that the sample plot you provided is, in itself, a good visual representation of density to the observer.
UPDATE 4: Based on the below comment, I eliminated double-counting and the algorithm is now of complexoty O(n^2/2) and, naturally, runs twice as fast.
A marginal improvement, especially in the case of several clusters, would be to paint the points in each clusted with a hue of a single color that varies based on the cluster's average density, say, light blue to dark blue.
As DerekG pointed out, you could use other density measures for the above scheme.
Another idea would be to compute each point's LOCAL density by counying the number of its neighbors and, if the number of neighbors exceeds a certain threshold, then to visually highlight the point by coloring it with a contrasting color, say black.
Please note that the code sample I provided in this answer can easily be modified to implement any of the above mentioned approaches including those by DerekG
I'm not an expert at cluster analysis but I'll try to help:
ORIGINAL ANSWER:
The answer depends on the definition of density. If you define density as the average distance of all pairs of points in the cluster, then this code is the answer:
from math import sqrt
points = [
[1, 3],
[2, 4],
[9, 1],
[2, 6],
[5, 3],
]
def density(points):
total_distance = 0
count = 0
i = 0
for x1,y1 in points:
for x2,y2 in points[i+1:]:
count += 1
total_distance += sqrt((x1-x2)**2 + (y1-y2)**2)
i += 1
print(count)
return count/total_distance
print(density(points))
Which prints:
0.2131384729384717

Calculating Point Density using Python

I have a list of X and Y coordinates from geodata of a specific part of the world. I want to assign each coordinate, a weight, based upon where it lies in the graph.
For Example: If a point lies in a place where there are a lot of other nodes around it, it lies in a high density area, and therefore has a higher weight.
The most immediate method I can think of is drawing circles of unit radius around each point and then calculating if the other points lie within in and then using a function, assign a weight to that point. But this seems primitive.
I've looked at pySAL and NetworkX but it looks like they work with graphs. I don't have any edges in the graph, just nodes.
A standard solution would be using KDE (Kernel Density Estimation).
Search on web: "KDE Estimation" you will find enormous links.
in Google type: KDE Estimation ext:pdf
Also, Scipy has KDE, follow this http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html. There is working example codes there ;)
If you have a lot of points, you may compute nearest neighbors more efficiently using a KDTree:
import numpy as np
import scipy.spatial as spatial
points = np.array([(1, 2), (3, 4), (4, 5), (100,100)])
tree = spatial.KDTree(np.array(points))
radius = 3.0
neighbors = tree.query_ball_tree(tree, radius)
print(neighbors)
# [[0, 1], [0, 1, 2], [1, 2], [3]]
tree.query_ball_tree returns indices (of points) of the nearest neighbors. For example, [0,1] (at index 0) means points[0] and points[1] are within radius distance from points[0]. [0,1,2] (at index 1) means points[0], points[1] and points[2] are within radius distance from points[1].
frequency = np.array(map(len, neighbors))
print(frequency)
# [2 3 2 1]
density = frequency/radius**2
print(density)
# [ 0.22222222 0.33333333 0.22222222 0.11111111]
Yes, you do have edges, and they are the distances between the nodes. In your case, you have a complete graph with weighted edges.
Simply derive the distance from each node to each other node -- which gives you O(N^2) in time complexity --, and use both nodes and edges as input to one of these approaches you found.
Happens though your problem seems rather an analysis problem other than anything else; you should try to run some clustering algorithm on your data, like K-means, that clusters nodes based on a distance function, in which you can simply use the euclidean distance.
The result of this algorithm is exactly what you'll need, as you'll have clusters of close elements, you'll know what and how many elements are assigned to each group, and you'll be able to, according to these values, generate the coefficient you want to assign to each node.
The only concern worth pointing out here is that you'll have to determine how many clusters -- k-means, k-clusters -- you want to create.
You initial inclination to draw a circle around each point and count the number of other points in that circle is a good one and as mentioned by unutbu, a KDTree will be a fast way to solve this problem.
This can be done very easily with PySAL, which using scipy's kdtree under the hood.
import pysal
import numpy
pts = numpy.random.random((100,2)) #generate some random points
radius = 0.2 #pick an arbitrary radius
#Build a Spatial Weights Matrix
W = pysal.threshold_continuousW_from_array(pts, threshold=radius)
# Note: if your points are in Latitude and Longitude you can increase the accuracy by
# passing the radius of earth to this function and it will use arc distances.
# W = pysal.threshold_continuousW_from_array(pts, threshold=radius, radius=pysal.cg.RADIUS_EARTH_KM)
print W.cardinalities
#{0: 10, 1: 15, ..... }
If your data is in a Shapefile, simply replace threshold_continuousW_from_array with threshold_continuousW_from_shapefile, see the docs for details.

Categories