Calculating Point Density using Python - python

I have a list of X and Y coordinates from geodata of a specific part of the world. I want to assign each coordinate, a weight, based upon where it lies in the graph.
For Example: If a point lies in a place where there are a lot of other nodes around it, it lies in a high density area, and therefore has a higher weight.
The most immediate method I can think of is drawing circles of unit radius around each point and then calculating if the other points lie within in and then using a function, assign a weight to that point. But this seems primitive.
I've looked at pySAL and NetworkX but it looks like they work with graphs. I don't have any edges in the graph, just nodes.

A standard solution would be using KDE (Kernel Density Estimation).
Search on web: "KDE Estimation" you will find enormous links.
in Google type: KDE Estimation ext:pdf
Also, Scipy has KDE, follow this http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html. There is working example codes there ;)

If you have a lot of points, you may compute nearest neighbors more efficiently using a KDTree:
import numpy as np
import scipy.spatial as spatial
points = np.array([(1, 2), (3, 4), (4, 5), (100,100)])
tree = spatial.KDTree(np.array(points))
radius = 3.0
neighbors = tree.query_ball_tree(tree, radius)
print(neighbors)
# [[0, 1], [0, 1, 2], [1, 2], [3]]
tree.query_ball_tree returns indices (of points) of the nearest neighbors. For example, [0,1] (at index 0) means points[0] and points[1] are within radius distance from points[0]. [0,1,2] (at index 1) means points[0], points[1] and points[2] are within radius distance from points[1].
frequency = np.array(map(len, neighbors))
print(frequency)
# [2 3 2 1]
density = frequency/radius**2
print(density)
# [ 0.22222222 0.33333333 0.22222222 0.11111111]

Yes, you do have edges, and they are the distances between the nodes. In your case, you have a complete graph with weighted edges.
Simply derive the distance from each node to each other node -- which gives you O(N^2) in time complexity --, and use both nodes and edges as input to one of these approaches you found.
Happens though your problem seems rather an analysis problem other than anything else; you should try to run some clustering algorithm on your data, like K-means, that clusters nodes based on a distance function, in which you can simply use the euclidean distance.
The result of this algorithm is exactly what you'll need, as you'll have clusters of close elements, you'll know what and how many elements are assigned to each group, and you'll be able to, according to these values, generate the coefficient you want to assign to each node.
The only concern worth pointing out here is that you'll have to determine how many clusters -- k-means, k-clusters -- you want to create.

You initial inclination to draw a circle around each point and count the number of other points in that circle is a good one and as mentioned by unutbu, a KDTree will be a fast way to solve this problem.
This can be done very easily with PySAL, which using scipy's kdtree under the hood.
import pysal
import numpy
pts = numpy.random.random((100,2)) #generate some random points
radius = 0.2 #pick an arbitrary radius
#Build a Spatial Weights Matrix
W = pysal.threshold_continuousW_from_array(pts, threshold=radius)
# Note: if your points are in Latitude and Longitude you can increase the accuracy by
# passing the radius of earth to this function and it will use arc distances.
# W = pysal.threshold_continuousW_from_array(pts, threshold=radius, radius=pysal.cg.RADIUS_EARTH_KM)
print W.cardinalities
#{0: 10, 1: 15, ..... }
If your data is in a Shapefile, simply replace threshold_continuousW_from_array with threshold_continuousW_from_shapefile, see the docs for details.

Related

How to get the K most distant points, given their coordinates?

We have boring CSV with 10000 rows of ages (float), titles (enum/int), scores (float), ....
We have N columns each with int/float values in a table.
You can imagine this as points in ND space
We want to pick K points that would have maximised distance between each other.
So if we have 100 points in a tightly packed cluster and one point in the distance we would get something like this for three points:
or this
For 4 points it will become more interesting and pick some point in the middle.
So how to select K most distant rows (points) from N (with any complexity)? It looks like an ND point cloud "triangulation" with a given resolution yet not for 3d points.
I search for a reasonably fast approach (approximate - no precise solution needed) for K=200 and N=100000 and ND=6 (probably multigrid or ANN on KDTree based, SOM or triangulation based..).. Does anyone know one?
From past experience with a pretty similar problem, a simple solution of computing the mean Euclidean distance of all pairs within each group of K points and then taking the largest mean, works very well. As someone noted above, it's probably hard to avoid a loop on all combinations (not on all pairs). So a possible implementation of all this can be as follows:
import itertools
import numpy as np
from scipy.spatial.distance import pdist
Npoints = 3 # or 4 or 5...
# making up some data:
data = np.matrix([[3,2,4,3,4],[23,25,30,21,27],[6,7,8,7,9],[5,5,6,6,7],[0,1,2,0,2],[3,9,1,6,5],[0,0,12,2,7]])
# finding row indices of all combinations:
c = [list(x) for x in itertools.combinations(range(len(data)), Npoints )]
distances = []
for i in c:
distances.append(np.mean(pdist(data[i,:]))) # pdist: a method of computing all pairwise Euclidean distances in a condensed way.
ind = distances.index(max(distances)) # finding the index of the max mean distance
rows = c[ind] # these are the points in question
I propose an approximate solution. The idea is to start from a set of K points chosen in a way I'll explain below, and repeatedly loop through these points replacing the current one with the point, among the N-K+1 points not belonging to the set but including the current one, that maximizes the sum of the distances from the points of the set. This procedure leads to a set of K points where the replacement of any single point would cause the sum of the distances among the points of the set to decrease.
To start the process we take the K points that are closest to the mean of all points. This way we have good chances that on the first loop the set of K points will be spread out close to its optimum. Subsequent iterations will make adjustments to the set of K points towards a maximum of the sum of distances, which for the current values of N, K and ND appears to be reachable in just a few seconds. In order to prevent excessive looping in edge cases, we limit the number of loops nonetheless.
We stop iterating when an iteration does not improve the total distance among the K points. Of course, this is a local maximum. Other local maxima will be reached for different initial conditions, or by allowing more than one replacement at a time, but I don't think it would be worthwhile.
The data must be adjusted in order for unit displacements in each dimension to have the same significance, i.e., in order for Euclidean distances to be meaningful. E.g., if your dimensions are salary and number of children, unadjusted, the algorithm will probably yield results concentrated in the extreme salary regions, ignoring that person with 10 kids. To get a more realistic output you could divide salary and number of children by their standard deviation, or by some other estimate that makes differences in salary comparable to differences in number of children.
To be able to plot the output for a random Gaussian distribution, I have set ND = 2 in the code, but setting ND = 6, as per your request, is no problem (except you cannot plot it).
import matplotlib.pyplot as plt
import numpy as np
import scipy.spatial as spatial
N, K, ND = 100000, 200, 2
MAX_LOOPS = 20
SIGMA, SEED = 40, 1234
rng = np.random.default_rng(seed=SEED)
means, variances = [0] * ND, [SIGMA**2] * ND
data = rng.multivariate_normal(means, np.diag(variances), N)
def distances(ndarray_0, ndarray_1):
if (ndarray_0.ndim, ndarray_1.ndim) not in ((1, 2), (2, 1)):
raise ValueError("bad ndarray dimensions combination")
return np.linalg.norm(ndarray_0 - ndarray_1, axis=1)
# start with the K points closest to the mean
# (the copy() is only to avoid a view into an otherwise unused array)
indices = np.argsort(distances(data, data.mean(0)))[:K].copy()
# distsums is, for all N points, the sum of the distances from the K points
distsums = spatial.distance.cdist(data, data[indices]).sum(1)
# but the K points themselves should not be considered
# (the trick is that -np.inf ± a finite quantity always yields -np.inf)
distsums[indices] = -np.inf
prev_sum = 0.0
for loop in range(MAX_LOOPS):
for i in range(K):
# remove this point from the K points
old_index = indices[i]
# calculate its sum of distances from the K points
distsums[old_index] = distances(data[indices], data[old_index]).sum()
# update the sums of distances of all points from the K-1 points
distsums -= distances(data, data[old_index])
# choose the point with the greatest sum of distances from the K-1 points
new_index = np.argmax(distsums)
# add it to the K points replacing the old_index
indices[i] = new_index
# don't consider it any more in distsums
distsums[new_index] = -np.inf
# update the sums of distances of all points from the K points
distsums += distances(data, data[new_index])
# sum all mutual distances of the K points
curr_sum = spatial.distance.pdist(data[indices]).sum()
# break if the sum hasn't changed
if curr_sum == prev_sum:
break
prev_sum = curr_sum
if ND == 2:
X, Y = data.T
marker_size = 4
plt.scatter(X, Y, s=marker_size)
plt.scatter(X[indices], Y[indices], s=marker_size)
plt.grid(True)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
Output:
Splitting the data into 3 equidistant Gaussian distributions the output is this:
Assuming that if you read your csv file with N (10000) rows and D dimension (or features) into a N*D martix X. You can calculate the distance between each point and store it in a distance matrix as follows:
import numpy as np
X = np.asarray(X) ### convert to numpy array
distance_matrix = np.zeros((X.shape[0],X.shape[0]))
for i in range(X.shape[0]):
for j in range(i+1,X.shape[0]):
## We compute triangle matrix and copy the rest. Distance from point A to point B and distance from point B to point A are the same.
distance_matrix[i][j]= np.linalg.norm(X[i]-X[j]) ## Here I am calculating Eucledian distance. Other distance measures can also be used.
#distance_matrix = distance_matrix + distance_matrix.T - np.diag(np.diag(distance_matrix)) ## This syntax can be used to get the lower triangle of distance matrix, which is not really required in your case.
K = 5 ## Number of points that you want to pick
indexes = np.unravel_index(np.argsort(distance_matrix.ravel())[-1*K:], distance_matrix.shape)
print(indexes)
Bottom Line Up Front: Dealing with multiple equally distant points and the Curse of Dimensionality are going to be larger problems than just finding the points. Spoiler alert: There's a surprise ending.
I think this an interesting question but I'm bewildered by some of the answers. I think this is, in part, due to the sketches provided. You've no doubt noticed the answers look similar -- 2d, with clusters -- even though you indicated a wider scope was needed. Because others will eventually see this, I'm going to step through my thinking a bit slowly so bear with me for the early part.
It makes sense to start with a simplified example to see if we can generalize a solution with data that's easy to grasp and a linear 2D model is easiest of the easy.
We don't need to calculate all the distances though. We just need the ones at the extremes. So we can then take the top and bottom few values:
right = lin_2_D.nlargest(8, ['x'])
left = lin_2_D.nsmallest(8, ['x'])
graph = sns.scatterplot(x="x", y="y", data=lin_2_D, color = 'gray', marker = '+', alpha = .4)
sns.scatterplot(x = right['x'], y = right['y'], color = 'red')
sns.scatterplot(x = left['x'], y = left['y'], color = 'green')
fig = graph.figure
fig.set_size_inches(8,3)
What we have so far: Of 100 points, we've eliminated the need to calculate the distance between 84 of them. Of what's left we can further drop this by ordering the results on one side and checking the distance against the others.
You can imagine a case where you have a couple of data points way off the trend line that could be captured by taking the greatest or least y values, and all that starts to look like Walter Tross's top diagram. Add in a couple of extra clusters and you get what looks his bottom diagram and it appears that we're sort of making the same point.
The problem with stopping here is the requirement you mentioned is that you need a solution that works for any number of dimensions.
The unfortunate part is that we run into four challenges:
Challenge 1: As you increase the dimensions you can run into a large number of cases where you have multiple solutions when seeking midpoints. So you're looking for k furthest points but have a large number of equally valid possible solutions and no way prioritizing them. Here are two super easy examples illustrate this:
A) Here we have just four points and in only two dimensions. You really can't get any easier than this, right? The distance from red to green is trivial. But try to find the next furthest point and you'll see both of the black points are equidistant from both the red and green points. Imagine you wanted the furthest six points using the first graphs, you might have 20 or more points that are all equidistant.
edit: I just noticed the red and green dots are at the edges of their circles rather than at the center, I'll update later but the point is the same.
B) This is super easy to imagine: Think of a D&D 4 sided die. Four points of data in a three-dimensional space, all equidistant so it's known as a triangle-based pyramid. If you're looking for the closest two points, which two? You have 4 choose 2 (aka, 6) combinations possible. Getting rid of valid solutions can be a bit of a problem because invariably you face questions such as "why did we get rid of these and not this one?"
Challenge 2: The Curse of Dimensionality. Nuff Said.
Challenge 3 Revenge of The Curse of Dimensionality Because you're looking for the most distant points, you have to x,y,z ... n coordinates for each point or you have to impute them. Now, your data set is much larger and slower.
Challenge 4 Because you're looking for the most distant points, dimension reduction techniques such as ridge and lasso are not going to be useful.
So, what to do about this?
Nothing.
Wait. What?!?
Not truly, exactly, and literally nothing. But nothing crazy. Instead, rely on a simple heuristic that is understandable and computationally easy. Paul C. Kainen puts it well:
Intuitively, when a situation is sufficiently complex or uncertain,
only the simplest methods are valid. Surprisingly, however,
common-sense heuristics based on these robustly applicable techniques
can yield results which are almost surely optimal.
In this case, you have not the Curse of Dimensionality but rather the Blessing of Dimensionality. It's true you have a lot of points and they'll scale linearly as you seek other equidistant points (k) but the total dimensional volume of space will increase to power of the dimensions. The k number of furthest points you're is insignificant to the total number of points. Hell, even k^2 becomes insignificant as the number of dimensions increase.
Now, if you had a low dimensionality, I would go with them as a solution (except the ones that are use nested for loops ... in NumPy or Pandas).
If I was in your position, I'd be thinking how I've got code in these other answers that I could use as a basis and maybe wonder why should I should trust this other than it lays out a framework on how to think through the topic. Certainly, there should be some math and maybe somebody important saying the same thing.
Let me reference to chapter 18 of Computer Intensive Methods in Control and Signal Processing and an expanded argument by analogy with some heavy(-ish) math. You can see from the above (the graph with the colored dots at the edges) that the center is removed, particularly if you followed the idea of removing the extreme y values. It's a though you put a balloon in a box. You could do this a sphere in a cube too. Raise that into multiple dimensions and you have a hypersphere in a hypercube. You can read more about that relationship here.
Finally, let's get to a heuristic:
Select the points that have the most max or min values per dimension. When/if you run out of them pick ones that are close to those values if there isn't one at the min/max. Essentially, you're choosing the corners of a box For a 2D graph you have four points, for a 3D you have the 8 corners of the box (2^3).
More accurately this would be a 4d or 5d (depending on how you might assign the marker shape and color) projected down to 3d. But you can easily see how this data cloud gives you the full range of dimensions.
Here is a quick check on learning; for purposes of ease, ignore the color/shape aspect: It's easy to graphically intuit that you have no problem with up to k points short of deciding what might be slightly closer. And you can see how you might need to randomize your selection if you have a k < 2D. And if you added another point you can see it (k +1) would be in a centroid. So here is the check: If you had more points, where would they be? I guess I have to put this at the bottom -- limitation of markdown.
So for a 6D data cloud, the values of k less than 64 (really 65 as we'll see in just a moment) points are pretty easy. But...
If you don't have a data cloud but instead have data that has a linear relationship, you'll 2^(D-1) points. So, for that linear 2D space, you have a line, for linear 3D space, you'd have a plane. Then a rhomboid, etc. This is true even if your shape is curved. Rather than do this graph myself, I'm using the one from an excellent post on by Inversion Labs on Best-fit Surfaces for 3D Data
If the number of points, k, is less than 2^D you need a process to decide what you don't use. Linear discriminant analysis should be on your shortlist. That said, you can probably satisfice the solution by randomly picking one.
For a single additional point (k = 1 + 2^D), you're looking for one that is as close to the center of the bounding space.
When k > 2^D, the possible solutions will scale not geometrically but factorially. That may not seem intuitive so let's go back to the two circles. For 2D you have just two points that could be a candidate for being equidistant. But if that were 3D space and rotate the points about the line, any point in what is now a ring would suffice as a solution for k. For a 3D example, they would be a sphere. Hyperspheres (n-spheres) from thereon. Again, 2^D scaling.
One last thing: You should seriously look at xarray if you're not already familiar with it.
Hope all this helps and I also hope you'll read through the links. It'll be worth the time.
*It would be the same shape, centrally located, with the vertices at the 1/3 mark. So like having 27 six-sided dice shaped like a giant cube. Each vertice (or point nearest it) would fix the solution. Your original k+1 would have to be relocated too. So you would select 2 of the 8 vertices. Final question: would it be worth calculating the distances of those points against each other (remember the diagonal is slightly longer than the edge) and then comparing them to the original 2^D points? Bluntly, no. Satifice the solution.
If you're interested in getting the most distant points you can take advantage of all of the methods that were developed for nearest neighbors, you just have to give a different "metric".
For example, using scikit-learn's nearest neighbors and distance metrics tools you can do something like this
import numpy as np
from sklearn.neighbors import BallTree
from sklearn.neighbors.dist_metrics import PyFuncDistance
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
def inverted_euclidean(x1, x2):
# You can speed this up using cython like scikit-learn does or numba
dist = np.sum((x1 - x2) ** 2)
# We invert the euclidean distance and set nearby points to the biggest possible
# positive float that isn't inf
inverted_dist = np.where(dist == 0, np.nextafter(np.inf, 0), 1 / dist)
return inverted_dist
# Make up some fake data
n_samples = 100000
n_features = 200
X, _ = make_blobs(n_samples=n_samples, centers=3, n_features=n_features, random_state=0)
# We exploit the BallTree algorithm to get the most distant points
ball_tree = BallTree(X, leaf_size=50, metric=PyFuncDistance(inverted_euclidean))
# Some made up query, you can also provide a stack of points to query against
test_point = np.zeros((1, n_features))
distance, distant_points_inds = ball_tree.query(X=test_point, k=10, return_distance=True)
distant_points = X[distant_points_inds[0]]
# We can try to visualize the query results
plt.plot(X[:, 0], X[:, 1], ".b", alpha=0.1)
plt.plot(test_point[:, 0], test_point[:, 1], "*r", markersize=9)
plt.plot(distant_points[:, 0], distant_points[:, 1], "sg", markersize=5, alpha=0.8)
plt.show()
Which will plot something like:
There are many points that you can improve on:
I implemented the inverted_euclidean distance function with numpy, but you can try to do what the folks of scikit-learn do with their distance functions and implement them in cython. You could also try to jit compile them with numba.
Maybe the euclidean distance isn't the metric you would like to use to find the furthest points, so you're free to implement your own or simply roll with what scikit-learn provides.
The nice thing about using the Ball Tree algorithm (or the KdTree algorithm) is that for each queried point you have to do log(N) comparisons to find the furthest point in the training set. Building the Ball Tree itself, I think also requires log(N) comparison, so in the end if you want to find the k furthest points for every point in the ball tree training set (X), it will have almost O(D N log(N)) complexity (where D is the number of features), which will increase up to O(D N^2) with the increasing k.

How to compare two 3D curves in Python?

I have a huge array with coordinates describing a 3D curves, ~20000 points. I am trying to use less points, by ignoring some, say take 1 every 2 points. When I do this and I plot the reduced number of points the shape looks the same. However I would like to compare the two curves properly, similar to the chi squared test to see how much the reduced plot differs from the original.
Is there an easy, built-in way of doing this or does anyone have any ideas on how to approach the problem.
The general question of "line simplification" seems to be an entire field of research. I recommend you to have a look, for instance, at the Ramer–Douglas–Peucker algorithm. There are several python modules, I could find: rdp and simplification (which also implement the Py-Visvalingam-Whyatt algorithm).
Anyway, I am trying something for evaluating the difference between two polylines, using interpolation. Any curves can be compared, even without common points.
The first idea is to compute the distance along the path for both polylines. They are used as landmarks to go from one given point on the first curve to a relatively close point on the other curve.
Then, the points of the first curve can be interpolated on the other curve. These two datasets can now be compared, point by point.
On the graph, the black curve is the interpolation of xy2 on the curve xy1. So the distances between the black squares and the orange circles can be computed, and averaged.
This gives an average distance measure, but nothing to compare against and decide if the applied reduction is good enough...
def normed_distance_along_path( polyline ):
polyline = np.asarray(polyline)
distance = np.cumsum( np.sqrt(np.sum( np.diff(polyline, axis=1)**2, axis=0 )) )
return np.insert(distance, 0, 0)/distance[-1]
def average_distance_between_polylines(xy1, xy2):
s1 = normed_distance_along_path(xy1)
s2 = normed_distance_along_path(xy2)
interpol_xy1 = interp1d( s1, xy1 )
xy1_on_2 = interpol_xy1(s2)
node_to_node_distance = np.sqrt(np.sum( (xy1_on_2 - xy2)**2, axis=0 ))
return node_to_node_distance.mean() # or use the max
# Two example polyline:
xy1 = [0, 1, 8, 2, 1.7], [1, 0, 6, 7, 1.9] # it should work in 3D too
xy2 = [.1, .6, 4, 8.3, 2.1, 2.2, 2], [.8, .1, 2, 6.4, 6.7, 4.4, 2.3]
average_distance_between_polylines(xy1, xy2) # 0.45004578069119189
If you subsample the original curve, a simple way to assess the approximation error is by computing the maximum distance between the original curve and the line segments between the resampled vertices. The maximum distance occurs at the original vertices and it suffices to evaluate at these points only.
By the way, this provides a simple way to perform the subsampling by setting a maximum tolerance and decimating until the tolerance is exceeded.
You can also think of computing the average distance, but this probably involves nasty integrals and might give less visually pleasing results.

Identifying weighted clusters with max distance diameter and sum(weight) > 50

Problem
Need to identify a way to find 2 mile clusters of points where each point has a value. Identify 2 mile areas which have a sum(value) > 50.
Data
I have data that looks like the following
ID COUNT LATITUDE LONGITUDE
187601546 20 025.56394 -080.03206
187601547 25 025.56394 -080.03206
187601548 4 025.56394 -080.03206
187601550 0 025.56298 -080.03285
Roughly 200K records. What I need to determine is if there are any areas where more than sum of the count exceeds 65 in a one mile radius (2 mile diameter) area.
Using each point as a center for an area
Now, I have python code from another project that will draw a shapefile around a point of x diameter as follows:
def poly_based_on_distance(center_lat,center_long, distance, bearing):
# bearing is in degrees
# distance in miles
# print ('center', center_lat, center_long)
destination = (vincenty(miles=distance).destination(Point(center_lat,
center_long), bearing).format_decimal())
And a routine to return destination and then see which points are inside the radius.
## This is the evaluation for overlap between points and
## area polyshapes
area_list = []
store_geo_dict = {}
for stores in locationdict:
location = Polygon(locationdict[stores])
for areas in AREAdictionary:
area = Polygon(AREAdictionary[areass])
if store.intersects(area):
area_list.append(areas)
store_geo_dict[stores] = area_list
area_list = []
At this point, I am simply drawing a circular shapefile around each of the 200K points, see which others were inside and doing the count.
Need Clustering Algorithm?
However, there might be an area with the required count density where one of the points is not in the center.
I'm familiar with clustering algos such as DBSCAN that use attributes for classification but this is a matter of finding a density clusters using a value for each point. Is there any clustering algorithm to find any cluster of a 2 mile diameter circle where the inside count is >= 50?
Any suggestions, python or R are preferred tools but this is wide-open and probably a one-off so computation efficiency is not a priority.
Not a complete solution, but maybe it will help simplify the problem depending on the distribution of your data. I will use planar coordinates and cKDTree in my example, this might work with geographic data if you can ignore curvature in a projection.
The main observation is the following: a point (x,y) does not contribute to a dense cluster if a ball of radius 2*r (e.g. 2 miles) around (x,y) contributes less than the cutoff value (e.g. 50 in your title). In fact, any point within r of (x,y) does not contribute to ant dense cluster.
This allows you to repeatedly discard points from consideration. If you are left with no points, there are no dense clusters; if you are left with some points, clusters may exist.
import numpy as np
from scipy.spatial import cKDTree
# test data
N = 1000
data = np.random.rand(N, 2)
x, y = data.T
# test weights of each point
weights = np.random.rand(N)
def filter_noncontrib(pts, weights, radius=0.1, cutoff=60):
tree = cKDTree(pts)
contribs = np.array(
[weights[tree.query_ball_point(pt, 2 * radius)].sum() for pt in pts]
)
return contribs >= cutoff
def possible_contributors(pts, weights, radius=0.1, cutoff=60):
n_pts = len(pts)
while len(pts):
mask = filter_noncontrib(pts, weights, radius, cutoff)
pts = pts[mask]
weights = weights[mask]
if len(pts) == n_pts:
break
n_pts = len(pts)
return pts
Example with dummy data:
DBSCAN can be adapted (see Generalized DBSCAN; define core points as weight sum >= 50), but it will not ensure the maximum cluster size (it computes transitive closures).
You could also try complete linkage. Use it to find clusters with the desired maximum diameter, then check if these satisfy the desired density. But that does not guarantee to find all.
It's probably faster to (a) build an index for fast radius search. (b) for every point, find neighbors in radius r; keep if they have the desired minimum sum. But that does not guarantee to find everything because the center is not necessarily a data point. Consider a max radius of 1, minimum weight 100. Two points with weight 50 each, at (0,0) and (1,1). Neither a query at (0,0) nor one at (1,1) will discover the solution, but a cluster at (.5,.5) satisfies the conditions.
Unfortunately, I believe your problem is at least NP-hard, so you won't be able to afford the ultimate solution.

Finding two most far away points in plot with many points in Python

I need to find the two points which are most far away from each other.
I have, as the screenshots say, an array containing two other arrays. one for the X and one for the Y coordinates. What's the best way to determine the longest line through the data? by saying this, i need to select the two most far away points in the plot. Hope you guys can help. Below are some screenshots to help explain the problem.
You can avoid computing all pairwise distances by observing that the two points which are furthest apart will occur as vertices in the convex hull. You can then compute pairwise distances between fewer points.
For example, with 100,000 points distributed uniformly in a unit square, there are only 22 points in the convex hull in my instance.
import numpy as np
from scipy import spatial
# test points
pts = np.random.rand(100_000, 2)
# two points which are fruthest apart will occur as vertices of the convex hull
candidates = pts[spatial.ConvexHull(pts).vertices]
# get distances between each pair of candidate points
dist_mat = spatial.distance_matrix(candidates, candidates)
# get indices of candidates that are furthest apart
i, j = np.unravel_index(dist_mat.argmax(), dist_mat.shape)
print(candidates[i], candidates[j])
# e.g. [ 1.11251218e-03 5.49583204e-05] [ 0.99989971 0.99924638]
If your data is 2-dimensional, you can compute the convex hull in O(N*log(N)) time where N is the number of points. By concentration of measure, this method deteriorates in performance for many common distributions as the number of dimensions grows.

How do I calculate a 3D centroid?

Is there even such a thing as a 3D centroid? Let me be perfectly clear—I've been reading and reading about centroids for the last 2 days both on this site and across the web, so I'm perfectly aware at the existing posts on the topic, including Wikipedia.
That said, let me explain what I'm trying to do. Basically, I want to take a selection of edges and/or vertices, but NOT faces. Then, I want to place an object at the 3D centroid position.
I'll tell you what I don't want:
The vertices average, which would pull too far in any direction that has a more high-detailed mesh.
The bounding box center, because I already have something working for this scenario.
I'm open to suggestions about center of mass, but I don't see how this would work, because vertices or edges alone don't define any sort of mass, especially when I just have an edge loop selected.
For kicks, I'll show you some PyMEL that I worked up, using #Emile's code as reference, but I don't think it's working the way it should:
from pymel.core import ls, spaceLocator
from pymel.core.datatypes import Vector
from pymel.core.nodetypes import NurbsCurve
def get_centroid(node):
if not isinstance(node, NurbsCurve):
raise TypeError("Requires NurbsCurve.")
centroid = Vector(0, 0, 0)
signed_area = 0.0
cvs = node.getCVs(space='world')
v0 = cvs[len(cvs) - 1]
for i, cv in enumerate(cvs[:-1]):
v1 = cv
a = v0.x * v1.y - v1.x * v0.y
signed_area += a
centroid += sum([v0, v1]) * a
v0 = v1
signed_area *= 0.5
centroid /= 6 * signed_area
return centroid
texas = ls(selection=True)[0]
centroid = get_centroid(texas)
print(centroid)
spaceLocator(position=centroid)
In theory centroid = SUM(pos*volume)/SUM(volume) when you split the part into finite volumes each with a location pos and volume value volume.
This is precisely the calculation done for finding the center of gravity of a composite part.
There is not just a 3D centroid, there is an n-dimensional centroid, and the formula for it is given in the "By integral formula" section of the Wikipedia article you cite.
Perhaps you are having trouble setting up this integral? You have not defined your shape.
[Edit] I'll beef up this answer in response to your comment. Since you have described your shape in terms of edges and vertices, then I'll assume it is a polyhedron. You can partition a polyedron into pyramids, find the centroids of the pyramids, and then the centroid of your shape is the centroid of the centroids (this last calculation is done using ja72's formula).
I'll assume your shape is convex (no hollow parts---if this is not the case then break it into convex chunks). You can partition it into pyramids (triangulate it) by picking a point in the interior and drawing edges to the vertices. Then each face of your shape is the base of a pyramid. There are formulas for the centroid of a pyramid (you can look this up, it's 1/4 the way from the centroid of the face to your interior point). Then as was said, the centroid of your shape is the centroid of the centroids---ja72's finite calculation, not an integral---as given in the other answer.
This is the same algorithm as in Hugh Bothwell's answer, however I believe that 1/4 is correct instead of 1/3. Perhaps you can find some code for it lurking around somewhere using the search terms in this description.
I like the question. Centre of mass sounds right, but the question then becomes, what mass for each vertex?
Why not use the average length of each edge that includes the vertex? This should compensate nicely areas with a dense mesh.
You will have to recreate face information from the vertices (essentially a Delauney triangulation).
If your vertices define a convex hull, you can pick any arbitrary point A inside the object. Treat your object as a collection of pyramidal prisms having apex A and each face as a base.
For each face, find the area Fa and the 2d centroid Fc; then the prism's mass is proportional to the volume (== 1/3 base * height (component of Fc-A perpendicular to the face)) and you can disregard the constant of proportionality so long as you do the same for all prisms; the center of mass is (2/3 A + 1/3 Fc), or a third of the way from the apex to the 2d centroid of the base.
You can then do a mass-weighted average of the center-of-mass points to find the 3d centroid of the object as a whole.
The same process should work for non-convex hulls - or even for A outside the hull - but the face-calculation may be a problem; you will need to be careful about the handedness of your faces.

Categories