What unit is the distance in when using cKDTree?

What unit is the distance in when using cKDTree? - python

I am trying to calculate the distance between the closest points in two geodataframes.
I used the function created by jHUW here. The function is as follows:
def ckdnearest(gdA, gdB):
nA = np.array(list(gdA.geometry.apply(lambda x: (x.x, x.y))))
nB = np.array(list(gdB.geometry.apply(lambda x: (x.x, x.y))))
btree = cKDTree(nB)
dist, idx = btree.query(nA, k=1)
gdB_nearest = gdB.iloc[idx].drop(columns="geometry").reset_index(drop=True)
gdf = pd.concat(
[
gdA.reset_index(drop=True),
gdB_nearest,
pd.Series(dist, name='dist')
],
axis=1)
return gdf
It's working fine between my datasets, but I was wondering what unit the returned distance is in. I did some research and found that the unit will be the same as the unit of the array used. I used an array of lat-lons, like so:
array([[-122.3295182, 47.6202074],
[-122.296276 , 37.8789939],
[-122.6857603, 45.5289172],
[-118.3804073, 33.9017057],
[ -93.2911788, 44.860997 ]])
I tried to find out what the units of lat-lons would be, but was unsuccessful. I also checked the distance between some of the point pairs on GoogleMaps to get some insight, but couldn't make sense of them. For instance, Googlemaps show a distance of 1.5 miles for my first pair, but the distance returned by the function is 0.0087466. I understand that ckDTree calculates the Euclidean distance but even then, the difference seems quite large. Please provide some insight if you have them.

The result of Scipy is indeed a L2 norm (aka Euclidean distance). The meaning of this distance is dependent of the chosen coordinate system. In your case, you appear to use a geographic coordinate system (which is a spherical coordinate system). As a result, coordinates are are based on angles and cannot be linearly transformed to meters (for example in Antarctica changing the angle does not impact much the distance in meter). Additionally, one need to consider the distortion of the space while computing the distance: a straight line for use on earth is a geodesic in your geographic space. The L2 norm computed by Scipy does not consider this. In fact, using this metric probably results in wrong results: the L2 norm computed over-estimate the actual distance (of the geodesic, both in meter or radian) in Antarctica compared to the equator. This means two point near to the north pole can be considered as close as two points located each in Japan and Europe... Thus, you certainly need to use a better metric. As for the unit of the distance, it does nor make much sense mainly because of this issue. On relatively good metric would be the length of the geodesic (possibly in meters) or the angle between two point. Unfortunately, AFAIK this is not possible with Scipy... Using a GIS library (like GDAL) may help.

Related

What is CRS/units in osmnx python? How to find distance between road edge and a given point in openstreetmap?

Here is a docs:
osmnx.distance.nearest_edges(G, X, Y, interpolate=None, return_dist=False)
Find the nearest edge to a point or to each of several points.
If X and Y are single coordinate values, this will return the nearest edge to that point. If X and Y are lists of coordinate values, this will return the nearest edge to each point.
If interpolate is None, search for the nearest edge to each point, one at a time, using an r-tree and minimizing the euclidean distances from the point to the possible matches. For accuracy, use a projected graph and points. This method is precise and also fastest if searching for few points relative to the graph’s size.
For a faster method if searching for many points relative to the graph’s size, use the interpolate argument to interpolate points along the edges and index them. If the graph is projected, this uses a k-d tree for euclidean nearest neighbor search, which requires that scipy is installed as an optional dependency. If graph is unprojected, this uses a ball tree for haversine nearest neighbor search, which requires that scikit-learn is installed as an optional dependency.
Parameters:
G (networkx.MultiDiGraph) – graph in which to find nearest edges
X (float or list) – points’ x (longitude) coordinates, in same CRS/units as graph and containing no nulls
Y (float or list) – points’ y (latitude) coordinates, in same CRS/units as graph and containing no nulls
interpolate (float) – spacing distance between interpolated points, in same units as graph. smaller values generate more points.
return_dist (bool) – optionally also return distance between points and nearest edges
Returns:
ne or (ne, dist) – nearest edges as (u, v, key) or optionally a tuple where dist contains distances between the points and their nearest edges
Return type:
tuple or list
Here is a question
But what is crs? why cant I use a normal longitude and latitude here? points are 6467474 something like this(dtype:float64). I am new to GIS.
u v key

What is CRS/units in osmnx python?
Are you asking what these terms mean? Or what their default values are? If it is the former, you can refer to any introductory GIS textbook. If it is the latter, as the OSMnx documentation states, the default CRS is EPSG:4326. Regarding distance units, it depends on what you did in your code. You did not provide a complete, minimal, reproducible example. If you projected your graph, then distances are measured in whatever units your projection is in. If you did not, then distances are measured in meters by default.
why cant I use a normal longitude and latitude here?
As the documentation states, you can pass in latitude and longitude to find the nearest edge(s) to point(s). I would strongly encourage you to work through the OSMnx usage examples to learn how the package works and practice some demonstration code (including find nearest edges to lat-lng points).
points are 6467474 something like this(dtype:float64). I am new to GIS. u v key
I don't know what you mean. Again, you need to provide a complete, minimal, reproducible code snippet so we can diagnose and troubleshoot. If you do, I can edit this answer if your code snippet provides enough info for me to give further information.

Python: how to find the offset that minimizes the euclidean distance between two series?

I have two non-identical series where one is lagging the other. I want to find the x_axis offset that minimizes the Euclidean distance between the two series.
df = pd.DataFrame({'a':[1,4,5,10,9,3,2,6,8,4], 'b': [1,7,3,4,1,10,5,4,7,4]})
I am using Dynamic Time Warping modules in Python, which give me the minimum distance, but I am not sure how to get the offset.
from dtw import dtw,accelerated_dtw
d1 = df['a'].values
d2 = df['b'].values
d, cost_matrix, acc_cost_matrix, path = accelerated_dtw(d1,d2, dist='euclidean')
plt.imshow(acc_cost_matrix.T, origin='lower', cmap='gray', interpolation='nearest')
plt.plot(path[0], path[1], 'w')
plt.xlabel('a')
plt.ylabel('b')
plt.title(f'DTW Minimum Path with minimum distance: {np.round(d,2)}')
plt.show()
I am not sure how to interpret the "15" distance measure on the top of the cost matrix. Is it the minimum distance between the already-offseted series? or is it the offset that results in the minimum distance between the two series?
Thank you in advance!

It seems like you have a misunderstanding of how dynamic time warping (DTW) works. DTW tries to find the smallest cost matching of two timeseries (in your case euclidean distance). But the core feature of the algorithm is that the matching is NON-LINEAR, and thus the warping in the name. The two timeseries are warped, or twisted, to find the perfect fit. So DTW doesn't really provide you an optimal offset, since it is not about offsetting the whole timeseries by a fixed amount, but it rather operates on a point-by-point basis.
Look how the matching lines in the DTW are not linear (and some points match to more than one point):
As for the distance, it is the accumulated cost (or the total euclidean distance of the optimal DTW matching).
Another thing worth mentioning about DTW is that one of its default constraints is to match every single point in each timeseries. But in your case, you're trying to offset the entire graph, so some of the points won't be matched. There are, however, ways to relax this constraint and to impose another constraint on the matching (so as to only match once and to force the DTW to perform linear offsetting). But this needs a deep understanding of how the algorithm works and requires complicated configurations.
In short, I don't think DTW is the right choice of an algorithm in your case. You can try writing a script that checks the euclidean distance for different options of offsets (which shouldn't be a hard task since you're dealing with a fixed offset).
You can read more about DTW here: https://towardsdatascience.com/dynamic-time-warping-3933f25fcdd#:~:text=Dynamic%20Time%20Warping%20is%20used,time%20series%20with%20different%20length.&text=How%20to%20do%20that%3F,total%20distance%20of%20each%20component.

Looking for a clustering algorithm, that can cluster by density around a centroid, but with a fixed maximum distance cutoff

I currently have a list with 3D coordinates which I want cluster by density into a unknown number of clusters. In addition to that I want to score the clusters by population and by distance to the centroids.
I would also like to be able to set a maximum possible distance from a certain centroid. Ideally the centroid represent a point of the data-set, but it is not absolutely necessary. I want to do this for a list ranging from approximately 100 to 10000 3D coordinates.
So for example, say i have a point [x,y,z] which could be my centroid:
Points that are closest to x,y,z should contribute the most to its score (i.e. a logistic scoring function like y = (1 + exp(4*(-1.0+x)))** -1 ,where x represents the euclidean distance to point [x,y ,z]
( https://www.wolframalpha.com/input/?i=(1+%2B+exp(4(-1.0%2Bx)))**+-1 )
Since this function never reaches 0, it is needed to set a maximum distance, e.g. 2 distance units to set a limit to the cluster.
I want to do this until no more clusters can be made, I am only interested in the centroid, thus it should preferably be a real datapoint instead of an interpolated one it also has other properties connected to it.
I have already tried DBSCAN from sklearn, which is several orders of magnitude faster than my code, but it does obviously not accomplish what I want to do
Currently I am just calculating the proximity of every point relative to all other points and am scoring every point by the number and distance to its neighbors (with the same scoring function discussed above), then I take the highest scored point and remove all other, lower scored, points that are within a certain cutoff distance. It gets the job done and is accurate, but it is too slow.
I hope I could be somewhat clear with what I want to do.

Use the neighbor search function of sklearn to find points within the maximum distance 2 fast. Only do this once compute the logistic weights only once.
Then do the remainder using ony this precomputed data?

Large set of x,y coordinates. Efficient way to find any within certain distance of each other?

I have a large set of data points in a pandas dataframe, with columns containing x/y coordinates for these points. I would like to identify all points that are within a certain distance "d" of any other point in the dataframe.
I first tried to do this using 'for' loops, checking the distance between the first point and all other points, then the distance between the second point and all others, etc. Clearly this is not very efficient for a large data set.
Recent searching online suggests that the best way might be to use scipy.spatial.ckdtree, but I can't figure out how to implement this. Most examples I see check against a single x/y location, whereas I want to check all vs all. Is anyone able to provide suggestions or examples, starting from an array of x/y coordinates taken from my dataframe as follows:
points = df_sub.loc[:,['FRONT_X','FRONT_Y']].values
That looks something like this:
[[19091199.587 -544406.722]
[19091161.475 -544452.426]
[19091163.893 -544464.899]
...
[19089150.04 -544747.196]
[19089774.213 -544729.005]
[19089690.516 -545165.489]]
The ideal output would be the ID's of all pairs of points that are within a cutoff distance "d" of each other.

scipy.spatial has many good functions for handling distance computations.
Let's create an array pos of 1000 (x, y) points, similar to what you have in your dataframe.
import numpy as np
from scipy.spatial import distance_matrix
num = 1000
pos = np.random.uniform(size=(num, 2))
# Distance threshold
d = 0.25
From here we shall use the distance_matrix function to calculate pairwise distances. Then we use np.argwhere to find the indices of all the pairwise distances less than some threshold d.
pair_dist = distance_matrix(pos, pos)
ids = np.argwhere(pair_dist < d)
ids now contains the "ID's of all pairs of points that are within a cutoff distance "d" of each other", as you desired.
Shortcomings
Of course, this method has the shortcoming that we always compute the distance between each point and itself (returning a distance of 0), which will always be less than our threshold d. However, we can exclude self-comparisons from our ids with the following fudge:
pair_dist[np.r_[:num], np.r_[:num]] = np.inf
ids = np.argwhere(pair_dist < d)
Another shortcoming is that we compute the full symmetric pairwise distance matrix when we only really need the upper or lower triangular pairwise distance matrix. However, unless this computation really is a bottleneck in your code, I wouldn't worry too much about this.

Wrap-around when calculating distance for k-means

I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference.
I'm doing something simple, similar to the following:
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)
data elements looks something like [22, 418, 192] where the first element is the hour.
Any ideas?

Even though #elyase answer is accepted, I think it is not the correct approach.
Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:
#current cluster X,, based on centroid position Xc=24
x1=1
x2=24
#current cluster Y, based on centroid position Yc=10
y1=12
y2=13
computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.
Simple k-means will result in clusters {x1,y1}, {x2,y2}
Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
Correct clustering would be {x1,x2},{y1,y2}
Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.
This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).
Details of the kernel k-means are given here

Why k-means doesn't work with arbitrary distances
K-means is not a distance-based algorithm.
K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).
In order for Lloyds algorithm to converge you need to have both steps optimize the same function:
the reassignment step
the centroid update step
Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.
If you plug in a random other distance function as suggested by #elyase k-means might no longer converge.
Proper solutions
There are various solutions to this:
Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.

The easiest approach, to me, is to adapt the K-means algorithm wraparound dimension via computing the "circular mean" for the dimension. Of course, you will also need to change the distance-to-centroid calculation accordingly.
#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)
sin = np.sin(angles)
cos = np.cos(angles)
a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi
#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.