Pandas how to create a dataframe showing distance metric between every row?

Pandas how to create a dataframe showing distance metric between every row? - python

I have a dataframe, df, that looks as follows:
I would like to make a 16x16 dataframe, df_distances, where I calculate the euclidean distance of [cx, cy] between every row with every other row. Diagonals will be zero as distance from row i's coordinates and itself is zero, etc.
In pseudocode, numpy.linalg.norm(row_i[cx, cy], row_j[cx, cy]) for all i,j in range 1,16.
How can I do this without doing some painful double loop? Surely there is a smart and efficient way!

You should be able do it row-wise by broadcasting.
points = df[['cx','cy']]
distances = dict()
for i in range(1,17):
point = points.iloc[i]
distances[i] = np.linalg.norm(point - points)

I believe what you are looking for is scipy.spatial.distance_matrix (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html). So you will create an numpy array with cx,cy columns and then pass that as both the x and the y to the distance matrix function

Related

Get Index of data point in training set with shortest distance to input matrix with Numpy

I would like to build a function npbatch(U,X) which compares data points in an input matrix (U) with data points in a training matrix (X) and gets me the index of X with the shortest euclidean distance to the data point in U.
I would like to avoid any loops to increase the performance and I would like to use the function scipy.spatial.distance.cdist to compute the distance.
Example Input:
U
array([[0.69646919, 0.28613933, 0.22685145],
[0.55131477, 0.71946897, 0.42310646],
[0.9807642 , 0.68482974, 0.4809319 ]])
X
array([[0.24875591, 0.16306678, 0.78364326],
[0.80852339, 0.62562843, 0.60411363],
[0.8857019 , 0.75911747, 0.18110506]])
--> Expected Output: Array with the three indices of the data points in X which have the shortest distance to the three data points in U.
My overall target is then to get the label of the corresponding data point using the index which I've got. Example for label input would be:
Y
array([1, 0, 0])
Thank you for any hint!

With scipy.spatial.distance.cdist you already chose a well-suited function for the task. To get the indices, we just have to apply numpy.argmin along the axis 0 (or axis 1 for cdist(U, X)):
ix = numpy.argmin(scipy.spatial.distance.cdist(X, U), 0)
Getting the label is then trivial:
Y[ix]

Selecting closest values by Euclidian distance from the mean from a numpy array

I'm sure there's a straightforward answer to this, but I'm very much a Python novice and trawling stackoverflow is getting me tantalisingly close but falling at the final hurdle, so apologies. I have an array of one dimensional arrays (in reality composed of >2000 arrays, each of ~800 values), but for representation sake:
group = [[0,1,3,4,5],[0,2,3,6,7],[0,4,3,2,5],...]
I'm trying to select the nearest n 1-d arrays to the mean (by Euclidian distance), but struggling to extract them from the original list. I can figure out the distances and sort them, but can't then extract them from the original group.
# Compute the mean
group_mean = group.mean(axis = 0)
distances = []
for x in group:
# Compute Euclidian distance from the mean
distances.append(np.linalg.norm(x - group_mean))
# Sort distances
distances.sort()
print(distances[0:5]) # Prints the five nearest distances
Any advice as to how to select out the five (or whatever) arrays from group corresponding to the nearest distances would be much appreciated.

you can put the array in with the dist array, and sort based on the distance to the mean:
import numpy as np
group = np.array([[0,1,3,4,5],[0,2,3,6,7],[0,4,3,2,5]])
group_mean = group.mean(axis = 0)
distances = [[np.linalg.norm(x - group_mean),x] for x in group]
distances.sort(key=lambda a : a[0])
print(distances[0:5]) # Prints the five nearest distances
If your arrays get larger, it might be wise to only save the index instead of the whole array:
distances = [[np.linalg.norm(x - group_mean),i] for i,x in enumerate(group)]
If you don't want to save the distances themself, but just want to sort based on the distance, you can do this:
group = list(group)
group.sort(key=lambda group: np.linalg.norm(group - np.mean(group)))

Find the corresponding [x,y] coordinates for a given z coordinate value

I have a 3D array, and I were to find the greatest Z-coordinate in that array. After that I need to find the corresponding X and Y coordinate values based on the Z-coordinate. How can I achieve it quickly via numpy?
What I did:
I used argsort to first sort the given 3D array, then used np. max(array) to find the greatest Z-coordinate. I do not know how else to continue. Can numpy.where be useful here?
Thanks!

What you are looking for is numpy argmax
quick example :
import numpy as np
data = np.random.rand(5,3)
print data
ind = np.argmax(data[:,2])
print data[ind, :]
outputs
[[0.92037795 0.59469121 0.02956843]
[0.82881039 0.23272832 0.97275488]
[0.98418468 0.45699429 0.44662552]
[0.62519115 0.16637013 0.40433299]
[0.98272718 0.01467489 0.57442259]]
[0.82881039 0.23272832 0.97275488]

Trace max along dim1 with varying index in dim3

I have a "cube" of 3D data where there is some peak in the column, or first dimension. The index of the peak may shift depending what row is examined. The third dimension may do something a bit more complicated, but for now can be thought of as just scaling things by some linear function.
I would like to find the index of the max along the first dimension, subject to the constraint that for each row, the z index is chosen such that the column peak will be closest to 0.5.
Here's a sample image that is a plane in row,column with a fixed z:
These arrays will at times be large -- say, 21x11x200 float64s, so I would like to vectorize this calculation. Written with a for loop, it looks like this:
cols, rows, zs = data.shape
for i in range(rows):
# for each field point, make an intermediate array that is 2D with focus,frequency dimensions
arr = data[:,i,:]
# compute the thru-focus max and find the peak closest to 0.5
maxs = np.max(arr, axis=0)
max_manip = np.abs(maxs-0.5)
freq_idx = np.argmin(max_manip)
# take the thru-focus slice that peaks closest to 0.5
arr2 = data[:,i,freq_idx]
focus_idx = np.argmax(arr2)
print(focus_idx)
My issue is that I do not know how to roll these calculations up into a vector operation. I would appreciate any help, thanks!

We just need to use the axis param with the relevant ufuncs there and that would lead us to a vectorized solution, like so -
# Get freq indices along all rows in one go
idx = np.abs(data.max(0)-0.5).argmin(1)
# Index into data with those and get the argmax indices
out = data[:,np.arange(data.shape[1]), idx].argmax(0)

Compute numpy array pairwise Euclidean distance except with self

edit: this question is not specifically about calculating distances, rather the most efficient way to loop through a numpy array, specifying that for index i all comparisons should be made with the rest of the array, as long as the second index is not i.
I have a numpy array with columns (X, Y, ID) and want to compare each element to each other element, but not itself. So, for each X, Y coordinate, I want to calculate the distance to each other X, Y coordinate, but not itself (where distance = 0).
Here is what I have - there must be a more "numpy" way to write this.
import math, arcpy
# Point feature class
fc = "MY_FEATURE_CLASS"
# Load points to numpy array: (X, Y, ID)
npArray = arcpy.da.FeatureClassToNumPyArray(fc,["SHAPE#X","SHAPE#Y","OID#"])
for row in npArray:
for row2 in npArray:
if row[2] != row2[2]:
# Pythagoras's theorem
distance = math.sqrt(math.pow((row[0]-row2[0]),2)+math.pow((row[1]-row2[1]),2))
Obviously, I'm a numpy newbie. I will not be surprised to find this a duplicate, but I don't have the numpy vocabulary to search out the answer. Any help appreciated!

Using SciPy's pdist, you could write something like
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(npArray, lambda a,b: np.sqrt((a[0]-b[0])**2 + (a[1]-b[1])**2)))
pdist will compute the pair-wise distances using the custom metric that ignores the 3rd coordinate (which is your ID in this case). squareform turns this into a more readable matrix such that distances[0,1] gives the distance between the 0th and 1st rows.

Each row of X is a 3 dimensional data instance or point.
The output pairwisedist[i, j] is distance of X[i, :] and X[j, :]
X = np.array([[6,1,7],[10,9,4],[13,9,3],[10,8,15],[14,4,1]])
a = np.sum(X*X,1)
b = np.repeat( a[:,np.newaxis],5,axis=1)
pairwisedist = b + b.T -2* X.dot(X.T)

I wanted to point out that custom written sqrt of sum of squares are prone to overflow and underflow. Bultin math.hypot, np.hypot are way safer for no compromise on performance
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(npArray, lambda a,b: math.hypot(*(a-b))
Refer

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas how to create a dataframe showing distance metric between every row? - python

You should be able do it row-wise by broadcasting. points = df[['cx','cy']] distances = dict() for i in range(1,17): point = points.iloc[i] distances[i] = np.linalg.norm(point - points)

I believe what you are looking for is scipy.spatial.distance_matrix (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html). So you will create an numpy array with cx,cy columns and then pass that as both the x and the y to the distance matrix function

Related

Get Index of data point in training set with shortest distance to input matrix with Numpy

Selecting closest values by Euclidian distance from the mean from a numpy array

Find the corresponding [x,y] coordinates for a given z coordinate value

Trace max along dim1 with varying index in dim3

Compute numpy array pairwise Euclidean distance except with self

Categories

Resources