I'm calculating with Python. Let's say i have this kind of DataFrame where it consists of long lat of some points
import pandas as pd
dfa=pd.DataFrame(([1,2],[1,3],[1,1],[1,4]), columns=['y','x'])
before, i used distance matrix from scipy.spatial and create another DataFrame with this code. but it seems that it can't precisely calculate the distance between points (with long lat)
from scipy.spatial import distance_matrix
pd.DataFrame(distance_matrix(dfa.values, dfa.values), index=dfa.index, columns=dfa.index)
Do you think it's possible to change the calculation with geodesic? here what i've tried.
from geopy.distance import geodesic
pd.DataFrame(geodesic(dfa.values[0], dfa.values[0]).kilometers, index=dfa.index, columns=dfa.index)
# i don't know how to change [0] adjusted to column and index
any suggestion?
Given a list or list-like object locations, you can do
distances = pd.DataFrame([[geodesic(a,b) for a in locations]
for b in locations])
This will be redundant, though, since it will calculate distance for both a,b and b,a, even though they should be the same. Depending on the cost of geodesic, you may find the some of the following alternatives faster:
distances = pd.DataFrame([[geodesic(a,b) if a > b else 0
for a in locations]
for b in locations])
distances = distances.add(distances.T)
size = len(locations)
distances = pd.DataFrame(columns = range(size), index = range(size))
def get_distance(i,j):
if distances.loc[j,i]:
return distances.loc[j,i]
if i == j:
return 0
return geodesic(locations[i], locations[j])
for i in range(size):
for j in range(size):
distances.loc[i,j] = get_distance(i,j)
You can also store the data as a dictionary with the keys being output from itertools.combinations. There's also this article on creating a symmetric matrix class.
Related
This is my data set:
https://pastebin.com/SsuKP2eH
I'm trying to find the nearest point for all points in the data set. These points are latitude and longitude on the Earth's surface. Of course, the nearest point cannot be the same point.
I tried the KDTree solutions listed in this post: https://stackoverflow.com/a/45128643 and changed the poster's random points (generated by np.random.uniform) to my own data set.
I expected to get an array full of distances, but instead, I got an array full of zeroes with some numbers like 2.87722e-06 and 0.616582 sprinkled in. This wasn't what I wanted. I tried the other solution, NearestNeighbours, on my data set and got the same result. So, I did some debugging and reduced the range of random numbers he used, making it closer to my own data set.
import numpy as np
import scipy.spatial as spatial
import pandas as pd
R = 6367
def using_kdtree(data):
"Based on https://stackoverflow.com/q/43020919/190597"
def dist_to_arclength(chord_length):
"""
https://en.wikipedia.org/wiki/Great-circle_distance
Convert Euclidean chord length to great circle arc length
"""
central_angle = 2*np.arcsin(chord_length/(2.0*R))
arclength = R*central_angle
return arclength
phi = np.deg2rad(data['Latitude'])
theta = np.deg2rad(data['Longitude'])
data['x'] = R * np.cos(phi) * np.cos(theta)
data['y'] = R * np.cos(phi) * np.sin(theta)
data['z'] = R * np.sin(phi)
tree = spatial.KDTree(data[['x', 'y','z']])
distance, index = tree.query(data[['x', 'y','z']], k=2)
return dist_to_arclength(distance[:, 1])
#return distance, index
np.random.seed(2017)
N = 1000
#data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 'Longitude':np.random.uniform(0, 360,size=N)})
data = pd.DataFrame({'Latitude':np.random.uniform(-49.19,49.32,size=N), 'Longitude':np.random.uniform(-123.02, -123.23,size=N)})
result = using_kdtree(data)
I found that the resulting distances array had small values, close to 0. This makes me believe that the reason why the result array for my data set is full of zeroes is because the differences between points are very small. Somewhere, the KD Tree/nearest neighbours loses precision and outputs garbage. Is there a way to make them keep the precision of my floats? The brute-force method can keep precision but it is far too slow with 7200 points to iterate through.
I think what's happening is that k=2 in
distance, index = tree.query(data[['x', 'y','z']], k=2)
tells KDTree you want the closest two points to a point. So the closest is obviously the point itself with distance from itself being zero. Also if you print index you see a Nx2 array and each row starts with the number of the row. This is KDTree's way of saying well the closest point to the i-th point is the i-th point itself.
Obviously that is not useful and you probably want only the 2nd closest point. Fortunately I found this in the documentation of the k parameter of query
Either the number of nearest neighbors to return, or a list of the
k-th nearest neighbors to return, starting from 1.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query
So
distance, index = tree.query(data[['x', 'y','z']], k=[2])
gives only the distance and index of the 2nd to closest point.
I am writing some code to calculate the real distance between one point and the rest of the points from the same array. The array holds positions of particles in 3D space. There is N-particles so the array's shape is (N,3). I choose one particle and calculate the distance between this particle and the rest of the particles, all within one array.
Would anyone here have any idea how to do this?
What I have so far:
xbox = 10
ybox = 10
zbox = 10
nparticles =15
positions = np.empty([nparticles, 3])
for i in range(nparticles):
xrandomalocation = random.uniform(0, xbox)
yrandomalocation = random.uniform(0, ybox)
zrandomalocation = random.uniform(0, zbox)
positions[i, 0] = xrandomalocation
positions[i, 1] = yrandomalocation
positions[i, 2] = zrandomalocation
And that's pretty much all I have right now. I was thinking of using np.linalg.norm however I am not sure at all how to implement it to my code (or maybe use it in a loop)?
It sounds like you could use scipy.distance.cdist or scipy.distance.pdist for this. For example, to get the distances from point X to the points in coords:
>>> from scipy.spatial import distance
>>> X = [(35.0456, -85.2672)]
>>> coords = [(35.1174, -89.9711),
... (35.9728, -83.9422),
... (36.1667, -86.7833)]
>>> distance.cdist(X, coords, 'euclidean')
array([[ 4.70444794, 1.6171966 , 1.88558331]])
pdist is similar, but only takes one array, and you get the distances between all pairs.
i am using this function:
from scipy.spatial import distance
def closest_node(node, nodes):
closest = distance.cdist([node], nodes)
index = closest.argmin()
euclidean = closest[0]
return nodes[index], euclidean[index]
where node is the single point in the space you want to compare with an array of points called nodes. it returns the point and the euclidean distance to your original node
I have a function:
def update(points, closest, centroids):
return np.array([points[closest==k].mean(axis=0) for k in range(centroids.shape[0])])
It basically the update of centroids step in kmeans algorithm.
Basically, points is a matrix, closest is an assignment of a point to a cluster..
and then all i am doing is finding the new mean based on points in a cluster..
but I was wondering if i can get rid of that for loop?
which is if i can find the cluster mean in one shot?
Here's a vectorized approach based on np.add.reduceat -
c = np.bincount(closest,minlength=centroids.shape[0])
mask = c != 0
pts_grp = points[closest.argsort()]
cut_idx = np.append(0,c[mask].cumsum()[:-1])
out = np.full((centroids.shape[0],points.shape[1]),np.nan)
out[mask] = np.add.reduceat(pts_grp,cut_idx,axis=0)/c[mask,None].astype(float)
I am new to Python. I would like to perform hierarchical clustering on N by P dataset that contains some missing values. I am planning to use scipy.cluster.hierarchy.linkage function that takes distance matrix in condensed form. Does Python have a method to compute distance matrix for missing value contained data? (In R dist function automatically takes care of missing values... but scipy.spatial.distance.pdist seems not handling missing values!)
I could not find a method to compute distance matrix for data with missing values. So here is my naive solution using Euclidean distance.
import numpy as np
def getMissDist(x,y):
return np.nanmean( (x - y)**2 )
def getMissDistMat(dat):
Npat = dat.shape[0]
dist = np.ndarray(shape=(Npat,Npat))
dist.fill(0)
for ix in range(0,Npat):
x = dat[ix,]
if ix >0:
for iy in range(0,ix):
y = dat[iy,]
dist[ix,iy] = getMissDist(x,y)
dist[iy,ix] = dist[ix,iy]
return dist
Then assume that dat is N (= number of cases) by P (=number of features) data matrix with missing values then one can perform hierarchical clustering on this dat as:
distMat = getMissDistMat(dat)
condensDist = dist.squareform(distMat)
link = hier.linkage(condensDist, method='average')
I have a collection of n dimensional points and I want to find which 2 are the closest. The best I could come up for 2 dimensions is:
from numpy import *
myArr = array( [[1, 2],
[3, 4],
[5, 6],
[7, 8]] )
n = myArr.shape[0]
cross = [[sum( ( myArr[i] - myArr[j] ) ** 2 ), i, j]
for i in xrange( n )
for j in xrange( n )
if i != j
]
print min( cross )
which gives
[8, 0, 1]
But this is too slow for large arrays. What kind of optimisation can I apply to it?
RELATED:
Euclidean distance between points in two different Numpy arrays, not within
Try scipy.spatial.distance.pdist(myArr). This will give you a condensed distance matrix. You can use argmin on it and find the index of the smallest value. This can be converted into the pair information.
There's a whole Wikipedia page on just this problem, see: http://en.wikipedia.org/wiki/Closest_pair_of_points
Executive summary: you can achieve O(n log n) with a recursive divide and conquer algorithm (outlined on the Wiki page, above).
You could take advantage of the latest version of SciPy's (v0.9) Delaunay triangulation tools. You can be sure that the closest two points will be an edge of a simplex in the triangulation, which is a much smaller subset of pairs than doing every combination.
Here's the code (updated for general N-D):
import numpy
from scipy import spatial
def closest_pts(pts):
# set up the triangluataion
# let Delaunay do the heavy lifting
mesh = spatial.Delaunay(pts)
# TODO: eliminate reduncant edges (numpy.unique?)
edges = numpy.vstack((mesh.vertices[:,:dim], mesh.vertices[:,-dim:]))
# the rest is easy
x = mesh.points[edges[:,0]]
y = mesh.points[edges[:,1]]
dists = numpy.sum((x-y)**2, 1)
idx = numpy.argmin(dists)
return edges[idx]
#print 'distance: ', dists[idx]
#print 'coords:\n', pts[closest_verts]
dim = 3
N = 1000*dim
pts = numpy.random.random(N).reshape(N/dim, dim)
Seems closely O(n):
There is a scipy function pdist that will get you the pairwise distances between points in an array in a fairly efficient manner:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
that outputs the N*(N-1)/2 unique pairs (since r_ij == r_ji). You can then search on the minimum value and avoid the whole loop mess in your code.
Perhaps you could proceed along these lines:
In []: from scipy.spatial.distance import pdist as pd, squareform as sf
In []: m= 1234
In []: n= 123
In []: p= randn(m, n)
In []: d= sf(pd(p))
In []: a= arange(m)
In []: d[a, a]= d.max()
In []: where(d< d.min()+ 1e-9)
Out[]: (array([701, 730]), array([730, 701]))
With substantially more points you need to be able to somehow utilize the hierarchical structure of your clustering.
How fast is it compared to just doing a nested loop and keeping track of the shortest pair? I think creating a huge cross array is what might be hurting you. Even O(n^2) is still pretty quick if you're only doing 2 dimensional points.
The accepted answer is OK for small datasets, but its execution time scales as n**2. However, as pointed out by #payne, an optimal solution can achieve n*log(n) computation time scaling.
This optial solution can be obtained using sklearn.neighbors.BallTree as follows.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import BallTree as tree
n = 10
dim = 2
xy = np.random.uniform(size=[n, dim])
# This solution is optimal when xy is very large
res = tree(xy)
dist, ids = res.query(xy, 2)
mindist = dist[:, 1] # second nearest neighbour
minid = np.argmin(mindist)
plt.plot(*xy.T, 'o')
plt.plot(*xy[ids[minid]].T, '-o')
This procedure scales well for very large sets of xy values and even for large dimensions dim (altough the example illustrates the case dim=2). The resulting output looks like this
An identical solution can be obtained using scipy.spatial.cKDTree, by replacing the sklearn import with the following Scipy one. Note however that cKDTree, unlike BallTree, does not scale well for high dimensions
from scipy.spatial import cKDTree as tree