I have two vectors, let's say x=[2,4,6,7] and y=[2,6,7,8] and I want to find the euclidean distance, or any other implemented distance (from scipy for example), between each corresponding pair. That will be
dist=[0, 2, 1, 1].
When I try
dist = scipy.spatial.distance.cdist(x,y, metric='sqeuclidean')
or
dist = [scipy.spatial.distance.cdist(x,y, metric='sqeuclidean') for x,y in zip(x,y)]
I get
ValueError: XA must be a 2-dimensional array.
How am I supposed to calculate dist and why do I have to reshape data for that purpose?
cdist does not compute the list of distances between corresponding pairs, but the matrix of distances between all pairs.
np.linalg.norm((np.asarray(x)-np.asarray(y))[:, None], axis=1)
Is how id typically write this for the Euclidian distance between n-dimensional points; but if you are only dealing with 1 dimensional points, the absolute difference, as suggested by elpres would be simpler.
Related
New to scipy. I am trying to use the cdist function to pick the greatest distance between vectors. My attempt is
dm = cdist(XA, XB, lambda u, v: np.max(np.sqrt(((u-v)**2).sum())))
but it doesn't seem to produce the correct result. Any suggestions?
The cdist function returns a NxM matrix containing all distances between the N vectors of XA and M vectors of XB. If you want the max distance, regardless of the vectors that originate it, you need to ravel() the 2D array into a 1D array and then look for the max() value:
dm = cdist(XA, XB,metric='euclidean').ravel().max()
I have two arrays of x,y,z coordinates, (e.g. a=[(x1,y1,z1)...(xN,yN,zN)], b = [(X1,Y1,Z1)...(XN,YN,ZN)] ). I need the fastest way to iterate through them and find the indices of b with the minimum euclidean distance to each point in a. here's the catch. I'm using a modified/weighted euclidean equation. Currently I'm doing two for loops which admittedly is the slowest way to do it.
b typically has around 500 coordinate sets to choose from, but a can have tens-to-hundreds of thousands
as an example:
a = (1,1,1), b = [(87,87,87),(2,2,2),(50,50,50)]
would return index 1.
You could create a k-d tree of array b and find the nearest distance of a coordinate in array a by traversing down the tree.
For array a of size n and array b of size m, the complexity would be O(mlog(m)) for building the tree and O(nlog(m)) for finding all the nearest distances.
To solve a problem I need manhattan distances between all the vectors. I tried sklearn.metrics.pairwise_distances but the size was too large, so in order to decrease memory footprint I used scipy.spatial.distance.pdist to get the condensed 1D matrix of distances.
I used below formula:
index = diagonalShape*(diagonalShape-1)/2 - (diagonalShape-i)*(diagonalShape-i-1)/2 + j - i - 1
to calculate the index of the 1D matrix to get the distance value of ij.
I've observed that for many entries the distances are different form scipy and sklearn. Why this is so when the formula used for calculating cityblock distances is same for both the libraries?
I'm using the module hcluster to calculate a dendrogram from a distance matrix. My distance matrix is an array of arrays generated like this:
import hcluster
import numpy as np
mols = (..a list of molecules)
distMatrix = np.zeros((10, 10))
for i in range(0,10):
for j in range(0,10):
sim = OETanimoto(mols[i],mols[j]) # a function to calculate similarity between molecules
distMatrix[i][j] = 1 - sim
I then use the command distVec = hcluster.squareform(distMatrix) to convert the matrix into a condensed vector and calculate the linkage matrix with vecLink = hcluster.linkage(distVec).
All this works fine but if I calculate the linkage matrix using the distance matrix and not the condensed vector matLink = hcluster.linkage(distMatrix) I get a different linkage matrix (the distances between the nodes are a lot larger and topology is slightly different)
Now I'm not sure whether this is because hcluster only works with condensed vectors or whether I'm making mistakes on the way there.
Thanks for your help!
I knocked up a quick random example similar to yours and experienced the same problem.
In the docstring it does say :
Performs hierarchical/agglomerative clustering on the
condensed distance matrix y. y must be a :math:{n \choose 2} sized
vector where n is the number of original observations paired
in the distance matrix.
However, having had a quick look at the code, it seems like the intent is for it to both work with vector shaped and matrix shaped code:
In hierachy.py there is a switch based upon the shape of the matrix.
It seems however that the key bit of info is in the function linkage's docstring:
- Q : ndarray
A condensed or redundant distance matrix. A condensed
distance matrix is a flat array containing the upper
triangular of the distance matrix. This is the form that
``pdist`` returns. Alternatively, a collection of
:math:`m` observation vectors in n dimensions may be passed as
a :math:`m` by :math:`n` array.
So I think that the interface doesn't allow the passing of a distance matrix.
Instead it thinks you are passing it m observation vectors in n dimensions .
Hence the difference in result?
Does that seem reasonable?
Else just take a look at the code itself I'm sure you'll be able to debug it and figure out why your examples are different.
Cheers
Matt
I am new to Numpy and I would like to ask you how to calculate euclidean distance between points stored in a vector.
Let's assume that we have a numpy.array each row is a vector and a single numpy.array. I would like to know if it is possible to calculate the euclidean distance between all the points and this single point and store them in one numpy.array.
Here is an interface:
points #2d list of row-vectors
singlePoint #one row-vector
listOfDistances= procedure( points,singlePoint)
Can we have something like this?
Or is it possible to have one command to have the single point as a list of other points and at the end we get a matrix of distances?
Thanks
To get the distance you can use the norm method of the linalg module in numpy:
np.linalg.norm(x - y)
While you can use vectorize, #Karl's approach will be rather slow with numpy arrays.
The easier approach is to just do np.hypot(*(points - single_point).T). (The transpose assumes that points is a Nx2 array, rather than a 2xN. If it's 2xN, you don't need the .T.
However this is a bit unreadable, so you write it out more explictly like this (using some canned example data...):
import numpy as np
single_point = [3, 4]
points = np.arange(20).reshape((10,2))
dist = (points - single_point)**2
dist = np.sum(dist, axis=1)
dist = np.sqrt(dist)
import numpy as np
def distance(v1, v2):
return np.sqrt(np.sum((v1 - v2) ** 2))
To apply a function to each element of a numpy array, try numpy.vectorize.
To do the actual calculation, we need the square root of the sum of squares of differences (whew!) between pairs of coordinates in the two vectors.
We can use zip to pair the coordinates, and sum with a comprehension to sum up the results. That looks like:
sum((x - y) ** 2 for (x, y) in zip(singlePoint, pointFromArray)) ** 0.5
import numpy as np
single_point = [3, 4]
points = np.arange(20).reshape((10,2))
distance = euclid_dist(single_point,points)
def euclid_dist(t1, t2):
return np.sqrt(((t1-t2)**2).sum(axis = 1))