Calculate condensed distance matrix with varying length data points - python

Scipy's pdist function expects an evenly shaped numpy array as input.
Working example:
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
#Example distance function.
def dfun(u, v):
return u.sum() + v.sum()
dat0 = np.array([-1, 1,-3, 1])
dat1 = np.array([-1, 1,-3, 1])
dat2 = np.array([ 1, 1, 1, 1])
data = np.array([dat0, dat1, dat2])
distance_matrix = pdist(data, dfun)
squareform(distance_matrix)
I got a custom distance function which works with run-length encoded data, thus the arrays may vary in size. When using the following input
dat0 = np.array([-1, 1,-4, 1])
dat1 = np.array([-1, 1,-3, 1, 1])
dat2 = np.array([ 1,-6])
A value error ValueError: A 2-dimensional array must be passed. is raised even though the distance function would be just fine handling the input. Does there exist an alternative to calculate these values?
Edit: the distance function in the above snippet is just an example for a metric which does not care about the actual number of elements inside the datapoint. In my case https://github.com/mclmza/AWarp is used which computes the dtw for sparse data sets example series: [1,-456,1,1,-23,1], thus padding the data is not a valid option.

If I understand correctly, you want to compute the distances using awarp, but that distance function takes signals of varying length. So you need to avoid creating an array, because NumPy doesn't allow 'ragged' arrays. Then I think you can do this:
from itertools import combinations
from scipy.spatial.distance import squareform
# Example distance function.
def dfun(u, v):
return u.sum() + v.sum()
dat0 = np.array([-1, 1,-4, 1])
dat1 = np.array([-1, 1,-3, 1, 1])
dat2 = np.array([ 1,-6])
data = [dat0, dat1, dat2]
dists = [dfun(a, b) for a, b in combinations(data, r=2)]
squareform(dists)
For your example, this yields:
array([[ 0, -4, -8],
[-4, 0, -6],
[-8, -6, 0]])
And if dfun = awarp then you get this output for those signals:
array([[ 0. , 0. , 2.23606798],
[ 0. , 0. , 2.44948974],
[ 2.23606798, 2.44948974, 0. ]])
I guess this approach only works if dfun is commutative, which I think awarp is.

Related

How to handle rounding to negative zero in Python docstring tests

Related to this question: How to have negative zero always formatted as positive zero in a python string?
I have the following function that implements Matlab's orth.m using Numpy. I have a docstring test that relies on np.array2string using suppress_small=True, which will make small values round to zero. However, sometimes they round to positive zero and sometimes they round to negative zero, depending on whether the answer comes out as 1e-16 or -1e-17 or similar. Which case happens is based on the SVD decomposition, and can vary from platform to platform or across Python versions depending on which underlying linear algebra solver is used (BLAS, Lapack, etc.)
What's the best way to design the docstring test to account for this?
In the final doctest, sometimes the Q[0, 1] term is -0. and sometimes it's 0.
import doctest
import numpy as np
def orth(A):
r"""
Orthogonalization basis for the range of A.
That is, Q.T # Q = I, the columns of Q span the same space as the columns of A, and the number
of columns of Q is the rank of A.
Parameters
----------
A : 2D ndarray
Input matrix
Returns
-------
Q : 2D ndarray
Orthogonalization matrix of A
Notes
-----
#. Based on the Matlab orth.m function.
Examples
--------
>>> import numpy as np
Full rank matrix
>>> A = np.array([[1, 0, 1], [-1, -2, 0], [0, 1, -1]])
>>> r = np.linalg.matrix_rank(A)
>>> print(r)
3
>>> Q = orth(A)
>>> with np.printoptions(precision=8):
... print(Q)
[[-0.12000026 -0.80971228 0.57442663]
[ 0.90175265 0.15312282 0.40422217]
[-0.41526149 0.5664975 0.71178541]]
Rank deficient matrix
>>> A = np.array([[1, 0, 1], [0, 1, 0], [1, 0, 1]])
>>> r = np.linalg.matrix_rank(A)
>>> print(r)
2
>>> Q = orth(A)
>>> print(np.array2string(Q, precision=8, suppress_small=True)) # Sometimes this fails
[[-0.70710678 -0. ]
[ 0. 1. ]
[-0.70710678 0. ]]
"""
# compute the SVD
(Q, S, _) = np.linalg.svd(A, full_matrices=False)
# calculate a tolerance based on the first eigenvalue (instead of just using a small number)
tol = np.max(A.shape) * S[0] * np.finfo(float).eps
# sum the number of eigenvalues that are greater than the calculated tolerance
r = np.sum(S > tol, axis=0)
# return the columns corresponding to the non-zero eigenvalues
Q = Q[:, np.arange(r)]
return Q
if __name__ == '__main__':
doctest.testmod(verbose=False)
You can print the rounded array plus 0.0 to eliminate the -0:
A = np.array([[1, 0, 1], [0, 1, 0], [1, 0, 1]])
Q = orth(A)
Q[0,1] = -1e-16 # simulate a small floating point deviation
print(np.array2string(Q.round(8)+0.0, precision=8, suppress_small=True))
#[[-0.70710678 0. ]
# [ 0. 1. ]
# [-0.70710678 0. ]]
So your doc string should be:
>>> Q = orth(A)
>>> print(np.array2string(Q.round(8)+0.0, precision=8, suppress_small=True)) # guarantee non-negative zeros
[[-0.70710678 0. ]
[ 0. 1. ]
[-0.70710678 0. ]]
Here's another alternative I came up with, although I think I like rounding the array to the given precision better. In this method, you shift the whole array by some amount that is bigger than the round-off error, but smaller than the precision comparison. That way the small numbers will still always be slightly positive.
>>> Q = orth(A)
>>> print(np.array2string(Q + np.full(Q.shape, 1e-14), precision=8, suppress_small=True))
[[-0.70710678 0. ]
[ 0. 1. ]
[-0.70710678 0. ]]

Normalizing vectors contained in an array

I've got an array, called X, where every element is a 2d-vector itself. The diagonal of this array is filled with nothing but zero-vectors.
Now I need to normalize every vector in this array, without changing the structure of it.
First I tried to calculate the norm of every vector and put it in an array, called N. After that I wanted to divide every element of X by every element of N.
Two problems occured to me:
1) Many entries of N are zero, which is obviously a problem when I try to divide by them.
2) The shapes of the arrays don't match, so np.divide() doesn't work as expected.
Beyond that I don't think, that it's a good idea to calculate N like this, because later on I want to be able to do the same with more than two vectors.
import numpy as np
# Example array
X = np.array([[[0, 0], [1, -1]], [[-1, 1], [0, 0]]])
# Array containing the norms
N = np.vstack((np.linalg.norm(X[0], axis=1), np.linalg.norm(X[1],
axis=1)))
R = np.divide(X, N)
I want the output to look like this:
R = np.array([[[0, 0], [0.70710678, -0.70710678]], [[-0.70710678, 0.70710678], [0, 0]]])
You do not need to use sklearn. Just define a function and then use list comprehension:
Assuming that the 0th dimension of the X is equal to the number of 2D arrays that you have, use this:
import numpy as np
# Example array
X = np.array([[[0, 0], [1, -1]], [[-1, 1], [0, 0]]])
def stdmtx(X):
X= X - X.mean(axis =1)[:, np.newaxis]
X= X / X.std(axis= 1, ddof=1)[:, np.newaxis]
return np.nan_to_num(X)
R = np.array([stdmtx(X[i,:,:]) for i in range(X.shape[0])])
The desired output R:
array([[[ 0. , 0. ],
[ 0.70710678, -0.70710678]],
[[-0.70710678, 0.70710678],
[ 0. , 0. ]]])

Why is my SVD calculation different than numpy's SVD calculation of this matrix?

I'm trying to manually compute the SVD of the matrix A defined below but I am having some problems. Computing it manually and with the svd method in numpy yields two different results.
Computed manually below:
import numpy as np
A = np.array([[3,2,2], [2,3,-2]])
V = np.linalg.eig(A.T # A)[1]
U = np.linalg.eig(A # A.T)[1]
S = np.c_[np.diag(np.sqrt(np.linalg.eig(A # A.T)[0])), [0,0]]
print(A)
print(U # S # V.T)
And computed via numpy's svd method:
X,Y,Z = np.linalg.svd(A)
Y = np.c_[np.diag(Y), [0,0]]
print(A)
print(X # Y # Z)
When these two codes are run. The manual calculation doesn't equal the svd method. Why is there a discrepancy between these two calculations?
Look at the eigenvalues returned by np.linalg.eig(A.T # A):
In [57]: evals, evecs = np.linalg.eig(A.T # A)
In [58]: evals
Out[58]: array([2.50000000e+01, 3.61082692e-15, 9.00000000e+00])
So (ignoring the normal floating point imprecision), it computed [25, 0, 9]. The eigenvectors associated with those eigenvalues are in the columns of evecs, in the same order. But your construction of S doesn't match that order; here's your S:
In [60]: S
Out[60]:
array([[5., 0., 0.],
[0., 3., 0.]])
When you compute U # S # V.T, the values in S # V.T are not correctly aligned.
As a quick fix, you can rerun your code with S set explicitly as follows:
S = np.array([[5, 0, 0],
[0, 0, 3]])
With that change, your code outputs
[[ 3 2 2]
[ 2 3 -2]]
[[-3. -2. -2.]
[-2. -3. 2.]]
That's better, but why are the signs wrong? Now the problem is that you have independently computed U and V. Eigenvectors are not unique; they are the basis of an eigenspace, and such a basis is not unique. If the eigenvalue is simple, and if the vector is normalized to have length one (which numpy.linalg.eig does), there is still a choice of the sign to be made. That is, if v is an eigenvector, then so is -v. The choices made by eig when computing U and V won't necessarily result in restoring the sign of A when U # S # V.T is computed.
It turns out that you can get the result that you expect by simply reversing all the signs in either U or V. Here is a modified version of your script that generates the output that you expected:
import numpy as np
A = np.array([[3, 2, 2],
[2, 3, -2]])
U = np.linalg.eig(A # A.T)[1]
V = -np.linalg.eig(A.T # A)[1]
#S = np.c_[np.diag(np.sqrt(np.linalg.eig(A # A.T)[0])), [0,0]]
S = np.array([[5, 0, 0],
[0, 0, 3]])
print(A)
print(U # S # V.T)
Output:
[[ 3 2 2]
[ 2 3 -2]]
[[ 3. 2. 2.]
[ 2. 3. -2.]]

How to normalize each vector of np.gradient elegantly?

Using the numpy gradient function, one obtains a list of arrays. E.g. in 3 dimensions 3 arrays corresponding to the x,y,z axes. I would like to normalize the gradient for each element.
What I have right now is:
gradient = np.gradient(self.image)
gradient_norm = np.sqrt(sum(x**2 for x gradient))
for dim in gradient:
np.divide(dim, gradient_norm, out=dim)
np.nan_to_num(dim, copy=False)
It seems highly verbose and inelegant for something which I think is not an exotic problem. Also the above does quite a bit of copying which I would like to avoid (as a bonus).
Compute the norm with np.linalg.norm and simply divide iteratively -
norms = np.linalg.norm(gradient,axis=0)
gradient = [np.where(norms==0,0,i/norms) for i in gradient]
Alternatively, if you don't mind a n+1 dim array as output -
out = np.where(norms==0,0,gradient/norms)
linalg.norm can broadcast with keepdims=True key arg
g = (np.arange(9) - 4).reshape((3, 3))
g
Out[215]:
array([[-4, -3, -2],
[-1, 0, 1],
[ 2, 3, 4]])
col_norm = g/np.linalg.norm(g, axis=0, keepdims=True)
col_norm
Out[217]:
array([[-0.87287156, -0.70710678, -0.43643578],
[-0.21821789, 0. , 0.21821789],
[ 0.43643578, 0.70710678, 0.87287156]])
row_norm = g/np.linalg.norm(g, axis=1, keepdims=True)
row_norm
Out[219]:
array([[-0.74278135, -0.55708601, -0.37139068],
[-0.70710678, 0. , 0.70710678],
[ 0.37139068, 0.55708601, 0.74278135]])

How to plot pairwise distances of two-dimensional vectors?

I have a set of data in python likes:
x y angle
If I want to calculate the distance between two points with all possible value and plot the distances with the difference between two angles.
x, y, a = np.loadtxt('w51e2-pa-2pk.log', unpack=True)
n = 0
f=(((x[n])-x[n+1:])**2+((y[n])-y[n+1:])**2)**0.5
d = a[n]-a[n+1:]
plt.scatter(f,d)
There are 255 points in my data.
f is the distance and d is the difference between two angles.
My question is can I set n = [1,2,3,.....255] and do the calculation again to get the f and d of all possible pairs?
You can obtain the pairwise distances through broadcasting by considering it as an outer operation on the array of 2-dimensional vectors as follows:
vecs = np.stack((x, y)).T
np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
For example,
In [1]: import numpy as np
...: x = np.array([1, 2, 3])
...: y = np.array([3, 4, 6])
...: vecs = np.stack((x, y)).T
...: np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
...:
Out[1]:
array([[ 0. , 1.41421356, 3.60555128],
[ 1.41421356, 0. , 2.23606798],
[ 3.60555128, 2.23606798, 0. ]])
Here, the (i, j)'th entry is the distance between the i'th and j'th vectors.
The case of the pairwise differences between angles is similar, but simpler, as you only have one dimension to deal with:
In [2]: a = np.array([10, 12, 15])
...: a[np.newaxis, :] - a[: , np.newaxis]
...:
Out[2]:
array([[ 0, 2, 5],
[-2, 0, 3],
[-5, -3, 0]])
Moreover, plt.scatter does not care that the results are given as matrices, and putting everything together using the notation of the question, you can obtain the plot of angles by distances by doing something like
vecs = np.stack((x, y)).T
f = np.linalg.norm(vecs[np.newaxis, :] - vecs[:, np.newaxis], axis=2)
d = angle[np.newaxis, :] - angle[: , np.newaxis]
plt.scatter(f, d)
You have to use a for loop and range() to iterate over n, e.g. like like this:
n = len(x)
for i in range(n):
# do something with the current index
# e.g. print the points
print x[i]
print y[i]
But note that if you use i+1 inside the last iteration, this will already be outside of your list.
Also in your calculation there are errors. (x[n])-x[n+1:] does not work because x[n] is a single value in your list while x[n+1:] is a list starting from n+1'th element. You can not subtract a list from an int or whatever it is.
Maybe you will have to even use two nested loops to do what you want. I guess that you want to calculate the distance between each point so a two dimensional array may be the data structure you want.
If you are interested in all combinations of the points in x and y I suggest to use itertools, which will give you all possible combinations. Then you can do it like follows:
import itertools
f = [((x[i]-x[j])**2 + (y[i]-y[j])**2)**0.5 for i,j in itertools.product(255,255) if i!=j]
# and similar for the angles
But maybe there is even an easier way...

Categories