Scipy: Calculation of standardized euclidean via cdist

Scipy: Calculation of standardized euclidean via cdist - python

The formula is available in the docs and pointed to in this answer. However when I'm trying to apply it I'm not getting a matching answer. I'm sure there's some silly mistake I'm making somewhere so thanks for bearing with me:
Setup
Say I have 2 matrices:
X: array([[0, 1, 0],
[1, 1, 1]])
X2: array([[1, 1, 0],
[1, 1, 1],
[1, 2, 0]])
Now applying Xans = scipy.spatial.distance.cdist(X, X2, 'seuclidean') gives:
Xans: array([[2.23606798, 2.88675135, 3.16227766],
[1.82574186, 0. , 2.88675135]])
Let's just focus on Xans[0][0] = 2.23606798, which should have been obtained by applying seuclidean(X[0], X2[0]).
Method 1: Using pdist
I tried doing this via pdist but get a NaN:
In [104]: scipy.spatial.distance.pdist([X[0], X2[0]], metric='seuclidean')
Out[104]: array([nan])
Why is this happening?
Method 2: Direct Formula Application
I tried manually using the formula linked in the answer above as follows:
In [107]: (((X[0] - X2[0])**2).sum()/(np.var([X[0], X2[0]])))**0.5
Out[107]: 2.0
As can be seen this is giving 2.0?
I'm clearly doing something very wrong - What is it?

The standardized Euclidean distance weights each variable with a separate variance. If you don't provide the variances with the V argument, it computes them from the input array. This is mentioned in the pdist docstring in the "Parameters" section under **kwargs, where it shows:
V : ndarray
The variance vector for standardized Euclidean.
Default: var(X, axis=0, ddof=1)
For example:
In [39]: A
Out[39]:
array([[3, 0, 2],
[2, 1, 2],
[0, 0, 1],
[3, 1, 2],
[1, 0, 0]])
In [40]: from scipy.spatial.distance import pdist
In [41]: pdist(A, metric='seuclidean')
Out[41]:
array([ 1.98029509, 2.55814731, 1.82574186, 2.71163072, 2.63368079,
0.76696499, 2.9868995 , 3.14284123, 1.35581536, 3.26898677])
We get the same result if we provide the variances computed as explained in the docstring:
In [42]: pdist(A, metric='seuclidean', V=np.var(A, axis=0, ddof=1))
Out[42]:
array([ 1.98029509, 2.55814731, 1.82574186, 2.71163072, 2.63368079,
0.76696499, 2.9868995 , 3.14284123, 1.35581536, 3.26898677])
Of course, if you provide variances that are all 1, you get the regular Euclidean distance:
In [43]: pdist(A, metric='seuclidean', V=np.ones(A.shape[1]))
Out[43]:
array([ 1.41421356, 3.16227766, 1. , 2.82842712, 2.44948974,
1. , 2.44948974, 3.31662479, 1.41421356, 3. ])
In [44]: pdist(A, metric='euclidean')
Out[44]:
array([ 1.41421356, 3.16227766, 1. , 2.82842712, 2.44948974,
1. , 2.44948974, 3.31662479, 1.41421356, 3. ])
The problem with your "Method 1" is that in your input array of just two points (i.e. [X[0], X2[0]]), the second and third components of the points don't change, so the variance associated with those components is 0:
In [45]: p = np.array([X[0], X2[0]])
In [46]: p
Out[46]:
array([[0, 1, 0],
[1, 1, 0]])
In [47]: np.var(p, axis=0, ddof=1)
Out[47]: array([ 0.5, 0. , 0. ])
When the code for the seuclidean divides by these variances, the result is either infinity or NaN--the latter if the numerator is also 0, which is the case in the third component of the input [X[0], X2[0]].
To work around this, you have to decide how you want to handle the case where the variance of a component is 0, and handle it explicitly. For example, if you want it to act like that variance is 1 in that case (just to avoid dividing by 0) you could do something like the following.
Suppose B is our array of points. The third column of B is all 1s.
In [63]: B
Out[63]:
array([[3, 0, 1],
[2, 1, 1],
[0, 0, 1],
[3, 1, 1],
[1, 0, 1]])
Compute the variances of the columns:
In [64]: V = np.var(B, axis=0, ddof=1)
In [65]: V
Out[65]: array([ 1.7, 0.3, 0. ])
Replace the variances that are 0 with 1:
In [66]: V[V == 0] = 1
In [67]: V
Out[67]: array([ 1.7, 0.3, 1. ])
Use V to compute the standardized Euclidean distances:
In [68]: pdist(B, metric='seuclidean', V=V)
Out[68]:
array([ 1.98029509, 2.30089497, 1.82574186, 1.53392998, 2.38459106,
0.76696499, 1.98029509, 2.93725228, 0.76696499, 2.38459106])
This has the same effect as simply removing the constant column:
In [69]: pdist(B[:, :2], metric='seuclidean')
Out[69]:
array([ 1.98029509, 2.30089497, 1.82574186, 1.53392998, 2.38459106,
0.76696499, 1.98029509, 2.93725228, 0.76696499, 2.38459106])
Your "Method 2" is wrong because your formula is wrong. You have to keep the variances for each component. np.var([X[0], X2[0]]) computes the (single) variance of all the values in the input. Instead, you need to use the axis and ddof arguments shown above.

Related

Losing decimal when doing array operation in Python

I tried to make a function and inside it there is a code to divides a column with its column sum and here I come up with.
A = np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
print(A)
A = A.T
Asum = A.sum(axis=1)
print(Asum)
for i in range(len(Asum)):
A[:,i] = A[:,i]/Asum[i]
I'm hoping some decimal matrix but it automatically turn into integer. It gives me a zero matrix. Where do I go wrong?

You must change:
Asum = A.sum(axis=1)
by:
Asum = A.sum(axis=0)
To get the column by column sum.
Also you can get the division easily with numpy.divide:
np.divide(A, Asum)
#array([[0.1, 0.1, 0.1],
# [0.2, 0.2, 0.2],
# [0.3, 0.3, 0.3],
# [0.4, 0.4, 0.4]])
Or simply with:
A/Asum

Your A is integer dtype; assigned floats get truncated. If A started as a float array your iteration would work. But you don't need to iterate to perform this calculation:
In [108]: A = np.array([[1,2,3,4],[1,2,3,4],[1,2,3,4]]).T
In [109]: A
Out[109]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
In [110]: Asum = A.sum(axis=1)
In [111]: Asum
Out[111]: array([ 3, 6, 9, 12])
A is (4,3), Asum is (4,). If we make it (4,1):
In [114]: Asum[:,None]
Out[114]:
array([[ 3],
[ 6],
[ 9],
[12]])
we can perform the divide without iteration (review broadcasting if necessary):
In [115]: A/Asum[:,None]
Out[115]:
array([[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333]])
sum has keepdims parameter that makes this kind of calculation easier:
In [117]: Asum = A.sum(axis=1, keepdims=True)
In [118]: Asum
Out[118]:
array([[ 3],
[ 6],
[ 9],
[12]])

How to index elements from a column of a ndarray such that the output is a column vector?

I have an nx2 array of points represented as a ndarray. I want to index some of the elements (indices are given in a ndarray as well) of one of the two column vectors such that the output is a column vector. If however the index array contains only one index, a (1,)-shaped array should be returned.
I already tried the following things without success:
import numpy as np
points = np.array([[0, 1], [1, 1.5], [2.5, 0.5], [4, 1], [5, 2]])
index = np.array([0, 1, 2])
points[index, [0]] -> array([0. , 1. , 2.5]) -> shape (3,)
points[[index], 0] -> array([[0. , 1. , 2.5]]) -> shape (1, 3)
points[[index], [0]] -> array([[0. , 1. , 2.5]]) -> shape (1, 3)
points[index, 0, np.newaxis] -> array([[0. ], [1. ], [2.5]]) -> shape(3, 1) # desired
np.newaxis works for this scenario however if the index array only contains one value it does not deliver the right shape:
import numpy as np
points = np.array([[0, 1], [1, 1.5], [2.5, 0.5], [4, 1], [5, 2]])
index = np.array([0])
points[index, 0, np.newaxis] -> array([[0.]]) -> shape (1, 1)
points[index, [0]] -> array([0.]) -> shape (1,) # desired
Is there possibility to index the ndarray such that the output has shapes (3,1) for the first example and (1,) for the second example without doing case differentiations based on the size of the index array?
Thanks in advance for your help!

In [329]: points = np.array([[0, 1], [1, 1.5], [2.5, 0.5], [4, 1], [5, 2]])
...: index = np.array([0, 1, 2])
We can select 3 rows with:
In [330]: points[index,:]
Out[330]:
array([[0. , 1. ],
[1. , 1.5],
[2.5, 0.5]])
However if we select a column as well, the result is 1d, even if we use [0]. That's because the (3,) row index is broadcast against the (1,) column index, resulting in a (3,) result:
In [331]: points[index,0]
Out[331]: array([0. , 1. , 2.5])
In [332]: points[index,[0]]
Out[332]: array([0. , 1. , 2.5])
If we make row index (3,1) shape, the result also (3,1):
In [333]: points[index[:,None],[0]]
Out[333]:
array([[0. ],
[1. ],
[2.5]])
In [334]: points[index[:,None],0]
Out[334]:
array([[0. ],
[1. ],
[2.5]])
We get the same thing if we use a row slice:
In [335]: points[0:3,[0]]
Out[335]:
array([[0. ],
[1. ],
[2.5]])
Using [index] doesn't help because it makes the row index (1,3) shape, resulting in a (1,3) result. Of course you could transpose it to get (3,1).
With a 1 element index:
In [336]: index1 = np.array([0])
In [337]: points[index1[:,None],0]
Out[337]: array([[0.]])
In [338]: _.shape
Out[338]: (1, 1)
In [339]: points[index1,0]
Out[339]: array([0.])
In [340]: _.shape
Out[340]: (1,)
If the row index was a scalar, as opposed to 1d:
In [341]: index1 = np.array(0)
In [342]: points[index1[:,None],0]
...
IndexError: too many indices for array
In [343]: points[index1[...,None],0] # use ... instead
Out[343]: array([0.])
In [344]: points[index1, 0] # scalar result
Out[344]: 0.0
I think handling the np.array([0]) case separately requires an if test. At least I can't think of a builtin numpy way of burying it.

I'm not certain I understand the wording in your question, but it seems as though you may be after the ndarray.swapaxes method (see https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.ndarray.swapaxes.html#numpy.ndarray.swapaxes)
for your snippet:
points = np.array([[0, 1], [1, 1.5], [2.5, 0.5], [4, 1], [5, 2]])
swapped = points.swapaxes(0,1)
print(swapped)
gives
[[0. 1. 2.5 4. 5. ]
[1. 1.5 0.5 1. 2. ]]

Numpy calculation of eigenvectors is incorrect

I run the following in Python and expected the columns in E[1] to be the eigenvectors of A, but they are not. Only Sympy.Matrix.eigenvects() seem to do it right. Why this error?
A
Out[194]:
matrix([[-3, 3, 2],
[ 1, -1, -2],
[-1, -3, 0]])
E = np.linalg.eig(A)
E
Out[196]:
(array([ 2., -4., -2.]),
matrix([[ -2.01889132e-16, 9.48683298e-01, 8.94427191e-01],
[ 5.54700196e-01, -3.16227766e-01, -3.71551690e-16],
[ -8.32050294e-01, 2.73252305e-17, 4.47213595e-01]]))
A*E[1] / E[1]
Out[205]:
matrix([[ 6.59900617, -4. , -2. ],
[ 2. , -4. , -3.88449298],
[ 2. , 8.125992 , -2. ]])

The eigenvectors are correct, within an expected margin of error.
What you discovered is that testing eigenvectors with element-wise division is a bad idea.
A better way is to compute the norm of the difference between matrix*vector and eigenvalue*vector.
NumPy performs computations in floating point arithmetics, limited to 52 bits of precision (double precision). This means any of its answers may contain numerical errors, at least of relative size 2**(-52) which is about 2e-16. So, when you see a number like 2e-16 coming from a calculation with numbers of size 1-3, the conclusion is: "that number should probably be zero, and the value we have for it is likely just noise". And if you divide by that number, noise is all you get.
SymPy, on the other hand, performs symbolic manipulations, so its answer (when it can get one) is exactly what the theory predicts.

From its docs:
The number w is an eigenvalue of a if there exists a vector v such that dot(a,v) = w * v. Thus, the arrays a, w, and v satisfy the equations dot(a[:,:], v[:,i]) = w[i] * v[:,i] for i \in {0,...,M-1}.
With your matrix:
In [1]: A = np.array([[-3, 3, 2],
...: [ 1, -1, -2],
...: [-1, -3, 0]])
...:
In [2]: w,v=np.linalg.eig(A)
In [3]: w
Out[3]: array([ 2., -4., -2.])
In [4]: v
Out[4]:
array([[ -9.39932874e-17, 9.48683298e-01, 8.94427191e-01],
[ 5.54700196e-01, -3.16227766e-01, 1.93473310e-16],
[ -8.32050294e-01, -4.08811066e-17, 4.47213595e-01]])
In [5]: np.dot(A,v)
Out[5]:
array([[ -2.22044605e-16, -3.79473319e+00, -1.78885438e+00],
[ 1.10940039e+00, 1.26491106e+00, -7.77156117e-16],
[ -1.66410059e+00, 4.44089210e-16, -8.94427191e-01]])
In [6]: w*v
Out[6]:
array([[ -1.87986575e-16, -3.79473319e+00, -1.78885438e+00],
[ 1.10940039e+00, 1.26491106e+00, -3.86946619e-16],
[ -1.66410059e+00, 1.63524427e-16, -8.94427191e-01]])
In [7]: np.dot(A,v)-w*v
Out[7]:
array([[ -3.40580301e-17, 8.88178420e-16, 2.22044605e-16],
[ 8.88178420e-16, -6.66133815e-16, -3.90209498e-16],
[ -2.22044605e-16, 2.80564783e-16, -3.33066907e-16]])
In [8]: np.allclose(np.dot(A,v), w*v)
Out[8]: True
So, yes, the documented test is satisfied, within floating point limits.
einsum can be used to highlight the i axis in the dot calculation.
In [10]: np.einsum('...k,ki->...i',A,v)
Out[10]:
array([[ -2.22044605e-16, -3.79473319e+00, -1.78885438e+00],
[ 1.10940039e+00, 1.26491106e+00, -7.77156117e-16],
[ -1.66410059e+00, 3.88578059e-16, -8.94427191e-01]])
When I divide by v (element wise), the result matches the eigenvalues, 2 -4,-2, except where v and the dot are virtually 0 (1e-16 or smaller).
In [11]: np.einsum('...k,ki->...i',A,v)/v
Out[11]:
array([[ 2.36234534, -4. , -2. ],
[ 2. , -4. , -4.01686475],
[ 2. , -9.50507681, -2. ]])

Replace all elements of a matrix by their inverses

I've got a simple problem and I can't figure out how to solve it.
Here is a matrix: A = np.array([[1,0,3],[0,7,9],[0,0,8]]).
I want to find a quick way to replace all elements of this matrix by their inverses, excluding of course the zero elements.
I know, thanks to the search engine of Stackoverflow, how to replace an element by a given value with a condition. On the contrary, I do not figure out how to replace elements by new elements depending on the previous ones (e.g. squared elements, inverses, etc.)

Use 1. / A (notice the dot for Python 2):
>>> A
array([[1, 0, 3],
[0, 7, 9],
[0, 0, 8]], dtype)
>>> 1./A
array([[ 1. , inf, 0.33333333],
[ inf, 0.14285714, 0.11111111],
[ inf, inf, 0.125 ]])
Or if your array has dtype float, you can do it in-place without warnings:
>>> A = np.array([[1,0,3], [0,7,9], [0,0,8]], dtype=np.float64)
>>> A[A != 0] = 1. / A[A != 0]
>>> A
array([[ 1. , 0. , 0.33333333],
[ 0. , 0.14285714, 0.11111111],
[ 0. , 0. , 0.125 ]])
Here we use A != 0 to select only those elements that are non-zero.
However if you try this on your original array you'd see
array([[1, 0, 0],
[0, 0, 0],
[0, 0, 0]])
because your array could only hold integers, and inverse of all others would have been rounded down to 0.
Generally all of the numpy stuff on arrays does element-wise vectorized transformations so that to square elements,
>>> A = np.array([[1,0,3],[0,7,9],[0,0,8]])
>>> A * A
array([[ 1, 0, 9],
[ 0, 49, 81],
[ 0, 0, 64]])

And just a note on Antti Haapala's answer, (Sorry, I can't comment yet)
if you wanted to keep the 0's, you could use
B=1./A #I use the 1. to make sure it uses floats
B[B==np.inf]=0

Weird behavior when squaring elements in numpy array

I have two numpy arrays of shape (1, 250000):
a = [[ 0 254 1 ..., 255 0 1]]
b = [[ 1 0 252 ..., 0 255 255]]
I want to create a new numpy array whose elements are the square root of the sum of squares of elements in the arrays a and b, but I am not getting the correct result:
>>> c = np.sqrt(np.square(a)+np.square(b))
>>> print c
[[ 1. 2. 4.12310553 ..., 1. 1. 1.41421354]]
Am I missing something simple here?

Presumably your arrays a and b are arrays of unsigned 8 bit integers--you can check by inspecting the attribute a.dtype. When you square them, the data type is preserved, and the 8 bit values overflow, which means the values "wrap around" (i.e. the squared values are modulo 256):
In [7]: a = np.array([[0, 254, 1, 255, 0, 1]], dtype=np.uint8)
In [8]: np.square(a)
Out[8]: array([[0, 4, 1, 1, 0, 1]], dtype=uint8)
In [9]: b = np.array([[1, 0, 252, 0, 255, 255]], dtype=np.uint8)
In [10]: np.square(a) + np.square(b)
Out[10]: array([[ 1, 4, 17, 1, 1, 2]], dtype=uint8)
In [11]: np.sqrt(np.square(a) + np.square(b))
Out[11]:
array([[ 1. , 2. , 4.12310553, 1. , 1. ,
1.41421354]], dtype=float32)
To avoid the problem, you can tell np.square to use a floating point data type:
In [15]: np.sqrt(np.square(a, dtype=np.float64) + np.square(b, dtype=np.float64))
Out[15]:
array([[ 1. , 254. , 252.00198412, 255. ,
255. , 255.00196078]])
You could also use the function numpy.hypot, but you might still want to use the dtype argument, otherwise the default data type is np.float16:
In [16]: np.hypot(a, b)
Out[16]: array([[ 1., 254., 252., 255., 255., 255.]], dtype=float16)
In [17]: np.hypot(a, b, dtype=np.float64)
Out[17]:
array([[ 1. , 254. , 252.00198412, 255. ,
255. , 255.00196078]])
You might wonder why the dtype argument that I used in numpy.square and numpy.hypot is not shown in the functions' docstrings. Both of these functions are numpy "ufuncs", and the authors of numpy decided that it was better to show only the main arguments in the docstring. The optional arguments are documented in the reference manual.

For this simple case, it works perfectly fine:
In [1]: a = np.array([[ 0, 2, 4, 6, 8]])
In [2]: b = np.array([[ 1, 3, 5, 7, 9]])
In [3]: c = np.sqrt(np.square(a) + np.square(b))
In [4]: print(c)
[[ 1. 3.60555128 6.40312424 9.21954446 12.04159458]]
You must be doing something wrong.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scipy: Calculation of standardized euclidean via cdist - python

Related

Losing decimal when doing array operation in Python

How to index elements from a column of a ndarray such that the output is a column vector?

Numpy calculation of eigenvectors is incorrect

Replace all elements of a matrix by their inverses

Weird behavior when squaring elements in numpy array

Categories

Resources