How to handle rounding to negative zero in Python docstring tests - python

Related to this question: How to have negative zero always formatted as positive zero in a python string?
I have the following function that implements Matlab's orth.m using Numpy. I have a docstring test that relies on np.array2string using suppress_small=True, which will make small values round to zero. However, sometimes they round to positive zero and sometimes they round to negative zero, depending on whether the answer comes out as 1e-16 or -1e-17 or similar. Which case happens is based on the SVD decomposition, and can vary from platform to platform or across Python versions depending on which underlying linear algebra solver is used (BLAS, Lapack, etc.)
What's the best way to design the docstring test to account for this?
In the final doctest, sometimes the Q[0, 1] term is -0. and sometimes it's 0.
import doctest
import numpy as np
def orth(A):
r"""
Orthogonalization basis for the range of A.
That is, Q.T # Q = I, the columns of Q span the same space as the columns of A, and the number
of columns of Q is the rank of A.
Parameters
----------
A : 2D ndarray
Input matrix
Returns
-------
Q : 2D ndarray
Orthogonalization matrix of A
Notes
-----
#. Based on the Matlab orth.m function.
Examples
--------
>>> import numpy as np
Full rank matrix
>>> A = np.array([[1, 0, 1], [-1, -2, 0], [0, 1, -1]])
>>> r = np.linalg.matrix_rank(A)
>>> print(r)
3
>>> Q = orth(A)
>>> with np.printoptions(precision=8):
... print(Q)
[[-0.12000026 -0.80971228 0.57442663]
[ 0.90175265 0.15312282 0.40422217]
[-0.41526149 0.5664975 0.71178541]]
Rank deficient matrix
>>> A = np.array([[1, 0, 1], [0, 1, 0], [1, 0, 1]])
>>> r = np.linalg.matrix_rank(A)
>>> print(r)
2
>>> Q = orth(A)
>>> print(np.array2string(Q, precision=8, suppress_small=True)) # Sometimes this fails
[[-0.70710678 -0. ]
[ 0. 1. ]
[-0.70710678 0. ]]
"""
# compute the SVD
(Q, S, _) = np.linalg.svd(A, full_matrices=False)
# calculate a tolerance based on the first eigenvalue (instead of just using a small number)
tol = np.max(A.shape) * S[0] * np.finfo(float).eps
# sum the number of eigenvalues that are greater than the calculated tolerance
r = np.sum(S > tol, axis=0)
# return the columns corresponding to the non-zero eigenvalues
Q = Q[:, np.arange(r)]
return Q
if __name__ == '__main__':
doctest.testmod(verbose=False)

You can print the rounded array plus 0.0 to eliminate the -0:
A = np.array([[1, 0, 1], [0, 1, 0], [1, 0, 1]])
Q = orth(A)
Q[0,1] = -1e-16 # simulate a small floating point deviation
print(np.array2string(Q.round(8)+0.0, precision=8, suppress_small=True))
#[[-0.70710678 0. ]
# [ 0. 1. ]
# [-0.70710678 0. ]]
So your doc string should be:
>>> Q = orth(A)
>>> print(np.array2string(Q.round(8)+0.0, precision=8, suppress_small=True)) # guarantee non-negative zeros
[[-0.70710678 0. ]
[ 0. 1. ]
[-0.70710678 0. ]]

Here's another alternative I came up with, although I think I like rounding the array to the given precision better. In this method, you shift the whole array by some amount that is bigger than the round-off error, but smaller than the precision comparison. That way the small numbers will still always be slightly positive.
>>> Q = orth(A)
>>> print(np.array2string(Q + np.full(Q.shape, 1e-14), precision=8, suppress_small=True))
[[-0.70710678 0. ]
[ 0. 1. ]
[-0.70710678 0. ]]

Related

Calculate means of array with specific elements

I'm implementing the Nearest Centroid Classification algorithm and I'm kind of blocked on how to use numpy.mean in my case.
So suppose I have some spherical datasets X:
[[ 0.39151059 3.48203037]
[-0.68677876 1.45377717]
[ 2.30803493 4.19341503]
[ 0.50395297 2.87076658]
[ 0.06677012 3.23265678]
[-0.24135103 3.78044279]
[-0.05660036 2.37695381]
[ 0.74210998 -3.2654815 ]
[ 0.05815341 -2.41905942]
[ 0.72126958 -1.71081388]
[ 1.03581142 -4.09666955]
[ 0.23209714 -1.86675298]
[-0.49136284 -1.55736028]
[ 0.00654881 -2.22505305]]]
and the labeled vector Y:
[0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
An example with 100 2D data points gives the following result:
The NCC algorithm consists of first calculating the class mean of each class (0 and 1: that's blue and red) and then calculating the nearest class centroid for the next data point.
This is my current function:
def mean_ncc(X,Y):
# find unique classes
m_cids = np.unique(Y) #[0. 1.]
# compute class means
mu = np.zeros((len(cids), X.shape[1])) #[[0. 0.] [0. 0.]] (in the case where Y has 2 unique points (0 and 1)
for class_idx, class_label in enumerate(cids):
mu[class_idx, :] = #problem here
return mu
So here I want an array containing the class means of '0' (blue) points and '1' (red) points:
How can I specify the number of elements of X whose mean I want to calculate?
I would like to do something like this:
for class_idx, class_label in enumerate(m_cids):
mu[class_idx, :] = np.mean(X[only the elements,that contains the same class_label], axis=0)
Is it possible or is there another way to implement this?
You could use something like this:
import numpy as np
tags = [0, 0, 1, 1, 0, 1]
values = [5, 4, 2, 5, 9, 8]
tags_np = np.array(tags)
values_np = np.array(values)
print(values_np[tags_np == 1].mean())
EDIT: You will surely need to look more into the axis parameter for the mean function:
import numpy as np
values = [[5, 4],
[5, 4],
[4, 3],
[4, 3]]
values_np = np.array(values)
tags_np = np.array([0, 0, 1, 1])
print(values_np[tags_np == 0].mean(axis=0))

Calculate condensed distance matrix with varying length data points

Scipy's pdist function expects an evenly shaped numpy array as input.
Working example:
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
#Example distance function.
def dfun(u, v):
return u.sum() + v.sum()
dat0 = np.array([-1, 1,-3, 1])
dat1 = np.array([-1, 1,-3, 1])
dat2 = np.array([ 1, 1, 1, 1])
data = np.array([dat0, dat1, dat2])
distance_matrix = pdist(data, dfun)
squareform(distance_matrix)
I got a custom distance function which works with run-length encoded data, thus the arrays may vary in size. When using the following input
dat0 = np.array([-1, 1,-4, 1])
dat1 = np.array([-1, 1,-3, 1, 1])
dat2 = np.array([ 1,-6])
A value error ValueError: A 2-dimensional array must be passed. is raised even though the distance function would be just fine handling the input. Does there exist an alternative to calculate these values?
Edit: the distance function in the above snippet is just an example for a metric which does not care about the actual number of elements inside the datapoint. In my case https://github.com/mclmza/AWarp is used which computes the dtw for sparse data sets example series: [1,-456,1,1,-23,1], thus padding the data is not a valid option.
If I understand correctly, you want to compute the distances using awarp, but that distance function takes signals of varying length. So you need to avoid creating an array, because NumPy doesn't allow 'ragged' arrays. Then I think you can do this:
from itertools import combinations
from scipy.spatial.distance import squareform
# Example distance function.
def dfun(u, v):
return u.sum() + v.sum()
dat0 = np.array([-1, 1,-4, 1])
dat1 = np.array([-1, 1,-3, 1, 1])
dat2 = np.array([ 1,-6])
data = [dat0, dat1, dat2]
dists = [dfun(a, b) for a, b in combinations(data, r=2)]
squareform(dists)
For your example, this yields:
array([[ 0, -4, -8],
[-4, 0, -6],
[-8, -6, 0]])
And if dfun = awarp then you get this output for those signals:
array([[ 0. , 0. , 2.23606798],
[ 0. , 0. , 2.44948974],
[ 2.23606798, 2.44948974, 0. ]])
I guess this approach only works if dfun is commutative, which I think awarp is.

Why is my SVD calculation different than numpy's SVD calculation of this matrix?

I'm trying to manually compute the SVD of the matrix A defined below but I am having some problems. Computing it manually and with the svd method in numpy yields two different results.
Computed manually below:
import numpy as np
A = np.array([[3,2,2], [2,3,-2]])
V = np.linalg.eig(A.T # A)[1]
U = np.linalg.eig(A # A.T)[1]
S = np.c_[np.diag(np.sqrt(np.linalg.eig(A # A.T)[0])), [0,0]]
print(A)
print(U # S # V.T)
And computed via numpy's svd method:
X,Y,Z = np.linalg.svd(A)
Y = np.c_[np.diag(Y), [0,0]]
print(A)
print(X # Y # Z)
When these two codes are run. The manual calculation doesn't equal the svd method. Why is there a discrepancy between these two calculations?
Look at the eigenvalues returned by np.linalg.eig(A.T # A):
In [57]: evals, evecs = np.linalg.eig(A.T # A)
In [58]: evals
Out[58]: array([2.50000000e+01, 3.61082692e-15, 9.00000000e+00])
So (ignoring the normal floating point imprecision), it computed [25, 0, 9]. The eigenvectors associated with those eigenvalues are in the columns of evecs, in the same order. But your construction of S doesn't match that order; here's your S:
In [60]: S
Out[60]:
array([[5., 0., 0.],
[0., 3., 0.]])
When you compute U # S # V.T, the values in S # V.T are not correctly aligned.
As a quick fix, you can rerun your code with S set explicitly as follows:
S = np.array([[5, 0, 0],
[0, 0, 3]])
With that change, your code outputs
[[ 3 2 2]
[ 2 3 -2]]
[[-3. -2. -2.]
[-2. -3. 2.]]
That's better, but why are the signs wrong? Now the problem is that you have independently computed U and V. Eigenvectors are not unique; they are the basis of an eigenspace, and such a basis is not unique. If the eigenvalue is simple, and if the vector is normalized to have length one (which numpy.linalg.eig does), there is still a choice of the sign to be made. That is, if v is an eigenvector, then so is -v. The choices made by eig when computing U and V won't necessarily result in restoring the sign of A when U # S # V.T is computed.
It turns out that you can get the result that you expect by simply reversing all the signs in either U or V. Here is a modified version of your script that generates the output that you expected:
import numpy as np
A = np.array([[3, 2, 2],
[2, 3, -2]])
U = np.linalg.eig(A # A.T)[1]
V = -np.linalg.eig(A.T # A)[1]
#S = np.c_[np.diag(np.sqrt(np.linalg.eig(A # A.T)[0])), [0,0]]
S = np.array([[5, 0, 0],
[0, 0, 3]])
print(A)
print(U # S # V.T)
Output:
[[ 3 2 2]
[ 2 3 -2]]
[[ 3. 2. 2.]
[ 2. 3. -2.]]

Numpy returns .00...002

Sorry if this post is a dupli,I couldn't find an answer... I have the following code:
import numpy as np
V = np.array([[6, 10, 0],
[2, 5, 0],
[0, 0, 0]])
subarr = np.array([[arr[0][0], arr[0][1]], [arr[1][0], arr[1][1]]])
det = np.linalg.det(subarr)
cross = np.cross(arr[0], arr[1])
print(f"Det: {det}")
print(f"Cross: {cross}")
I would expect that the det would return 10.0 and the cross returns in this case [0, 0, 10], the last number being equal to the det. For some reason, python returns
Det: 10.000000000000002
Cross: [ 0 0 10]
Can someone please explain why?
What you're seeing is floating point inaccuracies.
And in case you're wondering how you end up with floats when finding the determinant of a matrix made up of integers (where the usual calculation method is just 6*5 - 2*10 = 10), np.linalg.det uses LU decomposition to find the determinant. This isn't very efficient for 2x2 matrices, but is much more efficient when you have bigger matrices.
For your 2x2, you get:
scipy.linalg.lu(A, 1)
Out:
(array([[ 1. , 0. ],
[ 0.33333333, 1. ]]),
array([[ 6. , 10. ],
[ 0. , 1.66666667]]))
The determinant is just the product of the diagonals from this, which ends up being 6. * 1.66666667, which resolves to 10.00000002 due to floating point errors.

Range of index in NumPy correlation function

I am looking into the NumPy correlation function
numpy.correlate(a, v, mode='valid')[source]
Cross-correlation of two 1-dimensional sequences.
This function computes the correlation as generally defined in signal processing texts:
c_{av}[k] = sum_n a[n+k] * conj(v[n])
Then for the example:
a = [1, 2, 3]
v = [0, 1, 0.5]
np.correlate([1, 2, 3], [0, 1, 0.5], "full")
array([ 0.5, 2. , 3.5, 3. , 0. ])
So the k in the output array is from 0 to 4 in this example. However, I am wondering how does a[n+k] is defined when (n+k) > 2 in this case?
Also, how is conjugate(v(n)) defined and how is each element in array computed?
The formula c_{av}[k] = sum_n a[n+k] * conj(v[n]) is a little misleading because k on the left is not necessarily the Python index of the output array. In the 'full' mode, the possible values of k are those for which there exists at least one n such that a[n+k] * conj(v[n]) is defined (that is, both n+k and n fall in the ranges of respective arrays).
In your examples, k in sum_n a[n+k] * conj(v[n]) can be -2, -1, 0, 1, 2. These generate 5 values that you see. For example, k being -2 results in a[2-2]*conj(v[2]) which is 0.5, and so on.
In general, the range of k in the 'full' mode is from 1-len(a) to len(v)-1 inclusive. So, if k is really understood as Python index, then the formula should be
c_{av}[k] = sum_n a[n+k+len(a)-1] * conj(v[n])

Categories