Scipy - Nan when calculating Mahalanobis distance - python

When I try to calculate the Mahalanobis distance with the following python code I get some Nan entries in the result. Do you have any insight about why this happens?
My data.shape = (181, 1500)
from scipy.spatial.distance import pdist, squareform
data_log = log2(data + 1) # A log transform that I usually apply to my data
data_centered = data_log - data_log.mean(0) # zero centering
D = squareform( pdist(data_centered, 'mahalanobis' ) )
I also tried:
data_standard = data_centered / data_centered.std(0, ddof=1)
D = squareform( pdist(data_standard, 'mahalanobis' ) )
Also got nans.
The input is not corrupted and other distances, such as correlation distance, can be computed just fine.
For some reason when I reduce the number of features I stop getting Nans. E.g the following examples does not get any Nan:
D = squareform( pdist(data_centered[:,:200], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:480], 'mahalanobis' ) )
while those others get Nans:
D = squareform( pdist(data_centered[:,:300], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:600], 'mahalanobis' ) )
Any clue? Is this an expected behaviour if some condition for the input is not satisfied?

You have fewer observations than features, so the covariance matrix V computed by the scipy code is singular. The code doesn't check this, and blindly computes the "inverse" of the covariance matrix. Because this numerically computed inverse is basically garbage, the product (x-y)*inv(V)*(x-y) (where x and y are observations) might turn out to be negative. Then the square root of that value results in nan.
For example, this array also results in a nan:
In [265]: x
Out[265]:
array([[-1. , 0.5, 1. , 2. , 2. ],
[ 2. , 1. , 2.5, -1.5, 1. ],
[ 1.5, -0.5, 1. , 2. , 2.5]])
In [266]: squareform(pdist(x, 'mahalanobis'))
Out[266]:
array([[ 0. , nan, 1.90394328],
[ nan, 0. , nan],
[ 1.90394328, nan, 0. ]])
Here's the Mahalanobis calculation done "by hand":
In [279]: V = np.cov(x.T)
In theory, V is singular; the following value is effectively 0:
In [280]: np.linalg.det(V)
Out[280]: -2.968550671342364e-47
But inv doesn't see the problem, and returns an inverse:
In [281]: VI = np.linalg.inv(V)
Let's compute the distance between x[0] and x[2] and verify that we get the same non-nan value (1.9039) returned by pdist when we use VI:
In [295]: delta = x[0] - x[2]
In [296]: np.dot(np.dot(delta, VI), delta)
Out[296]: 3.625
In [297]: np.sqrt(np.dot(np.dot(delta, VI), delta))
Out[297]: 1.9039432764659772
Here's what happens when we try to compute the distance between x[0] and x[1]:
In [300]: delta = x[0] - x[1]
In [301]: np.dot(np.dot(delta, VI), delta)
Out[301]: -1.75
Then the square root of that value gives nan.
In scipy 0.16 (to be released in June 2015), you will get an error instead of nan or garbage. The error message describes the problem:
In [4]: x = array([[-1. , 0.5, 1. , 2. , 2. ],
...: [ 2. , 1. , 2.5, -1.5, 1. ],
...: [ 1.5, -0.5, 1. , 2. , 2.5]])
In [5]: pdist(x, 'mahalanobis')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-a3453ff6fe48> in <module>()
----> 1 pdist(x, 'mahalanobis')
/Users/warren/local_scipy/lib/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
1298 "singular. For observations with %d "
1299 "dimensions, at least %d observations "
-> 1300 "are required." % (m, n, n + 1))
1301 V = np.atleast_2d(np.cov(X.T))
1302 VI = _convert_to_double(np.linalg.inv(V).T.copy())
ValueError: The number of observations (3) is too small; the covariance matrix is singular. For observations with 5 dimensions, at least 6 observations are required.

Related

How to make a ufunc output a matrix given two array_like operands (instead of trying to broadcast them)?

I would like to get a matrix of values given two ndarray's from a ufunc, for example:
degs = numpy.array(range(5))
pnts = numpy.array([0.0, 0.1, 0.2])
values = scipy.special.eval_chebyt(degs, pnts)
The above code doesn't work (it gives a ValueError because it tries to broadcast two arrays and fails since they have different shapes: (5,) and (3,)); I would like to get a matrix of values with rows corresponding to degrees and columns to points at which polynomials are evaluated (or vice versa, it doesn't matter).
Currently my workaround is simply to use for-loop:
values = numpy.zeros((5,3))
for j in range(5):
values[j] = scipy.special.eval_chebyt(j, pnts)
Is there a way to do that? In general, how would you let a ufunc know you want an n-dimensional array if you have n array_like arguments?
I know about numpy.vectorize, but that seems neither faster nor more elegant than just a simple for-loop (and I'm not even sure you can apply it to an existent ufunc).
UPDATE What about ufunc's that receive 3 or more parameters? trying outer method gives a ValueError: outer product only supported for binary functions. For example, scipy.special.eval_jacobi.
What you need is exactly the outer method of ufuncs:
ufunc.outer(A, B, **kwargs)
Apply the ufunc op to all pairs (a, b) with a in A and b in B.
values = scipy.special.eval_chebyt.outer(degs, pnts)
#array([[ 1. , 1. , 1. ],
# [ 0. , 0.1 , 0.2 ],
# [-1. , -0.98 , -0.92 ],
# [-0. , -0.296 , -0.568 ],
# [ 1. , 0.9208, 0.6928]])
UPDATE
For more parameters, you must broadcast by hand. meshgrid often help for that,spanning each parameter in a dimension. For exemple :
n=3
alpha = numpy.array(range(5))
beta = numpy.array(range(3))
x = numpy.array(range(2))
data = numpy.meshgrid(n,alpha,beta,x)
values = scipy.special.eval_jacobi(*data)
Reshape the input arguments for broadcasting. In this case, change the shape of degs to be (5, 1) instead of just (5,). The shape (5, 1) broadcast with the shape (3,) results in the shape (5, 3):
In [185]: import numpy as np
In [186]: import scipy.special
In [187]: degs = np.arange(5).reshape(-1, 1) # degs has shape (5, 1)
In [188]: pnts = np.array([0.0, 0.1, 0.2])
In [189]: values = scipy.special.eval_chebyt(degs, pnts)
In [190]: values
Out[190]:
array([[ 1. , 1. , 1. ],
[ 0. , 0.1 , 0.2 ],
[-1. , -0.98 , -0.92 ],
[-0. , -0.296 , -0.568 ],
[ 1. , 0.9208, 0.6928]])

Vectorizing A Function With Array Parameter

I'm currently trying to apply Chi-Squared analysis to some data.
I want to plot a colourmap of varying values depending on the two coefficients of a model
def f(x, coeff):
return coeff[0] + numpy.exp(coeff[1] * x)
def chi_squared(coeff, x, y, y_err):
return numpy.sum(((y - f(x, coeff) / y_err)**2)
us = numpy.linspace(u0, u1, n)
vs = numpy.linspace(v0, v1, n)
rs = numpy.meshgrid(us, vs)
chi = numpy.vectorize(chi_squared)
chi(rs, x, y, y_error)
I tried vectorizing the function to be able to pass a meshgrid of the varying coefficents to produce the colormap.
The values of x, y, y_err are all 1D arrays of length n.
And u, v are the various changing coefficients.
However this doesn't work, resulting in
IndexError: invalid index to scalar variable.
This is because coeff is passed as a scalar rather than a vector, however I don't know how to correct this.
Update
My aim is to take an array of coordinates
rs = [[[u0, v0], [u1, v0],..,[un, v0]],...,[[u0, vm],..,[un,vm]]
Where each coordinate is the coefficient parameters to be passed to the chi-squared method.
This should return a 2D array populated with Chi-Squared values for the appropriate coordinate
chi = [[c00, c10, ..., cn0], ..., [c0m, c1m, ..., cnm]]
I can then use this data to plot a colormap using imshow
Here's my first attempt to run your code:
In [44]: def f(x, coeff):
...: return coeff[0] + numpy.exp(coeff[1] * x)
...:
...: def chi_squared(coeff, x, y, y_err):
...: return numpy.sum((y - f(x, coeff) / y_err)**2)
(I had to remove the ( in that last line)
First guess at possible array values:
In [45]: x = np.arange(3)
In [46]: y = x
In [47]: y_err = x
In [48]: us = np.linspace(0,1,3)
In [49]: rs = np.meshgrid(us,us)
In [50]: rs
Out[50]:
[array([[ 0. , 0.5, 1. ],
[ 0. , 0.5, 1. ],
[ 0. , 0.5, 1. ]]),
array([[ 0. , 0. , 0. ],
[ 0.5, 0.5, 0.5],
[ 1. , 1. , 1. ]])]
In [51]: chi_squared(rs, x, y, y_err)
/usr/local/bin/ipython3:5: RuntimeWarning: divide by zero encountered in true_divide
import sys
Out[51]: inf
oops, y_err shouldn't have a 0. Try again:
In [52]: y_err = np.array([1,1,1])
In [53]: chi_squared(rs, x, y, y_err)
Out[53]: 53.262865105526018
It also works if I turn the rs list into an array:
In [55]: np.array(rs).shape
Out[55]: (2, 3, 3)
In [56]: chi_squared(np.array(rs), x, y, y_err)
Out[56]: 53.262865105526018
Now, what was the purpose of vectorize?
The f function returns a (n,n) array:
In [57]: f(x, rs)
Out[57]:
array([[ 1. , 1.5 , 2. ],
[ 1. , 2.14872127, 3.71828183],
[ 1. , 3.21828183, 8.3890561 ]])
Lets modify the chi_squared to give sum an axis
In [61]: def chi_squared(coeff, x, y, y_err, axis=None):
...: return numpy.sum((y - f(x, coeff) / y_err)**2, axis=axis)
In [62]: chi_squared(np.array(rs), x, y, y_err)
Out[62]: 53.262865105526018
In [63]: chi_squared(np.array(rs), x, y, y_err, axis=0)
Out[63]: array([ 3. , 6.49033483, 43.77253028])
In [64]: chi_squared(np.array(rs), x, y, y_err, axis=1)
Out[64]: array([ 1.25 , 5.272053 , 46.74081211])
I'm tempted to change the coeff to coeff0, coeff1, to give more control from the start on how this parameter is passed, but it probably doesn't make a difference.
update
Now that you've been more specific about how the coeff values relate to x, y etc, I see that this can be solved with simple broadcasting. No need to use np.vectorize.
First, define a grid that has a different size; that way we, and the code, won't think that each dimension of the coeff grid has anything to do with the x,y values.
In [134]: rs = np.meshgrid(np.linspace(0,1,4), np.linspace(0,1,5), indexing='ij')
In [135]: coeff=np.array(rs)
In [136]: coeff.shape
Out[136]: (2, 4, 5)
Now look at what f looks like when given this coeff and x.
In [137]: f(x, coeff[...,None]).shape
Out[137]: (4, 5, 3)
coeff is effectively (4,5,1), while x is (1,1,3), resulting in a (4,5,3) (by broadcasting rules)
The same thing happens inside chi_squared, with the final step of sum on the last axis (size 3):
In [138]: chi_squared(coeff[...,None], x, y, y_err, axis=-1)
Out[138]:
array([[ 2. , 1.20406718, 1.93676807, 8.40646968,
32.99441808],
[ 2.33333333, 2.15923164, 3.84810347, 11.80559574,
38.73264336],
[ 3.33333333, 3.78106277, 6.42610554, 15.87138846,
45.13753532],
[ 5. , 6.06956056, 9.67077427, 20.60384785,
52.20909393]])
In [139]: _.shape
Out[139]: (4, 5)
One value for each coeff pair of values, the (4,5) grid.

numpy interpolation to increase a vector size

Hi I have to enlarge the number of points inside of vector to enlarge the vector to fixed size. for example:
for this simple vector
>>> a = np.array([0, 1, 2, 3, 4, 5])
>>> len(a)
# 6
now, I want to get a vector with size of 11 taken the a vector as base the results will be
# array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])
EDIT 1
what I need is a function that will enter the base vector and the number of values that must be the resultant vector, and I return a new vector with size equal to the parameter. something like
def enlargeVector(vector, size):
.....
return newVector
to use like:
>>> a = np.array([0, 1, 2, 3, 4, 5])
>>> b = enlargeVector(a, 200):
>>> len(b)
# 200
and b contains data results of linear, cubic, or whatever interpolation methods
There are many methods to do this within scipy.interpolate. My favourite is UnivariateSpline, which produces an order k spline guaranteed to be differentiable k times.
To use it:
from scipy.interpolate import UnivariateSpline
old_indices = np.arange(0,len(a))
new_length = 11
new_indices = np.linspace(0,len(a)-1,new_length)
spl = UnivariateSpline(old_indices,a,k=3,s=0)
new_array = spl(new_indices)
The s is a smoothing factor that you should set to 0 in this case (since the data are exact).
Note that for the problem you have specified (since a just increases monotonically by 1), this is overkill, since the second np.linspace gives already the desired output.
EDIT: clarified that the length is arbitrary
As AGML pointed out there are tools to do this, but how about a pure numpy solution:
In [20]: a = np.arange(6)
In [21]: temp = np.dstack((a[:-1], a[:-1] + np.diff(a) / 2.0)).ravel()
In [22]: temp
Out[22]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
In [23]: np.hstack((temp, [a[-1]]))
Out[23]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])

Difference between scipy pairwise distance and X.X+Y.Y - X.Y^t

Let's imagine we have data as
d1 = np.random.uniform(low=0, high=2, size=(3,2))
d2 = np.random.uniform(low=3, high=5, size=(3,2))
X = np.vstack((d1,d2))
X
array([[ 1.4930674 , 1.64890721],
[ 0.40456265, 0.62262546],
[ 0.86893397, 1.3590808 ],
[ 4.04177045, 4.40938126],
[ 3.01396153, 4.60005842],
[ 3.2144552 , 4.65539323]])
I want to compare two methods for generating the pairwise distances:
assuming that X and Y are the same:
(X-Y)^2 = X.X + Y.Y - 2*X.Y^t
Here is the first method as it is used in scikit-learn for computing the pairwise distance, and later for kernel matrix.
import numpy as np
def cal_pdist1(X):
Y = X
XX = np.einsum('ij,ij->i', X, X)[np.newaxis, :]
YY = XX.T
distances = -2*np.dot(X, Y.T)
distances += XX
distances += YY
return(distances)
cal_pdist1(X)
array([[ 0. , 2.2380968 , 0.47354188, 14.11610424,
11.02241244, 12.00213414],
[ 2.2380968 , 0. , 0.75800718, 27.56880003,
22.62893544, 24.15871196],
[ 0.47354188, 0.75800718, 0. , 19.37122424,
15.1050792 , 16.36714548],
[ 14.11610424, 27.56880003, 19.37122424, 0. ,
1.09274896, 0.74497242],
[ 11.02241244, 22.62893544, 15.1050792 , 1.09274896,
0. , 0.04325965],
[ 12.00213414, 24.15871196, 16.36714548, 0.74497242,
0.04325965, 0. ]])
Now, if I use scipy pairwise distance function as below, I get
import scipy, scipy.spatial
pd_sparse = scipy.spatial.distance.pdist(X, metric='seuclidean')
scipy.spatial.distance.squareform(pd_sparse)
array([[ 0. , 0.92916653, 0.45646989, 2.29444795, 1.89740167,
2.00059442],
[ 0.92916653, 0. , 0.50798432, 3.22211357, 2.78788236,
2.90062103],
[ 0.45646989, 0.50798432, 0. , 2.72720831, 2.28001564,
2.39338343],
[ 2.29444795, 3.22211357, 2.72720831, 0. , 0.71411943,
0.58399694],
[ 1.89740167, 2.78788236, 2.28001564, 0.71411943, 0. ,
0.14102567],
[ 2.00059442, 2.90062103, 2.39338343, 0.58399694, 0.14102567,
0. ]])
The results are completely different! Shouldn't they be the same?
pdist(..., metric='seuclidean') computes the standardized Euclidean distance, not the squared Euclidean distance (which is what cal_pdist returns).
From the docs:
Y = pdist(X, 'seuclidean', V=None)
Computes the standardized Euclidean distance. The standardized Euclidean distance between two n-vectors u and v is
__________________
√∑(ui−vi)^2 / V[xi]
V is the variance vector; V[i] is the variance computed over all the i’th components of the points. If not passed, it is automatically computed.
Try passing metric='sqeuclidean', and you will see that both functions return the same result to within rounding error.

Numpy's eigh and eig yield inconsistent eigenvalues

Currently I'm trying to solve the generalized eigenvalue problem in NumPy for two symmetric matrices and I've been running into massive trouble as I'm expecting all eigenvalues to be positive, but eigh returns several very large numbers that are not all positive, while eig returns the correct, expected values (but is, of course, very, very slow).
In this case, note that K is symmetric as expected from its construction (here is the code in question):
# Calculate K matrix (<i|pHp|j> in the LGL-nodes basis)
for i in range(Ne):
idx_s, idx_e = i*(Np-1), i*(Np-1)+Np
K[idx_s:idx_e, idx_s:idx_e] += dmat.T.dot(diag(w*peq[idx_s:idx_e])).dot(dmat)
# Re-make matrix for efficient vector products
K = sparse.csr_matrix(K)
# Make matrix for <i|p|j> in the LGL basis as efficient diagonal sparse matrix
S = sparse.diags(peq*w_d, 0)
# Solve the generalized eigenvalue problem: Kc = lSc for hermitian matrices K and S
lQ, Q = linalg.eigh(K.todense(), S.todense())
_lQ, _Q = linalg.eig(K.todense(), S.todense())
lQ.sort()
_lQ.sort()
if not allclose(lQ, _lQ):
print('Literally why')
print(lQ)
print(_lQ)
return
For testing, dmat is defined as
array([[ -896. , 1212.00631086, -484.43454844, 275.06612251,
-179.85209531, 124.26620323, -83.05199285, 32. ],
[ -205.43460499, 0. , 290.78944413, -135.17191772,
82.83085126, -55.64467829, 36.70818656, -14.07728095],
[ 50.7185076 , -179.61445086, 0. , 184.03311398,
-87.85829324, 54.08144362, -34.37053351, 13.01021241],
[ -23.81762789, 69.05246008, -152.20398294, 0. ,
152.89115899, -72.66291308, 42.31407046, -15.57316561],
[ 15.57316561, -42.31407046, 72.66291308, -152.89115899,
0. , 152.20398294, -69.05246008, 23.81762789],
[ -13.01021241, 34.37053351, -54.08144362, 87.85829324,
-184.03311398, 0. , 179.61445086, -50.7185076 ],
[ 14.07728095, -36.70818656, 55.64467829, -82.83085126,
135.17191772, -290.78944413, 0. , 205.43460499],
[ -32. , 83.05199285, -124.26620323, 179.85209531,
-275.06612251, 484.43454844, -1212.00631086, 896. ]])
And all of w[i], w_d[i], peq[i] are essentially arbitrary positive-valued arrays. w_d and w are of the same order (~ 1e-1) and peq[i] ranges on the order of (~ 1e-10 to 1e1)
Some of the output I'm getting is
Literally why
[ -6.25540943e+07 -4.82660391e+07 -2.62629052e+07 ..., 1.07960873e+10
1.07967334e+10 4.26007915e+10]
[ -5.25462340e-12+0.j 4.62614812e-01+0.j 1.23357898e+00+0.j ...,
2.17613917e+06+0.j 1.07967334e+10+0.j 4.26007915e+10+0.j]
EDIT:
Here's a self-contained version of the code for easier debugging
import numpy as np
from math import *
from scipy import sparse, linalg
# Variable declarations and such (pre-computed)
Ne, Np = 256, 8
N = Ne*Np - Ne + 1
domain_size = 4/Ne
x = np.array([-0.015625 , -0.01362094, -0.00924532, -0.0032703 , 0.0032703 ,
0.00924532, 0.01362094, 0.015625 ])
w = np.array([ 0.00055804, 0.00329225, 0.00533004, 0.00644467, 0.00644467,
0.00533004, 0.00329225, 0.00055804])
dmat = np.array([[ -896. , 1212.00631086, -484.43454844, 275.06612251,
-179.85209531, 124.26620323, -83.05199285, 32. ],
[ -205.43460499, 0. , 290.78944413, -135.17191772,
82.83085126, -55.64467829, 36.70818656, -14.07728095],
[ 50.7185076 , -179.61445086, 0. , 184.03311398,
-87.85829324, 54.08144362, -34.37053351, 13.01021241],
[ -23.81762789, 69.05246008, -152.20398294, 0. ,
152.89115899, -72.66291308, 42.31407046, -15.57316561],
[ 15.57316561, -42.31407046, 72.66291308, -152.89115899,
0. , 152.20398294, -69.05246008, 23.81762789],
[ -13.01021241, 34.37053351, -54.08144362, 87.85829324,
-184.03311398, 0. , 179.61445086, -50.7185076 ],
[ 14.07728095, -36.70818656, 55.64467829, -82.83085126,
135.17191772, -290.78944413, 0. , 205.43460499],
[ -32. , 83.05199285, -124.26620323, 179.85209531,
-275.06612251, 484.43454844, -1212.00631086, 896. ]])
# More declarations
x_d = np.zeros(N)
w_d = np.zeros(N)
dmat_d = np.zeros((N, N))
for i in range(Ne):
x_d[i*(Np-1):i*(Np-1)+Np] = x+i*domain_size
w_d[i*(Np-1):i*(Np-1)+Np] += w
dmat_d[i*(Np-1):i*(Np-1)+Np, i*(Np-1):i*(Np-1)+Np] += dmat
peq = (np.cos((x_d-2)*pi/4))**2
# Normalization
peq = peq/np.sum(w_d*peq)
p0 = np.maximum(peq, 1e-10)
p0 /= np.sum(p0*w_d)
# Make efficient matrix that can be built
K = sparse.lil_matrix((N, N))
# Calculate K matrix (<i|pHp|j> in the LGL-nodes basis)
for i in range(Ne):
idx_s, idx_e = i*(Np-1), i*(Np-1)+Np
K[idx_s:idx_e, idx_s:idx_e] += dmat.T.dot(np.diag(w*p0[idx_s:idx_e])).dot(dmat)
# Re-make matrix for efficient vector products
K = sparse.csr_matrix(K)
# Make matrix for <i|p|j> in the LGL basis as efficient diagonal sparse matrix
S = sparse.diags(p0*w_d, 0)
# Solve the generalized eigenvalue problem: Kc = lSc for hermitian matrices K and S
lQ, Q = linalg.eigh(K.todense(), S.todense())
_lQ, _Q = linalg.eig(K.todense(), S.todense())
lQ.sort()
_lQ.sort()
if not np.allclose(lQ, _lQ):
print('Literally why')
print(lQ)
print(_lQ)
EDIT2: This is really odd. Running all of the NumPy/SciPy tests on my machine, I receive no errors. But even running the simple test (with large enough matrices) as
import numpy as np
from spicy import linalg
M = np.random.random((1000,1000))
M += M.T
np.allclose(sorted(linalg.eigh(M)[0]), sorted(linalg.eig(M)[0]))
fails on my machine. Though running the same test with a 50x50 matrix does work---even after rebuilding the SciPy/NumPy stack and passing all unit tests.
EDIT3: Actually, this seems to fail everywhere, after testing it on a cluster computer. I'm not sure why.
The above fails due to the in-place behaviour of += and .T as a view rather than an operation.

Categories