Using numpy.cov on a vector yields NANs - python

Good afternoon.
I am faced with a PCA task which simply involves reducing the dimensionality of a vector. I'm not interested in a two-dimensional matrix in this case, but merely a D-dimensional vector which I would like to project along it's K principal eigenvectors.
In order to implement PCA, I need to retrieve the covariance matrix of this vector. Let's try to do this on an example vector:
someVec = np.array([[1.0, 1.0, 2.0, -1.0]])
I've defined this vector as a 1 X 4 matrix, i.e a row vector, in order to make it compatible with numpy.cov. Taking the covariance matrix of this vector through numpy.cov will yield a scalar covariance matrix, because numpy.cov makes the assumption that the features are in the rows:
print np.cov(someVec)
1.58333333333
but this is (or rather, should be) merely a difference in dimensionality assumptions, and taking the covariance of the transpose vector should work fine, right? Except that it doesn't:
print np.cov(someVec.T)
/usr/lib/python2.7/site-packages/numpy/lib/function_base.py:2005: RuntimeWarning:
invalid value encountered in divide
return (dot(X, X.T.conj()) / fact).squeeze()
[[ nan nan nan nan]
[ nan nan nan nan]
[ nan nan nan nan]
[ nan nan nan nan]]
I'm not exactly sure what I've done wrong here. Any advice?
Thanks,
Jason

If you want to pass in the transpose, you'll need to set rowvar to zero.
In [10]: np.cov(someVec, rowvar=0)
Out[10]: array(1.5833333333333333)
In [11]: np.cov(someVec.T, rowvar=0)
Out[11]: array(1.5833333333333333)
From the docs:
rowvar : int, optional
If rowvar is non-zero (default), then each row
represents a variable, with observations in the columns. Otherwise,
the relationship is transposed: each column represents a variable,
while the rows contain observations.
If you want to find a full covariance matrix, you'll need more than one observation. With a single observation, and numpy's default estimator, NaN is exactly what you'd expect. If you would like to have normalization done by N instead of (N-1), you can pass in a 1 to the bias.
In [12]: np.cov(someVec.T, bias=1)
Out[12]:
array([[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]])
Again, from the docs.
bias : int, optional
Default normalization is by (N - 1), where N is
the number of observations given (unbiased estimate). If bias is 1,
then normalization is by N. These values can be overridden by using
the keyword ddof in numpy versions >= 1.5.

You should use the option row_var=0 in numpy.cov:
In [1]: a = array([[1, 2, 3, 4]])
In [2]: np.cov(a)
Out[2]: array(1.6666666666666667)
In [3]: np.cov(a.T)
Out[3]:
array([[ nan, nan, nan, nan],
[ nan, nan, nan, nan],
[ nan, nan, nan, nan],
[ nan, nan, nan, nan]])
In [4]: np.cov(a.T, rowvar=0)
Out[4]: array(1.6666666666666667)

Not really, shouldn't that be returning a matrix of size 4 x 4? I mean, the vector has 4 "features", so given that I want to measure the variance between the features and store them in appropriate places, I need a covariance matrix.
Since you only have one observation, you can't compute a covariance matrix. Depending on the estimator the covariances would either be zero or undefined.
If that's not intuitively clear, try answering the following questions:
what is the variance of 1.0?
what is the covariance of 1.0 and 2.0?
In essence, these are the computations that you're asking numpy.cov() to perform.

Related

Concatenate unequal sized numpy arrays keeping index positioning fixed

Let's say I have data for 3 variable pairs, A, B, and C (in my actual application the number of variables is anywhere from 1000-3000 but could be even higher).
Let's also say that there are pieces of the data that come in arrays.
For example:
Array X:
np.array([[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]])
Where:
X[0,0] = corresponds to data for variables A and A
X[0,1] = corresponds to data for variables A and B
X[0,2] = corresponds to data for variables A and C
X[1,0] = corresponds to data for variables B and A
X[1,1] = corresponds to data for variables B and B
X[1,2] = corresponds to data for variables B and C
X[2,0] = corresponds to data for variables C and A
X[2,1] = corresponds to data for variables C and B
X[2,2] = corresponds to data for variables C and C
Array Y:
np.array([[2,12],
[-12, 2]])
Y[0,0] = corresponds to data for variables A and C
Y[0,1] = corresponds to data for variables A and B
Y[1,0] = corresponds to data for variables B and A
Y[1,1] = corresponds to data for variables C and A
Array Z:
np.array([[ 99, 77],
[-77, -99]])
Z[0,0] = corresponds to data for variables A and C
Z[0,1] = corresponds to data for variables B and C
Z[1,0] = corresponds to data for variables C and B
Z[1,1] = corresponds to data for variables C and A
I want to concatenate the above arrays keeping the variable position fixed as follows:
END_RESULT_ARRAY index 0 corresponds to variable A
END_RESULT_ARRAY index 1 corresponds to variable B
END_RESULT_ARRAY index 2 corresponds to variable C
Basically, there are N variables in the universe but can change every month (new ones can be introduced and existing ones can drop out and then return or never return). Within the N variables in the universe I compute permutations pairs and the positioning of each variable is fixed i.e. index 0 corresponds to variable A, index = 1 corresponds to variable B (as described above).
Given the above requirement the end END_RESULT_ARRAY should look like the following:
array([[[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]],
[[ nan, 12., 2.],
[-12., nan, nan],
[ 2., nan, nan]],
[[ nan, nan, 99.],
[ nan, nan, 77.],
[-99., -77., nan]]])
Keep in mind that the above is an illustration.
In my actual application, I have about 125 arrays and a new one is generated every month. Each monthly array may have different sizes and may only have data for a portion of the variables defined in my universe. Also, as new arrays are created each month there is no way of knowing what its size will be or which variables will have data (or which ones will be missing).
So up until the most recent monthly array, we can determine the max size from the available historical data. Each month we will have to re-check the max size of all the arrays as a new array comes available. Once we have the max size we can then re-stitch/concatenate all the arrays together IF THIS IS SOMETHING THAT IS DOABLE in numpy. This will be an on-going operation done every month.
I want a general mechanism to be able to stitch these arrays together keeping the requirements I describe regarding the index position for the variables fixed.
I actually want to use H5PY arrays as my data set will grow exponentially not too distant future. However, I would like to get this working with numpy as a first step.
Based on the comment made by #user3483203. The next step is to concatenate the arrays.
a = np.array([[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]])
b = np.array([[0,12], [-12, 0]])
out = np.full_like(a, np.nan); i, j = b.shape; out[:i, :j] = b
res = np.array([a, out])
print (res)
This answers the original question which has since been changed:
Lets say I have the following arrays:
np.array([[ 0., 2., 3.],
[ -2., 0., 4.],
[ -3., -4., 0.]])
np.array([[0,12],
[-12, 0]])
I want to concatenate the above 2 arrays such that the end result is
as follows:
array([[[0, 2, 3],
[-2, 0, 4],
[-3,-4, 0]],
[[0,12, np.nan],
[-12, 0, np.nan],
[np.nan, np.nan, np.nan]]])
Find out how much each array exceeds the max size in each dimension, then use np.pad to pad at the end of each dimension, then finally np.stack to stack them together:
import numpy as np
a = np.arange(12).reshape(4,3).astype(np.float)
b = np.arange(4).reshape(1,4).astype(np.float)
arrs = (a,b)
dims = len(arrs[0].shape)
maxshape = tuple( max(( x.shape[i] for x in arrs)) for i in range(dims))
paddedarrs = ( np.pad(x, tuple((0, maxshape[i]-x.shape[i]) for i in range(dims)), 'constant', constant_values=(np. nan,)) for x in (a,b))
c = np.stack(paddedarrs,0)
print (a)
print(b,"\n======================")
print(c)
[[ 0. 1. 2.]
[ 3. 4. 5.]
[ 6. 7. 8.]
[ 9. 10. 11.]]
[[0. 1. 2. 3.]]
======================
[[[ 0. 1. 2. nan]
[ 3. 4. 5. nan]
[ 6. 7. 8. nan]
[ 9. 10. 11. nan]]
[[ 0. 1. 2. 3.]
[nan nan nan nan]
[nan nan nan nan]
[nan nan nan nan]]]

How to replace row with float values with in a nested numpy array with a row of `NaN`s?

Say i have a numpy array:
a=np.array([[7,2,4],[1.2,7.4,3],[1.5,3.6,3.4]])
And my goal is to replace rows that which contain floats with a row of NaNs, and so far this is my attempt:
a[a.dtype==float]=np.nan
Which works, but only the first row that should be NaN, there's an second row that should be NaN that's left alone.
So my desired output would look like:
[[ 7. 2. 4.]
[ nan nan nan]
[ nan nan nan]]
Try rounding:
a[np.round(a)!=a] = np.nan
#array([[ 7., 2., 4.],
# [nan, nan, 3.],
# [nan, nan, nan]])
a.dtype==float returns True, hence that doesn't really make any sense. Also, all of your values are floats (you can check this by slicing type(a[0][0]).
You could use the .is_integer method on floats, but I think np.mod will be faster
a[np.mod(a, 1) != 0] = np.nan

scipy.stats.skew nan_policy parameter

According to scipy docs for skew , we have :
scipy.stats.skew(a, axis=0, bias=True, nan_policy='propagate')
where,
nan_policy : {‘propagate’, ‘raise’, ‘omit’}, optional Defines how to
handle when input contains nan. ‘propagate’ returns nan, ‘raise’
throws an error, ‘omit’ performs the calculations ignoring nan values.
Default is ‘propagate’
So, the default method for NaN is propagate. So, how are the NaNs propagated? I can understand the 'omit' method since, it performs calculations by omitting the nan values, and the raise method but, the docs don't seem to help understand how the missing values are treated in case of 'propagate' and how they would be plotted.
Also, would be great, if someone explained the bias parameter too.
bias : bool, optional
If False, then the calculations are corrected for statistical bias.
.
It just does computations with the data as is.
n [43]: x = np.arange(12, dtype=float).reshape(4, -1)
In [44]: x[2, 1] = np.nan
In [45]: x
Out[45]:
array([[ 0., 1., 2.],
[ 3., 4., 5.],
[ 6., nan, 8.],
[ 9., 10., 11.]])
In [46]: stats.skew(x, nan_policy='propagate')
Out[46]: array([ 0., nan, 0.])

Delete nan AND corresponding elements in two same-length array

I have two lists of the same length that I can convert into array to play with the numpy.stats.pearsonr method. Now, some of the elements of these lists are nan, and can thus not be used for that method. The best thing to do in my case is to remove those elements, and the corresponding element in the other list. Is there a practical and pythonic way to do it?
Example: I have
[1 2 nan 4 5 6 ] and [1 nan 3 nan 5 6]
and in the end I need
[1 5 6 ]
[1 5 6 ]
(here the number are representative of the position/indices, not of the actual numbers I am dealing with). EDIT: The tricky part here is to have both lists/arrays without nans in one array AND elements corresponding to nans in the other, and vice versa. Although it can certainly be done by manipulating the arrays, I am sure there is a clear and not overcomplicated way to do it in a pythonic way.
The accepted answer to proposed duplicate gets you half-way there. Since you're using Numpy already you should make these into numpy arrays. Then you should generate an indexing expression, and then use it to index these 2 arrays. Here indices will be a new array of bool of same shape where each element is True iff not (respective element in x is nan or respective element in y is nan):
>>> x
array([ 1., 2., nan, 4., 5., 6.])
>>> y
array([ 1., nan, 3., nan, 5., 6.])
>>> indices = np.logical_not(np.logical_or(np.isnan(x), np.isnan(y)))
>>> x = x[indices]
>>> y = y[indices]
>>> x
array([ 1., 5., 6.])
>>> y
array([ 1., 5., 6.])
Notably, this works for any 2 arrays of same shape.
P.S., if you know that the element type in the operand arrays is boolean, as is the case for arrays returned from isnan here, you can use ~ instead of logical_not and | instead of logical_or: indices = ~(np.isnan(x) | np.isnan(y))

NumPy array sum reduce

I have a numpy array with three columns of the form:
x1 y1 f1
x2 y2 f2
...
xn yn fn
The (x,y) pairs may repeat. I would need another array such that each (x,y) pair appears once and the corresponding third column is the sum of all the f values that appeared next to (x,y).
For example, the array
1 2 4.0
1 1 5.0
1 2 3.0
0 1 9.0
would give
0 1 9.0
1 1 5.0
1 2 7.0
The order of rows is not relevant. What is the fastest way to do this in Python?
Thank you!
This would be one approach to solve it -
import numpy as np
# Input array
A = np.array([[1,2,4.0],
[1,1,5.0],
[1,2,3.0],
[0,1,9.0]])
# Extract xy columns
xy = A[:,0:2]
# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(xy.T)
sorted_xy = xy[sorted_idx,:]
# Differentiation along rows for sorted array
df1 = np.diff(sorted_xy,axis=0)
df2 = np.append([True],np.any(df1!=0,1),0)
# OR df2 = np.append([True],np.logical_or(df1[:,0]!=0,df1[:,1]!=0),0)
# OR df2 = np.append([True],np.dot(df1!=0,[True,True]),0)
# Get unique sorted labels
sorted_labels = df2.cumsum(0)-1
# Get labels
labels = np.zeros_like(sorted_idx)
labels[sorted_idx] = sorted_labels
# Get unique indices
unq_idx = sorted_idx[df2]
# Get counts and unique rows and setup output array
counts = np.bincount(labels, weights=A[:,2])
unq_rows = xy[unq_idx,:]
out = np.append(unq_rows,counts.ravel()[:,None],1)
Input & Output -
In [169]: A
Out[169]:
array([[ 1., 2., 4.],
[ 1., 1., 5.],
[ 1., 2., 3.],
[ 0., 1., 9.]])
In [170]: out
Out[170]:
array([[ 0., 1., 9.],
[ 1., 1., 5.],
[ 1., 2., 7.]])
Thanks to #hpaulj, finally found the simplest solution. If d contains the 3-column data:
ind =d[0:2].astype(int)
x = zeros(shape=(N,N))
add.at(x,list(ind),d[2])
This solution assumes that the (x,y) indices in the first two columns are integer and smaller than N. This is what I need and should have mentioned in the post.
Edit: Note that the above solution produces a sparse matrix with the sum values at position (x,y) within the matrix.
Certainly easily done in Python:
arr = np.array([[1,2,4.0],
[1,1,5.0],
[1,2,3.0],
[0,1,9.0]])
d={}
for x, y, z in arr:
d.setdefault((x,y), 0)
d[x,y]+=z
>>> d
{(1.0, 2.0): 7.0, (0.0, 1.0): 9.0, (1.0, 1.0): 5.0}
Then translate back to numpy:
>>> np.array([[x,y,d[(x,y)]] for x,y in d.keys()])
array([[ 1., 2., 7.],
[ 0., 1., 9.],
[ 1., 1., 5.]])
If you have scipy, the sparse module does this kind of addition - again for an array where the 1st 2 columns are integers - ie. indexes.
from scipy import sparse
M = sparse.csr_matrix((d[:,0], (d[:,1],d[:,2])))
M = M.tocoo() # there may be a short cut to this csr coo round trip
x = np.column_stack([M.row, M.col, M.data]) # needs testing
For convenience in constructing certain kinds of linear algebra matrices, the csr sparse array format sums values with duplicate indices. It's implemented in compiled code so should be fairly fast. But putting the data into M and taking it back out might slow it down.
(ps. I haven't tested this script since I'm writing this on a machine without scipy).

Categories