Rows of sparse matrix where no column is zero - python

I have a matrix like this:
A = sp.csr_matrix(np.array(
[[1, 1, 2, 1],
[0, 0, 2, 0],
[1, 4, 1, 1],
[0, 1, 0, 0]]))
I want to get all the rows where all columns are nonzero, so I can then get their sum. Either as an array:
rows = [True, False, True, False]
result = A[rows].sum()
Or as indices:
rows = [0, 2]
result = A[rows].sum()
I am stuck however at the first part, figuring out which rows to include in the sum, as most results seem to be looking for the opposite (rows where all columns are zero).

In [35]: from scipy import sparse
In [36]: A = sparse.csr_matrix(np.array(
...: [[1, 1, 2, 1],
...: [0, 0, 2, 0],
...: [1, 4, 1, 1],
...: [0, 1, 0, 0]]))
In [37]: A
Out[37]:
<4x4 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Row format>
Sparse doesn't do 'all/any' kinds of operations because they treat 0's as significant values.
all on the dense equivalent works nicely:
In [41]: A.A.all(axis=1)
Out[41]: array([ True, False, True, False])
On the sparse one we can turn the dtype to boolean, and sum along the axis. And then test it for the full value:
In [42]: A.astype(bool).sum(axis=1)
Out[42]:
matrix([[4],
[1],
[4],
[1]])
In [43]: A.astype(bool).sum(axis=1).A1==4
Out[43]: array([ True, False, True, False])
Notice that the sparse sum returns a np.matrix. I used A1 to turn that into a 1d array.
If the matrix isn't too large, working with the dense array may be faster. Sparse operations like sum are actually performed with matrix multiplication.
In [51]: A.astype(bool)#np.ones(4,int)
Out[51]: array([4, 1, 4, 1])
Or we could convert it to lil format, and look at the length of the 'rows':
In [67]: A.tolil().data
Out[67]:
array([list([1, 1, 2, 1]), list([2]), list([1, 4, 1, 1]), list([1])],
dtype=object)
In [68]: [len(i) for i in A.tolil().data]
Out[68]: [4, 1, 4, 1]
But wait, there's more. The indptr attribute of the csr is:
In [69]: A.indptr
Out[69]: array([ 0, 4, 5, 9, 10], dtype=int32)
In [70]: np.diff(A.indptr)
Out[70]: array([4, 1, 4, 1], dtype=int32)
I've omitted some test timings, but this last is clearly the fastest!

It is a bit easier to do for numpy arrays than for sparse ones. If you do not mind converting to numpy as an intermediate step, you can get the right rows via
(A.toarray() != 0).all(axis=1)
to produce
array([ True, False, True, False])
and then use it in indexing A as such:
A[(A.toarray() != 0).all(axis=1),:].sum()
returns 12

Related

Scipy create sparse row matrix from a list of indices and a list of list data

suppose I have a list of list data, and a list containing the row number of each data, how to convert to a sparse matrix?
Example:
import numpy as np
data = np.array([[1,2,3],[4,5,6],[7,8,9]])
indices = np.array([0,0,4]) # row number, sum when duplicated
expected output is:
[[5, 7, 9], # row 0: [5,7,9]=[1,2,3]+[4,5,6]
[0, 0, 0],
[0, 0, 0],
[7, 8, 9]] # row 4
I understand that I can construct it using scipy.sparse.csr_matrix with data, and row, col, or indptr, but I now have already calculated data and indices, is there a way to simply construct a sparse matrix using these two? Thanks!
According to the documentation, there is a constructor that utilizes the CSR information directly:
csr_matrix((data, indices, indptr), [shape=(M, N)])
So in your specific case, you could write it like:
data = np.array([1,2,3,4,5,6,7,8,9])
indices = np.array([0,1,2,0,1,2,0,1,2]) # col numbers
indptr = np.array([0,6,6,6,9]) # row pointers
mat = csr_matrix((data, indices, indptr), shape=(4, 3))
To get an example on how the CSR format works, you can take a look into sparse matrices. I will explain the code nonetheless:
First, the data needs to be flattened to a single list. The indices of the CSR format relate to the column-indices, while the indptr is used to point to the rows.
So having an indptr value of 0 at position 0 in the list tells us that the 1st row (position + 1) of the matrix starts after 0 data entries. Similarly, a value of 6 at position 1 in the list tells us that the 2nd row (position + 1) of the matrix starts after 6 data entries.
The column-indices list is as you would expect it to behave: data[i] is positioned in column indices[i].
In [131]: data = np.array([[1,2,3],[4,5,6],[7,8,9]])
...: indices = np.array([0,0,3]) # row number, sum when duplicated
I corrected the indices for 0 based indexing.
We don't need sparse to sum the duplicates. There's a np.add.at that does this nicely:
In [135]: res = np.zeros((4,3),int)
In [136]: np.add.at(res, indices, data)
In [137]: res
Out[137]:
array([[5, 7, 9],
[0, 0, 0],
[0, 0, 0],
[7, 8, 9]])
If we make a csr from that:
In [141]: M = sparse.csr_matrix(res)
In [142]: M
Out[142]:
<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [143]: M.data
Out[143]: array([5, 7, 9, 7, 8, 9])
In [144]: M.indices
Out[144]: array([0, 1, 2, 0, 1, 2], dtype=int32)
In [145]: M.indptr
Out[145]: array([0, 3, 3, 3, 6], dtype=int32)
To make a csr directly, it's often easier to use the coo style of inputs. They are easier to understand.
Those inputs are 3 1d arrays of the same size:
In [160]: data.ravel()
Out[160]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [161]: row = np.repeat(indices,3)
In [162]: row
Out[162]: array([0, 0, 0, 0, 0, 0, 3, 3, 3])
In [163]: col = np.tile(np.arange(3),3)
In [164]: col
Out[164]: array([0, 1, 2, 0, 1, 2, 0, 1, 2])
In [165]: M1 = sparse.coo_matrix((data.ravel(),(rows, cols)))
In [166]: M1.data
Out[166]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
The coo format leaves the inputs as given; but on conversion to csr duplicates are summed.
In [168]: M2 = M1.tocsr()
In [169]: M2
Out[169]:
<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
In [170]: M2.data
Out[170]: array([5, 7, 9, 7, 8, 9])
In [171]: M2.indices
Out[171]: array([0, 1, 2, 0, 1, 2], dtype=int32)
In [172]: M2.indptr
Out[172]: array([0, 3, 3, 3, 6], dtype=int32)
In [173]: M2.A
Out[173]:
array([[5, 7, 9],
[0, 0, 0],
[0, 0, 0],
[7, 8, 9]])
#Erik shows how to use the csr format directly:
In [174]: M3 =sparse.csr_matrix((data.ravel(), col, [0,6,6,6,9]))
In [175]: M3
Out[175]:
<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 9 stored elements in Compressed Sparse Row format>
In [176]: M3.A
Out[176]:
array([[5, 7, 9],
[0, 0, 0],
[0, 0, 0],
[7, 8, 9]])
In [177]: M3.indices
Out[177]: array([0, 1, 2, 0, 1, 2, 0, 1, 2], dtype=int32)
Note this has 9 nonzero elements; it hasn't summed the duplicates for storage (though the .A display shows them summed). To sum, we need an extra step:
In [179]: M3.sum_duplicates()
In [180]: M3.data
Out[180]: array([5, 7, 9, 7, 8, 9])

Understanding numpy.where

I want to get first index of numpy array element which is greater than some specific element of that same array. I tried following:
>>> Q5=[[1,2,3],[4,5,6]]
>>> Q5 = np.array(Q5)
>>> Q5[0][Q5>Q5[0,0]]
array([2, 3])
>>> np.where(Q5[0]>Q5[0,0])
(array([1, 2], dtype=int32),)
>>> np.where(Q5[0]>Q5[0,0])[0][0]
1
Q1. Is above correct way to obtain first index of an element in Q5[0] greater than Q5[0,0]?
I am more concerned with np.where(Q5[0]>Q5[0,0]) returning tuple (array([1, 2], dtype=int32),) and hence requiring me to double index [0][0] at the end of np.where(Q5[0]>Q5[0,0])[0][0].
Q2. Why this return tuple, but below returns proper numpy array?
>>> np.where(Q5[0]>Q5[0,0],Q5[0],-1)
array([-1, 2, 3])
So that I can index directly:
>>> np.where(Q5[0]>Q5[0,0],Q5[0],-1)[1]
2
In [58]: A = np.arange(1,10).reshape(3,3)
In [59]: A.shape
Out[59]: (3, 3)
In [60]: A
Out[60]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
np.where with just the condition is really np.nonzero.
Generate a boolean array:
In [63]: A==6
Out[63]:
array([[False, False, False],
[False, False, True],
[False, False, False]])
Find where that is true:
In [64]: np.nonzero(A==6)
Out[64]: (array([1]), array([2]))
The result is a tuple, one element per dimension of the condition. Each element is an indexing array, together they define the location of the True(s)
Another test with several True
In [65]: (A%3)==1
Out[65]:
array([[ True, False, False],
[ True, False, False],
[ True, False, False]])
In [66]: np.nonzero((A%3)==1)
Out[66]: (array([0, 1, 2]), array([0, 0, 0]))
Using the tuple to index the original array:
In [67]: A[np.nonzero((A%3)==1)]
Out[67]: array([1, 4, 7])
Using the 3 argument where to create a new array with a mix of values from A and A+10
In [68]: np.where((A%3)==1,A+10, A)
Out[68]:
array([[11, 2, 3],
[14, 5, 6],
[17, 8, 9]])
If the condition has multiple True, nonzero isn't the test tool for finding the "first", since it necessarily finds all.
The nonzero tuple can be turned into a 2d array with a transpose. It actually may be easier to get the "first" from this array:
In [73]: np.argwhere((A%3)==1)
Out[73]:
array([[0, 0],
[1, 0],
[2, 0]])
You are looking in a 1d array, a row of A:
In [77]: A[0]>A[0,0]
Out[77]: array([False, True, True])
In [78]: np.nonzero(A[0]>A[0,0])
Out[78]: (array([1, 2]),) # 1 element tuple
In [79]: np.argwhere(A[0]>A[0,0])
Out[79]:
array([[1],
[2]])
In [81]: np.where(A[0]>A[0,0], 100, 0) # 3 argument where
Out[81]: array([ 0, 100, 100])
So whether you are searching a 1d array or a 2d (or 3 or 4), nonzero returns a tuple with one array element per dimension. That way it can always be used to index a like sized array. The 1d tuple might look redundant, but it is consistent with other dimensional results.
When trying understand operations like this, read the docs carefully, and look at individual steps. Here I look at the conditional matrix, the nonzero result, and its various uses.
Using argmax with a boolean array will give you the index of the first True.
In [54]: q
Out[54]:
array([[1, 2, 3],
[4, 5, 6]])
In [55]: q > q[0,0]
Out[55]:
array([[False, True, True],
[ True, True, True]], dtype=bool)
argmax can take an axis/dimension argument.
In [56]: np.argmax(q > q[0,0], 0)
Out[56]: array([1, 0, 0], dtype=int64)
That says the first True is index one for column zero and index zero for columns one and two.
In [57]: np.argmax(q > q[0,0], 1)
Out[57]: array([1, 0], dtype=int64)
That says the first True is index one for row zero and index zero for row one.
Q1. Is above correct way to obtain first index of an element in Q5[0] greater than Q5[0,0]?
No I would use argmax with 1 for the axis argument then select the first item from that result.
Q2. Why this return tuple
You told it to return -1 for False values and return Q5[0] items for True values.
Q2 ...but below returns proper numpy array?
You got lucky and chose the correct index.
numpy.where() is like a for loop with an if.
numpy.where(condition, values, new_value)
condition - just like if conditions.
values - The values to iterate on
new_value - if the condition is true for a value, its going to change to the new_value
If we would like to write it for a 1-dimensional array it should look something like this:
[xv if c else yv for c, xv, yv in zip(condition, x, y)]
Example:
>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.where(a < 5, a, 10*a)
array([ 0, 1, 2, 3, 4, 50, 60, 70, 80, 90])
First we create an array with numbers from 0 to 9 (0, 1, 2 ... 7, 8, 9)
and then we are checking for all the values in the array that are greater from 5 and multiplying their value by 10.
So now all the values in the array that are less then 5 stayed the same and all the values that are greater multiplied by 10

2d numpy mask not working as expected

I'm trying to turn a 2x3 numpy array into a 2x2 array by removing select indexes.
I think I can do this with a mask array with true/false values.
Given
[ 1, 2, 3],
[ 4, 1, 6]
I want to remove one element from each row to give me:
[ 2, 3],
[ 4, 6]
However this method isn't working quite like I would expect:
import numpy as np
in_array = np.array([
[ 1, 2, 3],
[ 4, 1, 6]
])
mask = np.array([
[False, True, True],
[True, False, True]
])
print in_array[mask]
Gives me:
[2 3 4 6]
Which is not what I want. Any ideas?
The only thing 'wrong' with that is it is the shape - 1d rather than 2. But what if your mask was
mask = np.array([
[False, True, False],
[True, False, True]
])
1 value in the first row, 2 in second. It couldn't return that as a 2d array, could it?
So the default behavior when masking like this is to return a 1d, or raveled result.
Boolean indexing like this is effectively a where indexing:
In [19]: np.where(mask)
Out[19]: (array([0, 0, 1, 1], dtype=int32), array([1, 2, 0, 2], dtype=int32))
In [20]: in_array[_]
Out[20]: array([2, 3, 4, 6])
It finds the elements of the mask which are true, and then selects the corresponding elements of the in_array.
Maybe the transpose of where is easier to visualize:
In [21]: np.argwhere(mask)
Out[21]:
array([[0, 1],
[0, 2],
[1, 0],
[1, 2]], dtype=int32)
and indexing iteratively:
In [23]: for ij in np.argwhere(mask):
...: print(in_array[tuple(ij)])
...:
2
3
4
6

indices of sparse_csc matrix are reversed after extracting some columns

I'm trying to extract columns of a scipy sparse column matrix, but the result is not stored as I'd expect. Here's what I mean:
In [77]: a = scipy.sparse.csc_matrix(np.ones([4, 5]))
In [78]: ind = np.array([True, True, False, False, False])
In [79]: b = a[:, ind]
In [80]: b.indices
Out[80]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
In [81]: a.indices
Out[81]: array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
How come b.indices is not [0, 1, 2, 3, 0, 1, 2, 3] ?
And since this behaviour is not the one I expect, is a[:, ind] not the correct way to extract columns from a csc matrix?
The indices are not sorted. You can either force the looping by reversing in a's rows, which is not that intuitive, or enforce sorted indices (you can also do it in-place, but I prefer casting). What I find funny is that the has_sorted_indices attribute does not always return a boolean, but mixes it with integer representation.
a = scipy.sparse.csc_matrix(np.ones([4, 5]))
ind = np.array([True, True, False, False, False])
b = a[::-1, ind]
b2 = a[:, ind]
b3 = b2.sorted_indices()
b.indices
>>array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
b.has_sorted_indices
>>1
b2.indices
>>array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
b2.has_sorted_indices
>>0
b3.indices
array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
b3.has_sorted_indices
>>True
csc and csr indices are not guaranteed to be sorted. I can't off hand find documentation to the effect, but the has_sort_indices and the sort methods suggest that.
In your case the order is the result of how the indexing is done. I found in previous SO questions, that multicolumn indexing is performed with a matrix multiplication:
In [165]: a = sparse.csc_matrix(np.ones([4,5]))
In [166]: b = a[:,[0,1]]
In [167]: b.indices
Out[167]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
This indexing is the equivalent to constructing a 'selection' matrix:
In [169]: I = sparse.csr_matrix(np.array([[1,0,0,0,0],[0,1,0,0,0]]).T)
In [171]: I.A
Out[171]:
array([[1, 0],
[0, 1],
[0, 0],
[0, 0],
[0, 0]], dtype=int32)
and doing this matrix multiplication:
In [172]: b1 = a * I
In [173]: b1.indices
Out[173]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
The order is the result of how the matrix multiplication was done. In fact a * a.T does the same reversal. We'd have to examine the multiplication code to know exactly why. Evidently the csc and csr calculation code doesn't require sorted indices, and doesn't bother to ensure the results are sorted.
https://docs.scipy.org/doc/scipy-0.19.1/reference/sparse.html#further-details
Further DetailsĀ¶
CSR column indices are not necessarily sorted. Likewise for CSC row indices. Use the .sorted_indices() and .sort_indices() methods when sorted indices are required (e.g. when passing data to other libraries).

Instantiate a matrix with x zeros and the rest ones

I would like to be able to quickly instantiate a matrix where the first few (variable number of) cells in a row are 0, and the rest are ones.
Imagine we want a 3x4 matrix.
I have instantiated the matrix first as all ones:
ones = np.ones([4,3])
Then imagine we have an array that announces how many leading zeros there are:
arr = np.array([2,1,3,0]) # first row has 2 zeroes, second row 1 zero, etc
Required result:
array([[0, 0, 1],
[0, 1, 1],
[0, 0, 0],
[1, 1, 1]])
Obviously this can be done in the opposite way as well, but I'd consider the approach where 1 is a default value, and zeros would be replaced.
What would be the best way to avoid some silly loop?
Here's one way. n is the number of columns in the result. The number of rows is determined by len(arr).
In [29]: n = 5
In [30]: arr = np.array([1, 2, 3, 0, 3])
In [31]: (np.arange(n) >= arr[:, np.newaxis]).astype(int)
Out[31]:
array([[0, 1, 1, 1, 1],
[0, 0, 1, 1, 1],
[0, 0, 0, 1, 1],
[1, 1, 1, 1, 1],
[0, 0, 0, 1, 1]])
There are two parts to the explanation of how this works. First, how to create a row with m zeros and n-m ones? For that, we use np.arange to create a row with values [0, 1, ..., n-1]`:
In [35]: n
Out[35]: 5
In [36]: np.arange(n)
Out[36]: array([0, 1, 2, 3, 4])
Next, compare that array to m:
In [37]: m = 2
In [38]: np.arange(n) >= m
Out[38]: array([False, False, True, True, True], dtype=bool)
That gives an array of boolean values; the first m values are False and the rest are True. By casting those values to integers, we get an array of 0s and 1s:
In [39]: (np.arange(n) >= m).astype(int)
Out[39]: array([0, 0, 1, 1, 1])
To perform this over an array of m values (your arr), we use broadcasting; this is the second key idea of the explanation.
Note what arr[:, np.newaxis] gives:
In [40]: arr
Out[40]: array([1, 2, 3, 0, 3])
In [41]: arr[:, np.newaxis]
Out[41]:
array([[1],
[2],
[3],
[0],
[3]])
That is, arr[:, np.newaxis] reshapes arr into a 2-d array with shape (5, 1). (arr.reshape(-1, 1) could have been used instead.) Now when we compare this to np.arange(n) (a 1-d array with length n), broadcasting kicks in:
In [42]: np.arange(n) >= arr[:, np.newaxis]
Out[42]:
array([[False, True, True, True, True],
[False, False, True, True, True],
[False, False, False, True, True],
[ True, True, True, True, True],
[False, False, False, True, True]], dtype=bool)
As #RogerFan points out in his comment, this is basically an outer product of the arguments, using the >= operation.
A final cast to type int gives the desired result:
In [43]: (np.arange(n) >= arr[:, np.newaxis]).astype(int)
Out[43]:
array([[0, 1, 1, 1, 1],
[0, 0, 1, 1, 1],
[0, 0, 0, 1, 1],
[1, 1, 1, 1, 1],
[0, 0, 0, 1, 1]])
Not as concise as I wanted (I was experimenting with mask_indices), but this will also do the work:
>>> n = 3
>>> zeros = [2, 1, 3, 0]
>>> numpy.array([[0] * zeros[i] + [1]*(n - zeros[i]) for i in range(len(zeros))])
array([[0, 0, 1],
[0, 1, 1],
[0, 0, 0],
[1, 1, 1]])
>>>
Works very simple: concatenates multiplied required number of times, one-element lists [0] and [1], creating the array row by row.

Categories