Find first nonzero column in scipy.sparse matrix - python

I am looking for the first column containing a nonzero element in a sparse matrix (scipy.sparse.csc_matrix). Actually, the first column starting with the i-th one to contain a nonzero element.
This is part of a certain type of linear equation solver. For dense matrices I had the following: (Relevant line is pcol = ...)
import numpy
D = numpy.matrix([[1,0,0],[2,0,0],[3,0,1]])
i = 1
pcol = i + numpy.argmax(numpy.any(D[:,i:], axis=0))
if pcol != i:
# Pivot columns i, pcol
D[:,[i,pcol]] = D[:,[pcol,i]]
print(D)
# Result should be numpy.matrix([[1,0,0],[2,0,0],[3,1,0]])
The above should swap columns 1 and 2. If we set i = 0 instead, D is unchanged since column 0 already contains nonzero entries.
What is an efficient way to do this for scipy.sparse matrices? Are there analogues for the numpy.any() and numpy.argmax() functions?

With a csc matrix it is easy to find the nonzero columns.
In [302]: arr=sparse.csc_matrix([[0,0,1,2],[0,0,0,2]])
In [303]: arr.A
Out[303]:
array([[0, 0, 1, 2],
[0, 0, 0, 2]])
In [304]: arr.indptr
Out[304]: array([0, 0, 0, 1, 3])
In [305]: np.diff(arr.indptr)
Out[305]: array([0, 0, 1, 2])
The last line shows how many nonzero terms there are in each column.
np.nonzero(np.diff(arr.indptr))[0][0] would be the index of the first nonzero value in that diff.
Do the same on a csr matrix for find the 1st nonzero row.
I can elaborate on indptr if you want.

Related

numpy get row index where elements in certain columns are zero

I want to find indexes of row based on criteria over certain columns
So, something like:
import numpy as np
x = np.random.rand(4, 5)
x[2, 2] = 0
x[2, 3] = 0
x[3, 1] = 0
x[1, 3] = 0
Now, I want to get the index of the rows where either of columns 3 or 4 are zeros. How can one do that with numpy? Do I need to make multiple calls to nonzero for each column and combine these indices using a set or something like that?
Using np.where first array within the tuple is row index
np.where(x[:,[3,4]]==0)
Out[79]: (array([1, 2], dtype=int64), array([0, 0], dtype=int64))

How to find where in numpy array a zero element is preceded by at least N-1 consecutive zeros?

Given a numpy array (let it be a bit array for simplicity), how can I construct a new array of the same shape where 1 stands exactly at the positions where in the original array there was a zero, preceded by at least N-1 consecutive zeros?
For example, what is the best way to implement function nzeros having two arguments, a numpy array and the minimal required number of consecutive zeros:
import numpy as np
a = np.array([0, 0, 0, 0, 1, 0, 0, 0, 1, 1])
b = nzeros(a, 3)
Function nzeros(a, 3) should return
array([0, 0, 1, 1, 0, 0, 0, 1, 0, 0])
Approach #1
We can use 1D convolution -
def nzeros(a, n):
# Define kernel for 1D convolution
k = np.ones(n,dtype=int)
# Get sliding summations for zero matches with that kernel
s = np.convolve(a==0,k)
# Look for summations that are equal to n value, which will occur for
# n consecutive 0s. Remember that we are using a "full" version of
# convolution, so there's one-off offsetting because of the way kernel
# slides across input data. Also, we need to create 1s at places where
# n consective 0s end, so we would need to slice out ending elements.
# Thus, we would end up with the following after int dtype conversion
return (s==n).astype(int)[:-n+1]
Sample run -
In [46]: a
Out[46]: array([0, 0, 0, 0, 1, 0, 0, 0, 1, 1])
In [47]: nzeros(a,3)
Out[47]: array([0, 0, 1, 1, 0, 0, 0, 1, 0, 0])
In [48]: nzeros(a,2)
Out[48]: array([0, 1, 1, 1, 0, 0, 1, 1, 0, 0])
Approach #2
Another way to solve and this could be considered as a variant of the 1D convolution approach, would be to use erosion, because if you look at the outputs, we can simply erode the mask of 0s from the starts until n-1 places. So, we can use scipy.ndimage.morphology's binary_erosion that also allow us to specify the portion of kernel center with its origin arg, hence we will avoid any slicing. The implementation would look something like this -
from scipy.ndimage.morphology import binary_erosion
out = binary_erosion(a==0,np.ones(n),origin=(n-1)//2).astype(int)
Using for loop:
def nzeros(a, n):
#Create a numpy array of zeros of length equal to n
b = np.zeros(n)
#Create a numpy array of zeros of same length as array a
c = np.zeros(len(a), dtype=int)
for i in range(0,len(a) - n):
if (b == a[i : i+n]).all(): #Check if array b is equal to slice in a
c[i+n-1] = 1
return c
Sample Output:
print(nzeros(a, 3))
[0 0 1 1 0 0 0 1 0 0]

How to do <= and >= on sparse matrices?

Is it possible to do <= or >= operations on Scipy sparse matrices, such that the expression returns True if the operation is true for all corresponding elements? For example, a <= b means that for all corresponding elements (a, b) in matrices (A, B), a <= b? Here's an example to consider:
import numpy as np
from scipy.sparse import csr_matrix
np.random.seed(0)
mat = csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
print(mat.A)
print()
np.random.seed(1)
matb = csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
print(matb.A)
Running this gives the warning: SparseEfficiencyWarning: Comparing sparse matrices using >= and <= is inefficient, using <, >, or !=, instead and gives the error: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
I'd like to be able to take 2 sparse matrices, A and B, and determine if A <= B for each pair of corresponding elements (a, b) in (A, B). Is this possible? What would the performance of such an operation be?
In [402]: np.random.seed = 0
...: mat = sparse.csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
In [403]: mat
Out[403]:
<10x12 sparse matrix of type '<class 'numpy.int64'>'
with 40 stored elements in Compressed Sparse Row format>
In [404]: mat.A
Out[404]:
array([[1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
[1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
...
[0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1],
[0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]], dtype=int64)
In [405]: np.random.seed = 1
...: matb = sparse.csr_matrix(np.random.rand(10, 12)>0.7, dtype=int)
In [407]: mat<matb
Out[407]:
<10x12 sparse matrix of type '<class 'numpy.bool_'>'
with 27 stored elements in Compressed Sparse Row format>
In [408]: mat>=matb
/home/paul/.local/lib/python3.6/site-packages/scipy/sparse/compressed.py:295: SparseEfficiencyWarning: Comparing sparse matrices using >= and <= is inefficient, using <, >, or !=, instead.
"using <, >, or !=, instead.", SparseEfficiencyWarning)
Out[408]:
<10x12 sparse matrix of type '<class 'numpy.float64'>'
with 93 stored elements in Compressed Sparse Row format>
In your case, neither mat or matb are particularly sparse, 40 and 36 nonzeros out of a possible 120. Even so the mat<matb results in 27 nonzero (True) values, while the >= test results in 93. Where ever both matrices are 0, the result is True.
It's warning us that using sparse matrices isn't going to save us space or time (compared to dense arrays) if we do this kind of testing. It's not going to kill us, it just won't be as efficient.
(Pulling some comments together for this answer):
To simply do elementwise <= on two sparse matrices A and B, you can do (A <= B). However, as #hpaulj points out, this is inefficient because any pair of corresponding 0 elements (i.e. (1,1) is 0 in both A and B) will be turned into a 1 with this operation. Assuming both A and B are sparse (mostly 0s), you will destroy their sparsity by making them mostly 1s.
To get around this, consider the following:
A = csr_matrix((3, 3))
A[1, 1] = 1
print(A.A)
print()
B = csr_matrix((3, 3))
B[0, 0] = 1
B[1, 1] = 2
print(B.A)
print(not (A > B).count_nonzero())
To explain that last line, A > B will do the opposite of A <= B, so corresponding 0s will remain 0, and any place where a > b will become a 1. Therefore, if the resulting matrix has any non-zero elements, it means that there is some (a, b) in (A, B) where a > b. This means that it is not the case that A <= B (elementwise).

Looping over large sparse array

Let's say I have a (sparse) matrix M size (N*N, N*N). I want to select elements from this matrix where the outer product of grid (a (n,m) array, where n*m=N) is True (it is a boolean 2D array, and na=grid.sum()). This can be done as follows
result = M[np.outer( grid.flatten(),grid.flatten() )].reshape (( N, N ) )
result is an (na,na) sparse array (and na < N). The previous line is what I want to achieve: get the elements of M that are true from the product of grid, and squeeze the ones that aren't true out of the array.
As n and m (and hence N) grow, and M and result are sparse matrices, I am not able to do this efficiently in terms of memory or speed. Closest I have tried is:
result = sp.lil_matrix ( (1, N*N), dtype=np.float32 )
# Calculate outer product
A = np.einsum("i,j", grid.flatten(), grid.flatten())
cntr = 0
it = np.nditer ( A, flags=['multi_index'] )
while not it.finished:
if it[0]:
result[0,cntr] = M[it.multi_index[0], it.multi_index[1]]
cntr += 1
# reshape result to be a N*N sparse matrix
The last reshape could be done by this approach, but I haven't got there yet, as the while loop is taking forever.
I have also tried selecting nonzero elements of A too, and looping over but this eats up all of the memory:
A=np.einsum("i,j", grid.flatten(), grid.flatten())
nzero = A.nonzero() # This eats lots of memory
cntr = 0
for (i,j) in zip (*nzero):
temp_mat[0,cntr] = M[i,j]
cnt += 1
'n' and 'm' in the example above are around 300.
I don't know if it was a typo, or code error, but your example is missing an iternext:
R=[]
it = np.nditer ( A, flags=['multi_index'] )
while not it.finished:
if it[0]:
R.append(M[it.multi_index])
it.iternext()
I think appending to a list is simpler and faster than R[ctnr]=.... It's competitive if R is a regular array, and sparse indexing is slower (even the fastest lil format).
ndindex wraps this use of a nditer as:
R=[]
for index in np.ndindex(A.shape):
if A[index]:
R.append(M[index])
ndenumerate also works:
R = []
for index,a in np.ndenumerate(A):
if a:
R.append(M[index])
But I wonder if you really want to advance the cntr each it step, not just the True cases. Otherwise reshaping result to (N,N) doesn't make much sense. But in that case, isn't your problem just
M[:N, :N].multiply(A)
or if M was a dense array:
M[:N, :N]*A
In fact if both M and A are sparse, then the .data attribute of that multiply will be the same as the R list.
In [76]: N=4
In [77]: M=np.arange(N*N*N*N).reshape(N*N,N*N)
In [80]: a=np.array([0,1,0,1])
In [81]: A=np.einsum('i,j',a,a)
In [82]: A
Out[82]:
array([[0, 0, 0, 0],
[0, 1, 0, 1],
[0, 0, 0, 0],
[0, 1, 0, 1]])
In [83]: M[:N, :N]*A
Out[83]:
array([[ 0, 0, 0, 0],
[ 0, 17, 0, 19],
[ 0, 0, 0, 0],
[ 0, 49, 0, 51]])
In [84]: c=sparse.csr_matrix(M)[:N,:N].multiply(sparse.csr_matrix(A))
In [85]: c.data
Out[85]: array([17, 19, 49, 51], dtype=int32)
In [89]: [M[index] for index, a in np.ndenumerate(A) if a]
Out[89]: [17, 19, 49, 51]

Find Indices Of Columns Having Some Nonzero Element In A 2d array

I have a numpy array with dim (157,1944).
I want to get indices of columns that have a Nonzero element in any row.
example: [[0,0,3,4], [0,0,1,1]] ----> [2,3]
If you look each row, there is a Non Zero element in columns [2, 3]
So if I have
[[0,1,3,4], [0,0,1,1]]
I should get [1,2,3] because column index 0 has no Nonzero elements in any row.
Not sure if your question is completely defined. However, say we start with
import numpy as np
a = np.array([[0,0,3,4], [0,0,1,1]])
then
>>> np.nonzero(np.all(a != 0, axis=0))[0]
array([2, 3])
are the indices of the columns for which none of the rows are nonzero, and
>>> np.nonzero(np.any(a != 0, axis=0))[0]
array([2, 3])
are the indices of the columns for which not all of the rows are zero (it happens to be the same for the example you gave).

Categories