Set rows of scipy.sparse matrix that meet certain condition to zeros

Set rows of scipy.sparse matrix that meet certain condition to zeros - python

I wonder what is the best way to replaces rows that do not satisfy a certain condition with zeros for sparse matrices. For example (I use plain arrays for illustration):
I want to replace every row whose sum is greater than 10 with a row of zeros
a = np.array([[0,0,0,1,1],
[1,2,0,0,0],
[6,7,4,1,0], # sum > 10
[0,1,1,0,1],
[7,3,2,2,8], # sum > 10
[0,1,0,1,2]])
I want to replace a[2] and a[4] with zeros, so my output should look like this:
array([[0, 0, 0, 1, 1],
[1, 2, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 1, 1, 0, 1],
[0, 0, 0, 0, 0],
[0, 1, 0, 1, 2]])
This is fairly straight forward for dense matrices:
row_sum = a.sum(axis=1)
to_keep = row_sum >= 10
a[to_keep] = np.zeros(a.shape[1])
However, when I try:
s = sparse.csr_matrix(a)
s[to_keep, :] = np.zeros(a.shape[1])
I get this error:
raise NotImplementedError("Fancy indexing in assignment not "
NotImplementedError: Fancy indexing in assignment not supported for csr matrices.
Hence, I need a different solution for sparse matrices. I came up with this:
def zero_out_unfit_rows(s_mat, limit_row_sum):
row_sum = s_mat.sum(axis=1).T.A[0]
to_keep = row_sum <= limit_row_sum
to_keep = to_keep.astype('int8')
temp_diag = get_sparse_diag_mat(to_keep)
return temp_diag * s_mat
def get_sparse_diag_mat(my_diag):
N = len(my_diag)
my_diags = my_diag[np.newaxis, :]
return sparse.dia_matrix((my_diags, [0]), shape=(N,N))
This relies on the fact that if we set 2nd and 4th elements of the diagonal in the identity matrix to zero, then rows of the pre-multiplied matrix are set to zero.
However, I feel that there is a better, more scipynic, solution. Is there a better solution?

Not sure if it is very scithonic, but a lot of the operations on sparse matrices are better done by accessing the guts directly. For your case, I personally would do:
a = np.array([[0,0,0,1,1],
[1,2,0,0,0],
[6,7,4,1,0], # sum > 10
[0,1,1,0,1],
[7,3,2,2,8], # sum > 10
[0,1,0,1,2]])
sps_a = sps.csr_matrix(a)
# get sum of each row:
row_sum = np.add.reduceat(sps_a.data, sps_a.indptr[:-1])
# set values to zero
row_mask = row_sum > 10
nnz_per_row = np.diff(sps_a.indptr)
sps_a.data[np.repeat(row_mask, nnz_per_row)] = 0
# ask scipy.sparse to remove the zeroed entries
sps_a.eliminate_zeros()
>>> sps_a.toarray()
array([[0, 0, 0, 1, 1],
[1, 2, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 1, 1, 0, 1],
[0, 0, 0, 0, 0],
[0, 1, 0, 1, 2]])
>>> sps_a.nnz # it does remove the entries, not simply set them to zero
10

Related

Replace all but the first 1 in an array with 0

I am trying to find a way to replace all of the duplicate 1 with 0. As an example:
[[0,1,0,1,0],
[1,0,0,1,0],
[1,1,1,0,1]]
Should become:
[[0,1,0,0,0],
[1,0,0,0,0],
[1,0,0,0,0]]
I found a similar problem, however the solution does not seem to work numpy: setting duplicate values in a row to 0

Assume array contains only zeros and ones, you can find the max value per row using numpy.argmax and then use advanced indexing to reassign the values on the index to a zeros array.
arr = np.array([[0,1,0,1,0],
[1,0,0,1,0],
[1,1,1,0,1]])
res = np.zeros_like(arr)
idx = (np.arange(len(res)), np.argmax(arr, axis=1))
res[idx] = arr[idx]
res
array([[0, 1, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]])

Try looping through each row of the grid
In each row, find all the 1s. In particular you want their indices (positions within the row). You can do this with a list comprehension and enumerate, which automatically gives an index for each element.
Then, still within that row, go through every 1 except for the first, and set it to zero.
grid = [[0, 1, 0, 1, 0], [1, 0, 0, 1, 0], [1, 1, 1, 0, 1]]
for row in grid:
ones = [i for i, element in enumerate(row) if element==1]
for i in ones[1:]:
row[i] = 0
print(grid)
Gives: [[0, 1, 0, 0, 0], [1, 0, 0, 0, 0], [1, 0, 0, 0, 0]]

You can use cumsum:
(arr.cumsum(axis=1).cumsum(axis=1) == 1) * 1
this will create a cummulative sum, by then checking if a value is 1 you can find the first 1s

Find first n non zero values in in numpy 2d array

I would like to know the fastest way to extract the indices of the first n non zero values per column in a 2D array.
For example, with the following array:
arr = [
[4, 0, 0, 0],
[0, 0, 0, 0],
[0, 4, 0, 0],
[2, 0, 9, 0],
[6, 0, 0, 0],
[0, 7, 0, 0],
[3, 0, 0, 0],
[1, 2, 0, 0],
With n=2 I would have [0, 0, 1, 1, 2] as xs and [0, 3, 2, 5, 3] as ys. 2 values in the first and second columns and 1 in the third.
Here is how it is currently done:
x = []
y = []
n = 3
for i, c in enumerate(arr.T):
a = c.nonzero()[0][:n]
if len(a):
x.extend([i]*len(a))
y.extend(a)
In practice I have arrays of size (405, 256).
Is there a way to make it faster?

Here is a method, although quite confusing as it uses a lot of functions, that does not require sorting the array (only a linear scan is necessary to get non null values):
n = 2
# Get indices with non null values, columns indices first
nnull = np.stack(np.where(arr.T != 0))
# split indices by unique value of column
cols_ids= np.array_split(range(len(nnull[0])), np.where(np.diff(nnull[0]) > 0)[0] +1 )
# Take n in each (max) and concatenate the whole
np.concatenate([nnull[:, u[:n]] for u in cols_ids], axis = 1)
outputs:
array([[0, 0, 1, 1, 2],
[0, 3, 2, 5, 3]], dtype=int64)

Here is one approach using argsort, it gives a different order though:
n = 2
m = arr!=0
# non-zero values first
idx = np.argsort(~m, axis=0)
# get first 2 and ensure non-zero
m2 = np.take_along_axis(m, idx, axis=0)[:n]
y,x = np.where(m2)
# slice
x, idx[y,x]
# (array([0, 1, 2, 0, 1]), array([0, 2, 3, 3, 5]))

Use dislocation comparison for the row results of the transposed nonzero:
>>> n = 2
>>> i, j = arr.T.nonzero()
>>> mask = np.concatenate([[True] * n, i[n:] != i[:-n]])
>>> i[mask], j[mask]
(array([0, 0, 1, 1, 2], dtype=int64), array([0, 3, 2, 5, 3], dtype=int64))

Use numpy.frompyfunc to add broadcasting to a python function with argument

From an array like db (which will be approximately (1e6, 300)) and a mask = [1, 0, 1] vector, I define the target as a 1 in the first column.
I want to create an out vector that consists of ones where the corresponding row in db matches the mask and target==1, and zeros everywhere else.
db = np.array([ # out for mask = [1, 0, 1]
# target, vector #
[1, 1, 0, 1], # 1
[0, 1, 1, 1], # 0 (fit to mask but target == 0)
[0, 0, 1, 0], # 0
[1, 1, 0, 1], # 1
[0, 1, 1, 0], # 0
[1, 0, 0, 0], # 0
])
I have defined a vline function that applies a mask to each array line using np.array_equal(mask, mask & vector) to check that vectors 101 and 111 fit the mask, then retains only the indices where target == 1.
out is initialized to array([0, 0, 0, 0, 0, 0])
out = [0, 0, 0, 0, 0, 0]
The vline function is defined as:
def vline(idx, mask):
line = db[idx]
target, vector = line[0], line[1:]
if np.array_equal(mask, mask & vector):
if target == 1:
out[idx] = 1
I get the correct result by applying this function line-by-line in a for loop:
def check_mask(db, out, mask=[1, 0, 1]):
# idx_db to iterate over db lines without enumerate
for idx in np.arange(db.shape[0]):
vline(idx, mask=mask)
return out
assert check_mask(db, out, [1, 0, 1]) == [1, 0, 0, 1, 0, 0] # it works !
Now I want to vectorize vline by creating a ufunc:
ufunc_vline = np.frompyfunc(vline, 2, 1)
out = [0, 0, 0, 0, 0, 0]
ufunc_vline(db, [1, 0, 1])
print out
But the ufunc complains about broadcasting inputs with those shapes:
In [217]: ufunc_vline(db, [1, 0, 1])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-217-9008ebeb6aa1> in <module>()
----> 1 ufunc_vline(db, [1, 0, 1])
ValueError: operands could not be broadcast together with shapes (6,4) (3,)
In [218]:

Converting vline to a numpy ufunc fundamentally doesn't make sense, since ufuncs are always applied to numpy arrays in an elementwise fashion. Because of this, the input arguments must either have the same shape, or must be broadcastable to the same shape. You are passing two arrays with incompatible shapes to your ufunc_vline function (db.shape == (6, 4) and mask.shape == (3,)), hence the ValueError you are seeing.
There are a couple of other issues with ufunc_vline:
np.frompyfunc(vline, 2, 1) specifies that vline should return a single output argument, whereas vline actually returns nothing (but modifies out in place).
You are passing db as the first argument to ufunc_vline, whereas vline expects the first argument to be idx, which is used as an index into the rows of db.
Also, bear in mind that creating a ufunc from a Python function using np.frompyfunc will not yield any noticeable performance benefit over a standard Python for loop. To see any serious improvement you would probably need to code the ufunc in a low-level language such as C (see this example in the documentation).
Having said that, your vline function can be easily vectorized using standard boolean array operations:
def vline_vectorized(db, mask):
return db[:, 0] & np.all((mask & db[:, 1:]) == mask, axis=1)
For example:
db = np.array([ # out for mask = [1, 0, 1]
# target, vector #
[1, 1, 0, 1], # 1
[0, 1, 1, 1], # 0 (fit to mask but target == 0)
[0, 0, 1, 0], # 0
[1, 1, 0, 1], # 1
[0, 1, 1, 0], # 0
[1, 0, 0, 0], # 0
])
mask = np.array([1, 0, 1])
print(repr(vline_vectorized(db, mask)))
# array([1, 0, 0, 1, 0, 0])

Numpy: increment elements of an array given the indices required to increment

I am trying to turn a second order tensor into a binary third order tensor. Given a second order tensor as a m x n numpy array: A, I need to take each element value: x, in A and replace it with a vector: v, with dimensions equal to the maximum value of A, but with a value of 1 incremented at the index of v corresponding to the value x (i.e. v[x] = 1). I have been following this question: Increment given indices in a matrix, which addresses producing an array with increments at indices given by 2 dimensional coordinates. I have been reading the answers and trying to use np.ravel_multi_index() and np.bincount() to do the same but with 3 dimensional coordinates, however I keep on getting a ValueError: "invalid entry in coordinates array". This is what I have been using:
def expand_to_tensor_3(array):
(x, y) = array.shape
(a, b) = np.indices((x, y))
a = a.reshape(x*y)
b = b.reshape(x*y)
tensor_3 = np.bincount(np.ravel_multi_index((a, b, array.reshape(x*y)), (x, y, np.amax(array))))
return tensor_3
If you know what is wrong here or know an even better method to accomplish my goal, both would be really helpful, thanks.

You can use (A[:,:,np.newaxis] == np.arange(A.max()+1)).astype(int).
Here's a demonstration:
In [52]: A
Out[52]:
array([[2, 0, 0, 2],
[3, 1, 2, 3],
[3, 2, 1, 0]])
In [53]: B = (A[:,:,np.newaxis] == np.arange(A.max()+1)).astype(int)
In [54]: B
Out[54]:
array([[[0, 0, 1, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0]],
[[0, 0, 0, 1],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]],
[[0, 0, 0, 1],
[0, 0, 1, 0],
[0, 1, 0, 0],
[1, 0, 0, 0]]])
Check a few individual elements of A:
In [55]: A[0,0]
Out[55]: 2
In [56]: B[0,0,:]
Out[56]: array([0, 0, 1, 0])
In [57]: A[1,3]
Out[57]: 3
In [58]: B[1,3,:]
Out[58]: array([0, 0, 0, 1])
The expression A[:,:,np.newaxis] == np.arange(A.max()+1) uses broadcasting to compare each element of A to np.arange(A.max()+1). For a single value, this looks like:
In [63]: 3 == np.arange(A.max()+1)
Out[63]: array([False, False, False, True], dtype=bool)
In [64]: (3 == np.arange(A.max()+1)).astype(int)
Out[64]: array([0, 0, 0, 1])
A[:,:,np.newaxis] is a three-dimensional view of A with shape (3,4,1). The extra dimension is added so that the comparison to np.arange(A.max()+1) will broadcast to each element, giving a result with shape (3, 4, A.max()+1).
With a trivial change, this will work for an n-dimensional array. Indexing a numpy array with the ellipsis ... means "all the other dimensions". So
(A[..., np.newaxis] == np.arange(A.max()+1)).astype(int)
converts an n-dimensional array to an (n+1)-dimensional array, where the last dimension is the binary indicator of the integer in A. Here's an example with a one-dimensional array:
In [6]: a = np.array([3, 4, 0, 1])
In [7]: (a[...,np.newaxis] == np.arange(a.max()+1)).astype(int)
Out[7]:
array([[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0]])

You can make it work this way:
tensor_3 = np.bincount(np.ravel_multi_index((a, b, array.reshape(x*y)),
(x, y, np.amax(array) + 1)))
The difference is that I add 1 to the amax() result, because ravel_multi_index() expects that the indexes are all strictly less than the dimensions, not less-or-equal.
I'm not 100% sure if this is what you wanted; another way to make the code run is to specify mode='clip' or mode='wrap' in ravel_multi_index(), which does something a bit different and I'm guessing is less correct. But you can try it.

matlab find() for nonzero element in python

I have a sparse matrix (numpy.array) and I would like to have the index of the nonzero elements in it.
In Matlab I would write:
[i, j] = find(CM)
and in Python what should I do?
I have tried numpy.nonzero (but I don't know how to take the indices from that) and flatnonzero (but it's not convenient for me, I need both the row and column index).
Thanks in advance!

Assuming that by "sparse matrix" you don't actually mean a scipy.sparse matrix, but merely a numpy.ndarray with relatively few nonzero entries, then I think nonzero is exactly what you're looking for. Starting from an array:
>>> a = (np.random.random((5,5)) < 0.10)*1
>>> a
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
nonzero returns the indices (here x and y) where the nonzero entries live:
>>> a.nonzero()
(array([1, 2, 3]), array([4, 2, 0]))
We can assign these to i and j:
>>> i, j = a.nonzero()
We can also use them to index back into a, which should give us only 1s:
>>> a[i,j]
array([1, 1, 1])
We can even modify a using these indices:
>>> a[i,j] = 2
>>> a
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 2],
[0, 0, 2, 0, 0],
[2, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
If you want a combined array from the indices, you can do that too:
>>> np.array(a.nonzero()).T
array([[1, 4],
[2, 2],
[3, 0]])
(there are lots of ways to do this reshaping; I chose one almost at random.)

This goes slightly beyond what you as and I only mention it since I once faced a similar problem. If you want the indices to access some other array there is some very simple sytax:
import numpy as np
array = np.random.randint(0, 2, size=(3, 3))
data = np.random.random(size=(3, 3))
Now array looks something like
>>> print array
array([[0, 1, 0],
[1, 0, 1],
[1, 1, 0]])
while data could be
>>> print data
array([[ 0.92824816, 0.43605604, 0.16627849],
[ 0.00301434, 0.94342538, 0.95297402],
[ 0.32665135, 0.03504204, 0.86902492]])
Then if we want the elements of data which are zero:
>>> print data[array==0]
array([ 0.92824816, 0.16627849, 0.94342538, 0.86902492])
Which is nice and simple.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Set rows of scipy.sparse matrix that meet certain condition to zeros - python

Related

Replace all but the first 1 in an array with 0

Find first n non zero values in in numpy 2d array

Use numpy.frompyfunc to add broadcasting to a python function with argument

Numpy: increment elements of an array given the indices required to increment

matlab find() for nonzero element in python

Categories

Resources