I would like to be able to quickly instantiate a matrix where the first few (variable number of) cells in a row are 0, and the rest are ones.
Imagine we want a 3x4 matrix.
I have instantiated the matrix first as all ones:
ones = np.ones([4,3])
Then imagine we have an array that announces how many leading zeros there are:
arr = np.array([2,1,3,0]) # first row has 2 zeroes, second row 1 zero, etc
Required result:
array([[0, 0, 1],
[0, 1, 1],
[0, 0, 0],
[1, 1, 1]])
Obviously this can be done in the opposite way as well, but I'd consider the approach where 1 is a default value, and zeros would be replaced.
What would be the best way to avoid some silly loop?
Here's one way. n is the number of columns in the result. The number of rows is determined by len(arr).
In [29]: n = 5
In [30]: arr = np.array([1, 2, 3, 0, 3])
In [31]: (np.arange(n) >= arr[:, np.newaxis]).astype(int)
Out[31]:
array([[0, 1, 1, 1, 1],
[0, 0, 1, 1, 1],
[0, 0, 0, 1, 1],
[1, 1, 1, 1, 1],
[0, 0, 0, 1, 1]])
There are two parts to the explanation of how this works. First, how to create a row with m zeros and n-m ones? For that, we use np.arange to create a row with values [0, 1, ..., n-1]`:
In [35]: n
Out[35]: 5
In [36]: np.arange(n)
Out[36]: array([0, 1, 2, 3, 4])
Next, compare that array to m:
In [37]: m = 2
In [38]: np.arange(n) >= m
Out[38]: array([False, False, True, True, True], dtype=bool)
That gives an array of boolean values; the first m values are False and the rest are True. By casting those values to integers, we get an array of 0s and 1s:
In [39]: (np.arange(n) >= m).astype(int)
Out[39]: array([0, 0, 1, 1, 1])
To perform this over an array of m values (your arr), we use broadcasting; this is the second key idea of the explanation.
Note what arr[:, np.newaxis] gives:
In [40]: arr
Out[40]: array([1, 2, 3, 0, 3])
In [41]: arr[:, np.newaxis]
Out[41]:
array([[1],
[2],
[3],
[0],
[3]])
That is, arr[:, np.newaxis] reshapes arr into a 2-d array with shape (5, 1). (arr.reshape(-1, 1) could have been used instead.) Now when we compare this to np.arange(n) (a 1-d array with length n), broadcasting kicks in:
In [42]: np.arange(n) >= arr[:, np.newaxis]
Out[42]:
array([[False, True, True, True, True],
[False, False, True, True, True],
[False, False, False, True, True],
[ True, True, True, True, True],
[False, False, False, True, True]], dtype=bool)
As #RogerFan points out in his comment, this is basically an outer product of the arguments, using the >= operation.
A final cast to type int gives the desired result:
In [43]: (np.arange(n) >= arr[:, np.newaxis]).astype(int)
Out[43]:
array([[0, 1, 1, 1, 1],
[0, 0, 1, 1, 1],
[0, 0, 0, 1, 1],
[1, 1, 1, 1, 1],
[0, 0, 0, 1, 1]])
Not as concise as I wanted (I was experimenting with mask_indices), but this will also do the work:
>>> n = 3
>>> zeros = [2, 1, 3, 0]
>>> numpy.array([[0] * zeros[i] + [1]*(n - zeros[i]) for i in range(len(zeros))])
array([[0, 0, 1],
[0, 1, 1],
[0, 0, 0],
[1, 1, 1]])
>>>
Works very simple: concatenates multiplied required number of times, one-element lists [0] and [1], creating the array row by row.
Related
I have a matrix like this:
A = sp.csr_matrix(np.array(
[[1, 1, 2, 1],
[0, 0, 2, 0],
[1, 4, 1, 1],
[0, 1, 0, 0]]))
I want to get all the rows where all columns are nonzero, so I can then get their sum. Either as an array:
rows = [True, False, True, False]
result = A[rows].sum()
Or as indices:
rows = [0, 2]
result = A[rows].sum()
I am stuck however at the first part, figuring out which rows to include in the sum, as most results seem to be looking for the opposite (rows where all columns are zero).
In [35]: from scipy import sparse
In [36]: A = sparse.csr_matrix(np.array(
...: [[1, 1, 2, 1],
...: [0, 0, 2, 0],
...: [1, 4, 1, 1],
...: [0, 1, 0, 0]]))
In [37]: A
Out[37]:
<4x4 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Row format>
Sparse doesn't do 'all/any' kinds of operations because they treat 0's as significant values.
all on the dense equivalent works nicely:
In [41]: A.A.all(axis=1)
Out[41]: array([ True, False, True, False])
On the sparse one we can turn the dtype to boolean, and sum along the axis. And then test it for the full value:
In [42]: A.astype(bool).sum(axis=1)
Out[42]:
matrix([[4],
[1],
[4],
[1]])
In [43]: A.astype(bool).sum(axis=1).A1==4
Out[43]: array([ True, False, True, False])
Notice that the sparse sum returns a np.matrix. I used A1 to turn that into a 1d array.
If the matrix isn't too large, working with the dense array may be faster. Sparse operations like sum are actually performed with matrix multiplication.
In [51]: A.astype(bool)#np.ones(4,int)
Out[51]: array([4, 1, 4, 1])
Or we could convert it to lil format, and look at the length of the 'rows':
In [67]: A.tolil().data
Out[67]:
array([list([1, 1, 2, 1]), list([2]), list([1, 4, 1, 1]), list([1])],
dtype=object)
In [68]: [len(i) for i in A.tolil().data]
Out[68]: [4, 1, 4, 1]
But wait, there's more. The indptr attribute of the csr is:
In [69]: A.indptr
Out[69]: array([ 0, 4, 5, 9, 10], dtype=int32)
In [70]: np.diff(A.indptr)
Out[70]: array([4, 1, 4, 1], dtype=int32)
I've omitted some test timings, but this last is clearly the fastest!
It is a bit easier to do for numpy arrays than for sparse ones. If you do not mind converting to numpy as an intermediate step, you can get the right rows via
(A.toarray() != 0).all(axis=1)
to produce
array([ True, False, True, False])
and then use it in indexing A as such:
A[(A.toarray() != 0).all(axis=1),:].sum()
returns 12
Given list like indice = [1, 0, 2] and dimension m = 3, I want to get the mask array like this
>>> import numpy as np
>>> mask_array = np.array([ [1, 1, 0], [1, 0, 0], [1, 1, 1] ])
>>> mask_array
[[1, 1, 0],
[1, 0, 0],
[1, 1, 1]]
Given m = 3, so the axis=1 of mask_array is 3, the row of mask_array indicates the length of indice.
For converting the indice to mask_array, the rule is marking the item values whose index is less or equal to the each entry of inside to value 1. For example, indice[0]=1, so the output is [1, 1, 0], given dimension is 3.
In NumPy, are there any APIs which can be used to do this?
Sure, just use broadcasting with arange(m), make sure to use an np.array for the indices, not a list...
>>> indice = [1, 0, 2]
>>> m = 3
>>> np.arange(m) <= np.array(indice)[..., None]
array([[ True, True, False],
[ True, False, False],
[ True, True, True]])
Note, the [..., None] just reshapes the indices array so that the broadcasting works like we want, like this:
>>> indices = np.array(indice)
>>> indices
array([1, 0, 2])
>>> indices[...,None]
array([[1],
[0],
[2]])
I'm looking for vectorized way to changing the array value above the first non-zero element in the column.
for x in range(array.shape[1]):
for y in range(array.shape[0]):
if array[y,x]>0:
break
else:
array[y,x]=255
In
Out
As you wrote about an array (not a DataFrame), I assume that you have
a Numpy array and want to use Numpy methods.
To do your task, run the following code:
np.where(np.cumsum(np.not_equal(array, 0), axis=0), array, 255)
Example and explanation of steps:
The source array:
array([[0, 1, 0],
[0, 0, 1],
[1, 1, 0],
[1, 0, 0]])
np.not_equal(array, 0) computes a boolean array with True for
elements != 0:
array([[False, True, False],
[False, False, True],
[ True, True, False],
[ True, False, False]])
np.cumsum(..., axis=0) computes cumulative sum (True counted as 1)
along axis 0 (in columns):
array([[0, 1, 0],
[0, 1, 1],
[1, 2, 1],
[2, 2, 1]], dtype=int32)
4. The above array is a mask used in where. For masked values (where
the corresponding element of the mask is True (actually, != 0)),
take values from corresponding elements of array, otherwise take 255:
np.where(..., array, 255)
The result (for my array) is:
array([[255, 1, 255],
[255, 0, 1],
[ 1, 1, 0],
[ 1, 0, 0]])
Use masking:
array[array == 0] = 255
I'm trying to extract columns of a scipy sparse column matrix, but the result is not stored as I'd expect. Here's what I mean:
In [77]: a = scipy.sparse.csc_matrix(np.ones([4, 5]))
In [78]: ind = np.array([True, True, False, False, False])
In [79]: b = a[:, ind]
In [80]: b.indices
Out[80]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
In [81]: a.indices
Out[81]: array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
How come b.indices is not [0, 1, 2, 3, 0, 1, 2, 3] ?
And since this behaviour is not the one I expect, is a[:, ind] not the correct way to extract columns from a csc matrix?
The indices are not sorted. You can either force the looping by reversing in a's rows, which is not that intuitive, or enforce sorted indices (you can also do it in-place, but I prefer casting). What I find funny is that the has_sorted_indices attribute does not always return a boolean, but mixes it with integer representation.
a = scipy.sparse.csc_matrix(np.ones([4, 5]))
ind = np.array([True, True, False, False, False])
b = a[::-1, ind]
b2 = a[:, ind]
b3 = b2.sorted_indices()
b.indices
>>array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
b.has_sorted_indices
>>1
b2.indices
>>array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
b2.has_sorted_indices
>>0
b3.indices
array([0, 1, 2, 3, 0, 1, 2, 3], dtype=int32)
b3.has_sorted_indices
>>True
csc and csr indices are not guaranteed to be sorted. I can't off hand find documentation to the effect, but the has_sort_indices and the sort methods suggest that.
In your case the order is the result of how the indexing is done. I found in previous SO questions, that multicolumn indexing is performed with a matrix multiplication:
In [165]: a = sparse.csc_matrix(np.ones([4,5]))
In [166]: b = a[:,[0,1]]
In [167]: b.indices
Out[167]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
This indexing is the equivalent to constructing a 'selection' matrix:
In [169]: I = sparse.csr_matrix(np.array([[1,0,0,0,0],[0,1,0,0,0]]).T)
In [171]: I.A
Out[171]:
array([[1, 0],
[0, 1],
[0, 0],
[0, 0],
[0, 0]], dtype=int32)
and doing this matrix multiplication:
In [172]: b1 = a * I
In [173]: b1.indices
Out[173]: array([3, 2, 1, 0, 3, 2, 1, 0], dtype=int32)
The order is the result of how the matrix multiplication was done. In fact a * a.T does the same reversal. We'd have to examine the multiplication code to know exactly why. Evidently the csc and csr calculation code doesn't require sorted indices, and doesn't bother to ensure the results are sorted.
https://docs.scipy.org/doc/scipy-0.19.1/reference/sparse.html#further-details
Further Details¶
CSR column indices are not necessarily sorted. Likewise for CSC row indices. Use the .sorted_indices() and .sort_indices() methods when sorted indices are required (e.g. when passing data to other libraries).
How to remove rows while iterating in numpy, as Java does:
Iterator < Message > itMsg = messages.iterator();
while (itMsg.hasNext()) {
Message m = itMsg.next();
if (m != null) {
itMsg.remove();
continue;
}
}
Here is my pseudo code. Remove the rows whose entries are all 0 and 1 while iterating.
#! /usr/bin/env python
import numpy as np
M = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 0, 0, 0], #remove this row whose entries are all 0
[1, 1, 1, 1] #remove this row whose entries are all 1
])
it = np.nditer(M, order="K", op_flags=['readwrite'])
while not it.finished :
row = it.next() #how to get a row?
sumRow = np.sum(row)
if sumRow==4 or sumRow==0 : #remove rows whose entries are all 0 and 1 as well
#M = np.delete(M, row, axis =0)
it.remove_axis(i) #how to get i?
Writing good numpy code requires you to think in a vectorized fashion. Not every problem has a good vectorization, but for those that do, you can write clean and fast code pretty easily. In this case, we can decide on what rows we want to remove/keep and then use that to index into your array:
>>> M
array([[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 0],
[1, 1, 1, 1]])
>>> M[~((M == 0).all(1) | (M == 1).all(1))]
array([[0, 1, 0, 0],
[0, 0, 1, 0]])
Step by step, we can compare M to something to make a boolean array:
>>> M == 0
array([[ True, False, True, True],
[ True, True, False, True],
[ True, True, True, True],
[False, False, False, False]], dtype=bool)
We can use all to see if a row or column is all true:
>>> (M == 0).all(1)
array([False, False, True, False], dtype=bool)
We can use | to do an or operation:
>>> (M == 0).all(1) | (M == 1).all(1)
array([False, False, True, True], dtype=bool)
We can use this to select rows:
>>> M[(M == 0).all(1) | (M == 1).all(1)]
array([[0, 0, 0, 0],
[1, 1, 1, 1]])
But since these are the rows we want to throw away, we can use ~ (NOT) to flip False and True:
>>> M[~((M == 0).all(1) | (M == 1).all(1))]
array([[0, 1, 0, 0],
[0, 0, 1, 0]])
If instead we wanted to keep columns which weren't all 1 or all 0, we simply need to change what axis we're working on:
>>> M
array([[1, 1, 0, 1],
[1, 0, 1, 1],
[1, 0, 0, 1],
[1, 1, 1, 1]])
>>> M[:, ~((M == 0).all(axis=0) | (M == 1).all(axis=0))]
array([[1, 0],
[0, 1],
[0, 0],
[1, 1]])