Numpy: sums 1D array indexed by rows of 2D boolean array - python

Assume that I have an (m,)-array a and an (n, m)-array b of booleans. For each row b[i] of b, I want to take np.sum(a, where=b[i]), which should return an (n,)-array. I could do the following:
a = np.array([1,2,3])
b = np.array([[True, False, False], [True, True, False], [False, True, True]])
c = np.array([np.sum(a, where=r) for r in b])
# c is [1,3,5]
but this seems quite unelegant to me. I would have hoped that broadcasting magic makes something like
c = np.sum(a, where=b)
# c is 9
work, but apparently, np.sum then sums over the rows of b, which I do not want. Is there a numpy-inherent way of achieving the desired behavour with np.sum (or any ufunc.reduce)?

How about:
a = np.array([1,2,3])
b = np.array([[True, False, False], [True, True, False], [False, True, True]])
c = np.sum(a*b, axis = 1)
output:
array([1, 3, 5])

Related

How to return a numpy array of the indices of the first element in each row of a numpy array with a given value?

Given a numpy array of shape (2, 4):
input = np.array([[False, True, False, True], [False, False, True, True]])
I want to return an array of shape (N,) where each element of the array is the index of the first True value:
expected = np.array([1, 2])
Is there an easy way to do this using numpy functions and without resorting to standard loops?
np.max with axis finds the max along the dimension; argmax finds the first max index:
In [42]: arr = np.array([[False, True, False, True], [False, False, True, True]])
In [43]: np.argmax(arr, axis=1)
Out[43]: array([1, 2])
This worked for me:
nonzeros = np.nonzero(input)
u, indices = np.unique(nonzeros[0], return_index=True)
expected = nonzeros[1][indices]

Python/Numpy: Combining boolean masks by row in grouped columns in multidimensional array

I have a 3D boolean array (a 2D numpy array of boolean mask arrays) with r rows and c cols. In the example below the array shape is (3, 6, 2); 3 rows and 6 columns, where each column contains 2 elements.
maskArr = np.array([
[[True, False], [True, True], [True, True], [True, True], [True, True], [True, True]],
[[False, True], [False, True], [True, True], [False, True], [True, True], [True, True]],
[[True, False], [True, True], [True, True], [True, True], [True, True], [True, True]],
])
# If n=2: |<- AND these 2 cols ->|<- AND these 2 cols ->|<- AND these 2 cols ->|
# If n=3: |<----- AND these 3 cols ----->|<----- AND these 3 cols ----->|
I know I can use np.all(maskArr, axis=1) to and together all the mask arrays in each row as in previous answer, but instead I would like to and together the boolean arrays in each row in increments of n columns.
So if we start with 6 columns, as above, and n=2, I would like to apply the equivalent of np.all on every 2 columns for an end result of 3 columns, where:
The first column of the result array equals the rows of the first (2) columns of the original array ANDed together - result[:,0] = np.all(maskArr[:,0:1], axis=1)
The second column of the result array equals the rows of the second (2) columns of the original array ANDed together. - result[:,1] = np.all(maskArr[:,2:3], axis=1)
And the third column of the result array equals the rows of the last (2) columns of the original array ANDed together. - result[:,2] = np.all(maskArr[:,4:5], axis=1)
Is there a way to use np.all (or another vectorized approach) to get this result?
Expected result with n=2:
>>> np.array([
[[True, False], [True, True], [True, True]],
[[False, True], [False, True], [True, True]],
[[True, False], [True, True], [True, True]],
])
Note: The array I'm working with is extremely large so I'm looking for a vectorized approach to minimize performance impact. The actual boolean arrays can be thousands of elements long.
I've tried:
n = 2
c = len(maskArr[0]) ## c = 6 (number of columns)
nResultColumns = int(c / n) ## nResultColumns = 3
combinedMaskArr = [np.all(maskArr[:,i*n:i*n+n], axis=1) for i in range(nResultColumns)]
which gives me:
>>> [
array([[True, False], [False, True], [True, False]]),
array([[True, True], [False, True], [True, True]]),
array([[True, True], [True, True], [True, True]])
]
The output above is not the expected format or values.
Any guidance or suggestions on how to get to the expected result?
Thank you in advance.
The following works, if I understood your problem correctly.
n = 2
cols = mask_arr.shape[1]
chunks = math.ceil(cols / n)
groups = np.array_split(np.swapaxes(mask_arr, 0, 1), chunks)
combined = np.array([np.all(g, axis=0) for g in groups])
result = np.swapaxes(combined, 0, 1)
If cols is divisible by n, I think this works:
n = 2
rows, cols = mask_arr.shape[0:2]
result = np.all(mask_arr.reshape(rows, cols // n, n, -1), axis=2)

Using entrywise sum of boolean arrays as inclusive `or`

I would like to compare many m-by-n boolean numpy arrays and get an array of the same shape whose entries are True if the corresponding entry in at least one of the inputs is True.
The easiest way I've found to do this is:
In [5]: import numpy as np
In [6]: a = np.array([True, False, True])
In [7]: b = np.array([True, True, False])
In [8]: a + b
Out[8]: array([ True, True, True])
But I can also use
In [11]: np.stack([a, b]).sum(axis=0) > 0
Out[11]: array([ True, True, True])
Are these equivalent operations? Are there any gotchas I should be aware of? Is one method preferable to the other?
You can use np.logical_or
a = np.array([True, False, True])
b = np.array([True, True, False])
np.logical_or(a,b)
it also works for (m,n) arrays
a = np.random.rand(3,4) < 0.5
b = np.random.rand(3,4) < 0.5
print('a\n',a)
print('b\n',b)
np.logical_or(a,b)

numpy where operation on 2D array

I have a numpy array 'A' of size 571x24 and I am trying to find the index of zeros in it so I do:
>>>A.shape
(571L, 24L)
import numpy as np
z1 = np.where(A==0)
z1 is a tuple with following size:
>>> len(z1)
2
>>> len(z1[0])
29
>>> len(z1[1])
29
I was hoping to create a z1 of same size as A. How do I achieve that?
Edit: I want to create array z1 of booleans for presence of zero in A such that:
>>>z1.shape
(571L, 24L)
You can just check this with the equality operator in python with numpy. Example:
>>> A = np.array([[0,2,2,1],[2,0,0,3]])
>>> A == 0
array([[ True, False, False, False],
[False, True, True, False]], dtype=bool)
np.where() does something else, see documentation. Although, it is possible to achieve this with np.where() using broadcasting. See documentation.
>>> np.where(A == 0, True, False)
array([[ True, False, False, False],
[False, True, True, False]], dtype=bool)
Try this:
import numpy as np
myarray = np.array([[0,3,4,5],[9,4,0,4],[1,2,3,4]])
ix = np.in1d(myarray.ravel(), 0).reshape(myarray.shape)
Output of ix:
array([[ True, False, False, False],
[False, False, True, False],
[False, False, False, False]], dtype=bool)

pandas build matrix of row by row comparisons

I have two dataframes, a (10,2) and a (4,2) and I am looking for a faster/more pythonic way to compare them row by row.
x = pd.DataFrame([range(10),range(2,12)])
x = x.transpose()
y = pd.DataFrame([[5,8],[2,3],[5,5]])
I'd like to build a comparison matrix (10,3) that shows which of the rows in the first dataframe fit the following requirements in the second dataframe. the x1 value must be >= the y[0] value and the x[0] value must be <= the y1 value. In reality, the data are dates, but for simplicity I have just used integers to make this example easier to follow. We're testing for overlap in time periods, so the logic shows that there must be some overlap in the periods of the respective tables.
arr = np.zeros((len(x),len(y)), dtype=bool)
for xrow in x.index:
for yrow in y.index:
if x.loc[xrow,1] >= y.loc[yrow,0] and x.loc[xrow,0] <= y.loc[yrow,1]:
arr[xrow,yrow] = True
arr
The brute force approach above is too slow. Any suggestions for how I could vectorize this or do some sort of transposed matrix comparisons?
You can convert x, y to NumPy arrays and then extend dimensions with np.newaxis/None, which would bring in NumPy's broadcasting when performing the same operations. Thus, all those comparisons and the output boolean array would be created in a vectorized fashion. The implementation would look like this -
X = np.asarray(x)
Y = np.asarray(y)
arr = (X[:,None,1] >= Y[:,0]) & (X[:,None,0] <= Y[:,1])
Sample run -
In [207]: x = pd.DataFrame([range(10),range(2,12)])
...: x = x.transpose()
...: y = pd.DataFrame([[5,8],[2,3],[5,5]])
...:
In [208]: X = np.asarray(x)
...: Y = np.asarray(y)
...: arr = (X[:,None,1] >= Y[:,0]) & (X[:,None,0] <= Y[:,1])
...:
In [209]: arr
Out[209]:
array([[False, True, False],
[False, True, False],
[False, True, False],
[ True, True, True],
[ True, False, True],
[ True, False, True],
[ True, False, False],
[ True, False, False],
[ True, False, False],
[False, False, False]], dtype=bool)

Categories