Related
Let's suppose we have 2 tensors like
A = [[1, 2, 3, 4],
[5, 6, 7, 8]]
B = [[True, True, True, True],
[True, False, True, True]]
I want to extract K left-most columns from A where its corresponding boolean mask in B is True. In the above example, if K=2, the results should be
C = [[1, 2],
[5, 7]]
6 is not included in C because its corresponding boolean mask is False.
I was able to do that with the following code:
batch_size = 2
C = tf.zeros((batch_size, K), tf.int32)
for batch_idx in tf.range(batch_size):
a = A[batch_idx]
b = B[batch_idx]
tmp = tf.boolean_mask(a, b)
tmp = tmp[:K]
C = tf.tensor_scatter_nd_update(
C, [[batch_idx]], tf.expand_dims(tmp, axis=0))
But I don't want to iterate over A and B with for loop.
Is there any way to do this with matrix operators only?
Not sure if it will work for all corner cases, but you could try using a tf.ragged.boolean_mask
import tensorflow as tf
A = [[1, 2, 3, 4],
[5, 6, 7, 8]]
B = [[True, True, True, True],
[True, False, True, True]]
K = 2
tmp = tf.ragged.boolean_mask(A, B)
C = tmp[:, :K].to_tensor()
tf.Tensor(
[[1 2]
[5 7]], shape=(2, 2), dtype=int32)
K = 3:
tf.Tensor(
[[1 2 3]
[5 7 8]], shape=(2, 3), dtype=int32)
I'm confused about the way numpy array slicing is working in the example below. I can't figure out how exactly the slicing is working and would appreciate an explanation.
import numpy as np
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16]
])
m = [False,True,True,False]
# Test 1 - Expected behaviour
print(arr[m])
Out:
array([[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
# Test 2 - Expected behaviour
print(arr[m,:])
Out:
array([[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
# Test 3 - Expected behaviour
print(arr[:,m])
Out:
array([[ 2, 3],
[ 6, 7],
[10, 11],
[14, 15]])
### What's going on here? ###
# Test 4
print(arr[m,m])
Out:
array([ 6, 11]) # <--- diagonal components. I expected [[6,7],[10,11]].
I found that I could achieve the desired result with arr[:,m][m]. But I'm still curious about how this works.
You can use matrix multiplication to create a 2d mask.
import numpy as np
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16]
])
m = [False,True,True,False]
mask2d = np.array([m]).T * m
print(arr[mask2d])
Output :
[ 6 7 10 11]
Alternatively, you can have the output in matrix format.
print(np.ma.masked_array(arr, ~mask2d))
It's just the way indexing works for numpy arrays. Usually if you have specific "slices" of rows and columns you want to select you just do:
import numpy as np
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16]
])
# You want to check out rows 2-3 cols 2-3
print(arr[2:4,2:4])
Out:
[[11 12]
[15 16]]
Now say you want to select arbitrary combinations of specific row and column indices, for example you want row0-col2 and row2-col3
print(arr[[0, 2], [2, 3]])
Out:
[ 3 12]
What you are doing is identical to the above. [m,m] is equivalent to:
[m,m] == [[False,True,True,False], [False,True,True,False]]
Which is in turn equivalent to saying you want row1-col1 and row2-col2
print(arr[[1, 2], [1, 2]])
Out:
[ 6 11]
I don't know why, but this is the way numpy treats slicing by a tuple of 1d boolean arrays:
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12]
])
m1 = [True, False, True]
m2 = [False, False, True, True]
# Pseudocode for what NumPy does
#def arr[m1,m2]:
# intm1 = np.transpose(np.argwhere(m1)) # [True, False, True] -> [0,2]
# intm2 = np.transpose(np.argwhere(m2)) # [False, False, True, True] -> [2,3]
# return arr[intm1,intm2] # arr[[0,2],[2,3]]
print(arr[m1,m2]) # --> [3 12]
What I was expecting was slicing behaviour with non-contiguous segments of the array; selecting the intersection of rows and columns, can be achieved with:
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12]
])
m1 = [True, False, True]
m2 = [False, False, True, True]
def row_col_select(arr, *ms):
n = arr.ndim
assert(len(ms) == n)
# Accumulate a full boolean mask which will have the shape of `arr`
accum_mask = np.reshape(True, (1,) * n)
for i in range(n):
shape = tuple([1]*i + [arr.shape[i]] + [1]*(n-i-1))
m = np.reshape(ms[i], shape)
accum_mask = np.logical_and(accum_mask, m)
# Select `arr` according to full boolean mask
# The boolean mask is the multiplication of the boolean arrays across each corresponding dimension. E.g. for m1 and m2 above it is:
# m1: | m2: False False True True
# |
# True | [[False False True True]
# False | [False False False False]
# True | [False False True True]]
return arr[accum_mask]
print(row_col_select(arr,m1,m2)) # --> [ 3 4 11 12]
In [55]: arr = np.array([
...: [1,2,3,4],
...: [5,6,7,8],
...: [9,10,11,12],
...: [13,14,15,16]
...: ])
...: m = [False,True,True,False]
In all your examples we can use this m1 instead of the boolean list:
In [58]: m1 = np.where(m)[0]
In [59]: m1
Out[59]: array([1, 2])
If m was a 2d array like arr than we could use it to select elements from arr - but they will be raveled; but when used to select along one dimension, the equivalent array index is clearer. Yes we could use np.array([2,1]) or np.array([2,1,1,2]) to select rows in a different order or even multiple times. But substituting m1 for m does not loose any information or control.
Select rows, or columns:
In [60]: arr[m1]
Out[60]:
array([[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
In [61]: arr[:,m1]
Out[61]:
array([[ 2, 3],
[ 6, 7],
[10, 11],
[14, 15]])
With 2 arrays, we get 2 elements, arr[1,1] and arr[2,2].
In [62]: arr[m1, m1]
Out[62]: array([ 6, 11])
Note that in MATLAB we have to use sub2ind to do the same thing. What's easy in numpy is a bit harder in MATLAB; for blocks it's the other way.
To get a block, we have to create a column array to broadcast with the row one:
In [63]: arr[m1[:,None], m1]
Out[63]:
array([[ 6, 7],
[10, 11]])
If that's too hard to remember, np.ix_ can do it for us:
In [64]: np.ix_(m1,m1)
Out[64]:
(array([[1],
[2]]),
array([[1, 2]]))
[63] is doing the same thing as [62]; the difference is that the 2 arrays broadcast differently. It's the same broadcasting as done in these additions:
In [65]: m1+m1
Out[65]: array([2, 4])
In [66]: m1[:,None]+m1
Out[66]:
array([[2, 3],
[3, 4]])
This indexing behavior is perfectly consistent - provided we don't import expectations from other languages.
I used m1 because boolean arrays don't broadcast, as show below:
In [67]: np.array(m)
Out[67]: array([False, True, True, False])
In [68]: np.array(m)[:,None]
Out[68]:
array([[False],
[ True],
[ True],
[False]])
In [69]: arr[np.array(m)[:,None], np.array(m)]
...
IndexError: too many indices for array
in fact the 'column' boolean doesn't work either:
In [70]: arr[np.array(m)[:,None]]
...
IndexError: boolean index did not match indexed array along dimension 1; dimension is 4 but corresponding boolean dimension is 1
We can use logical_and to broadcast a column boolean against a row boolean:
In [72]: mb = np.array(m)
In [73]: mb[:,None]&mb
Out[73]:
array([[False, False, False, False],
[False, True, True, False],
[False, True, True, False],
[False, False, False, False]])
In [74]: arr[_]
Out[74]: array([ 6, 7, 10, 11]) # 1d result
This is the case you quoted: "If obj.ndim == x.ndim, x[obj] returns a 1-dimensional array filled with the elements of x corresponding to the True values of obj"
Your other quote:
*"Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view)." *
means that if arr1 = arr[m,:], arr1 is a copy, and any modifications to arr1 will not affect arr. However I could use arr[m,:]=10to modify arr. The alternative to a copy is a view, as in basic indexing, arr2=arr[0::2,:]. modifications to arr2 do modify arr as well.
I'm trying to vectorize the following function using numpy and am completely lost.
A = ndarray: Z x 3
B = ndarray: Z x 3
C = integer
D = ndarray: C x 3
Pseudocode:
entries = []
means = []
For i in range(C):
for p in range(len(B)):
if B[p] == D[i]:
entries.append(A[p])
means.append(columnwise_means(entries))
return means
an example would be :
A = [[1,2,3],[1,2,3],[4,5,6],[4,5,6]]
B = [[9,8,7],[7,6,5],[1,2,3],[3,4,5]]
C = 2
D = [[1,2,3],[4,5,6]]
Returns:
[average([9,8,7],[7,6,5]), average(([1,2,3],[3,4,5])] = [[8,7,6],[2,3,4]]
I've tried using np.where, np.argwhere, np.mean, etc but can't seem to get the desired effect. Any help would be greatly appreciated.
Thanks!
Going by the expected output of the question, I am assuming that in the actual code, you would have :
IF conditional statement as : if A[p] == D[i], and
Entries would be appended from B : entries.append(B[p]).
So, here's one vectorized approach with NumPy broadcasting and dot-product -
mask = (D[:,None,:] == A).all(-1)
out = mask.dot(B)/(mask.sum(1)[:,None])
If the input arrays are integer arrays, then you can save on memory and boost up performance, considering the arrays as indices of a n-dimensional array and thus create the 2D mask without going 3D like so -
dims = np.maximum(A.max(0),D.max(0))+1
mask = np.ravel_multi_index(D.T,dims)[:,None] == np.ravel_multi_index(A.T,dims)
Sample run -
In [107]: A
Out[107]:
array([[1, 2, 3],
[1, 2, 3],
[4, 5, 6],
[4, 5, 6]])
In [108]: B
Out[108]:
array([[9, 8, 7],
[7, 6, 5],
[1, 2, 3],
[3, 4, 5]])
In [109]: mask = (D[:,None,:] == A).all(-1)
...: out = mask.dot(B)/(mask.sum(1)[:,None])
...:
In [110]: out
Out[110]:
array([[8, 7, 6],
[2, 3, 4]])
I see two hints :
First, comparing array by rows. A way to do that is to simplify you index system in 1D :
def indexer(M,base=256):
return (M*base**arange(3)).sum(axis=1)
base is an integer > A.max() . Then the selection can be done like that :
indices=np.equal.outer(indexer(D),indexer(A))
for :
array([[ True, True, False, False],
[False, False, True, True]], dtype=bool)
Second, each group can have different length, so vectorisation is difficult for the last step. Here a way to do achieve the job.
B=array(B)
means=[B[i].mean(axis=0) for i in indices]
I have an m x 3 matrix A and its row subset B (n x 3). Both are sets of indices into another, large 4D matrix; their data type is dtype('int64'). I would like to generate a boolean vector x, where x[i] = True if B does not contain row A[i,:].
There are no duplicate rows in either A or B.
I was wondering if there's an efficient way how to do this in Numpy? I found an answer that's somewhat related: https://stackoverflow.com/a/11903368/265289; however, it returns the actual rows (not a boolean vector).
You could follow the same pattern as shown in jterrace's answer, except use np.in1d instead of np.setdiff1d:
import numpy as np
np.random.seed(2015)
m, n = 10, 5
A = np.random.randint(10, size=(m,3))
B = A[np.random.choice(m, n, replace=False)]
print(A)
# [[2 2 9]
# [6 8 5]
# [7 8 0]
# [6 7 8]
# [3 8 6]
# [9 2 3]
# [1 2 6]
# [2 9 8]
# [5 8 4]
# [8 9 1]]
print(B)
# [[2 2 9]
# [1 2 6]
# [2 9 8]
# [3 8 6]
# [9 2 3]]
def using_view(A, B, assume_unique=False):
Ad = np.ascontiguousarray(A).view([('', A.dtype)] * A.shape[1])
Bd = np.ascontiguousarray(B).view([('', B.dtype)] * B.shape[1])
return ~np.in1d(Ad, Bd, assume_unique=assume_unique)
print(using_view(A, B, assume_unique=True))
yields
[False True True True False False False False True True]
You can use assume_unique=True (which can speed up the calculation) since
there are no duplicate rows in A or B.
Beware that A.view(...) will raise
ValueError: new type not compatible with array.
if A.flags['C_CONTIGUOUS'] is False (i.e. if A is not a C-contiguous array).
Therefore, in general we need to use np.ascontiguous(A) before calling view.
As B.M. suggests, you could instead view each row using the "void"
dtype:
def using_void(A, B):
dtype = 'V{}'.format(A.dtype.itemsize * A.shape[-1])
Ad = np.ascontiguousarray(A).view(dtype)
Bd = np.ascontiguousarray(B).view(dtype)
return ~np.in1d(Ad, Bd, assume_unique=True)
This is safe to use with integer dtypes. However, note that
In [342]: np.array([-0.], dtype='float64').view('V8') == np.array([0.], dtype='float64').view('V8')
Out[342]: array([False], dtype=bool)
so using np.in1d after viewing as void may return incorrect results for arrays
with float dtype.
Here is a benchmark of some of the proposed methods:
import numpy as np
np.random.seed(2015)
m, n = 10000, 5000
# Note A may contain duplicate rows,
# so don't use assume_unique=True for these benchmarks.
# In this case, using assume_unique=False does not improve the speed much anyway.
A = np.random.randint(10, size=(2*m,3))
# make A not C_CONTIGUOUS; the view methods fail for non-contiguous arrays
A = A[::2]
B = A[np.random.choice(m, n, replace=False)]
def using_view(A, B, assume_unique=False):
Ad = np.ascontiguousarray(A).view([('', A.dtype)] * A.shape[1])
Bd = np.ascontiguousarray(B).view([('', B.dtype)] * B.shape[1])
return ~np.in1d(Ad, Bd, assume_unique=assume_unique)
from scipy.spatial import distance
def using_distance(A, B):
return ~np.any(distance.cdist(A,B)==0,1)
from functools import reduce
def using_loop(A, B):
pred = lambda i: A[:, i:i+1] == B[:, i]
return ~reduce(np.logical_and, map(pred, range(A.shape[1]))).any(axis=1)
from pandas.core.groupby import get_group_index, _int64_overflow_possible
from functools import partial
def using_pandas(A, B):
shape = [1 + max(A[:, i].max(), B[:, i].max()) for i in range(A.shape[1])]
assert not _int64_overflow_possible(shape)
encode = partial(get_group_index, shape=shape, sort=False, xnull=False)
a1, b1 = map(encode, (A.T, B.T))
return ~np.in1d(a1, b1)
def using_void(A, B):
dtype = 'V{}'.format(A.dtype.itemsize * A.shape[-1])
Ad = np.ascontiguousarray(A).view(dtype)
Bd = np.ascontiguousarray(B).view(dtype)
return ~np.in1d(Ad, Bd)
# Sanity check: make sure all the functions return the same result
for func in (using_distance, using_loop, using_pandas, using_void):
assert (func(A, B) == using_view(A, B)).all()
In [384]: %timeit using_pandas(A, B)
100 loops, best of 3: 1.99 ms per loop
In [381]: %timeit using_void(A, B)
100 loops, best of 3: 6.72 ms per loop
In [378]: %timeit using_view(A, B)
10 loops, best of 3: 35.6 ms per loop
In [383]: %timeit using_loop(A, B)
1 loops, best of 3: 342 ms per loop
In [379]: %timeit using_distance(A, B)
1 loops, best of 3: 502 ms per loop
since there are only 3 columns, one solution would be to just reduce accross columns:
>>> a
array([[2, 2, 9],
[6, 8, 5],
[7, 8, 0],
[6, 7, 8],
[3, 8, 6],
[9, 2, 3],
[1, 2, 6],
[2, 9, 8],
[5, 8, 4],
[8, 9, 1]])
>>> b
array([[2, 2, 9],
[1, 2, 6],
[2, 9, 8],
[3, 8, 6],
[9, 2, 3]])
>>> from functools import reduce
>>> pred = lambda i: a[:, i:i+1] == b[:,i]
>>> reduce(np.logical_and, map(pred, range(a.shape[1]))).any(axis=1)
array([ True, False, False, False, True, True, True, True, False, False], dtype=bool)
though this would create an m x n intermediate array which may not be memory efficient.
Alternatively, if the values are indices, i.e. non-negative integers, you may use pandas.groupby.get_group_index to reduce to one dimensional arrays. This is an efficient algorithm which pandas use internally for groupby operations; The only caveat is that you may need to verify that there will not be any integer overflow:
>>> from pandas.core.groupby import get_group_index, _int64_overflow_possible
>>> from functools import partial
>>> shape = [1 + max(a[:, i].max(), b[:, i].max()) for i in range(a.shape[1])]
>>> assert not _int64_overflow_possible(shape)
>>> encode = partial(get_group_index, shape=shape, sort=False, xnull=False)
>>> a1, b1 = map(encode, (a.T, b.T))
>>> np.in1d(a1, b1)
array([ True, False, False, False, True, True, True, True, False, False], dtype=bool)
You can treat A and B as two sets of XYZ arrays and calculate the euclidean distances between them with scipy.spatial.distance.cdist. The zero distances would be of interest to us. This distance calculation is supposed to be a pretty efficient implementation, so hopefully we would have an efficient solution to solve our case. So, the implementation to find such a boolean output would look like this -
from scipy.spatial import distance
out = ~np.any(distance.cdist(A,B)==0,1)
# OR np.all(distance.cdist(A,B)!=0,1)
Sample run -
In [582]: A
Out[582]:
array([[0, 2, 2],
[1, 0, 3],
[3, 3, 3],
[2, 0, 3],
[2, 0, 1],
[1, 1, 1]])
In [583]: B
Out[583]:
array([[2, 0, 3],
[2, 3, 3],
[1, 1, 3],
[2, 0, 1],
[0, 2, 2],
[2, 2, 2],
[1, 2, 3]])
In [584]: out
Out[584]: array([False, True, True, False, False, True], dtype=bool)
I have a NumPy array 'boolarr' of boolean type. I want to count the number of elements whose values are True. Is there a NumPy or Python routine dedicated for this task? Or, do I need to iterate over the elements in my script?
You have multiple options. Two options are the following.
boolarr.sum()
numpy.count_nonzero(boolarr)
Here's an example:
>>> import numpy as np
>>> boolarr = np.array([[0, 0, 1], [1, 0, 1], [1, 0, 1]], dtype=np.bool)
>>> boolarr
array([[False, False, True],
[ True, False, True],
[ True, False, True]], dtype=bool)
>>> boolarr.sum()
5
Of course, that is a bool-specific answer. More generally, you can use numpy.count_nonzero.
>>> np.count_nonzero(boolarr)
5
That question solved a quite similar question for me and I thought I should share :
In raw python you can use sum() to count True values in a list :
>>> sum([True,True,True,False,False])
3
But this won't work :
>>> sum([[False, False, True], [True, False, True]])
TypeError...
In terms of comparing two numpy arrays and counting the number of matches (e.g. correct class prediction in machine learning), I found the below example for two dimensions useful:
import numpy as np
result = np.random.randint(3,size=(5,2)) # 5x2 random integer array
target = np.random.randint(3,size=(5,2)) # 5x2 random integer array
res = np.equal(result,target)
print result
print target
print np.sum(res[:,0])
print np.sum(res[:,1])
which can be extended to D dimensions.
The results are:
Prediction:
[[1 2]
[2 0]
[2 0]
[1 2]
[1 2]]
Target:
[[0 1]
[1 0]
[2 0]
[0 0]
[2 1]]
Count of correct prediction for D=1: 1
Count of correct prediction for D=2: 2
boolarr.sum(axis=1 or axis=0)
axis = 1 will output number of trues in a row and axis = 0 will count number of trues in columns
so
boolarr[[true,true,true],[false,false,true]]
print(boolarr.sum(axis=1))
will be
(3,1)
b[b].size
where b is the Boolean ndarray in question. It filters b for True, and then count the length of the filtered array.
This probably isn't as efficient np.count_nonzero() mentioned previously, but is useful if you forget the other syntax. Plus, this shorter syntax saves programmer time.
Demo:
In [1]: a = np.array([0,1,3])
In [2]: a
Out[2]: array([0, 1, 3])
In [3]: a[a>=1].size
Out[3]: 2
In [5]: b=a>=1
In [6]: b
Out[6]: array([False, True, True])
In [7]: b[b].size
Out[7]: 2
For 1D array, this is what worked for me:
import numpy as np
numbers= np.array([3, 1, 5, 2, 5, 1, 1, 5, 1, 4, 2, 1, 4, 5, 3, 4,
5, 2, 4, 2, 6, 6, 3, 6, 2, 3, 5, 6, 5])
numbersGreaterThan2= np.count_nonzero(numbers> 2)