Suppose I have a column vector y with length n, and I have a matrix X of size n*m. I want to check for each element i in y, whether the element is in the corresponding row in X. What is the most efficient way of doing this?
For example:
y = [1,2,3,4].T
and
X =[[1, 2, 3],[3, 4, 5],[4, 3, 2],[2, 2, 2]]
Then the output should be
[1, 0, 1, 0] or [True, False, True, False]
which ever is easier.
Of course we can use a for loop to iterate through both y and X, but is there any more efficient way of doing this?
Vectorized approach using broadcasting -
((X == y[:,None]).any(1)).astype(int)
Sample run -
In [41]: X # Input 1
Out[41]:
array([[1, 2, 3],
[3, 4, 5],
[4, 3, 2],
[2, 2, 2]])
In [42]: y # Input 2
Out[42]: array([1, 2, 3, 4])
In [43]: X == y[:,None] # Broadcasted comparison
Out[43]:
array([[ True, False, False],
[False, False, False],
[False, True, False],
[False, False, False]], dtype=bool)
In [44]: (X == y[:,None]).any(1) # Check for any match along each row
Out[44]: array([ True, False, True, False], dtype=bool)
In [45]: ((X == y[:,None]).any(1)).astype(int) # Convert to 1s and 0s
Out[45]: array([1, 0, 1, 0])
Related
I have a large sample (M) of boolean arrays of length 'N'. So there are 2^N unique boolean arrays possible.
I would like to know how many arrays are duplicates of each other and create a histogram.
One possibility is to create a unique integer (a[0] + a[1]*2 + a[3]*4 + ...+a[N]*2^(N-1)) for each unique array and histogram that integer.
But this is going to be O(M*N). What is the best way to do this?
numpy.ravel_multi_index is able to do this for you:
arr = np.array([[True, True, True],
[True, False, True],
[True, False, True],
[False, False, False],
[True, False, True]], dtype=int)
nums = np.ravel_multi_index(arr.T, (2,) * arr.shape[1])
>>> nums
array([7, 5, 5, 0, 5], dtype=int64)
And since you need a histogram, use
>>> np.histogram(nums, bins=np.arange(2**arr.shape[1]+1))
(array([1, 0, 0, 0, 0, 3, 0, 1], dtype=int64),
array([0, 1, 2, 3, 4, 5, 6, 7, 8]))
Another option is to use np.unique:
>>> np.unique(arr, return_counts=True, axis=0)
(array([[0, 0, 0],
[1, 0, 1],
[1, 1, 1]]),
array([1, 3, 1], dtype=int64))
With vectorized operation, the creation of a key is much more faster than a[0] + a[1]x2 + a[3]x4 + ...+a[N]*2^(N-1). I think that's no a better solution... in any case you need to almost "read" one time each value, and this require MxN step.
N = 3
M = 5
sample = [
np.array([True, True, True]),
np.array([True, False, True]),
np.array([True, False, True]),
np.array([False, False, False]),
np.array([True, False, True]),
]
multipliers = [2<<i for i in range(N-2, -1, -1)]+[1]
buckets = {}
buck2vec = {}
for s in sample:
key = sum(s*multipliers)
if key not in buckets:
buckets[key] = 0
buck2vec[key] = s
buckets[key]+=1
for key in buckets:
print(f"{buck2vec[key]} -> {buckets[key]} occurency")
Results:
[False False False] -> 1 occurency
[ True False True] -> 3 occurency
[ True True True] -> 1 occurency
suppose we have two arrays like these two:
A=np.array([[1, 4, 3, 0, 5],[6, 0, 7, 12, 11],[20, 15, 34, 45, 56]])
B=np.array([[4, 5, 6, 7]])
I intend to write a code in which I can find the indexes of an array such as A based on values in
the array B
for example, I want the final results to be something like this:
C=[[0 1]
[0 4]
[1 0]
[1 2]]
can anybody provide me with a solution or a hint?
Do you mean?
In [375]: np.isin(A,B[0])
Out[375]:
array([[False, True, False, False, True],
[ True, False, True, False, False],
[False, False, False, False, False]])
In [376]: np.argwhere(np.isin(A,B[0]))
Out[376]:
array([[0, 1],
[0, 4],
[1, 0],
[1, 2]])
B shape of (1,4) where the initial 1 isn't necessary. That's why I used B[0], though isin, via in1d ravels it anyways.
where is result is often more useful
In [381]: np.where(np.isin(A,B))
Out[381]: (array([0, 0, 1, 1]), array([1, 4, 0, 2]))
though it's a bit harder to understand.
Another way to get the isin array:
In [383]: (A==B[0,:,None,None]).any(axis=0)
Out[383]:
array([[False, True, False, False, True],
[ True, False, True, False, False],
[False, False, False, False, False]])
You can try in this way by using np.where().
index = []
for num in B:
for nums in num:
x,y = np.where(A == nums)
index.append([x,y])
print(index)
>>array([[0,1],
[0,4],
[1,0],
[1,2]])
With zip and np.where:
>>> list(zip(*np.where(np.in1d(A, B).reshape(A.shape))))
[(0, 1), (0, 4), (1, 0), (1, 2)]
Alternatively:
>>> np.vstack(np.where(np.isin(A,B))).transpose()
array([[0, 1],
[0, 4],
[1, 0],
[1, 2]], dtype=int64)
Using numpy, I have a matrix called points.
points
=> matrix([[0, 2],
[0, 0],
[1, 3],
[4, 6],
[0, 7],
[0, 3]])
If I have the tuple (1, 3), I want to find the row in points that matches these numbers (in this case, the row index is 2).
I tried using np.where:
np.where(points == (1, 3))
=> (array([2, 2, 5]), array([0, 1, 1]))
What is the meaning of this output? Can it be used to find the row where (1, 3) occurs?
You were just needed to look for ALL matches along each row, like so -
np.where((a==(1,3)).all(axis=1))[0]
Steps involved using given sample -
In [17]: a # Input matrix
Out[17]:
matrix([[0, 2],
[0, 0],
[1, 3],
[4, 6],
[0, 7],
[0, 3]])
In [18]: (a==(1,3)) # Matrix of broadcasted matches
Out[18]:
matrix([[False, False],
[False, False],
[ True, True],
[False, False],
[False, False],
[False, True]], dtype=bool)
In [19]: (a==(1,3)).all(axis=1) # Look for ALL matches along each row
Out[19]:
matrix([[False],
[False],
[ True],
[False],
[False],
[False]], dtype=bool)
In [20]: np.where((a==(1,3)).all(1))[0] # Use np.where to get row indices
Out[20]: array([2])
I'm wondering if it's possible without iterating with a for loop to do something like this:
a = np.array([[1, 2, 5, 3, 4],
[4, 5, 6, 7, 8]])
cleaver = np.argmax(a == 5, axis=1) # np.array([2, 1])
foo(a, cleaver)
>>> np.array([False, False, True, True, True],
[False, True, True, True, True])
Is there a way to accomplish this through slicing or some other non-iterative function? The arrays I'm using are quite large and iterating over them row by row is prohibitively expensive.
You can use some broadcasting magic -
cleaver[:,None] <= np.arange(a.shape[1])
Sample run -
In [60]: a
Out[60]:
array([[1, 2, 5, 3, 4],
[4, 5, 6, 7, 8]])
In [61]: cleaver
Out[61]: array([2, 1])
In [62]: cleaver[:,None] <= np.arange(a.shape[1])
Out[62]:
array([[False, False, True, True, True],
[False, True, True, True, True]], dtype=bool)
How do I get the masked data only without flattening the data into a 1D array? That is, suppose I have a numpy array
a = np.array([[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3]])
and I mask all elements greater than 1,
b = ma.masked_greater(a, 1)
masked_array(data =
[[0 1 -- --]
[0 1 -- --]
[0 1 -- --]],
mask =
[[False False True True]
[False False True True]
[False False True True]],
fill_value = 999999)
How do I get only the masked elements without flattening the output? That is, I need to get
array([[ 2, 3],
[2, 3],
[2, 3]])
Lets try an example that produces a ragged result - different number of 'masked' values in each row.
In [292]: a=np.arange(12).reshape(3,4)
In [293]: a
Out[293]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [294]: a<6
Out[294]:
array([[ True, True, True, True],
[ True, True, False, False],
[False, False, False, False]], dtype=bool)
The flattened list of values that match this condition. It can't return a regular 2d array, so it has to revert to a flattened array.
In [295]: a[a<6]
Out[295]: array([0, 1, 2, 3, 4, 5])
do the same thing, but iterating row by row
In [296]: [a1[a1<6] for a1 in a]
Out[296]: [array([0, 1, 2, 3]), array([4, 5]), array([], dtype=int32)]
Trying to make an array of the result produces an object type array, which is little more than a list in an array wrapper:
In [297]: np.array([a1[a1<6] for a1 in a])
Out[297]: array([array([0, 1, 2, 3]), array([4, 5]), array([], dtype=int32)], dtype=object)
The fact that the result is ragged is a good indicator that it is difficult, if not impossible, to perform that action with one vectorized operation.
Here's another way of producing the list of arrays. With sum I find how many elements there are in each row, and then use this to split the flattened array into sublists.
In [320]: idx=(a<6).sum(1).cumsum()[:-1]
In [321]: idx
Out[321]: array([4, 6], dtype=int32)
In [322]: np.split(a[a<6], idx)
Out[322]: [array([0, 1, 2, 3]), array([4, 5]), array([], dtype=float64)]
It does use 'flattening'. And for these small examples it is slower than the row iteration. (Don't worry about the empty float array, split had to construct something and used a default dtype. )
A different mask, without empty rows clearly shows the equivalence of the 2 approaches.
In [344]: mask=np.tri(3,4,dtype=bool) # lower tri
In [345]: mask
Out[345]:
array([[ True, False, False, False],
[ True, True, False, False],
[ True, True, True, False]], dtype=bool)
In [346]: idx=mask.sum(1).cumsum()[:-1]
In [347]: idx
Out[347]: array([1, 3], dtype=int32)
In [348]: [a1[m] for a1,m in zip(a,mask)]
Out[348]: [array([0]), array([4, 5]), array([ 8, 9, 10])]
In [349]: np.split(a[mask],idx)
Out[349]: [array([0]), array([4, 5]), array([ 8, 9, 10])]
Zip the two lists together, and then filter them out:
data = [[0, 1, 1, 1], [0, 1, 1, 1], [0, 1, 1, 1]]
mask = [[False, False, True, True],
[False, False, True, True],
[False, False, True, True]]
zipped = zip(data, mask) # [([0, 1, 1, 1], [False, False, True, True]), ([0, 1, 1, 1], [False, False, True, True]), ([0, 1, 1, 1], [False, False, True, True])]
masked = []
for lst, mask in zipped:
pairs = zip(lst, mask) # [(0, False), (1, False), (1, True), (1, True)]
masked.append([num for num, b in pairs if b])
print(masked) # [[1, 1], [1, 1], [1, 1]]
or more succinctly:
zipped = [...]
masked = [[num for num, b in zip(lst, mask) if b] for lst, mask in zipped]
print(masked) # [[1, 1], [1, 1], [1, 1]]
Due to vectorization in numpy you can use np.where to select items from the first array and use None (or some arbitrary value) to indicate the places that a value has been masked out. Note that this means you have to use a less compact representation for the array so may want to use -1 or some special value.
import numpy as np
a = np.array([
[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3]])
mask = np.array([[ True, True, True, True],
[ True, False, True, True],
[False, True, True, False]])
np.where(a, np.array, None)
This produces
array([[0, 1, 2, 3],
[0, None, 2, 3],
[None, 1, 2, None]], dtype=object)