Related
I have a big array of integers and second array of arrays. I want to create a boolean mask for the first array based on data from the second array of arrays. Preferably I would use the numpy.isin but it clearly states in it's documentation:
The values against which to test each value of element. This argument is flattened if it is an array or array_like. See notes for behavior with non-array-like parameters.
Do you maybe know some performant way of doing this instead of list comprehension?
So for example having those arrays:
a = np.array([[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
b = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
I would like to have result like:
np.array([
[True, True, False, False, False, False, False, False, False, False],
[False, False, True, True, False, False, False, False, False, False],
[False, False, False, False, True, True, False, False, False, False],
[False, False, False, False, False, False, True, True, False, False],
[False, False, False, False, False, False, False, False, True, True]
])
You can use broadcasting to avoid any loop (this is however more memory expensive):
(a == b[...,None]).any(-2)
Output:
array([[ True, True, False, False, False, False, False, False, False, False],
[False, False, True, True, False, False, False, False, False, False],
[False, False, False, False, True, True, False, False, False, False],
[False, False, False, False, False, False, True, True, False, False],
[False, False, False, False, False, False, False, False, True True]])
Try numpy.apply_along_axis to work with numpy.isin:
np.apply_along_axis(lambda x: np.isin(a, x), axis=1, arr=b)
returns
array([[[ True, True, False, False, False, False, False, False, False, False]],
[[False, False, True, True, False, False, False, False, False, False]],
[[False, False, False, False, True, True, False, False, False, False]],
[[False, False, False, False, False, False, True, True, False, False]],
[[False, False, False, False, False, False, False, False, True, True]]])
I will update with an edit comparing the runtime with a list comp
EDIT:
Whelp, I tested the runtime, and wouldn't you know, listcomp is faster
timeit.timeit("[np.isin(a,x) for x in b]",number=10000, globals=globals())
0.37380070000654086
vs
timeit.timeit("np.apply_along_axis(lambda x: np.isin(a, x), axis=1, arr=b) ",number=10000, globals=globals())
0.6078917000122601
the other answer to this post by #mozway is much faster:
timeit.timeit("(a == b[...,None]).any(-2)",number=100, globals=globals())
0.007107900004484691
and should probably be accepted.
This is a bit cheated but ultra fast solution. The cheating is that I sort the seconds matrix before so that I can use binary search.
#nb.njit(parallel=True)
def isin_multi(a, b):
out = np.zeros((b.shape[0], a.shape[0]), dtype=nb.boolean)
for i in nb.prange(a.shape[0]):
for j in nb.prange(b.shape[0]):
index = np.searchsorted(b[j], a[i])
if index >= len(b[j]) or b[j][index] != a[i]:
out[j][i] = False
else:
out[j][i] = True
break
return out
a = np.random.randint(200000, size=200000)
b = np.random.randint(200000, size=(50, 5000))
b = np.sort(b, axis=1)
start = time.perf_counter()
for _ in range(20):
isin_multi(a, b)
print(f"isin_multi {time.perf_counter() - start:.3f} seconds")
start = time.perf_counter()
for _ in range(20):
np.array([np.isin(a, ids) for ids in b])
print(f"comprehension {time.perf_counter() - start:.3f} seconds")
Results:
isin_multi 2.951 seconds.
comprehension 21.093 seconds
Suppose I have a 2d boolean array with shape (nrows,ncols). I'm trying to efficiently extract the indices of the topmost True value for each column in the array. If the column has all False values, then no indices are returned for that column. Below is an example of a boolean array with shape (4,6) where the indices of the bold Trues would be the desired output.
False False False False False False
True False False True False False
True False True False False True
True False True True False False
Desired output of indices (row,col): [(1,0),(2,2),(1,3),(2,5)]
I tried using numpy.where and also an implementation of the skyline algorithm but both options are slow. Is there a more efficient way to solve this problem?
Thank you in advance for your help.
You can use np.argmax to detect the first True values.
Prepare the example array.
import numpy as np
a = np.array(
[[0,0,0,0,0,0],
[1,0,0,1,0,0],
[1,0,1,0,0,1],
[1,0,1,1,0,0]]).astype('bool')
a
Output
array([[False, False, False, False, False, False],
[ True, False, False, True, False, False],
[ True, False, True, False, False, True],
[ True, False, True, True, False, False]])
Stack one row of False to deal with columns without a True. Find first True in every column with np.argmax and append an arange for the row indices. You have to adjust the column indices by -1 because we added one row to the array. Then select the columns where the True's index was greater than 0
b = np.vstack([np.zeros_like(a[0]),a])
t = b.argmax(axis=0)
np.vstack([t - 1, np.arange(len(a[0]))]).T[t > 0]
Output
array([[1, 0],
[2, 2],
[1, 3],
[2, 5]])
Translating #HenryYik answer to numpy gives a one line solution
np.vstack([a.argmax(axis=0), np.arange(len(a[0]))]).T[a.sum(0) > 0]
Output
array([[1, 0],
[2, 2],
[1, 3],
[2, 5]])
If you are open to using pandas, you can construct a df, drop columns with False only and then idxmax:
arr = [[False, False, False, False, False, False],
[True, False, False, True, False, False],
[True, False, True, False, False, True],
[True, False, True, True, False, False]]
df = pd.DataFrame(arr, columns=range(len(arr[0])))
s = df.loc[:, df.sum()>0].idxmax()
print (s)
Result:
0 1
2 2
3 1
5 2
dtype: int64
Which is col value vs row value. You can convert it back to your desired form:
print (list(zip(s, s.index)))
[(1, 0), (2, 2), (1, 3), (2, 5)]
I suggest you try this:
def get_topmost(ar: np.ndarray):
return [(row.index(True), i) for i, row in enumerate(ar.T.tolist()) if True in row]
Example: (should works as is)
>>> test = np.array([
[False, False, False, False, False, False],
[True, False, False, True, False, False],
[True, False, True, False, False, True],
[True, False, True, True, False, False],
])
>>> print(get_topmost(test))
[(1, 0), (2, 2), (1, 3), (2, 5)]
X = np.arange(1, 26).reshape(5, 5)
X[:,1:2] % 2 == 0
The conditions should only be applied to the second column
I want the whole matrix where the condition is true like
[array([[False, True, False, False, False],
[ False, False, False, False, False],
[False, True, False, False, False],
[ False, False, False, False, False],
[False, True, False, False, False]])]
It's giving the error
IndexError: boolean index did not match indexed array along dimension 1; dimension is 5 but corresponding boolean dimension is 1
Is this what you want?
import numpy as np
X = np.arange(1, 26).reshape(5, 5)
X=[X[::] % 2 == 0]
print(X)
Output
[array([[False, True, False, True, False],
[ True, False, True, False, True],
[False, True, False, True, False],
[ True, False, True, False, True],
[False, True, False, True, False]])]
If you want to get the whole matrix where the condition is true. You can simply do this
X % 2 == 0
If you want to get the first column where condition is true then
X[:, 1:2] % 2 ==0
I want to find up/down patterns in a time series. This is what I use for simple up/down:
diff = np.diff(source, n=1)
encoding = np.where(diff > 0, 1, 0)
Is there a way with Numpy to do that for patterns with a given lookback length without a slow loop? For example up/up/up = 0 down/down/down = 1 up/down/up = 2 up/down/down = 3.....
Thank you for your help.
I learned yesterday about np.lib.stride_tricks.as_strided from one of StackOverflow answers similar to this. This is an awesome trick and not that hard to understand as I expected. Now, if you get it, let's define a function called rolling that lists all the patterns to check with:
def rolling(a, window):
shape = (a.size - window + 1, window)
strides = (a.itemsize, a.itemsize)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
compare_with = [True, False, True]
bool_arr = np.random.choice([True, False], size=15)
paterns = rolling(bool_arr, len(compare_with))
And after that you can calculate indexes of pattern matches as discussed here
idx = np.where(np.all(paterns == compare_with, axis=1))
Sample run:
bool_arr
array([ True, False, True, False, True, True, False, False, False,
False, False, False, True, True, False])
patterns
array([[ True, False, True],
[False, True, False],
[ True, False, True],
[False, True, True],
[ True, True, False],
[ True, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, True],
[False, True, True],
[ True, True, False]])
idx
(array([ 0, 2, 13], dtype=int64),)
I have the following array:
[(True,False,True), (False,False,False), (False,False,True)]
If any element contains a True then they should all be true. So the above should become:
[(True,True,True), (False,False,False), (True,True,True)]
My below code attempts to do that but it simply converts all elements to True:
a = np.array([(True,False,True), (False,False,False), (False,True,False)], dtype='bool')
aint = a.astype('int')
print(aint)
aint[aint.sum() > 0] = (1,1,1)
print(aint.astype('bool'))
The output is:
[[1 0 1]
[0 0 0]
[0 1 0]]
[[ True True True]
[ True True True]
[ True True True]]
You could try np.any, which tests whether any array element along a given axis evaluates to True.
Here's a quick line of code that uses a list comprehension to get your intended result.
lst = [(True,False,True), (False,False,False), (False,False,True)]
result = [(np.any(x),) * len(x) for x in lst]
# result is [(True, True, True), (False, False, False), (True, True, True)]
I'm no numpy wizard but this should return what you want.
import numpy as np
def switch(arr):
if np.any(arr):
return np.ones(*arr.shape).astype(bool)
return arr.astype(bool)
np.apply_along_axis(switch, 1, a)
array([[ True, True, True],
[False, False, False],
[ True, True, True]])
ndarray.any along axis=1 and np.tile will get job done
np.tile(a.any(1)[:,None], a.shape[1])
array([[ True, True, True],
[False, False, False],
[ True, True, True]])
Create an array of True's based on the original array's second dimension and assign it to all rows that have a True in it.
>>> a
array([[ True, False, True],
[False, False, False],
[False, True, False]])
>>> a[a.any(1)] = np.ones(a.shape[1], dtype=bool)
>>> a
array([[ True, True, True],
[False, False, False],
[ True, True, True]])
>>>
Relies on Broadcasting.