Is there a faster alternative to np.where for determining indeces? - python

I have an array like this:
arrayElements = [[1, 4, 6],[2, 4, 6],[3, 5, 6],...,[2, 5, 6]]
I need to know, for example, the indices where an arrayElements is equal to 1.
Right now, I am doing:
rows, columns = np.where(arrayElements == 1)
This works, but I am doing this in a loop that loops through all possible element values, in my case, it's 1-500,000+. This is taking 30-40 minutes to run depending how big my array is. Can anyone suggest a better way of going about this? (Additional information is that I don't care about the column that the value is in, just the row, not sure if that's useful.)
Edit: I need to know the value of every element separately. That is, I need the values of rows for each value that elements contains.

So you are generating thousands of arrays like this:
In [271]: [(i,np.where(arr==i)[0]) for i in range(1,7)]
Out[271]:
[(1, array([0])),
(2, array([1, 3])),
(3, array([2])),
(4, array([0, 1])),
(5, array([2, 3])),
(6, array([0, 1, 2, 3]))]
I could do the == test for all values at once with a bit of broadcasting:
In [281]: arr==np.arange(1,7)[:,None,None]
Out[281]:
array([[[ True, False, False],
[False, False, False],
[False, False, False],
[False, False, False]],
[[False, False, False],
[ True, False, False],
[False, False, False],
[ True, False, False]],
[[False, False, False],
[False, False, False],
[ True, False, False],
[False, False, False]],
[[False, True, False],
[False, True, False],
[False, False, False],
[False, False, False]],
[[False, False, False],
[False, False, False],
[False, True, False],
[False, True, False]],
[[False, False, True],
[False, False, True],
[False, False, True],
[False, False, True]]])
and since you only care about rows, apply an any:
In [282]: (arr==np.arange(1,7)[:,None,None]).any(axis=2)
Out[282]:
array([[ True, False, False, False],
[False, True, False, True],
[False, False, True, False],
[ True, True, False, False],
[False, False, True, True],
[ True, True, True, True]])
The where on this is the same values as in Out[271], but grouped differently:
In [283]: np.where((arr==np.arange(1,7)[:,None,None]).any(axis=2))
Out[283]:
(array([0, 1, 1, 2, 3, 3, 4, 4, 5, 5, 5, 5]),
array([0, 1, 3, 2, 0, 1, 2, 3, 0, 1, 2, 3]))
It can be split up with:
In [284]: from collections import defaultdict
In [285]: dd = defaultdict(list)
In [287]: for i,j in zip(*Out[283]): dd[i].append(j)
In [288]: dd
Out[288]:
defaultdict(list,
{0: [0], 1: [1, 3], 2: [2], 3: [0, 1], 4: [2, 3], 5: [0, 1, 2, 3]})
This 2nd approach may be faster for some arrays, though it may not scale well to your full problem.

By using np.isin (see documentation), you can test for multiple element values.
For example:
import numpy as np
a = np.array([1,2,3,4])
check_for = np.array([1,2])
locs = np.isin(a, check_for)
# [True, True, False, False]
np.where(locs)
#[0, 1]
Note: This assumes that you do not need to know the indices for every element value separately.
In the case that you need to track every element value separately, use a default dictionary and iterate through the matrix.
from collections import defaultdict
tracker = defaultdict(set)
for (row, column), value in np.ndenumerate(arrayElements):
tracker[value].add(row)

You could try looping over the values and indices using numpy.ndenumerate and using Counter, defaultdict, or dict where the keys are the values in the array.

Related

Unexpected masked element in numpy's isin() with masked arrays. Bug?

Using numpy and the following masked arrays
import numpy.ma as ma
a = ma.MaskedArray([[1,2,3],[4,5,6]], [[True,False,False],[False,False,False]])
ta = ma.array([1,4,5])
>>> a
masked_array(
data=[[--, 2, 3],
[4, 5, 6]],
mask=[[ True, False, False],
[False, False, False]],
fill_value=999999)
>>> ta
masked_array(data=[1, 4, 5],
mask=False,
fill_value=999999)
to check for each element in a if it is in ta, I use
ma.isin(a, ta)
This command gives
masked_array(
data=[[False, False, False],
[True, True, --]],
mask=[[False, False, False],
[False, False, True]],
fill_value=True)
Why is the last element in the result masked? Neither of the input arrays is masked at this point.
Using the the standard numpy version produces to be expected results:
>>> import numpy as np
>>> np.isin(a, ta)
array([[ True, False, False],
[ True, True, False]])
Here, however, the very first element is True because the mask of a was ignored.
Tested with Python 3.9.4 and numpy 1.20.3.

Assign scalar value to all elements of numpy array which are in another array

For example, I have a numpy array a = np.arange(10), and I need to assign the values in ignore = [2,3,4] to be number 255, like this:
a = np.arange(10)
ignore_value = [3,4,5]
a[a in ignore_value] = 255 # what is the correct way to implement this?
The last line in the program above cannot be accepted by Python3.5 but it shows what I want to do.
Edit:
I found a solution, but it's not vectorized.
for el in ignore_value:
a[a == el] = 255
This looks really ugly and is very slow since there is a for loop here, so do I have a better way ?
In [500]: a = np.arange(10)
In [501]: ignore_value = [3,4,5]
In [502]: np.isin(a, ignore_value)
Out[502]:
array([False, False, False, True, True, True, False, False, False,
False])
In [503]: a[np.isin(a, ignore_value)]=255
In [504]: a
Out[504]: array([ 0, 1, 2, 255, 255, 255, 6, 7, 8, 9])
You could also construct the mask with:
In [506]: a[:,None]==ignore_value
Out[506]:
array([[False, False, False],
[False, False, False],
[False, False, False],
[ True, False, False],
[False, True, False],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]])
In [507]: (a[:,None]==ignore_value).any(axis=1)
Out[507]:
array([False, False, False, True, True, True, False, False, False,
False])
You can use numpy.isin with boolean indexing.
>>> a = np.arange(10)
>>> ignore_value = [3,4,5]
>>> a[np.isin(a, ignore_value)] = 255
>>> a
array([ 0, 1, 2, 255, 255, 255, 6, 7, 8, 9])
... or with numpy.where:
>>> a = np.arange(10)
>>> a = np.where(np.isin(a, ignore_value), 255, a)
>>> a
array([ 0, 1, 2, 255, 255, 255, 6, 7, 8, 9])
In both cases, np.isin(a, ignore_value) will give you a boolean array indicating where a has a value occuring in ignore_value.
>>> np.isin(a, ignore_value)
array([False, False, False, True, True, True, False, False, False, False])

Generate three non-overlapping mask for 2-D matrix that covers all of it

I have a 2-d array and I want to divide it into 3 non-overlapping and random sub-matrix by mask generation. For example I have a matrix like follow:
input = [[1,2,3],
[4,5,6],
[7,8,9]]
I want three random zero-one masks like follow:
mask1 = [[0,1,0],
[1,0,1],
[0,0,0]]
mask2 = [[1,0,0],
[0,1,0],
[1,0,0]]
mask3 =[[0,0,1],
[0,0,0],
[0,1,1]]
But my input matrix is too large and I need to do it in a fast way. I also want to determine the ratio of ones for every mask as input. In the above example the ratio is equal for all masks.
To produce one random mask, I use following code:
np.random.choice([0, 1],size=(size of matrix[0],size of matrix[1]))
My problem is how to produce non-overlapping masks.
IIUC, you can make a random matrix of 0, 1, and 2, and then extract the m == 0, m == 1, and m == 2 values:
groups = np.random.randint(0, 3, (5,5))
masks = (groups[...,None] == np.arange(3)[None,:]).T
However, this wouldn't guarantee an equal number of elements in each mask. To achieve that, you could permute a balanced allocation:
a = np.arange(25).reshape(5,5) # dummy input
groups = np.random.permutation(np.arange(a.size) % 3).reshape(a.shape)
masks = (groups[...,None] == np.arange(3)[None,:]).T
If you wanted a random probability to be in a group:
groups = np.random.choice([0,1,2], p=[0.3, 0.6, 0.1], size=a.shape)
or something. All you need to do is decide how you want to assign cells to groups, and then you can build your masks.
For example:
In [431]: groups = np.random.permutation(np.arange(a.size) % 3).reshape(a.shape)
In [432]: groups
Out[432]:
array([[1, 0, 0, 2, 0],
[1, 2, 0, 0, 1],
[2, 0, 2, 0, 2],
[1, 1, 2, 1, 0],
[2, 2, 1, 1, 0]], dtype=int32)
In [433]: masks = (groups[...,None] == np.arange(3)[None,:]).T
In [434]: masks
Out[434]:
array([[[False, False, False, False, False],
[ True, False, True, False, False],
[ True, True, False, False, False],
[False, True, True, False, False],
[ True, False, False, True, True]],
[[ True, True, False, True, False],
[False, False, False, True, False],
[False, False, False, False, True],
[False, False, False, True, True],
[False, True, False, False, False]],
[[False, False, True, False, True],
[False, True, False, False, True],
[False, False, True, True, False],
[ True, False, False, False, False],
[False, False, True, False, False]]])
which gives me a full mask:
In [450]: masks.sum(axis=0)
Out[450]:
array([[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1]])
and reasonably balanced. If the number of cells were a multiple of 3, these numbers would all agree.
In [451]: masks.sum(2).sum(1)
Out[451]: array([9, 8, 8])
You can use .astype(int) to convert from a bool array to an int array of 0s and 1s if you like.

Identifying column indices that a subsetted np array came from in larger np array

I have a "large" numpy array like follows:
from numpy import array
large = array([[-0.047391 , -0.10926778, -0.00899118, 0.07461428, -0.07667476,
0.06961918, 0.09440736, 0.01648382, -0.04102225, -0.05038805,
-0.00930337, 0.3667651 , -0.02803499, 0.02597451, -0.1218804 ,
0.00561949],
[-0.00253788, -0.08670117, -0.00466262, 0.07330351, -0.06403728,
0.00301005, 0.12807456, 0.01198117, -0.04290793, -0.06138136,
-0.01369276, 0.37094407, -0.03747804, 0.04444246, -0.01162705,
0.00554793]])
And a "small" array that was subsetted from large.
small = array([[-0.10926778, -0.07667476, 0.09440736],
[-0.08670117, -0.06403728, 0.12807456]])
Without any other information, how could we identify the column indices in large from which the small array was generated?
In this case, the answer is 1, 4, 6 (starting at 0 as done in python).
What would be a generalizable way to determine this?
Something like this (not sure how you want to squeeze the result from 2D down to 1D?):
>>> np.isin(large,small)
array([[False, True, False, False, True, False, True, False, False,
False, False, False, False, False, False, False],
[False, True, False, False, True, False, True, False, False,
False, False, False, False, False, False, False]], dtype=bool)
>>> np.where(np.isin(large,small)) # tuple of arrays
(array([0, 0, 0, 1, 1, 1]), array([1, 4, 6, 1, 4, 6]))
# And generalizing, if you really want that as 2x2x3 array of indices:
idxs = array(np.where(np.isin(large,small)))
idxs.reshape( (2,) + small.shape )
array([[[0, 0, 0],
[1, 1, 1]],
[[1, 4, 6],
[1, 4, 6]]])

Mask from max values in numpy array, specific axis

Input example:
I have a numpy array, e.g.
a=np.array([[0,1], [2, 1], [4, 8]])
Desired output:
I would like to produce a mask array with the max value along a given axis, in my case axis 1, being True and all others being False. e.g. in this case
mask = np.array([[False, True], [True, False], [False, True]])
Attempt:
I have tried approaches using np.amax but this returns the max values in a flattened list:
>>> np.amax(a, axis=1)
array([1, 2, 8])
and np.argmax similarly returns the indices of the max values along that axis.
>>> np.argmax(a, axis=1)
array([1, 0, 1])
I could iterate over this in some way but once these arrays become bigger I want the solution to remain something native in numpy.
Method #1
Using broadcasting, we can use comparison against the max values, while keeping dims to facilitate broadcasting -
a.max(axis=1,keepdims=1) == a
Sample run -
In [83]: a
Out[83]:
array([[0, 1],
[2, 1],
[4, 8]])
In [84]: a.max(axis=1,keepdims=1) == a
Out[84]:
array([[False, True],
[ True, False],
[False, True]], dtype=bool)
Method #2
Alternatively with argmax indices for one more case of broadcasted-comparison against the range of indices along the columns -
In [92]: a.argmax(axis=1)[:,None] == range(a.shape[1])
Out[92]:
array([[False, True],
[ True, False],
[False, True]], dtype=bool)
Method #3
To finish off the set, and if we are looking for performance, use intialization and then advanced-indexing -
out = np.zeros(a.shape, dtype=bool)
out[np.arange(len(a)), a.argmax(axis=1)] = 1
Create an identity matrix and select from its rows using argmax on your array:
np.identity(a.shape[1], bool)[a.argmax(axis=1)]
# array([[False, True],
# [ True, False],
# [False, True]], dtype=bool)
Please note that this ignores ties, it just goes with the value returned by argmax.
You're already halfway in the answer. Once you compute the max along an axis, you can compare it with the input array and you'll have the required binary mask!
In [7]: maxx = np.amax(a, axis=1)
In [8]: maxx
Out[8]: array([1, 2, 8])
In [12]: a >= maxx[:, None]
Out[12]:
array([[False, True],
[ True, False],
[False, True]], dtype=bool)
Note: This uses NumPy broadcasting when doing the comparison between a and maxx
in on line : np.equal(a.max(1)[:,None],a) or np.equal(a.max(1),a.T).T .
But this can lead to several ones in a row.
In a multi-dimensional case you can also use np.indices. Let's suppose you have an array:
a = np.array([[
[0, 1, 2],
[3, 8, 5],
[6, 7, -1],
[9, 5, 8]],[
[5, 2, 8],
[7, 6, -3],
[-1, 2, 1],
[3, 5, 6]]
])
you can access argmax values calculated for axis 0 like so:
k = np.zeros((2, 4, 3), np.bool)
k[a.argmax(0), ind[0], ind[1]] = 1
The output would be:
array([[[False, False, False],
[False, True, True],
[ True, True, False],
[ True, True, True]],
[[ True, True, True],
[ True, False, False],
[False, False, True],
[False, False, False]]])

Categories