What is properway to specify numpy masked array maksed value? - python

I basically want to run something like the following
x = np.array([1,2,3,4,5])
a = ma.masked_array(x, mask=[0, 0, 0, 1, 0])
for i in range(5):
if (a[i] == "--"):
print("a[{0:d}] is masked value".format(i))
I am not sure how I should specify the -- value of the masked array in the if (a[i] == "--") part where "--" is something that I could not figure out. I know there are few other ways of doing it by processing the entire masked array into a boolean values, but I don't want that.
Edit.
The array a is an masked array, and when I print it out I get
masked_array(data=[1, 2, 3, --, 5],
mask=[False, False, False, True, False],
fill_value=999999)
What I want to do is to skip the -- values in that output using the if statement.

A masked array has two key attributes, data and mask.
In [63]: a.mask
Out[63]: array([False, False, False, True, False])
In [64]: a.data
Out[64]: array([1, 2, 3, 4, 5])
getmask docs say its equivalent to getting the attribute:
In [65]: np.ma.getmask(a)
Out[65]: array([False, False, False, True, False])
That mask can then be used to select values from data:
In [66]: a.data[a.mask]
Out[66]: array([4])
More commonly we are interested in the unmasked values:
In [67]: a.compressed()
Out[67]: array([1, 2, 3, 5])
After all if using masking, we aren't "supposed" to care about the masked values. The compressed ones can be used to take the sum:
In [68]: a.sum()
Out[68]: 11
Alternatively the masked values can be filled with something innocuous
In [69]: a.filled()
Out[69]: array([ 1, 2, 3, 999999, 5])
In [70]: a.filled(0)
Out[70]: array([1, 2, 3, 0, 5])

The proper way should be:
mask_a = numpy.ma.getmask(a)
which following your example returns the mask array:
array([False, False, False, True, False])
If I understand correctly how numpy works internally, this does not "process" the masked array to get boolean out of it. The mask is already there, you are just getting it in a proper array which can be used in your for loop, so if you are worried about performance... don't worry.
for i in range(5):
if mask_a[i]:
print("a[{0:d}] is masked value".format(i))
However, if for whatever reason you don't want to use the getmask function, you can get the string representation of a.
str_a = str(a)
which in your example is: '[1 2 3 -- 5]'
Then you can strip the square brackets and split the string on white spaces:
str_a = str(a)[1:-1].split()
which in your example is ['1', '2', '3', '--', '5'].
Then you have a list where you can filter out the "--" values with your for loop:
for i in range(5):
if str_a[i] == "--":
print("a[{0:d}] is masked value".format(i))
But honestly, using the getmask function should be the way to go: I didn't profile it, but I expect it to be faster.

Related

Python "in" keyword-function does not work properly on numpy arrays

Why does this piece of code return True when it clearly can be seen that the element [1, 1] is not present in the first array and what am I supposed to change in order to make it return False?
aux = np.asarray([[0, 1], [1, 2], [1, 3]])
np.asarray([1, 1]) in aux
True
Checking for equality for the two arrays broadcasts the 1d array so the == operator checks if the corresponding indices are equal.
>>> np.array([1, 1]) == aux
array([[False, True],
[ True, False],
[ True, False]])
Since none of the inner arrays are all True, no array in aux is completely equal to the other array. We can check for this using
np.any(np.all(np.array([1, 1]) == aux, axis=1))
You can think of the in operator looping through an iterable and comparing each item for equality with what's being matched. I think what happens here can be demonstrated with the comparison to the first vector in the matrix/list:
>>> np.array([1, 1]) == np.array([0,1])
array([False, True])
and bool([False, True]) in Python == True so the in operator immediately returns.

How to convert a pytorch tensor of ints to a tensor of booleans?

I would like to cast a tensor of ints to a tensor of booleans.
Specifically I would like to be able to have a function which transforms tensor([0,10,0,16]) to tensor([0,1,0,1])
This is trivial in Tensorflow by just using tf.cast(x,tf.bool).
I want the cast to change all ints greater than 0 to a 1 and all ints equal to 0 to a 0. This is the equivalent of !! in most languages.
Since pytorch does not seem to have a dedicated boolean type to cast to, what is the best approach here?
Edit: I am looking for a vectorized solution opposed to looping through each element.
What you're looking for is to generate a boolean mask for the given integer tensor. For this, you can simply check for the condition: "whether the values in the tensor are greater than 0" using simple comparison operator (>) or using torch.gt(), which would then give us the desired result.
# input tensor
In [76]: t
Out[76]: tensor([ 0, 10, 0, 16])
# generate the needed boolean mask
In [78]: t > 0
Out[78]: tensor([0, 1, 0, 1], dtype=torch.uint8)
# sanity check
In [93]: mask = t > 0
In [94]: mask.type()
Out[94]: 'torch.ByteTensor'
Note: In PyTorch version 1.4+, the above operation would return 'torch.BoolTensor'
In [9]: t > 0
Out[9]: tensor([False, True, False, True])
# alternatively, use `torch.gt()` API
In [11]: torch.gt(t, 0)
Out[11]: tensor([False, True, False, True])
If you indeed want single bits (either 0s or 1s), cast it using:
In [14]: (t > 0).type(torch.uint8)
Out[14]: tensor([0, 1, 0, 1], dtype=torch.uint8)
# alternatively, use `torch.gt()` API
In [15]: torch.gt(t, 0).int()
Out[15]: tensor([0, 1, 0, 1], dtype=torch.int32)
The reason for this change has been discussed in this feature-request issue: issues/4764 - Introduce torch.BoolTensor ...
TL;DR: Simple one liner
t.bool().int()
PyTorch's to(dtype) method has convenient data-type named aliases. You can simply call bool:
>>> t.bool()
tensor([False, True, False, True])
>>> t.bool().int()
tensor([0, 1, 0, 1], dtype=torch.int32)
Convert boolean to number value:
a = torch.tensor([0,4,0,0,5,0.12,0.34,0,0])
print(a.gt(0)) # output in boolean dtype
# output: tensor([False, True, False, False, True, True, True, False, False])
print(a.gt(0).to(torch.float32)) # output in float32 dtype
# output: tensor([0., 1., 0., 0., 1., 1., 1., 0., 0.])
Another option would be to simply do:
temp = torch.tensor([0,10,0,16])
temp.bool()
#Returns
tensor([False, True, False, True])
You can use comparisons as shown below:
>>> a = tensor([0,10,0,16])
>>> result = (a == 0)
>>> result
tensor([ True, False, True, False])

Determine indices of entries in an array that start with a certain string

How can I determine the indices of elements in an numpy array that start with a certain string (e.g. using startswith)?
Example
Array:
test1234
testworld
hello
mynewcar
test5678
Now I need the indices where the value starts with test. My desired outcome is:
[0,1,4]
You could use np.char.startswith to get the mask of matches and then np.flatnonzero to get the matching indices -
np.flatnonzero(np.char.startswith(a, 'test'))
Sample run -
In [61]: a = np.array(['test1234', 'testworld','hello','mynewcar','test5678'])
In [62]: np.char.startswith(a, 'test')
Out[62]: array([ True, True, False, False, True], dtype=bool)
In [63]: np.flatnonzero(np.char.startswith(a, 'test'))
Out[63]: array([0, 1, 4])
#Divakar's answer is the way to go, but just as an alternative, you can also use a list comprehension:
a = np.array(['test1234', 'testworld', 'hello', 'mynewcar', 'test5678'])
[i for i, si in enumerate(a) if si.startswith('test')]
will give
[0, 1, 4]
This list you could also convert back to a numpy array:
np.array([i for i, si in enumerate(a) if si.startswith('test')])

Find indices in an array that contain one of values from another array

How to get the index of values in an array (a) by a another array (label) with more than one "markers"? For example, given
label = array([1, 2])
a = array([1, 1, 2, 2, 3, 3])
the goal is to find the indices of a with the value of 1 or 2; that is, 0, 1, 2, 3.
I tried several combinations. None of the following seems to work.
label = array([1, 2])
a = array([1, 1, 2, 2, 3, 3])
idx = where(a==label) # gives me only the index of the last value in label
idx = where(a==label[0] or label[1]) # Is confused by all or any?
idx = where(a==label[0] | label[1]) # gives me results as if nor. idx = [4,5]
idx = where(a==label[0] || label[1]) # syntax error
idx = where(a==bolean.or(label,0,1) # I know, this is not the correct form but I don`t remember it correctly but remember the error: also asks for a.all or a.any
idx = where(label[0] or label[1] in a) # gives me only the first appearance. index = 0. Also without where().
idx = where(a==label[0] or a==label[1]).all()) # syntax error
idx = where(a.any(0,label[0] or label[1])) # gives me only the first appearance. index=0. Also without where().
idx = where(a.any(0,label[0] | label[1])) # gives me only the first appearance. index=0. Also without where().
idx=where(a.any(0,label)) # Datatype not understood
Ok, I think you get my problem. Does anyone know how to do it correctly? Best would be a solution with a general label instead of label[x] so that the use of label is more variable for later changes.
You can use numpy.in1d:
>>> a = numpy.array([1, 1, 2, 2, 3, 3])
>>> label = numpy.array([1, 2])
>>> numpy.in1d(a, label)
array([ True, True, True, True, False, False], dtype=bool)
The above returns a mask. If you want indices, you can call numpy.nonzero on the mask array.
Also, if the values in label array are unique, you can pass assume_unique=True to in1d to possibly speed it up.
np.where(a==label) is the same as np.nonzeros(a==label). It tells us the coordinates (indexes) of all non-zero (or True) elements in the array, a==label.
So instead of trying all these different where expressions, focus on the conditional array
Without the where here's what some of your expressions produce:
In [40]: a==label # 2 arrays don't match in size, scalar False
Out[40]: False
In [41]: a==label[0] # result is the size of a
Out[41]: array([ True, True, False, False, False, False], dtype=bool)
In [42]: a==label[0] or label[1] # or is a Python scalar operation
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
In [43]: a==label[0] | label[1]
Out[43]: array([False, False, False, False, True, True], dtype=bool)
This last is the same as a==(label[0] | label[1]), the | is evaluated before the ==.
You need to understand how each of those arrays (or scalar or error) are produced before you understand what where gives you.
Correct combination of 2 equality tests (the extra () are important):
In [44]: (a==label[1]) | (a==label[0])
Out[44]: array([ True, True, True, True, False, False], dtype=bool)
Using broadcasting to separately test the 2 elements of label. Result is 2d array:
In [45]: a==label[:,None]
Out[45]:
array([[ True, True, False, False, False, False],
[False, False, True, True, False, False]], dtype=bool)
In [47]: (a==label[:,None]).any(axis=0)
Out[47]: array([ True, True, True, True, False, False], dtype=bool)
As I understand it, you want the indices of 1 and 2 in array "a".
In that case, try
label= [1,2]
a= [1,1,2,2,3,3]
idx_list = list()
for x in label:
for i in range(0,len(a)-1):
if a[i] == x:
idx_list.append(i)
I think what I'm reading as your intent is to get the indices in the second list, 'a', of the values in the first list, 'labels'. I think that a dictionary is a good way to store this information where the labels will be keys and indices will be the values.
Try this:
labels = [a,2]
a = [1,1,2,2,3,3]
results = {}
for label in labels:
results[label] = [i for i,x in enumerate(a) if x == label]
if you want the indices of 1 just call results[1]. The list comprehension is and the enumerate function are the real MVPs here.

How to invert numpy.where (np.where) function

I frequently use the numpy.where function to gather a tuple of indices of a matrix having some property. For example
import numpy as np
X = np.random.rand(3,3)
>>> X
array([[ 0.51035326, 0.41536004, 0.37821622],
[ 0.32285063, 0.29847402, 0.82969935],
[ 0.74340225, 0.51553363, 0.22528989]])
>>> ix = np.where(X > 0.5)
>>> ix
(array([0, 1, 2, 2]), array([0, 2, 0, 1]))
ix is now a tuple of ndarray objects that contain the row and column indices, whereas the sub-expression X>0.5 contains a single boolean matrix indicating which cells had the >0.5 property. Each representation has its own advantages.
What is the best way to take ix object and convert it back to the boolean form later when it is desired? For example
G = np.zeros(X.shape,dtype=np.bool)
>>> G[ix] = True
Is there a one-liner that accomplishes the same thing?
Something like this maybe?
mask = np.zeros(X.shape, dtype='bool')
mask[ix] = True
but if it's something simple like X > 0, you're probably better off doing mask = X > 0 unless mask is very sparse or you no longer have a reference to X.
mask = X > 0
imask = np.logical_not(mask)
For example
Edit: Sorry for being so concise before. Shouldn't be answering things on the phone :P
As I noted in the example, it's better to just invert the boolean mask. Much more efficient/easier than going back from the result of where.
The bottom of the np.where docstring suggests to use np.in1d for this.
>>> x = np.array([1, 3, 4, 1, 2, 7, 6])
>>> indices = np.where(x % 3 == 1)[0]
>>> indices
array([0, 2, 3, 5])
>>> np.in1d(np.arange(len(x)), indices)
array([ True, False, True, True, False, True, False], dtype=bool)
(While this is a nice one-liner, it is a lot slower than #Bi Rico's solution.)

Categories