NumPy masked array not considering fill_value when comparing to scalar - python

I have a masked numpy array like the following:
mar = np.ma.array([0, 0, 100, 100], mask=[False, True, True, False], fill_value=-1)
So the two values in the middle are masked, calling mar.filled() would return [0, -1, -1, 100].
I want to compare this array to a scalar 0, i.e.:
mar == 0
which returns
masked_array(data = [True -- -- False],
mask = [False True True False],
fill_value = True)
Note that the fill_value is now True which is the default fill value for bool arrays but does not make sense for me in this case (I would have expected that it is set to -1 == 0 which is False).
To illustrate my problem more clearly: (mar == 0).filled() and mar.filled() == 0 do not return the same result.
Is this intended behaviour or is it a bug? In any case, is there a workaround to achieve my desired behaviour? I know that I can just convert to a normal array before comparison using .filled() but I would like to avoid that if possible, since the code should not care whether it is a masked array or a normal one.

mar == 0 uses mar.__eq__(0)
docs for that method say:
When either of the elements is masked, the result is masked as well,
but the underlying boolean data are still set, with self and other
considered equal if both are masked, and unequal otherwise.
That method in turn uses mar._comparison
This first performs the comparison on the .data attributes
In [16]: mar.data
Out[16]: array([ 0, 0, 100, 100])
In [17]: mar.data == 0
Out[17]: array([ True, True, False, False])
But then it compares the masks and adjusts values. 0 is not masked, so its 'mask' is False. Since the mask for the masked elements of mar is True, the masks don't match, and the comparison .data is set to False.
In [19]: np.ma.getmask(0)
Out[19]: False
In [20]: mar.mask
Out[20]: array([False, True, True, False])
In [21]: (mar==0).data
Out[21]: array([ True, False, False, False])
I get a different fill_value in the comparison. That could be a change in v 1.14.0.
In [24]: mar==0
Out[24]:
masked_array(data=[True, --, --, False],
mask=[False, True, True, False],
fill_value=-1)
In [27]: (mar==0).filled()
Out[27]: array([True, -1, -1, False], dtype=object)
This is confusing. Comparisons (and in general most functions) on masked arrays have to deal with the .data, the mask, and the fill. Numpy code that isn't ma aware usually works the .data and ignores the masking. ma methods may work with the filled() values, or the compressed. This comparison method attempts to take all 3 attributes into account.
Testing the equality with a masked 0 array (same mask and fillvalue):
In [34]: mar0 = np.ma.array([0, 0, 0, 0], mask=[False, True, True, False], fill_
...: value=-1)
In [35]: mar0
Out[35]:
masked_array(data=[0, --, --, 0],
mask=[False, True, True, False],
fill_value=-1)
In [36]: mar == mar0
Out[36]:
masked_array(data=[True, --, --, False],
mask=[False, True, True, False],
fill_value=-1)
In [37]: _.data
Out[37]: array([ True, True, True, False])
mar == 0 is the same as mar == np.ma.array([0, 0, 0, 0], mask=False)

I don't know why (mar == 0) does not yield the desired output. But you can consider
np.equal(mar, 0)
which retain the original fill value.

Related

Efficienctly selecting rows that end with zeros in numpy

I have a tensor / array of shape N x M, where M is less than 10 but N can potentially be > 2000. All entries are larger than or equal to zero. I want to filter out rows that either
Do not contain any zeros
End with zeros only, i.e [1,2,0,0] would be valid but not [1,0,2,0] or [0,0,1,2]. Put differently once a zero appears all following entries of that row must also be zero, otherwise the row should be ignored.
as efficiently as possible. Consider the following example
Example:
[[35, 25, 17], # no zeros -> valid
[12, 0, 0], # ends with zeros -> valid
[36, 2, 0], # ends with zeros -> valid
[8, 0, 9]] # contains zeros and does not end with zeros -> invalid
should yield [True, True, True, False]. The straightforward implementation I came up with is:
import numpy as np
T = np.array([[35,25,17], [12,0,0], [36,2,0], [0,0,9]])
N,M = T.shape
valid = [i*[True,] + (M-i)*[False,] for i in range(1, M+1)]
mask = [((row > 0).tolist() in valid) for row in T]
Is there a more elegant and efficient solution to this? Any help is greatly appreciated!
Here's one way:
x[np.all((x == 0) == (x.cumprod(axis=1) == 0), axis=1)]
This calculates the row-wise cumulative product, matches the original array's zeros up with the cumprod array, then filters any rows where there's one or more False.
Workings:
In [3]: x
Out[3]:
array([[35, 25, 17],
[12, 0, 0],
[36, 2, 0],
[ 8, 0, 9]])
In [4]: x == 0
Out[4]:
array([[False, False, False],
[False, True, True],
[False, False, True],
[False, True, False]])
In [5]: x.cumprod(axis=1) == 0
Out[5]:
array([[False, False, False],
[False, True, True],
[False, False, True],
[False, True, True]])
In [6]: (x == 0) == (x.cumprod(axis=1) == 0)
Out[6]:
array([[ True, True, True],
[ True, True, True],
[ True, True, True],
[ True, True, False]]) # bad row!
In [7]: np.all((x == 0) == (x.cumprod(axis=1) == 0), axis=1)
Out[7]: array([ True, True, True, False])

Count number of transitions in each row of a Numpy array

I have a 2D boolean array
a=np.array([[True, False, True, False, True],[True , True, True , True, True], [True , True ,False, False ,False], [False, True , True, False, False], [True , True ,False, True, False]])
I would like to create a new array, providing count of True-False transitions in each row of this array.
The desired result is count=[2, 0, 1, 1, 2]
I operate with a large numpy array, so I don't apply cycle to browse through all lines.
I tried to adopt available solutions to a 2D array with counting for each line separately, but did not succeed.
Here is a possible solution:
b = a.astype(int)
c = (b[:, :-1] - b[:, 1:])
count = (c == 1).sum(axis=1)
Result:
>>> count
array([2, 0, 1, 1, 2])

Apply numpy 'where' along one of axes

I have an array like that:
array = np.array([
[True, False],
[True, False],
[True, False],
[True, True],
])
I would like to find the last occurance of True for each row of the array.
If it was 1d array I would do it in this way:
np.where(array)[0][-1]
How do I do something similar in 2D? Kind of like:
np.where(array, axis = 1)[0][:,-1]
but there is no axis argument in np.where.
Since True is greater than False, find the position of the largest element in each row. Unfortunately, argmax finds the first largest element, not the last one. So, reverse the array sideways, find the first True from the end, and recalculate the indexes:
(array.shape[1] - 1) - array[:, ::-1].argmax(axis=1)
# array([0, 0, 0, 1])
The method fails if there are no True values in a row. You can check if that's the case by dividing by array.max(axis=1). A row with no Trues will have its last True at the infinity :)
array[0, 0] = False
((array.shape[1] - 1) - array[:, ::-1].argmax(axis=1)) / array.max(axis=1)
#array([inf, 0., 0., 1.])
I found an older answer but didn't like that it returns 0 for both a True in the first position, and for a row of False.
So here's a way to solve that problem, if it's important to you:
import numpy as np
arr = np.array([[False, False, False], # -1
[False, False, True], # 2
[True, False, False], # 0
[True, False, True], # 2
[True, True, False], # 1
[True, True, True], # 2
])
# Make an adustment for no Trues at all.
adj = np.sum(arr, axis=1) == 0
# Get the position and adjust.
x = np.argmax(np.cumsum(arr, axis=1), axis=1) - adj
# Compare to expected result:
assert np.all(x == np.array([-1, 2, 0, 2, 1, 2]))
print(x)
Gives [-1 2 0 2 1 2].

Can someone please explain np.less_equal.outer(range(1,18),range(1,13))

I was debugging a code written by someone who has left the organization and came across a line, which uses np.less_equal.outer & np.greater_equal.outer functions. I know that np.outer creates a Cartesian cross product of two 1-dimensional arrays and creates two arrays, and np.less_equal compares the element of two arrays and returns true or false. Can someone please explain how this combined form works.
Thanks!
less_equal and greater_equal are special types of numpy functions called ufuncs, in that they have extendible functionalities, including accumulate, at, and outer.
In this case ufunc.outer extends the function to work similarly to the outer product - but while the actual outer product would be multiply.outer, this instead does the greater or less than comparison.
So you get a 2d array of booleans corresponding to each element of the first array, and whether they are greater or less than each of the elements in the second array.
np.less_equal.outer(range(1,18),range(1,13))
Out[]:
array([[ True, True, True, ..., True, True, True],
[False, True, True, ..., True, True, True],
[False, False, True, ..., True, True, True],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]], dtype=bool)
EDIT: a much more pythonic way of doing this would be:
np.triu(np.ones((18, 13), dtype = bool), 0)
That is, the upper triangle of a boolean array of shape (18, 13)
From the documentation, we have that for one-dimensional arrays A and B, the operation np.less_equal.outer(A, B) is equivalent to:
m = len(A)
n = len(B)
r = empty(m, n)
for i in range(m):
for j in range(n):
r[i,j] = (A[i] <= B[j])
Here's the mathematical representation of the result:
here is an example:
np.less_equal([4, 2, 1], [2, 2, 2])
array([False, True, True])
np.greater_equal([4, 2, 1], [2, 2, 2])
array([ True, True, False], dtype=bool)
and first the outer function
np.outer(range(1,2), range(1,3))
array([[1 2 3],
[2 4 6],
)
hope that helps.

np.isreal behavior different in pandas.DataFrame and numpy.array

I have a array like below
np.array(["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"d":9,"e":10,"f":11}])
and a pandas DataFrame like below
df = pd.DataFrame({'A': ["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"d":9,"e":10,"f":11}]})
When I apply np.isreal to DataFrame
df.applymap(np.isreal)
Out[811]:
A
0 False
1 False
2 True
3 False
4 False
5 True
When I do np.isreal for the numpy array.
np.isreal( np.array(["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"d":9,"e":10,"f":11}]))
Out[813]: array([ True, True, True, True, True, True], dtype=bool)
I must using the np.isreal in the wrong use case, But can you help me about why the result is different ?
A partial answer is that isreal is only intended to be used on array-like as the first argument.
You want to use isrealobj on each element to get the bahavior you see here:
In [11]: a = np.array(["hello","world",{"a":5,"b":6,"c":8},"usa","india",{"d":9,"e":10,"f":11}])
In [12]: a
Out[12]:
array(['hello', 'world', {'a': 5, 'b': 6, 'c': 8}, 'usa', 'india',
{'d': 9, 'e': 10, 'f': 11}], dtype=object)
In [13]: [np.isrealobj(aa) for aa in a]
Out[13]: [True, True, True, True, True, True]
In [14]: np.isreal(a)
Out[14]: array([ True, True, True, True, True, True], dtype=bool)
That does leave the question, what does np.isreal do on something that isn't array-like e.g.
In [21]: np.isrealobj("")
Out[21]: True
In [22]: np.isreal("")
Out[22]: False
In [23]: np.isrealobj({})
Out[23]: True
In [24]: np.isreal({})
Out[24]: True
It turns out this stems from .imag since the test that isreal does is:
return imag(x) == 0 # note imag == np.imag
and that's it.
In [31]: np.imag(a)
Out[31]: array([0, 0, 0, 0, 0, 0], dtype=object)
In [32]: np.imag("")
Out[32]:
array('',
dtype='<U1')
In [33]: np.imag({})
Out[33]: array(0, dtype=object)
This looks up the .imag attribute on the array.
In [34]: np.asanyarray("").imag
Out[34]:
array('',
dtype='<U1')
In [35]: np.asanyarray({}).imag
Out[35]: array(0, dtype=object)
I'm not sure why this isn't set in the string case yet...
I think this a small bug in Numpy to be honest. Here Pandas is just looping over each item in the column and calling np.isreal() on it. E.g.:
>>> np.isreal("a")
False
>>> np.isreal({})
True
I think the paradox here has to do with how np.real() treats inputs of dtype=object. My guess is it's taking the object pointer and treating it like an int, so of course np.isreal(<some object>) returns True. Over an array of mixed types like np.array(["A", {}]), the array is of dtype=object so np.isreal() is treating all the elements (including the strings) the way it would anything with dtype=object.
To be clear, I think the bug is in how np.isreal() treats arbitrary objects in a dtype=object array, but I haven't confirmed this explicitly.
There are a couple things going on here. First is pointed out by the previous answers in that np.isreal acts strangely when passed ojbects.
However, I think you are also confused about what applymap is doing. Difference between map, applymap and apply methods in Pandas is always a great reference.
In this case what you think you are doing is actually:
df.apply(np.isreal, axis=1)
Which essentially calls np.isreal(df), whereas df.applymap(np.isreal) is essentially calling np.isreal on each individual element of df. e.g
np.isreal(df.A)
array([ True, True, True, True, True, True], dtype=bool)
np.array([np.isreal(x) for x in df.A])
array([False, False, True, False, False, True], dtype=bool)

Categories