unexpected behaviour of numpy.median on masked arrays - python

I've a question relating the behaviour of numpy.median() on masked arrays created with numpy.ma.masked_array().
As I've understood from debugging my own code, numpy.median() does not work as expected on masked arrays (see Using numpy.median on a masked array for a definition of the problem)
The answer provided was:
Explanation: If I remember correctly, the np.median does not support subclasses, so it fails to work correctly on np.ma.MaskedArray.
The conclusion therefore being that in order to calculate the median of the elements in a masked array is to use numpy.ma.median() since this is a median function dedicated to masked arrays.
My problem lies in the fact that I've just spent a considerable amount of time finding this problem since there is no way of knowing this problem.
There is no warning or exception raised when trying to calculate the median of a masked array via numpy.median().
The answer returned by this function is not what is expected, and cause serious problems when people are not aware of this.
Does anyone know if this might be considered a bug?
In my opinion, the expected behaviour should be that using numpy.median on a masked array will raise and exception of some sort.
Any thoughts???
The below test script shows the unwanted and unexpected behaviour of using numpy.median on a masked array (note that the correct and expected median value of the valid elements is 2.5!!!):
In [1]: import numpy as np
In [2]: test = np.array([1, 2, 3, 4, 100, 100, 100, 100])
In [3]: valid_elements = np.array([1, 1, 1, 1, 0, 0, 0, 0], dtype=np.bool)
In [4]: testm = np.ma.masked_array(test, ~valid_elements)
In [5]: testm
Out[5]:
masked_array(data = [1 2 3 4 -- -- -- --],
mask = [False False False False True True True True],
fill_value = 999999)
In [6]: np.median(test)
Out[6]: 52.0
In [7]: np.median(test[valid_elements])
Out[7]: 2.5
In [8]: np.median(testm)
Out[8]: 4.0
In [9]: np.ma.median(testm)
Out[9]: 2.5

Does anyone know if this might be considered a bug?
Well, it is a Bug! I posted it a few months ago on their issue tracker (Link to the bug report).
The reason for this behaviour is that np.median uses the partition method of the input-array but np.ma.MaskedArray doesn't override the partition method. So when arr.partition is called in np.median it simply defaults to the basic numpy.ndarray.partition method (which is bogus for a masked array!).

Related

ndarray .floor() and .ceil() methods missing?

I'm processing a numpy.matrix and I'm missing the round-up and down functions.
I.e. I can do:
data = [[1, -20],[-30, 2]]
np.matrix(data).mean(0).round().astype(int).tolist()[0]
Out[58]: [-14, -9]
Thus use .round(). But I cannot use .floor() or .ceil().
They are also not mentioned in the SciPy NumPy 1.14 reference.
Why are these (quite essential) functions missing?
edit:
I've found that you can do np.floor(np.matrix(data).mean(0)).astype(int).tolist()[0]. But why the difference? Why is .round() a method and .floor not?
As with most of these why questions we can only deduce likely reasons from patterns, and some knowledge of the history.
https://docs.scipy.org/doc/numpy/reference/ufuncs.html#floating-functions
floor and ceil are classed as floating ufuncs. rint is also a ufunc that performs like round. ufuncs have a standardized interface, including parameters like out and where.
np.round is in /usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py. numeric is one of original packages that was merged to form the current numpy. It is alias for np.round_ which ends up calling np.around, also in fromnumeric. Note the available parameters include out, but also decimals (which is missing from rint). And it delegates to the .round method.
One advantage to having a function is that you don't have to first convert the list into an array:
In [115]: data = [[1, -20],[-30, 2]]
In [119]: np.mean(data,0)
Out[119]: array([-14.5, -9. ])
In [120]: np.mean(data,0).round()
Out[120]: array([-14., -9.])
In [121]: np.rint(np.mean(data,0))
Out[121]: array([-14., -9.])
using other parameters:
In [138]: np.mean(data,axis=0, keepdims=True,dtype=int)
Out[138]: array([[-14, -9]])

python numpy where returning unexpected warning

Using python 2.7, scipy 1.0.0-3
Apparently I have a misunderstanding of how the numpy where function is supposed to operate or there is a known bug in its operation. I'm hoping someone can tell me which and explain a work-around to suppress the annoying warning that I am trying to avoid. I'm getting the same behavior when I use the pandas Series where().
To make it simple, I'll use a numpy array as my example. Say I want to apply np.log() on the array and only so for the condition a value is a valid input, i.e., myArray>0.0. For values where this function should not be applied, I want to set the output flag of -999.9:
myArray = np.array([1.0, 0.75, 0.5, 0.25, 0.0])
np.where(myArray>0.0, np.log(myArray), -999.9)
I expected numpy.where() to not complain about the 0.0 value in the array since the condition is False there, yet it does and it appears to actually execute for that False condition:
-c:2: RuntimeWarning: divide by zero encountered in log
array([ 0.00000000e+00, -2.87682072e-01, -6.93147181e-01,
-1.38629436e+00, -9.99900000e+02])
The numpy documentation states:
If x and y are given and input arrays are 1-D, where is equivalent to:
[xv if c else yv for (c,xv,yv) in zip(condition,x,y)]
I beg to differ with this statement since
[np.log(val) if val>0.0 else -999.9 for val in myArray]
provides no warning at all:
[0.0, -0.2876820724517809, -0.69314718055994529, -1.3862943611198906, -999.9]
So, is this a known bug? I don't want to suppress the warning for my entire code.
You can have the log evaluated at the relevant places only using its optional where parameter
np.where(myArray>0.0, np.log(myArray, where=myArray>0.0), -999.9)
or more efficiently
mask = myArray > 0.0
np.where(mask, np.log(myArray, where=mask), -999)
or if you find the "double where" ugly
np.log(myArray, where=myArray>0.0, out=np.full(myArray.shape, -999.9))
Any one of those three should suppress the warning.
This behavior of where should be understandable given a basic understanding of Python. This is a Python expression that uses a couple of numpy functions.
What happens in this expression?
np.where(myArray>0.0, np.log(myArray), -999.9)
The interpreter first evaluates all the arguments of the function, and then passes the results to the where. Effectively then:
cond = myArray>0.0
A = np.log(myArray)
B = -999.9
np.where(cond, A, B)
The warning is produced in the 2nd line, not in the 4th.
The 4th line is equivalent to:
[xv if c else yv for (c,xv,yv) in zip(cond, A, B)]
or
[A[i] if c else B for i,c in enumerate(cond)]
np.where is most often used with one argument, where it is a synonym for np.nonzero. We don't see this three-argument form that often on SO. It isn't that useful, in part because it doesn't save on calculations.
Masked assignment is more often, especially if there are more than 2 alternatives.
In [123]: mask = myArray>0
In [124]: out = np.full(myArray.shape, np.nan)
In [125]: out[mask] = np.log(myArray[mask])
In [126]: out
Out[126]: array([ 0. , -0.28768207, -0.69314718, -1.38629436, nan])
Paul Panzer showed how to do the same with the where parameter of log. That feature isn't being used as much as it could be.
In [127]: np.log(myArray, where=mask, out=out)
Out[127]: array([ 0. , -0.28768207, -0.69314718, -1.38629436, nan])
This is not a bug. See this related answer to a similar question. The example in the docs is misleading, but that answer looks at it in detail.
The issue is that ternary statements are processed by the interpreter at compile-time while numpy.where is a regular function. Therefore, ternary statements allow short-circuiting, whereas this is not possible when arguments are defined beforehand.
In other words, the arguments of numpy.where are calculated before the Boolean array is processed.
You may think this is inefficient: why build 2 separate arrays and then use a 3rd Boolean array to decide which item to choose? Surely that's double the work / double the memory?
However, this inefficiency is more than offset by the vectorisation provided by numpy functions acting on an entire array, e.g. np.log(arr).
Consider the example provided in the docs:
If x and y are given and input arrays are 1-D, where is
equivalent to::
[xv if c else yv for (c,xv,yv) in zip(condition,x,y)]
Notice the inputs are arrays. Try running:
c = np.array([0])
result = [xv if c else yv for (c, xv, yv) in zip(c==0, np.array([1]), np.log(c))]
You will notice that this errors.

Understanding views/copies of Python numpy arrays: why doesn't a double boolean index work?

I am reading http://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html (and http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy) and trying to understand why Python would not attempt to interpret the last line in the following in an intuitive way,
import numpy as np
import pandas as pd
n = 100
tested = np.zeros(n, dtype=bool)
dt = pd.DatetimeIndex([pd.datetime(2000, 1, 1, 0, 0, 0, 0)
+ pd.Timedelta(minutes=idx) for idx in range(n)])
data = np.random.normal(size=n)
data[np.random.uniform(size=n) > 0.9] = np.nan
mask = np.isnan(data)
dt_ = dt[~mask]
data_ = data[~mask]
select = pd.DatetimeIndex(dt_) > pd.datetime(2000, 1, 1, 1, 0, 0)
data__ = data_[select] # do something with this later, otherwise, why mask?
tested[~mask][select] = True # or np.ones(select.sum(), dtype=bool)
i.e., please, Python gods, update those elements of tested masked by first and second boolean arrays. It is is quite annoying that the first boolean mask returns a copy (thanks for the catch #user2357112) of the tested array that, when indexed and updated by the second, does not write the results back into the original array. I think I am right is saying that some other interpreted languages use such syntax, why not here? Or perhaps there is some computer science best-practices that I am not understanding correctly? Whereas what actually happens is that tested is still a vector of boolean of all False.
The key to using both mask and select is change ~mask to indices, and apply select to that array.
print(select.shape)
print(data__.shape)
idx = np.where(~mask)[0]
tested[idx[select]] = True
print(tested.sum())
producing for one run:
(92,)
(34,)
34
I've found with timing tests that data[mask] is effectively data[np.where(mask)], so there's no loss in power to use array selection rather than boolean masking.
If this kind of double masking occurred often enough, numpy developers could package this code in a function, even in compiled code.
data[mask][select] = 1 # translates to:
data.__getitem__(mask).__setitem__(select, 1)
is consistent Python syntax. If you want different interpreter action, such as:
data.__setitem__(<foo(mask, select)>, 1)
write a PEP, and convince other Python developers that it is a good idea.

Ignoring negative values when using np.log(array)

When taking the log of a specific column within a numpy array, i.e., logSFROIIdC = np.log(data_dC[:, 9]) the compiler returns the error:
-c:13: RuntimeWarning: divide by zero encountered in log.
Now, I know why this happens, i.e., log(-1) = Math Error.
However, I want to be able to call something or write some code which then skips any value in the array which would cause this error, then ignoring that row altogether. Allowing that data column to be usable again.
I have tried various methods and this is a last resort asking the community.
You can control this behavior with np.seterr. Here's an example.
First, tell numpy to ignore invalid values:
In [4]: old = np.seterr(invalid='ignore')
Now log(-1) doesn't generate a warning:
In [5]: x = np.array([-1.,1])
In [6]: np.log(x)
Out[6]: array([ nan, 0.])
Restore the previous settings:
In [7]: np.seterr(**old)
Out[7]: {'divide': 'warn', 'invalid': 'ignore', 'over': 'warn', 'under': 'ignore'}
And now we get the warning:
In [8]: np.log(x)
/Users/warren/anaconda/bin/ipython:1: RuntimeWarning: invalid value encountered in log
#!/Users/warren/anaconda/python.app/Contents/MacOS/python
Out[8]: array([ nan, 0.])
There is also a context manager, np.errstate. For example,
In [10]: with np.errstate(invalid='ignore'):
....: y = np.log(x)
....:
In [11]: y
Out[11]: array([ nan, 0.])
You can also use a masked array and NumPy will automatically apply a mask for the invalid values after you perform the np.log() calculation:
a = np.array([1,2,3,0,4,-1,-2])
b = np.log(np.ma.array(a))
print(b.sum())
# 3.17805383035
Where np.ma.array(a) is creating a masked array with no masked elements. It works because NumPy automatically masks elements that are inf (or any invalid value) in calculations with masked arrays.
Alternatively, you could have created the mask yourself (which I recommend) like:
a = np.ma.array(a, mask=(a<=0))
One hack is to limit the values from being negative in the first place. np.clip to the rescue.
positive_array = np.clip(array, some_small_positive_value, None) to avoid negative values in your array. Though I am not sure if bringing the values close to zero serve your purpose.

What happens when I set values that are too big for a given dtype?

So I have always created numpy arrays like that:
>>> u = np.zeros( 10, int )
>>> v = np.zeros( 10, float )
I have always been oblivious about maximum permitted values until now. I have been assuming that it would simply work. If it didn't, I would get OverflowError, and then I would find some workaround like taking the logarithm.
But recently I started to use the other dtypes:
>>> v8 = np.zeros( 10, np.uint8 )
>>> v8[0] = 2 ** 8 - 1
>>> v8[1] = 2 ** 8
>>> v8
>>> array([255, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8)
Ok so I don't get any warning when I assign a value bigger than 255. That's a bit scary.
So my questions are:
when I used arrays with types int and float, is it possible that I set a value that was too big (resulting in completely wrong calculations) without knowing it?
if I want to use uint8, do I have to manually check all assigned values are in [ 0, 255 ]?
numpy works very deep at the machine level. Tests are time consuming and so, testing is left to the developer. Python is much more high-level and many test are done automatically or, in the case of ints, ints can have arbitrary large values. Everywhere you have to decide between speed and security. numpy is farther on the speed side.
In situations, where it is necessary to test the range of values, you have to check it by yourself.
The clip-method may help you:
>>> u = np.array([124,-130, 213])
>>> u.astype('b')
array([124, 126, -43], dtype=int8)
>>> u.clip(-128,127).astype('b')
array([ 124, -128, 127], dtype=int8)
As explained in the other answers, too large values get 'wrapped around', so you need to clip them by hand to the minimum and maximum allowed values before converting. For integers, these limits can be obtained using np.iinfo. You could write your own utility function to do this conversion in a safe way for a given dtype:
def safe_convert(x, new_dtype):
info = np.iinfo(new_dtype)
return x.clip(info.min, info.max).astype(new_dtype)
Quick test:
In [31]: safe_convert(np.array([-1,0,1,254,255,256]), np.uint8)
Out[31]: array([ 0, 0, 1, 254, 255, 255], dtype=uint8)
In [32]: safe_convert(np.array([-129,-128,-127,126,127,128]), np.int8)
Out[32]: array([-128, -128, -127, 126, 127, 127], dtype=int8)
Yes, uint8 will mask your values (take the 8 lsb), so you need to manually check it:
>>> a = numpy.uint8(256)
>>> a
0
And yes, overflow can occur without you realizing it. It's a common source of error in many programming languages. However, long integers in python behave in an uncommon way: They have no explicitly defined limit.
I've written about it in this answer.
As already explained, numpy wraps around to avoid doing checks.
If clipping is not acceptable, before you cast, you can use numpy.min_scalar_type to get the minimum dtype that will hold your data without loosing data.
Also note that practically the only reason to use uint8 is to save memory in very big arrays, as the computation speed is usually roughly the same (in some operations will be internally casted upwards, even). If your arrays are not too big so that the memory is not a big concern, you should be safer and use uint16 or even uint32 for intermediate computations. If memory is your problem, you should consider moving to out of core storage, like PyTables; if you are now about to fill the memory, maybe with a bigger dataset not even uint8 will be enough.

Categories