Ignoring negative values when using np.log(array) - python

When taking the log of a specific column within a numpy array, i.e., logSFROIIdC = np.log(data_dC[:, 9]) the compiler returns the error:
-c:13: RuntimeWarning: divide by zero encountered in log.
Now, I know why this happens, i.e., log(-1) = Math Error.
However, I want to be able to call something or write some code which then skips any value in the array which would cause this error, then ignoring that row altogether. Allowing that data column to be usable again.
I have tried various methods and this is a last resort asking the community.

You can control this behavior with np.seterr. Here's an example.
First, tell numpy to ignore invalid values:
In [4]: old = np.seterr(invalid='ignore')
Now log(-1) doesn't generate a warning:
In [5]: x = np.array([-1.,1])
In [6]: np.log(x)
Out[6]: array([ nan, 0.])
Restore the previous settings:
In [7]: np.seterr(**old)
Out[7]: {'divide': 'warn', 'invalid': 'ignore', 'over': 'warn', 'under': 'ignore'}
And now we get the warning:
In [8]: np.log(x)
/Users/warren/anaconda/bin/ipython:1: RuntimeWarning: invalid value encountered in log
#!/Users/warren/anaconda/python.app/Contents/MacOS/python
Out[8]: array([ nan, 0.])
There is also a context manager, np.errstate. For example,
In [10]: with np.errstate(invalid='ignore'):
....: y = np.log(x)
....:
In [11]: y
Out[11]: array([ nan, 0.])

You can also use a masked array and NumPy will automatically apply a mask for the invalid values after you perform the np.log() calculation:
a = np.array([1,2,3,0,4,-1,-2])
b = np.log(np.ma.array(a))
print(b.sum())
# 3.17805383035
Where np.ma.array(a) is creating a masked array with no masked elements. It works because NumPy automatically masks elements that are inf (or any invalid value) in calculations with masked arrays.
Alternatively, you could have created the mask yourself (which I recommend) like:
a = np.ma.array(a, mask=(a<=0))

One hack is to limit the values from being negative in the first place. np.clip to the rescue.
positive_array = np.clip(array, some_small_positive_value, None) to avoid negative values in your array. Though I am not sure if bringing the values close to zero serve your purpose.

Related

python numpy where returning unexpected warning

Using python 2.7, scipy 1.0.0-3
Apparently I have a misunderstanding of how the numpy where function is supposed to operate or there is a known bug in its operation. I'm hoping someone can tell me which and explain a work-around to suppress the annoying warning that I am trying to avoid. I'm getting the same behavior when I use the pandas Series where().
To make it simple, I'll use a numpy array as my example. Say I want to apply np.log() on the array and only so for the condition a value is a valid input, i.e., myArray>0.0. For values where this function should not be applied, I want to set the output flag of -999.9:
myArray = np.array([1.0, 0.75, 0.5, 0.25, 0.0])
np.where(myArray>0.0, np.log(myArray), -999.9)
I expected numpy.where() to not complain about the 0.0 value in the array since the condition is False there, yet it does and it appears to actually execute for that False condition:
-c:2: RuntimeWarning: divide by zero encountered in log
array([ 0.00000000e+00, -2.87682072e-01, -6.93147181e-01,
-1.38629436e+00, -9.99900000e+02])
The numpy documentation states:
If x and y are given and input arrays are 1-D, where is equivalent to:
[xv if c else yv for (c,xv,yv) in zip(condition,x,y)]
I beg to differ with this statement since
[np.log(val) if val>0.0 else -999.9 for val in myArray]
provides no warning at all:
[0.0, -0.2876820724517809, -0.69314718055994529, -1.3862943611198906, -999.9]
So, is this a known bug? I don't want to suppress the warning for my entire code.
You can have the log evaluated at the relevant places only using its optional where parameter
np.where(myArray>0.0, np.log(myArray, where=myArray>0.0), -999.9)
or more efficiently
mask = myArray > 0.0
np.where(mask, np.log(myArray, where=mask), -999)
or if you find the "double where" ugly
np.log(myArray, where=myArray>0.0, out=np.full(myArray.shape, -999.9))
Any one of those three should suppress the warning.
This behavior of where should be understandable given a basic understanding of Python. This is a Python expression that uses a couple of numpy functions.
What happens in this expression?
np.where(myArray>0.0, np.log(myArray), -999.9)
The interpreter first evaluates all the arguments of the function, and then passes the results to the where. Effectively then:
cond = myArray>0.0
A = np.log(myArray)
B = -999.9
np.where(cond, A, B)
The warning is produced in the 2nd line, not in the 4th.
The 4th line is equivalent to:
[xv if c else yv for (c,xv,yv) in zip(cond, A, B)]
or
[A[i] if c else B for i,c in enumerate(cond)]
np.where is most often used with one argument, where it is a synonym for np.nonzero. We don't see this three-argument form that often on SO. It isn't that useful, in part because it doesn't save on calculations.
Masked assignment is more often, especially if there are more than 2 alternatives.
In [123]: mask = myArray>0
In [124]: out = np.full(myArray.shape, np.nan)
In [125]: out[mask] = np.log(myArray[mask])
In [126]: out
Out[126]: array([ 0. , -0.28768207, -0.69314718, -1.38629436, nan])
Paul Panzer showed how to do the same with the where parameter of log. That feature isn't being used as much as it could be.
In [127]: np.log(myArray, where=mask, out=out)
Out[127]: array([ 0. , -0.28768207, -0.69314718, -1.38629436, nan])
This is not a bug. See this related answer to a similar question. The example in the docs is misleading, but that answer looks at it in detail.
The issue is that ternary statements are processed by the interpreter at compile-time while numpy.where is a regular function. Therefore, ternary statements allow short-circuiting, whereas this is not possible when arguments are defined beforehand.
In other words, the arguments of numpy.where are calculated before the Boolean array is processed.
You may think this is inefficient: why build 2 separate arrays and then use a 3rd Boolean array to decide which item to choose? Surely that's double the work / double the memory?
However, this inefficiency is more than offset by the vectorisation provided by numpy functions acting on an entire array, e.g. np.log(arr).
Consider the example provided in the docs:
If x and y are given and input arrays are 1-D, where is
equivalent to::
[xv if c else yv for (c,xv,yv) in zip(condition,x,y)]
Notice the inputs are arrays. Try running:
c = np.array([0])
result = [xv if c else yv for (c, xv, yv) in zip(c==0, np.array([1]), np.log(c))]
You will notice that this errors.

unexpected behaviour of numpy.median on masked arrays

I've a question relating the behaviour of numpy.median() on masked arrays created with numpy.ma.masked_array().
As I've understood from debugging my own code, numpy.median() does not work as expected on masked arrays (see Using numpy.median on a masked array for a definition of the problem)
The answer provided was:
Explanation: If I remember correctly, the np.median does not support subclasses, so it fails to work correctly on np.ma.MaskedArray.
The conclusion therefore being that in order to calculate the median of the elements in a masked array is to use numpy.ma.median() since this is a median function dedicated to masked arrays.
My problem lies in the fact that I've just spent a considerable amount of time finding this problem since there is no way of knowing this problem.
There is no warning or exception raised when trying to calculate the median of a masked array via numpy.median().
The answer returned by this function is not what is expected, and cause serious problems when people are not aware of this.
Does anyone know if this might be considered a bug?
In my opinion, the expected behaviour should be that using numpy.median on a masked array will raise and exception of some sort.
Any thoughts???
The below test script shows the unwanted and unexpected behaviour of using numpy.median on a masked array (note that the correct and expected median value of the valid elements is 2.5!!!):
In [1]: import numpy as np
In [2]: test = np.array([1, 2, 3, 4, 100, 100, 100, 100])
In [3]: valid_elements = np.array([1, 1, 1, 1, 0, 0, 0, 0], dtype=np.bool)
In [4]: testm = np.ma.masked_array(test, ~valid_elements)
In [5]: testm
Out[5]:
masked_array(data = [1 2 3 4 -- -- -- --],
mask = [False False False False True True True True],
fill_value = 999999)
In [6]: np.median(test)
Out[6]: 52.0
In [7]: np.median(test[valid_elements])
Out[7]: 2.5
In [8]: np.median(testm)
Out[8]: 4.0
In [9]: np.ma.median(testm)
Out[9]: 2.5
Does anyone know if this might be considered a bug?
Well, it is a Bug! I posted it a few months ago on their issue tracker (Link to the bug report).
The reason for this behaviour is that np.median uses the partition method of the input-array but np.ma.MaskedArray doesn't override the partition method. So when arr.partition is called in np.median it simply defaults to the basic numpy.ndarray.partition method (which is bogus for a masked array!).

numpy: Invalid value encountered in true_divide

I have two numpy arrays and I am trying to divide one with the other and at the same time, I want to make sure that the entries where the divisor is 0, should just be replaced with 0.
So, I do something like:
log_norm_images = np.where(b_0 > 0, np.divide(diff_images, b_0), 0)
This gives me a run time warning of:
RuntimeWarning: invalid value encountered in true_divide
Now, I wanted to see what was going on and I did the following:
xx = np.isfinite(diff_images)
print (xx[xx == False])
xx = np.isfinite(b_0)
print (xx[xx == False])
However, both of these return empty arrays meaning that all the values in the arrays are finite. So, I am not sure where the invalid value is coming from. I am assuming checking b_0 > 0 in the np.where function takes care of the divide by 0.
The shape of the two arrays are (96, 96, 55, 64) and (96, 96, 55, 1)
You may have a NAN, INF, or NINF floating around somewhere. Try this:
np.isfinite(diff_images).all()
np.isfinite(b_0).all()
If one or both of those returns False, that's likely the cause of the runtime error.
The reason you get the runtime warning when running this:
log_norm_images = np.where(b_0 > 0, np.divide(diff_images, b_0), 0)
is that the inner expression
np.divide(diff_images, b_0)
gets evaluated first, and is run on all elements of diff_images and b_0 (even though you end up ignoring the elements that involve division-by-zero). In other words, the warning happens before the code that ignores those elements. That is why it's a warning and not an error: there are legitimate cases like this one where the division-by-zero is not a problem because it's being handled in a later operation.
Another useful Numpy command is nan_to_num(diff_images)
By default it replaces in a Numpy array; NaN to zero, -INF to -(large number) and +INF to +(large number)
You can change the defaults, see https://numpy.org/doc/stable/reference/generated/numpy.nan_to_num.html
As #drammock pointed out, the cause of the warning is that some of the values in b_0 is 0 and the runtime warning is generated before the np.where is evaluated. While #Luca's suggestion of running np.errstate(invalid='ignore', divide='ignore'):" before the np.where will prevent the warning in this case, there may be other legitimate cases where this warning could be generated. For instance, corresponding elements of b_0 and diff_images are set to np.inf, which would return np.nan.
So to prevent warnings for known cases (i.e. b_0 = 0) and allow for warnings of unknown cases, evaluate the np.where first then evaluate the arithmetic:
#First, create log_norm_images
log_norm_images = np.zeros(b_0.shape)
#Now get the valid indexes
valid = where(b_0 > 0)
#Lastly, evaluate the division problem at the valid indexes
log_norm_images[valid] = np.divide(diff_images[valid], b_0[valid])
num = np.array([1,2,3,4,5])
den = np.array([1,1,0,1,1])
res = np.array([None]*5)
ix = (den!=0)
res[ix] = np.divide( num[ix], den[ix] )
print(res)
[1.0 2.0 None 4.0 5.0]

numpy.random.multinomial bad outputs?

I have this function:
import numpy as np
def unhot(vec):
""" takes a one-hot vector and returns the corresponding integer """
assert np.sum(vec) == 1 # this assertion shouldn't fail, but it did...
return list(vec).index(1)
that I call on the output of a call to:
numpy.random.multinomial(1, coe)
and I got an assertion error at some point when I ran it. How is this possible? Isn't the output of numpy.random.multinomial guaranteed to be a one-hot vector?
Then I removed the assertion error, and now I have:
ValueError: 1 is not in list
Is there some fine-print I am missing, or is this just broken?
Well, this is the problem, and I should've realized, because I've encountered it before:
np.random.multinomial(1,A([ 0., 0., np.nan, 0.]))
returns
array([0, 0, -9223372036854775807,0])
I was using an unstable softmax implementation that gave the Nans.
Now, I was trying to ensure that the parameters I passed multinomial had a sum <= 1, but I did it like this:
coe = softmax(coeffs)
while np.sum(coe) > 1-1e-9:
coe /= (1+1e-5)
and with NaNs in there, the while statement will never even get triggered, I think.

Division by zero in numpy (sub)arrays

I have three arrays that are processed with a mathematical function to get a final result array. Some of the arrays contain NaNs and some contain 0. However a division by zero logically raise a Warning, a calculation with NaN gives NaN. So I'd like to do certain operations on certain parts of the arrays where zeros are involved:
r=numpy.array([3,3,3])
k=numpy.array([numpy.nan,0,numpy.nan])
n=numpy.array([numpy.nan,0,0])
1.0*n*numpy.exp(r*(1-(n/k)))
e.g. in cases where k == 0, I'd like to get as a result 0. In all other cases I'd to calculate the function above. So what is the way to do such calculations on parts of the array (via indexing) to get a final single result array?
import numpy
r=numpy.array([3,3,3])
k=numpy.array([numpy.nan,0,numpy.nan])
n=numpy.array([numpy.nan,0,0])
indxZeros=numpy.where(k==0)
indxNonZeros=numpy.where(k!=0)
d=numpy.empty(k.shape)
d[indxZeros]=0
d[indxNonZeros]=n[indxNonZeros]/k[indxNonZeros]
print d
Is following what you need?
>>> rv = 1.0*n*numpy.exp(r*(1-(n/k)))
>>> rv[k==0] = 0
>>> rv
array([ nan, 0., nan])
So, you may think that the solution to this problem is to use numpy.where, but the following:
numpy.where(k==0, 0, 1.0*n*numpy.exp(r*(1-(n/k))))
still gives a warning, as the expression is actually evaluated for the cases where k is zero, even if those results aren't used.
If this really bothers you, you can use numexpr for this expression, which will actually branch on the where statement and not evaluate the k==0 case:
import numexpr
numexpr.evaluate('where(k==0, 0, 1.0*n*exp(r*(1-(n/k))))')
Another way, based on indexing as you asked for, involves a little loss in legibility
result = numpy.zeros_like(k)
good = k != 0
result[good] = 1.0*n[good]*numpy.exp(r[good]*(1-(n[good]/k[good])))
This can be bypassed somewhat by defining a gaussian function:
def gaussian(r, k, n):
return 1.0*n*numpy.exp(r*(1-(n/k)))
result = numpy.zeros_like(k)
good = k != 0
result[good] = gaussian(r[good], k[good], n[good])

Categories