I have three arrays that are processed with a mathematical function to get a final result array. Some of the arrays contain NaNs and some contain 0. However a division by zero logically raise a Warning, a calculation with NaN gives NaN. So I'd like to do certain operations on certain parts of the arrays where zeros are involved:
r=numpy.array([3,3,3])
k=numpy.array([numpy.nan,0,numpy.nan])
n=numpy.array([numpy.nan,0,0])
1.0*n*numpy.exp(r*(1-(n/k)))
e.g. in cases where k == 0, I'd like to get as a result 0. In all other cases I'd to calculate the function above. So what is the way to do such calculations on parts of the array (via indexing) to get a final single result array?
import numpy
r=numpy.array([3,3,3])
k=numpy.array([numpy.nan,0,numpy.nan])
n=numpy.array([numpy.nan,0,0])
indxZeros=numpy.where(k==0)
indxNonZeros=numpy.where(k!=0)
d=numpy.empty(k.shape)
d[indxZeros]=0
d[indxNonZeros]=n[indxNonZeros]/k[indxNonZeros]
print d
Is following what you need?
>>> rv = 1.0*n*numpy.exp(r*(1-(n/k)))
>>> rv[k==0] = 0
>>> rv
array([ nan, 0., nan])
So, you may think that the solution to this problem is to use numpy.where, but the following:
numpy.where(k==0, 0, 1.0*n*numpy.exp(r*(1-(n/k))))
still gives a warning, as the expression is actually evaluated for the cases where k is zero, even if those results aren't used.
If this really bothers you, you can use numexpr for this expression, which will actually branch on the where statement and not evaluate the k==0 case:
import numexpr
numexpr.evaluate('where(k==0, 0, 1.0*n*exp(r*(1-(n/k))))')
Another way, based on indexing as you asked for, involves a little loss in legibility
result = numpy.zeros_like(k)
good = k != 0
result[good] = 1.0*n[good]*numpy.exp(r[good]*(1-(n[good]/k[good])))
This can be bypassed somewhat by defining a gaussian function:
def gaussian(r, k, n):
return 1.0*n*numpy.exp(r*(1-(n/k)))
result = numpy.zeros_like(k)
good = k != 0
result[good] = gaussian(r[good], k[good], n[good])
Related
I'm writing unit tests for my simulation and want to check that for specific parameters the result, a numpy array, is zero. Due to calculation inaccuracies, small values are also accepted (1e-7). What is the best way to assert this array is close to 0 in all places?
np.testing.assert_array_almost_equal(a, np.zeros(a.shape)) and assert_allclose fail as the relative tolerance is inf (or 1 if you switch the arguments) Docu
I feel like np.testing.assert_array_almost_equal_nulp(a, np.zeros(a.shape)) is not precise enough as it compares the difference to the spacing, therefore it's always true for nulps >= 1 and false otherways but does not say anything about the amplitude of a Docu
Use of np.testing.assert_(np.all(np.absolute(a) < 1e-7)) based on this question does not give any of the detailed output, I am used to by other np.testing methods
Is there another way to test this? Maybe another testing package?
If you compare a numpy array with all zeros, you can use the absolute tolerance, as the relative tolerance does not make sense here:
from numpy.testing import assert_allclose
def test_zero_array():
a = np.array([0, 1e-07, 1e-08])
assert_allclose(a, 0, atol=1e-07)
The rtol value does not matter in this case, as it is multiplied with 0 if calculating the tolerance:
atol + rtol * abs(desired)
Update: Replaced np.zeros_like(a) with the simpler scalar 0. As pointed out by #hintze, np array comparisons also work against scalars.
Question
Suppose we are given a numpy array arr of doubles and a small positive integer n. I am looking for an efficient way to set the n least significant entries of each element of arr to 0 or to 1. Is there a ufunc for that? If not, are there suitable C functions that I could apply to the elements from Cython?
Motivation
Below I will provide the motivation for the question. If you find that an answer to the question above is not needed to fulfill the end goal, I am happy to receive respective comments. I will then create a separate question in order to keep things sorted.
The motivation for this question is to implement a version of np.unique(arr, True) that accepts a relative tolerance parameter. Thereby, the second argument of np.unique is of importance: I need to know the indices of the unique elements (first occurrence!) in the original array. Thereby, it is not important that the elements are sorted.
I am aware of questions and solutions on np.unique with tolerance. However, I have not found a solution that also returns the indices of the first occurrences of the unique elements in the original array. Furthermore, the solutions that I have seen were based on sorting, which runs in O(arr.size log(arr.size)). However, a constant-time solution is possible with a hash map.
The idea is to round each element in arr up and down and to put these elements in a hash map. If either of the values is in the hash map already, an entry is ignored. Otherwise, the element is included in the result. As insertion and lookup run in constant average time for hash maps, this method should be faster than a sorting based method in theory.
Below find my Cython implementation:
import numpy as np
cimport numpy as np
import cython
from libcpp.unordered_map cimport unordered_map
#cython.boundscheck(False)
#cython.wraparound(False)
def unique_tol(np.ndarray[DOUBLE_t, ndim=1] lower,
np.ndarray[DOUBLE_t, ndim=1] higher):
cdef long i, count
cdef long endIndex = lower.size
cdef unordered_map[double, short] vals = unordered_map[double, short]()
cdef np.ndarray[DOUBLE_t, ndim=1] result_vals = np.empty_like(lower)
cdef np.ndarray[INT_t, ndim=1] result_indices = np.empty_like(lower,
dtype=int)
count = 0
for i in range(endIndex):
if not vals.count(lower[i]) and not vals.count(higher[i]):
# insert in result
result_vals[count] = lower[i]
result_indices[count] = i
# put lowerVal and higherVal in the hashMap
vals[lower[i]]
vals[higher[i]]
# update the index in the result
count += 1
return result_vals[:count], result_indices[:count]
This method called with appropriate rounding does the job. For example, if differences less than 10^-6 shall be ignored, we would write
unique_tol(np.round(a, 6), np.round(a+1e-6, 6))
Now I would like to replace np.round with a relative rounding procedure based on manipulation of the mantissa. I am aware of alternative ways of relative rounding, but I think manipulating the mantissa directly should be more efficient and elegant. (Admittedly, I do not think that the performance gain is significant. But I would be interested in the solution.)
EDIT
The solution by Warren Weckesser works like charm. However, the result is not applicable as I was hoping for, since two numbers with very small difference can have different exponents. Unifying the mantissa will then not lead to similar numbers. I guess I have to stick with the relative rounding solutions that are out there.
"I am looking for an efficient way to set the n least significant entries of each element of arr to 0 or to 1."
You can create a view of the array with data type numpy.uint64, and then manipulate the bits in that view as needed.
For example, I'll set the lowest 21 bits in the mantissa of this array to 0.
In [46]: np.set_printoptions(precision=15)
In [47]: x = np.array([0.0, -1/3, 1/5, -1/7, np.pi, 6.02214076e23])
In [48]: x
Out[48]:
array([ 0.000000000000000e+00, -3.333333333333333e-01,
2.000000000000000e-01, -1.428571428571428e-01,
3.141592653589793e+00, 6.022140760000000e+23])
Create a view of the data in x with data type numpy.uint64:
In [49]: u = x.view(np.uint64)
Take a look at the binary representation of the values.
In [50]: [np.binary_repr(t, width=64) for t in u]
Out[50]:
['0000000000000000000000000000000000000000000000000000000000000000',
'1011111111010101010101010101010101010101010101010101010101010101',
'0011111111001001100110011001100110011001100110011001100110011010',
'1011111111000010010010010010010010010010010010010010010010010010',
'0100000000001001001000011111101101010100010001000010110100011000',
'0100010011011111111000011000010111001010010101111100010100010111']
Set the lower n bits to 0, and take another look.
In [51]: n = 21
In [52]: u &= ~np.uint64(2**n-1)
In [53]: [np.binary_repr(t, width=64) for t in u]
Out[53]:
['0000000000000000000000000000000000000000000000000000000000000000',
'1011111111010101010101010101010101010101010000000000000000000000',
'0011111111001001100110011001100110011001100000000000000000000000',
'1011111111000010010010010010010010010010010000000000000000000000',
'0100000000001001001000011111101101010100010000000000000000000000',
'0100010011011111111000011000010111001010010000000000000000000000']
Because u is a view of the same data as in x, x has also been modified in-place.
In [54]: x
Out[54]:
array([ 0.000000000000000e+00, -3.333333332557231e-01,
1.999999999534339e-01, -1.428571428405121e-01,
3.141592653468251e+00, 6.022140758954589e+23])
Similar to #WarrenWeckesser's but without the black magic using "official" ufuncs instead. Downside: I'm pretty sure it's slower, quite possibly significantly so:
>>> a = np.random.normal(size=10)**5
>>> a
array([ 9.87664561e-12, -1.79654870e-03, 4.36740261e-01, 7.49256141e+00,
-8.76894617e-01, 2.93850753e+00, -1.44149959e-02, -1.03026094e-03,
3.18390143e-03, 3.05521581e-03])
>>>
>>> mant,expn = np.frexp(a)
>>> mant
array([ 0.67871792, -0.91983293, 0.87348052, 0.93657018, -0.87689462,
0.73462688, -0.92255974, -0.5274936 , 0.81507877, 0.78213525])
>>> expn
array([-36, -9, -1, 3, 0, 2, -6, -9, -8, -8], dtype=int32)
>>> a_binned = np.ldexp(np.round(mant,5),expn)
>>> a_binned
array([ 9.87667590e-12, -1.79654297e-03, 4.36740000e-01, 7.49256000e+00,
-8.76890000e-01, 2.93852000e+00, -1.44150000e-02, -1.03025391e-03,
3.18390625e-03, 3.05523437e-03])
Using python 2.7, scipy 1.0.0-3
Apparently I have a misunderstanding of how the numpy where function is supposed to operate or there is a known bug in its operation. I'm hoping someone can tell me which and explain a work-around to suppress the annoying warning that I am trying to avoid. I'm getting the same behavior when I use the pandas Series where().
To make it simple, I'll use a numpy array as my example. Say I want to apply np.log() on the array and only so for the condition a value is a valid input, i.e., myArray>0.0. For values where this function should not be applied, I want to set the output flag of -999.9:
myArray = np.array([1.0, 0.75, 0.5, 0.25, 0.0])
np.where(myArray>0.0, np.log(myArray), -999.9)
I expected numpy.where() to not complain about the 0.0 value in the array since the condition is False there, yet it does and it appears to actually execute for that False condition:
-c:2: RuntimeWarning: divide by zero encountered in log
array([ 0.00000000e+00, -2.87682072e-01, -6.93147181e-01,
-1.38629436e+00, -9.99900000e+02])
The numpy documentation states:
If x and y are given and input arrays are 1-D, where is equivalent to:
[xv if c else yv for (c,xv,yv) in zip(condition,x,y)]
I beg to differ with this statement since
[np.log(val) if val>0.0 else -999.9 for val in myArray]
provides no warning at all:
[0.0, -0.2876820724517809, -0.69314718055994529, -1.3862943611198906, -999.9]
So, is this a known bug? I don't want to suppress the warning for my entire code.
You can have the log evaluated at the relevant places only using its optional where parameter
np.where(myArray>0.0, np.log(myArray, where=myArray>0.0), -999.9)
or more efficiently
mask = myArray > 0.0
np.where(mask, np.log(myArray, where=mask), -999)
or if you find the "double where" ugly
np.log(myArray, where=myArray>0.0, out=np.full(myArray.shape, -999.9))
Any one of those three should suppress the warning.
This behavior of where should be understandable given a basic understanding of Python. This is a Python expression that uses a couple of numpy functions.
What happens in this expression?
np.where(myArray>0.0, np.log(myArray), -999.9)
The interpreter first evaluates all the arguments of the function, and then passes the results to the where. Effectively then:
cond = myArray>0.0
A = np.log(myArray)
B = -999.9
np.where(cond, A, B)
The warning is produced in the 2nd line, not in the 4th.
The 4th line is equivalent to:
[xv if c else yv for (c,xv,yv) in zip(cond, A, B)]
or
[A[i] if c else B for i,c in enumerate(cond)]
np.where is most often used with one argument, where it is a synonym for np.nonzero. We don't see this three-argument form that often on SO. It isn't that useful, in part because it doesn't save on calculations.
Masked assignment is more often, especially if there are more than 2 alternatives.
In [123]: mask = myArray>0
In [124]: out = np.full(myArray.shape, np.nan)
In [125]: out[mask] = np.log(myArray[mask])
In [126]: out
Out[126]: array([ 0. , -0.28768207, -0.69314718, -1.38629436, nan])
Paul Panzer showed how to do the same with the where parameter of log. That feature isn't being used as much as it could be.
In [127]: np.log(myArray, where=mask, out=out)
Out[127]: array([ 0. , -0.28768207, -0.69314718, -1.38629436, nan])
This is not a bug. See this related answer to a similar question. The example in the docs is misleading, but that answer looks at it in detail.
The issue is that ternary statements are processed by the interpreter at compile-time while numpy.where is a regular function. Therefore, ternary statements allow short-circuiting, whereas this is not possible when arguments are defined beforehand.
In other words, the arguments of numpy.where are calculated before the Boolean array is processed.
You may think this is inefficient: why build 2 separate arrays and then use a 3rd Boolean array to decide which item to choose? Surely that's double the work / double the memory?
However, this inefficiency is more than offset by the vectorisation provided by numpy functions acting on an entire array, e.g. np.log(arr).
Consider the example provided in the docs:
If x and y are given and input arrays are 1-D, where is
equivalent to::
[xv if c else yv for (c,xv,yv) in zip(condition,x,y)]
Notice the inputs are arrays. Try running:
c = np.array([0])
result = [xv if c else yv for (c, xv, yv) in zip(c==0, np.array([1]), np.log(c))]
You will notice that this errors.
For writing “piecewise functions” in Python, I'd normally use if (in either the control-flow or ternary-operator form).
def spam(x):
return x+1 if x>=0 else 1/(1-x)
Now, with NumPy, the mantra is to avoid working on single values in favour of vectorisation, for performance. So I reckon something like this would be preferred:As Leon remarks, the following is wrong
def eggs(x):
y = np.zeros_like(x)
positive = x>=0
y[positive] = x+1
y[np.logical_not(positive)] = 1/(1-x)
return y
(Correct me if I've missed something here, because frankly I find this very ugly.)
Now, of course eggs will only work if x is actually a NumPy array, because otherwise x>=0 simply yields a single boolean, which can't be used for indexing (at least doesn't do the right thing).
Is there a good way to write code that looks more like spam but works idiomatic on Numpy arrays, or should I just use vectorize(spam)?
Use np.where. You'll get an array as the output even for plain number input, though.
def eggs(x):
y = np.asarray(x)
return np.where(y>=0, y+1, 1/(1-y))
This works for both arrays and plain numbers:
>>> eggs(5)
array(6.0)
>>> eggs(-3)
array(0.25)
>>> eggs(np.arange(-3, 3))
/home/praveen/.virtualenvs/numpy3-mkl/bin/ipython3:2: RuntimeWarning: divide by zero encountered in true_divide
array([ 0.25 , 0.33333333, 0.5 , 1. , 2. , 3. ])
>>> eggs(1)
/home/praveen/.virtualenvs/numpy3-mkl/bin/ipython3:3: RuntimeWarning: divide by zero encountered in long_scalars
# -*- coding: utf-8 -*-
array(2.0)
As ayhan remarks, this raises a warning, since 1/(1-x) gets evaluated for the whole range. But a warning is just that: a warning. If you know what you're doing, you can ignore the warning. In this case, you're only choosing 1/(1-x) from indices where it can never be inf, so you're safe.
I would use numpy.asarray (which is a no-op if the argument is already an numpy array) if I want to handle both numbers and numpy arrays
def eggs(x):
x = np.asfarray(x)
m = x>=0
x[m] = x[m] + 1
x[~m] = 1 / (1 - x[~m])
return x
(here I used asfarray to enforce a floating-point type, since your function requires floating-point computations).
This is less efficient than your spam function for single inputs, and arguably uglier. However it seems to be the easiest choice.
EDIT: If you want to ensure that x is not modified (as pointed out by Leon) you can replace np.asfarray(x) by np.array(x, dtype=np.float64), the array constructor copies by default.
Basically I have an array that may vary between any two numbers, and I want to preserve the distribution while constraining it to the [0,1] space. The function to do this is very very simple. I usually write it as:
def to01(array):
array -= array.min()
array /= array.max()
return array
Of course it can and should be more complex to account for tons of situations, such as all the values being the same (divide by zero) and float vs. integer division (use np.subtract and np.divide instead of operators). But this is the most basic.
The problem is that I do this very frequently across stuff in my project, and it seems like a fairly standard mathematical operation. Is there a built in function that does this in NumPy?
Don't know if there's a builtin for that (probably not, it's not really a difficult thing to do as is). You can use vectorize to apply a function to all the elements of the array:
def to01(array):
a = array.min()
# ignore the Runtime Warning
with numpy.errstate(divide='ignore'):
b = 1. /(array.max() - array.min())
if not(numpy.isfinite(b)):
b = 0
return numpy.vectorize(lambda x: b * (x - a))(array)