shortcut evaluation of numpy's array comparison - python

In numpy if I want to compare two arrays, say for example I want to test if all elements in A are less than values in B, I use if (A < B).all():. But in practice this requires allocation and evaluation of complete array C = A < B and then calling C.all() on it. This is a bit of waste. Is there any way to 'shortcut' the comparison, i.e. directly evaluate A < B element by element (without allocation and calculation of temporary C) and stop and return False when first invalid element comparison is found?

Plain Python and and or use shortcut evaluation, but numpy does not.
(A < B).all()
uses numpy building blocks, the broadcasting, the element by element comparison with < and the all reduction. The < works just other binary operations, plus, times, and, or, gt, le, etc. And all is like other reduction methods, any, max, sum, mean, and can operate on the whole array or by rows or by columns.
It is possible to write a function that combines the all and < into one iteration, but it would be difficult to get the generality that I just described.
But if you must implement an iterative solution, with a shortcut action, and do it fast, I'd suggest developing the idea with nditer, and then compile it with cython.
http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html is a good tutorial on using nditer, and it takes you through using it in cython. nditer takes care of broadcasting and iteration, letting you concentrate on the comparison and any shortcutting.
Here's a sketch of an iterator that could be cast into cython:
import numpy as np
a = np.arange(4)[:,None]
b = np.arange(2,5)[None,:]
c = np.array(True)
it = np.nditer([a, b, c], flags=['reduce_ok'],
op_flags = [['readonly'], ['readonly'],['readwrite']])
for x, y, z in it:
z[...] = x<y
if not z:
print('>',x,y)
break
else:
print(x,y)
print(z)
with a sample run:
1420:~/mypy$ python stack34852272.py
(array(0), array(2))
(array(0), array(3))
(array(0), array(4))
(array(1), array(2))
(array(1), array(3))
(array(1), array(4))
('>', array(2), array(2))
False
Start with a default False, and a different break condition and you get a shortcutting any. Generalizing the test to handle <, <=, etc will be more work.
Get something like this working in Python, and then try it in Cython. If you have trouble with that step, come back with a new question. SO has a good base of Cython users.

How large are you arrays? I would imagine they are very large, e.g. A.shape = (1000000) or larger before performance becomes an issue. Would you consider using numpy views?
Instead of comparing (A < B).all() or (A < B).any() you can try defining a view, such as (A[:10] < B[:10]).all(). Here's a simple loop that might work:
k = 0
while( (A[k*10: (k+1)*10] < B[k*10: (k+1)*10] ).all() ):
k += 1
Instead of 10 you can use 100 or 10**3 segment size you wish. Obviously if your segment size is 1, you are saying:
k = 0
while ( A[k] < B[k] ):
k+= 1
Sometimes, comparing the entire array can become memory intensive. If A and B have length of 10000 and I need to compare each pair of elements, I am going to run out of space.

Related

Numpy - writing a function in vector form?

I'm quite new to NumPy (or SciPy) and coming from Octave/Matlab, this seems a bit challenging to me.
I'm reading through the docs and writing some basic functions. I came across this section: Vectorizing functions (vectorize)
It defines this function:
def addsubtract(a, b):
if a > b:
return a - b
else:
return a + b
Then vectorizes it:
vec_addsubtract = np.vectorize(addsubtract)
But at the end, it says:
This particular function could have been written in vector form without the use of vectorize.
I wouldn't know any other way to write such function. So what is the vector form?
np.vectorize is a glorified python for loop, which means that it effectively strips away any optimizations that numpy offers.
To actually vectorize addsubtract, we can use the fact that numpy offers three things: a vectorized add function, a vectorized subtract function, and all sorts of boolean mask operations.
The simplest, but least efficient, way to write this is using np.where:
np.where(a > b, a - b, a + b)
This is inefficient because it pre-computes a - b and a + b in all cases, and then selects from one or the other for each element.
A more efficient solution would only compute the values where the condition required it:
result = np.empty_like(a)
mask = a > b
np.subtract(a, b, where=mask, out=result)
np.add(a, b, where=~mask, out=result)
For very small arrays, the overhead of the complicated method makes it less worthwhile. But for large arrays, it's the fastest solution.
Fun fact: the page in the tutorial you are referencing will not be available in future versions of the SciPy tutorial exactly because it is an intro to NumPy, as explained in PR #12432.
You can do this with np.where, which computes both results (a-b and a+b) and selects the values depending on an boolean array (a>b):
def addsubtract(a, b):
return np.where(a>b, a-b, a+b)
It can be seen as a vectorized ternary operator: "Where a>b, take the value from a-b, else take the value from a+b".
Despite computing both possible results, it was significantly faster than the vectorized if/else function you wrote (at least on my machine).

How to effectively check if numpy array can be cast to another integer type?

Let's say I have a numpy array of some integer type (say np.int64) and want to cast it to another type (say np.int8). How can I most effectively check if the operation is safe (preserving all values)?
There are two approaches I've come up with:
Approach 1: Use the type information
def is_safe(data, new_type):
if np.can_cast(data, new_type):
return True # Handle the trivial allowed cases
type_info = np.iinfo(new_type)
return np.all((data >= type_info.min) & (data <= type_info.max))
Approach 2: Use np.can_cast on all items
def is_safe(data, new_type):
if np.can_cast(data, new_type):
return True # Handle the trivial allowed cases
return all(np.can_cast(item, new_type) for item in np.nditer(item))
Both of these approaches seem to be valid (and work for trivial cases) but are they correct and efficient? Is there another, better approach?
P.S. To complicate things further, np.can_cast(np.int8, np.uint64) returns False (naturally) so changing between signed and unsigned integers has to be checked somewhat separately.
If you already know that the array is of a NumPy integer type, then the only check needed is that the values are within the range specified by min/max of the target integer range. This is a much simpler check than the generic can_cast, which has no a priori knowledge of the things it is fed. Consequently, can_cast takes longer. I tested this on casting integers 0-99 from np.int64 to np.int8.
So, while both approaches are correct, the first one is preferable if you know that data is a NumPy integer array.
>>> timeit.timeit("np.all((data >= type_info.min) & (data <= type_info.max))", setup="import numpy as np\ndata = np.array(range(100), dtype=np.int64)\ntype_info = np.iinfo(np.int8)")
6.745509549000417
>>> timeit.timeit("all(np.can_cast(item, np.uint8) for item in np.nditer(data))", setup="import numpy as np\ndata = np.array(range(100), dtype=np.int64)")
51.0065170609887
It is slightly faster (20% or so) to assign the min and max values to new variables:
type_info = np.iinfo(new_type)
a = type_info.min
b = type_info.max
return np.all((data >= a) & (data <= b))

Allowing for deviations in exact values during matrix multiplication, python

I need to solve this:
Check if AT * n * A = n, where A is the test matrix, AT is the transposed test matrix and n = [[1,0,0,0],[0,-1,0,0],[0,0,-1,0],[0,0,0,-1]].
I don't know how to check for equality due to the numerical errors in the float multiplication. How do I go about doing this?
Current code:
def trans(A):
n = numpy.matrix([[1,0,0,0],[0,-1,0,0],[0,0,-1,0],[0,0,0,-1]])
c = numpy.matrix.transpose(A) * n * numpy.matrix(A)
Have then tried
>if c == n:
return True
I have also tried assigning variables to every element of matrix and then checking that each variable is within certain limits.
Typically, the way that numerical-precision limitations are overcome is by allowing for some epsilon (or error-value) between the actual value and expected value that is still considered 'equal'. For example, I might say that some value a is equal to some value b if they are within plus/minus 0.01. This would be implemented in python as:
def float_equals(a, b, epsilon):
return abs(a-b)<epsilon
Of course, for matrixes entered as lists, this isn't quite so simple. We have to check if all values are within the epsilon to their partner. One example solution would be as follows, assuming your matrices are standard python lists:
from itertools import product # need this to generate indexes
def matrix_float_equals(A, B, epsilon):
return all(abs(A[i][j]-B[i][j])<epsilon for i,j in product(xrange(len(A)), repeat = 2))
all returns True iff all values in a list are True (list-wise and). product effectively dot-products two lists, with the repeat keyword allowing easy duplicate lists. Therefore given a range repeated twice, it will produce a list of tuples for each index. Of course, this method of index generation assumes square, equally-sized matrices. For non-square matrices you have to get more creative, but the idea is the same.
However, as is typically the way in python, there are libraries that do this kind of thing for you. Numpy's allclose does exactly this; compares two numpy arrays for equality element-wise within some tolerance. If you're working with matrices in python for numeric analysis, numpy is really the way to go, I would get familiar with its basic API.
If a and b are numpy arrays or matrices of the same shape, then you can use allclose:
if numpy.allclose(a, b): # a is approximately equal to b
# do something ...
This checks that for all i and all j, |aij - bij| < εa for some absolute error εa (by default 10-5) and that |aij - bij| < |bij| εr for some relative error εr (by default 10-8). Thus it is safe to use, even if your calculations introduce numerical errors.

NumPy array to bounded by 0 and 1?

Basically I have an array that may vary between any two numbers, and I want to preserve the distribution while constraining it to the [0,1] space. The function to do this is very very simple. I usually write it as:
def to01(array):
array -= array.min()
array /= array.max()
return array
Of course it can and should be more complex to account for tons of situations, such as all the values being the same (divide by zero) and float vs. integer division (use np.subtract and np.divide instead of operators). But this is the most basic.
The problem is that I do this very frequently across stuff in my project, and it seems like a fairly standard mathematical operation. Is there a built in function that does this in NumPy?
Don't know if there's a builtin for that (probably not, it's not really a difficult thing to do as is). You can use vectorize to apply a function to all the elements of the array:
def to01(array):
a = array.min()
# ignore the Runtime Warning
with numpy.errstate(divide='ignore'):
b = 1. /(array.max() - array.min())
if not(numpy.isfinite(b)):
b = 0
return numpy.vectorize(lambda x: b * (x - a))(array)

Optimizing Boolean Compares Between Arrays with Python

I've got some Python code I'm trying to optimize. It deals with two 2D arrays of identical size (their size can be arbitrary). The first array is full of arbitrary Boolean values, and the second is full of semi-random numbers between 0 and 1.
What I'm trying to do is change the binary values based on the values in the modifier array. Here's a code snippet that works just fine and encapsulates what I'm trying to do within two for-loops:
import numpy as np
xdim = 3
ydim = 4
binaries = np.greater(np.random.rand(xdim,ydim), 0.5)
modifier = np.random.rand(xdim,ydim)
for i in range(binaries.shape[0]):
for j in range(binaries.shape[1]):
if np.greater(modifier[i,j], 0.2):
binaries[i,j] = False
My question: is there a better (or more proper) way to do this? I'd rather use things like slices instead of nested for loops, but the comparisons and Boolean logic make me think that this might be the best way.
binaries &= ~(modifier > 0.2)
modifiler > 0.2 create a binary array, ~ operator does boolean not, and &= does boolean and.
NOTE ~ &= are bitwise operators, but you can use them as boolean operators.

Categories