check if two vectors are equal python - python

I have one vector called cm which does not change
cm = np.array([[99,99,0]])
and another vector called pt. that I want to loop through certain values. but when the two are equal, I want it skip over and not perform the operation. for the sake of this post I just said to have it print out the value of pt, but I actually have a whole host of operations to run. here is my code
for i in range (95,103):
for j in range (95,103):
pt = np.array([[i,j,0]])
if pt == cm:
continue
print pt
i have tried changing the 4th line to
if pt.all == cm.all
but that prints everything, including the one I want to skip
and then if i turn it into
if pt.all() == cm.all()
that also doesn't work. what is the difference between those two anyway?
does anyone know how i can fix it so that when pt = [99,99,0] it will skip the operations and go back to the beginning of the loop? Thanks!

You're probably looking for (pt == cm).all(), although if floats are involved np.allclose(pt, cm) is probably a better idea in case you have numerical errors.
(1) pt.all == cm.all
This checks to see whether the two methods are equal:
>>> pt.all
<built-in method all of numpy.ndarray object at 0x16cbbe0>
>>> pt.all == cm.all
False
(2) pt.all() == cm.all()
This checks to see see whether the result of all matches in each case. For example:
>>> pt
array([[99, 99, 0]])
>>> pt.all()
False
>>> cm = np.array([10, 10, 0])
>>> cm.all()
False
>>> pt.all() == cm.all()
True
(3) (pt == cm).all()
This creates an array testing to see whether the two are equal, and returns whether the result is all True:
>>> pt
array([[99, 99, 0]])
>>> cm
array([[99, 99, 0]])
>>> pt == cm
array([[ True, True, True]], dtype=bool)
>>> (pt == cm).all()
True
One downside is that this constructs a temporary array, but often that's not an issue in practice.
Aside: when you're writing nested loops with numpy arrays you've usually made a mistake somewhere. Python-level loops are slow, and so you lose a lot of the benefits you get from using numpy in the first place. But that's a separate issue.

Related

Is there an efficient way to pass "all" as a numpy index?

I have code that generates a boolean array that acts as a mask on numpy arrays, along the lines of:
def func():
a = numpy.arange(10)
mask = a % 2 == 0
return a[mask]
Now, I need to separate this into a case where the mask is created, and one where it is not created and all values are used instead. This could be achieved as follows:
def func(use_mask):
a = numpy.arange(10)
if use_mask:
mask = a % 2 == 0
else:
mask = numpy.ones(10, dtype=bool)
return a[mask]
However, this becomes extremely wasteful for large arrays, since an equally large boolean array must first be created.
My question is thus: Is there something I can pass as an "index" to recreate the behavior of such an everywhere-true array?
Systematically changing occurrences of a[mask] to something else involving some indexing magic etc. is a valid solution, but just avoiding the masking entirely via an expanded case distinction or something else that changes the structure of the code is not desired, as it would impair readability and maintainability (see next paragraph).
For the sake of completeness, here's what I'm currently considering doing, though this makes the code messier and less streamlined since it expands the if/else beyond where it technically needs to be (in reality, the mask is used more than once, hence every occurrence would need to be contained within the case distinction; I used f1 and f2 as examples here):
def func(use_mask):
a = numpy.arange(10)
if use_mask:
mask = a % 2 == 0
r = f1(a[mask])
q = f2(a[mask], r)
return q
else:
r = f1(a)
q = f2(a, r)
return q
Recall that a[:] returns the contents of a (even if a is multidimensional). We cannot store the : in the mask variable, but we can use a slice object equivalently:
def func(use_mask):
a = numpy.arange(10)
if use_mask:
mask = a % 2 == 0
else:
mask = slice(None)
return a[mask]
This does not use any memory to create the index array. I'm not sure what the CPU usage of the a[slice(None)] operation is, though.

How exactly does the .any() Python method work? [duplicate]

This question already has answers here:
How do Python's any and all functions work?
(10 answers)
Closed 2 years ago.
I'm trying to write a script that simulates a system of chemical reactions over time. One of the inputs to the function is the following array:
popul_num = np.array([200, 100, 0, 0])
Which contains the number of discrete molecules of each species in the system. Part of the main function has an if statement that's meant to check that number of molecules is positive. if it is processed to the next iteration, else break out of the whole simulation
if popul_num.any() < 0: # Any method isn't working! --> Does .any() work on arrays or just lists?
print("Break out of loop negative molecule numbers")
tao_all = tao_all[0:-1]
popul_num_all = popul_num_all[0:-1]
else:
break
I've used the .any() to try find if any element of the popul_num array is negative. But it doesn't work, it doesn't throw an error, the system just never enters the if statement and I can't figure out why?
I've just ran the program and the final number of molecules the system returned was: [135 -19 65 54] the program should have broken out before the second element got to -19.
Any suggestions?
Cheers
You should use .any() on a boolean array after doing the comparison, not on the values of popul_num themselves. It will return True if any of the values of the boolean array are True, otherwise False.
In fact, .any() tests for any "truthy" values, which for integers means non-zero values, so it will work on an array of integers to test if any of them are non-zero, which is what you are doing, but this is not testing the thing that you are interested in knowing. The code then compounds the problem by doing an < 0 test on the boolean value returned by any, which always evaluates True because boolean values are treated as 0 and 1 (for False and True respectively) in operations involving integers.
You can do:
if (popul_num < 0).any():
do_whatever
Here popul_num < 0 is a boolean array containing the results of element-by-element comparisons. In your example:
>>> popul_num < 0
array([False, False, False, False], dtype=bool)
You are, however, correct to use array.any() (or np.any(array)) rather than using the builtin any(). The latter happens to work for a 1-d array, but would not work with more dimensions. This is because iterating e.g. over a 4d array (which is what the builtin any() would do) gives a sequence of 3d arrays, not the individual elements.
There is also similarly .all(). The above test is equivalent to:
if not (popul_num >= 0).all():
The any method of numpy arrays returns a boolean value, so when you write:
if popul_num.any() < 0:
popul_num.any() will be either True (=1) or False (=0) so it will never be less than zero. Thus, you will never enter this if-statement.
What any() does is evaluate each element of the array as a boolean and return whether any of them are truthy. For example:
>>> np.array([0.0]).any()
False
>>> np.array([1.0]).any()
True
>>> np.array([0.0, 0.35]).any()
True
As you can see, Python/numpy considers 0 to be falsy and all other numbers to be truthy. So calling any on an array of numbers tells us whether any number in the array is nonzero. But you want to know whether any number is negative, so we have to transfrom the array first. Let's introduce a negative number into your array to demonstrate.
>>> popul_num = np.array([200, 100, 0, -1])
>>> popul_num < 0 # Test is applied to all elements in the array
np.ndarray([False, False, False, True])
>>> (popul_num < 0).any()
True
You asked about any on lists versus arrays. Python's builtin list has no any method:
>>> [].any()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'any'
There is a builtin function (not method since it doesn't belong to a class) called any that serves the same purpose as the numpy .any method. These two expressions are logically equivalent:
any(popul_num < 0)
(popul_num < 0).any()
We would generally expect the second one to be faster since numpy is implemented in C. However only the first one will work with non-numpy types such as list and set.
any() is a method for use on iterables. It returns True if any of the items in the iterable are truthy. You want something more like:
if any([True for x in popul_num if x > 0]):
print("at least one element was greater than zero!")
any() returns True, if at least one of the elements is True:
L1 = [False, False, False]
any(L1)
# >>> False
L1 = [True, False, False]
any(L1)
# >>> True
L1 = ["Hi", False, False]
any(L1)
# >>> True
L1 = []
any(L1)
# >>> False
any() returns True if at least one element in a NumPy array evaluates to True and np.all test whether all array elements along a given axis evaluate to True. What you need to solve your problem is the all method.

Evaluating a function using numpy

What is the significance of the return part when evaluating functions? Why is this necessary?
Your assumption is right: dfdx[0] is indeed the first value in that array, so according to your code that would correspond to evaluating the derivative at x=-1.0.
To know the correct index where x is equal to 0, you will have to look for it in the x array.
One way to find this is the following, where we find the index of the value where |x-0| is minimal (so essentially where x=0 but float arithmetic requires taking some precautions) using argmin :
index0 = np.argmin(np.abs(x-0))
And we then get what we want, dfdx at the index where x is 0 :
print dfdx[index0]
An other but less robust way regarding float arithmetic trickery is to do the following:
# we make a boolean array that is True where x is zero and False everywhere else
bool_array = (x==0)
# Numpy alows to use a boolean array as a way to index an array
# Doing so will get you the all the values of dfdx where bool_array is True
# In our case that will hopefully give us dfdx where x=0
print dfdx[bool_array]
# same thing as oneliner
print dfdx[x==0]
You give the answer. x[0] is -1.0, and you want the value at the middle of the array.`np.linspace is the good function to build such series of values :
def f1(x):
g = np.sin(math.pi*np.exp(-x))
return g
n = 1001 # odd !
x=linspace(-1,1,n) #x[n//2] is 0
f1x=f1(x)
df1=np.diff(f1(x),1)
dx=np.diff(x)
df1dx = - math.pi*np.exp(-x)*np.cos(math.pi*np.exp(-x))[:-1] # to discard last element
# In [3]: np.allclose(df1/dx,df1dx,atol=dx[0])
# Out[3]: True
As an other tip, numpy arrays are more efficiently and readably used without loops.

Strange outcome when testing identity with numpy [duplicate]

Is there a Pythonic and efficient way to check whether a Numpy array contains at least one instance of a given row? By "efficient" I mean it terminates upon finding the first matching row rather than iterating over the entire array even if a result has already been found.
With Python arrays this can be accomplished very cleanly with if row in array:, but this does not work as I would expect for Numpy arrays, as illustrated below.
With Python arrays:
>>> a = [[1,2],[10,20],[100,200]]
>>> [1,2] in a
True
>>> [1,20] in a
False
but Numpy arrays give different and rather odd-looking results. (The __contains__ method of ndarray seems to be undocumented.)
>>> a = np.array([[1,2],[10,20],[100,200]])
>>> np.array([1,2]) in a
True
>>> np.array([1,20]) in a
True
>>> np.array([1,42]) in a
True
>>> np.array([42,1]) in a
False
You can use .tolist()
>>> a = np.array([[1,2],[10,20],[100,200]])
>>> [1,2] in a.tolist()
True
>>> [1,20] in a.tolist()
False
>>> [1,20] in a.tolist()
False
>>> [1,42] in a.tolist()
False
>>> [42,1] in a.tolist()
False
Or use a view:
>>> any((a[:]==[1,2]).all(1))
True
>>> any((a[:]==[1,20]).all(1))
False
Or generate over the numpy list (potentially VERY SLOW):
any(([1,2] == x).all() for x in a) # stops on first occurrence
Or use numpy logic functions:
any(np.equal(a,[1,2]).all(1))
If you time these:
import numpy as np
import time
n=300000
a=np.arange(n*3).reshape(n,3)
b=a.tolist()
t1,t2,t3=a[n//100][0],a[n//2][0],a[-10][0]
tests=[ ('early hit',[t1, t1+1, t1+2]),
('middle hit',[t2,t2+1,t2+2]),
('late hit', [t3,t3+1,t3+2]),
('miss',[0,2,0])]
fmt='\t{:20}{:.5f} seconds and is {}'
for test, tgt in tests:
print('\n{}: {} in {:,} elements:'.format(test,tgt,n))
name='view'
t1=time.time()
result=(a[...]==tgt).all(1).any()
t2=time.time()
print(fmt.format(name,t2-t1,result))
name='python list'
t1=time.time()
result = True if tgt in b else False
t2=time.time()
print(fmt.format(name,t2-t1,result))
name='gen over numpy'
t1=time.time()
result=any((tgt == x).all() for x in a)
t2=time.time()
print(fmt.format(name,t2-t1,result))
name='logic equal'
t1=time.time()
np.equal(a,tgt).all(1).any()
t2=time.time()
print(fmt.format(name,t2-t1,result))
You can see that hit or miss, the numpy routines are the same speed to search the array. The Python in operator is potentially a lot faster for an early hit, and the generator is just bad news if you have to go all the way through the array.
Here are the results for 300,000 x 3 element array:
early hit: [9000, 9001, 9002] in 300,000 elements:
view 0.01002 seconds and is True
python list 0.00305 seconds and is True
gen over numpy 0.06470 seconds and is True
logic equal 0.00909 seconds and is True
middle hit: [450000, 450001, 450002] in 300,000 elements:
view 0.00915 seconds and is True
python list 0.15458 seconds and is True
gen over numpy 3.24386 seconds and is True
logic equal 0.00937 seconds and is True
late hit: [899970, 899971, 899972] in 300,000 elements:
view 0.00936 seconds and is True
python list 0.30604 seconds and is True
gen over numpy 6.47660 seconds and is True
logic equal 0.00965 seconds and is True
miss: [0, 2, 0] in 300,000 elements:
view 0.00936 seconds and is False
python list 0.01287 seconds and is False
gen over numpy 6.49190 seconds and is False
logic equal 0.00965 seconds and is False
And for 3,000,000 x 3 array:
early hit: [90000, 90001, 90002] in 3,000,000 elements:
view 0.10128 seconds and is True
python list 0.02982 seconds and is True
gen over numpy 0.66057 seconds and is True
logic equal 0.09128 seconds and is True
middle hit: [4500000, 4500001, 4500002] in 3,000,000 elements:
view 0.09331 seconds and is True
python list 1.48180 seconds and is True
gen over numpy 32.69874 seconds and is True
logic equal 0.09438 seconds and is True
late hit: [8999970, 8999971, 8999972] in 3,000,000 elements:
view 0.09868 seconds and is True
python list 3.01236 seconds and is True
gen over numpy 65.15087 seconds and is True
logic equal 0.09591 seconds and is True
miss: [0, 2, 0] in 3,000,000 elements:
view 0.09588 seconds and is False
python list 0.12904 seconds and is False
gen over numpy 64.46789 seconds and is False
logic equal 0.09671 seconds and is False
Which seems to indicate that np.equal is the fastest pure numpy way to do this...
Numpys __contains__ is, at the time of writing this, (a == b).any() which is arguably only correct if b is a scalar (it is a bit hairy, but I believe – works like this only in 1.7. or later – this would be the right general method (a == b).all(np.arange(a.ndim - b.ndim, a.ndim)).any(), which makes sense for all combinations of a and b dimensionality)...
EDIT: Just to be clear, this is not necessarily the expected result when broadcasting is involved. Also someone might argue that it should handle the items in a separately as np.in1d does. I am not sure there is one clear way it should work.
Now you want numpy to stop when it finds the first occurrence. This AFAIK does not exist at this time. It is difficult because numpy is based mostly on ufuncs, which do the same thing over the whole array.
Numpy does optimize these kind of reductions, but effectively that only works when the array being reduced is already a boolean array (i.e. np.ones(10, dtype=bool).any()).
Otherwise it would need a special function for __contains__ which does not exist. That may seem odd, but you have to remember that numpy supports many data types and has a bigger machinery to select the correct ones and select the correct function to work on it. So in other words, the ufunc machinery cannot do it, and implementing __contains__ or such specially is not actually that trivial because of data types.
You can of course write it in python, or since you probably know your data type, writing it yourself in Cython/C is very simple.
That said. Often it is much better anyway to use sorting based approach for these things. That is a little tedious as well as there is no such thing as searchsorted for a lexsort, but it works (you could also abuse scipy.spatial.cKDTree if you like). This assumes you want to compare along the last axis only:
# Unfortunatly you need to use structured arrays:
sorted = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[-1]).ravel()
# Actually at this point, you can also use np.in1d, if you already have many b
# then that is even better.
sorted.sort()
b_comp = np.ascontiguousarray(b).view(sorted.dtype)
ind = sorted.searchsorted(b_comp)
result = sorted[ind] == b_comp
This works also for an array b, and if you keep the sorted array around, is also much better if you do it for a single value (row) in b at a time, when a stays the same (otherwise I would just np.in1d after viewing it as a recarray). Important: you must do the np.ascontiguousarray for safety. It will typically do nothing, but if it does, it would be a big potential bug otherwise.
I think
equal([1,2], a).all(axis=1) # also, ([1,2]==a).all(axis=1)
# array([ True, False, False], dtype=bool)
will list the rows that match. As Jamie points out, to know whether at least one such row exists, use any:
equal([1,2], a).all(axis=1).any()
# True
Aside: I suspect in (and __contains__) is just as above but using any instead of all.
I've compared the suggested solutions with perfplot and found that, if you're looking for a 2-tuple in a long unsorted list,
np.any(np.all(a == b, axis=1))
is the fastest solution. An explicit short-circuit loop can always be faster if a match is found in the first few rows.
Code to reproduce the plot:
import numpy as np
import perfplot
target = [6, 23]
def setup(n):
return np.random.randint(0, 100, (n, 2))
def any_all(data):
return np.any(np.all(target == data, axis=1))
def tolist(data):
return target in data.tolist()
def loop(data):
for row in data:
if np.all(row == target):
return True
return False
def searchsorted(a):
s = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[-1]).ravel()
s.sort()
t = np.ascontiguousarray(target).view(s.dtype)
ind = s.searchsorted(t)
return (s[ind] == t)[0]
perfplot.save(
"out02.png",
setup=setup,
kernels=[any_all, tolist, loop, searchsorted],
n_range=[2 ** k for k in range(2, 20)],
xlabel="len(array)",
)
If you really want to stop at the first occurrence, you could write a loop, like:
import numpy as np
needle = np.array([10, 20])
haystack = np.array([[1,2],[10,20],[100,200]])
found = False
for row in haystack:
if np.all(row == needle):
found = True
break
print("Found: ", found)
However, I strongly suspect, that it will be much slower than the other suggestions which use numpy routines to do it for the whole array.

find stretches of Trues in numpy array

Is there a good way to find stretches of Trues in a numpy boolean array? If I have an array like:
x = numpy.array([True,True,False,True,True,False,False])
Can I get an array of indices like:
starts = [0,3]
ends = [1,4]
or any other appropriate way to store this information. I know this can be done with some complicated while loops, but I'm looking for a better way.
You can pad x with Falses (one at the beginning and one at the end), and use np.diff. A "diff" of 1 means transition from False to True, and of -1 means transition from True to False.
The convention is to represent range's end as the index one after the last. This example complies with the convention (you can easily use ends-1 instead of ends to get the array in your question):
x1 = np.hstack([ [False], x, [False] ]) # padding
d = np.diff(x1.astype(int))
starts = np.where(d == 1)[0]
ends = np.where(d == -1)[0]
starts, ends
=> (array([0, 3]), array([2, 5]))

Categories