Strange outcome when testing identity with numpy [duplicate] - python

Is there a Pythonic and efficient way to check whether a Numpy array contains at least one instance of a given row? By "efficient" I mean it terminates upon finding the first matching row rather than iterating over the entire array even if a result has already been found.
With Python arrays this can be accomplished very cleanly with if row in array:, but this does not work as I would expect for Numpy arrays, as illustrated below.
With Python arrays:
>>> a = [[1,2],[10,20],[100,200]]
>>> [1,2] in a
True
>>> [1,20] in a
False
but Numpy arrays give different and rather odd-looking results. (The __contains__ method of ndarray seems to be undocumented.)
>>> a = np.array([[1,2],[10,20],[100,200]])
>>> np.array([1,2]) in a
True
>>> np.array([1,20]) in a
True
>>> np.array([1,42]) in a
True
>>> np.array([42,1]) in a
False

You can use .tolist()
>>> a = np.array([[1,2],[10,20],[100,200]])
>>> [1,2] in a.tolist()
True
>>> [1,20] in a.tolist()
False
>>> [1,20] in a.tolist()
False
>>> [1,42] in a.tolist()
False
>>> [42,1] in a.tolist()
False
Or use a view:
>>> any((a[:]==[1,2]).all(1))
True
>>> any((a[:]==[1,20]).all(1))
False
Or generate over the numpy list (potentially VERY SLOW):
any(([1,2] == x).all() for x in a) # stops on first occurrence
Or use numpy logic functions:
any(np.equal(a,[1,2]).all(1))
If you time these:
import numpy as np
import time
n=300000
a=np.arange(n*3).reshape(n,3)
b=a.tolist()
t1,t2,t3=a[n//100][0],a[n//2][0],a[-10][0]
tests=[ ('early hit',[t1, t1+1, t1+2]),
('middle hit',[t2,t2+1,t2+2]),
('late hit', [t3,t3+1,t3+2]),
('miss',[0,2,0])]
fmt='\t{:20}{:.5f} seconds and is {}'
for test, tgt in tests:
print('\n{}: {} in {:,} elements:'.format(test,tgt,n))
name='view'
t1=time.time()
result=(a[...]==tgt).all(1).any()
t2=time.time()
print(fmt.format(name,t2-t1,result))
name='python list'
t1=time.time()
result = True if tgt in b else False
t2=time.time()
print(fmt.format(name,t2-t1,result))
name='gen over numpy'
t1=time.time()
result=any((tgt == x).all() for x in a)
t2=time.time()
print(fmt.format(name,t2-t1,result))
name='logic equal'
t1=time.time()
np.equal(a,tgt).all(1).any()
t2=time.time()
print(fmt.format(name,t2-t1,result))
You can see that hit or miss, the numpy routines are the same speed to search the array. The Python in operator is potentially a lot faster for an early hit, and the generator is just bad news if you have to go all the way through the array.
Here are the results for 300,000 x 3 element array:
early hit: [9000, 9001, 9002] in 300,000 elements:
view 0.01002 seconds and is True
python list 0.00305 seconds and is True
gen over numpy 0.06470 seconds and is True
logic equal 0.00909 seconds and is True
middle hit: [450000, 450001, 450002] in 300,000 elements:
view 0.00915 seconds and is True
python list 0.15458 seconds and is True
gen over numpy 3.24386 seconds and is True
logic equal 0.00937 seconds and is True
late hit: [899970, 899971, 899972] in 300,000 elements:
view 0.00936 seconds and is True
python list 0.30604 seconds and is True
gen over numpy 6.47660 seconds and is True
logic equal 0.00965 seconds and is True
miss: [0, 2, 0] in 300,000 elements:
view 0.00936 seconds and is False
python list 0.01287 seconds and is False
gen over numpy 6.49190 seconds and is False
logic equal 0.00965 seconds and is False
And for 3,000,000 x 3 array:
early hit: [90000, 90001, 90002] in 3,000,000 elements:
view 0.10128 seconds and is True
python list 0.02982 seconds and is True
gen over numpy 0.66057 seconds and is True
logic equal 0.09128 seconds and is True
middle hit: [4500000, 4500001, 4500002] in 3,000,000 elements:
view 0.09331 seconds and is True
python list 1.48180 seconds and is True
gen over numpy 32.69874 seconds and is True
logic equal 0.09438 seconds and is True
late hit: [8999970, 8999971, 8999972] in 3,000,000 elements:
view 0.09868 seconds and is True
python list 3.01236 seconds and is True
gen over numpy 65.15087 seconds and is True
logic equal 0.09591 seconds and is True
miss: [0, 2, 0] in 3,000,000 elements:
view 0.09588 seconds and is False
python list 0.12904 seconds and is False
gen over numpy 64.46789 seconds and is False
logic equal 0.09671 seconds and is False
Which seems to indicate that np.equal is the fastest pure numpy way to do this...

Numpys __contains__ is, at the time of writing this, (a == b).any() which is arguably only correct if b is a scalar (it is a bit hairy, but I believe – works like this only in 1.7. or later – this would be the right general method (a == b).all(np.arange(a.ndim - b.ndim, a.ndim)).any(), which makes sense for all combinations of a and b dimensionality)...
EDIT: Just to be clear, this is not necessarily the expected result when broadcasting is involved. Also someone might argue that it should handle the items in a separately as np.in1d does. I am not sure there is one clear way it should work.
Now you want numpy to stop when it finds the first occurrence. This AFAIK does not exist at this time. It is difficult because numpy is based mostly on ufuncs, which do the same thing over the whole array.
Numpy does optimize these kind of reductions, but effectively that only works when the array being reduced is already a boolean array (i.e. np.ones(10, dtype=bool).any()).
Otherwise it would need a special function for __contains__ which does not exist. That may seem odd, but you have to remember that numpy supports many data types and has a bigger machinery to select the correct ones and select the correct function to work on it. So in other words, the ufunc machinery cannot do it, and implementing __contains__ or such specially is not actually that trivial because of data types.
You can of course write it in python, or since you probably know your data type, writing it yourself in Cython/C is very simple.
That said. Often it is much better anyway to use sorting based approach for these things. That is a little tedious as well as there is no such thing as searchsorted for a lexsort, but it works (you could also abuse scipy.spatial.cKDTree if you like). This assumes you want to compare along the last axis only:
# Unfortunatly you need to use structured arrays:
sorted = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[-1]).ravel()
# Actually at this point, you can also use np.in1d, if you already have many b
# then that is even better.
sorted.sort()
b_comp = np.ascontiguousarray(b).view(sorted.dtype)
ind = sorted.searchsorted(b_comp)
result = sorted[ind] == b_comp
This works also for an array b, and if you keep the sorted array around, is also much better if you do it for a single value (row) in b at a time, when a stays the same (otherwise I would just np.in1d after viewing it as a recarray). Important: you must do the np.ascontiguousarray for safety. It will typically do nothing, but if it does, it would be a big potential bug otherwise.

I think
equal([1,2], a).all(axis=1) # also, ([1,2]==a).all(axis=1)
# array([ True, False, False], dtype=bool)
will list the rows that match. As Jamie points out, to know whether at least one such row exists, use any:
equal([1,2], a).all(axis=1).any()
# True
Aside: I suspect in (and __contains__) is just as above but using any instead of all.

I've compared the suggested solutions with perfplot and found that, if you're looking for a 2-tuple in a long unsorted list,
np.any(np.all(a == b, axis=1))
is the fastest solution. An explicit short-circuit loop can always be faster if a match is found in the first few rows.
Code to reproduce the plot:
import numpy as np
import perfplot
target = [6, 23]
def setup(n):
return np.random.randint(0, 100, (n, 2))
def any_all(data):
return np.any(np.all(target == data, axis=1))
def tolist(data):
return target in data.tolist()
def loop(data):
for row in data:
if np.all(row == target):
return True
return False
def searchsorted(a):
s = np.ascontiguousarray(a).view([('', a.dtype)] * a.shape[-1]).ravel()
s.sort()
t = np.ascontiguousarray(target).view(s.dtype)
ind = s.searchsorted(t)
return (s[ind] == t)[0]
perfplot.save(
"out02.png",
setup=setup,
kernels=[any_all, tolist, loop, searchsorted],
n_range=[2 ** k for k in range(2, 20)],
xlabel="len(array)",
)

If you really want to stop at the first occurrence, you could write a loop, like:
import numpy as np
needle = np.array([10, 20])
haystack = np.array([[1,2],[10,20],[100,200]])
found = False
for row in haystack:
if np.all(row == needle):
found = True
break
print("Found: ", found)
However, I strongly suspect, that it will be much slower than the other suggestions which use numpy routines to do it for the whole array.

Related

Is there an elegant way to check if index can be requested in a numpy array?

I am looking for an elegant way to check if a given index is inside a numpy array (for example for BFS algorithms on a grid).
The following code does what I want:
import numpy as np
def isValid(np_shape: tuple, index: tuple):
if min(index) < 0:
return False
for ind,sh in zip(index,np_shape):
if ind >= sh:
return False
return True
arr = np.zeros((3,5))
print(isValid(arr.shape,(0,0))) # True
print(isValid(arr.shape,(2,4))) # True
print(isValid(arr.shape,(4,4))) # False
But I'd prefer something build-in or more elegant than writing my own function including python for-loops (yikes)
You can try:
def isValid(np_shape: tuple, index: tuple):
index = np.array(index)
return (index >= 0).all() and (index < arr.shape).all()
arr = np.zeros((3,5))
print(isValid(arr.shape,(0,0))) # True
print(isValid(arr.shape,(2,4))) # True
print(isValid(arr.shape,(4,4))) # False
I have benchmarked the answers quite a bit, and come to the conclusion that actually the explicit for loop as provided in my code performs best.
Dmitri's solution is wrong for several reasons (tuple1 < tuple2 just compares the first value; ideas like np.all(ni < sh for ind,sh in zip(index,np_shape)) fail as the input to all returns a generator, not a list etc).
#mozway's solution is correct, but all the casts make it a lot slower. Also it always needs to consider all numbers for casting, while an explicit loop can stop earlier, I suppose.
Here is my benchmark (Method 0 is #mozway's solution, Method 1 is my solution):

What is the fastest way I can determine whether the intersection of multiple sets of integers is nonempty?

I am part of a undergrad mathematics group. I have a collection of sets of unique integers (sets are varying length). Very often I need to determine whether the intersection of all sets in the collection is nonempty. I do not need to know what the intersection is, just whether it is nonempty or not. I have to do this a lot. I don't have a lot of experience with time complexity and making efficient algorithms. What is the fastest way to go about this?
I included what I have so far. It's horribly slow. If S has 15+ sets, then the script takes forever.
# S is an array of integers
def intersects(S):
if S == []:
return True # if S is empty, I deem the intersection nonempty for reasons
A = S[0]
for i in range(1, len(S)):
B = S[i]
A = get_intersection(A, B) # returns intersection of A and B
if A == []:
return False
return True
set intersection can accept multiple sets
S[0].intersection(*S[1:])
(or even just set.intersection(*S))
for example
>>> s1 = set([1,2,3])
>>> s2 = 2,3,4
>>> s3 = 3,4,5
>>> s1.intersection(s2,s3)
set([3])
Another approach is to put everything into set.intersection:
import numpy as np
S = [set(np.random.randint(0,100,100)) for _ in range(20)]
set.intersection(*S)
# set()

check if two vectors are equal python

I have one vector called cm which does not change
cm = np.array([[99,99,0]])
and another vector called pt. that I want to loop through certain values. but when the two are equal, I want it skip over and not perform the operation. for the sake of this post I just said to have it print out the value of pt, but I actually have a whole host of operations to run. here is my code
for i in range (95,103):
for j in range (95,103):
pt = np.array([[i,j,0]])
if pt == cm:
continue
print pt
i have tried changing the 4th line to
if pt.all == cm.all
but that prints everything, including the one I want to skip
and then if i turn it into
if pt.all() == cm.all()
that also doesn't work. what is the difference between those two anyway?
does anyone know how i can fix it so that when pt = [99,99,0] it will skip the operations and go back to the beginning of the loop? Thanks!
You're probably looking for (pt == cm).all(), although if floats are involved np.allclose(pt, cm) is probably a better idea in case you have numerical errors.
(1) pt.all == cm.all
This checks to see whether the two methods are equal:
>>> pt.all
<built-in method all of numpy.ndarray object at 0x16cbbe0>
>>> pt.all == cm.all
False
(2) pt.all() == cm.all()
This checks to see see whether the result of all matches in each case. For example:
>>> pt
array([[99, 99, 0]])
>>> pt.all()
False
>>> cm = np.array([10, 10, 0])
>>> cm.all()
False
>>> pt.all() == cm.all()
True
(3) (pt == cm).all()
This creates an array testing to see whether the two are equal, and returns whether the result is all True:
>>> pt
array([[99, 99, 0]])
>>> cm
array([[99, 99, 0]])
>>> pt == cm
array([[ True, True, True]], dtype=bool)
>>> (pt == cm).all()
True
One downside is that this constructs a temporary array, but often that's not an issue in practice.
Aside: when you're writing nested loops with numpy arrays you've usually made a mistake somewhere. Python-level loops are slow, and so you lose a lot of the benefits you get from using numpy in the first place. But that's a separate issue.

Python - Intersection of 2D Numpy Arrays

I'm desperately searching for an efficient way to check if two 2D numpy Arrays intersect.
So what I have is two arrays with an arbitrary amount of 2D arrays like:
A=np.array([[2,3,4],[5,6,7],[8,9,10]])
B=np.array([[5,6,7],[1,3,4]])
C=np.array([[1,2,3],[6,6,7],[10,8,9]])
All I need is a True if there is at least one vector intersecting with another one of the other array, otherwise a false. So it should give results like this:
f(A,B) -> True
f(A,C) -> False
I'm kind of new to python and at first I wrote my program with Python lists, which works but of course is very inefficient. The Program takes days to finish so I am working on a numpy.array solution now, but these arrays really are not so easy to handle.
Here's Some Context about my Program and the Python List Solution:
What i'm doing is something like a self-avoiding random walk in 3 Dimensions. http://en.wikipedia.org/wiki/Self-avoiding_walk. But instead of doing a Random walk and hoping that it will reach a desirable length (e.g. i want chains build up of 1000 beads) without reaching a dead end i do the following:
I create a "flat" Chain with the desired length N:
X=[]
for i in range(0,N+1):
X.append((i,0,0))
Now i fold this flat chain:
randomly choose one of the elements ("pivotelement")
randomly choose one direction ( either all elements to the left or to the right of the pivotelment)
randomly choose one out of 9 possible rotations in space (3 axes * 3 possible rotations 90°,180°,270°)
rotate all the elements of the chosen direction with the chosen rotation
Check if the new elements of the chosen direction intersect with the other direction
No intersection -> accept the new configuration, else -> keep the old chain.
Steps 1.-6. have to be done a large amount of times (e.g. for a chain of length 1000, ~5000 Times) so these steps have to be done efficiently. My List-based solution for this is the following:
def PivotFold(chain):
randPiv=random.randint(1,N) #Chooses a random pivotelement, N is the Chainlength
Pivot=chain[randPiv] #get that pivotelement
C=[] #C is going to be a shifted copy of the chain
intersect=False
for j in range (0,N+1): # Here i shift the hole chain to get the pivotelement to the origin, so i can use simple rotations around the origin
C.append((chain[j][0]-Pivot[0],chain[j][1]-Pivot[1],chain[j][2]-Pivot[2]))
rotRand=random.randint(1,18) # rotRand is used to choose a direction and a Rotation (2 possible direction * 9 rotations = 18 possibilitys)
#Rotations around Z-Axis
if rotRand==1:
for j in range (randPiv,N+1):
C[j]=(-C[j][1],C[j][0],C[j][2])
if C[0:randPiv].__contains__(C[j])==True:
intersect=True
break
elif rotRand==2:
for j in range (randPiv,N+1):
C[j]=(C[j][1],-C[j][0],C[j][2])
if C[0:randPiv].__contains__(C[j])==True:
intersect=True
break
...etc
if intersect==False: # return C if there was no intersection in C
Shizz=C
else:
Shizz=chain
return Shizz
The Function PivotFold(chain) will be used on the initially flat chain X a large amount of times. it's pretty naivly written so maybe you have some protips to improve this ^^ I thought that numpyarrays would be good because i can efficiently shift and rotate entire chains without looping over all the elements ...
This should do it:
In [11]:
def f(arrA, arrB):
return not set(map(tuple, arrA)).isdisjoint(map(tuple, arrB))
In [12]:
f(A, B)
Out[12]:
True
In [13]:
f(A, C)
Out[13]:
False
In [14]:
f(B, C)
Out[14]:
False
To find intersection? OK, set sounds like a logical choice.
But numpy.array or list are not hashable? OK, convert them to tuple.
That is the idea.
A numpy way of doing involves very unreadable boardcasting:
In [34]:
(A[...,np.newaxis]==B[...,np.newaxis].T).all(1)
Out[34]:
array([[False, False],
[ True, False],
[False, False]], dtype=bool)
In [36]:
(A[...,np.newaxis]==B[...,np.newaxis].T).all(1).any()
Out[36]:
True
Some timeit result:
In [38]:
#Dan's method
%timeit set_comp(A,B)
10000 loops, best of 3: 34.1 µs per loop
In [39]:
#Avoiding lambda will speed things up
%timeit f(A,B)
10000 loops, best of 3: 23.8 µs per loop
In [40]:
#numpy way probably will be slow, unless the size of the array is very big (my guess)
%timeit (A[...,np.newaxis]==B[...,np.newaxis].T).all(1).any()
10000 loops, best of 3: 49.8 µs per loop
Also the numpy method will be RAM hungry, as A[...,np.newaxis]==B[...,np.newaxis].T step creates a 3D array.
Using the same idea outlined here, you could do the following:
def make_1d_view(a):
a = np.ascontiguousarray(a)
dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(dt).ravel()
def f(a, b):
return len(np.intersect1d(make_1d_view(A), make_1d_view(b))) != 0
>>> f(A, B)
True
>>> f(A, C)
False
This doesn't work for floating point types (it will not consider +0.0 and -0.0 the same value), and np.intersect1d uses sorting, so it is has linearithmic, not linear, performance. You may be able to squeeze some performance by replicating the source of np.intersect1d in your code, and instead of checking the length of the return array, calling np.any on the boolean indexing array.
You can also get the job done with some np.tile and np.swapaxes business!
def intersect2d(X, Y):
"""
Function to find intersection of two 2D arrays.
Returns index of rows in X that are common to Y.
"""
X = np.tile(X[:,:,None], (1, 1, Y.shape[0]) )
Y = np.swapaxes(Y[:,:,None], 0, 2)
Y = np.tile(Y, (X.shape[0], 1, 1))
eq = np.all(np.equal(X, Y), axis = 1)
eq = np.any(eq, axis = 1)
return np.nonzero(eq)[0]
To answer the question more specifically, you'd only need to check if the returned array is empty.
This should be much faster it is not O(n^2) like the for-loop solution, but it isn't fully numpythonic. Not sure how better to leverage numpy here
def set_comp(a, b):
sets_a = set(map(lambda x: frozenset(tuple(x)), a))
sets_b = set(map(lambda x: frozenset(tuple(x)), b))
return not sets_a.isdisjoint(sets_b)
i think you want true if tow arrays have subarray set ! you can use this :
def(A,B):
for i in A:
for j in B:
if i==j
return True
return False
This problem can be solved efficiently using the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
len(npi.intersection(A, B)) > 0

Find the number of 1s in the same position in two arrays

I have two lists:
A = [0,0,0,1,0,1]
B = [0,0,1,1,1,1]
I want to find the number of 1s in the same position in both lists.
The answer for these arrays would be 2.
A little shorter and hopefully more pythonic way:
>>> A=[0,0,0,1,0,1]
>>> B=[0,0,1,1,1,1]
x = sum(1 for a,b in zip(A,B) if (a==b==1))
>>> x
2
I'm not an expert of Python, but what is wrong with a simple loop from start to end of first array?
In C# I would do something like:
int match=0;
for (int cnt=0; cnt< A.Count;cnt++)
{
if ((A[cnt]==B[cnt]==1)) match++;
}
Would that be possible in your language?
Motivated by brief need to be perverse, I offer the following solution:
A = [0,0,0,1,0,1]
B = [0,0,1,1,1,1]
print len(set(i for i, n in enumerate(A) if n == 1) &
set(i for i, n in enumerate(B) if n == 1))
(Drakosha's suggestion is a far more reasonable way to solve this problem. This just demonstrates that one can often look at the same problem in different ways.)
With SciPy:
>>> from scipy import array
>>> A=array([0,0,0,1,0,1])
>>> B=array([0,0,1,1,1,1])
>>> A==B
array([ True, True, False, True, False, True], dtype=bool)
>>> sum(A==B)
4
>>> A!=B
array([False, False, True, False, True, False], dtype=bool)
>>> sum(A!=B)
2
[A[i]+B[i] for i in range(min([len(A), len(B)]))].count(2)
Basically this just creates a new list which has all the elements of the other two added together. You know there were two 1's if the sum is 2 (assuming only 0's and 1's in the list). Therefore just perform the count operation on 2.
Here comes another method which exploits the fact that the array just contains zeros and ones.
The scalar product of two vectors x and y is sum( x(i)*y(i) ) the only situation yielding a non zero result is if x(i)==y(i)==1 thus using numpy for instance
from numpy import *
x = array([0,0,0,1,0,1])
y = array([0,0,1,1,1,1])
print dot(x,y)
simple and nice. This method does n multiplications and adds n-1 times, however there are fast implementations using SSE, GPGPU, vectorisation, (add your fancy word here) for dot products (scalar products)
I timed the numpy-method against this method:
sum(1 for a,b in zip(x,y) if (a==b==1))
and found that for 1000000 loops the numpy-version did it in 2121ms and the zip-method did it in 9502ms thus the numpy-version is a lot faster
I did a better analysis of the efectivness and found that
for n element(s) in the array the zip method took t1 ms and the dot product took t2 ms for one itteration
elements zip dot
1 0.0030 0.0207
10 0.0063 0.0230
100 0.0393 0.0476
1000 0.3696 0.2932
10000 7.6144 2.7781
100000 115.8824 30.1305
From this data one could draw the conclusion that if the number of elements in the array is expected to (in mean) be more than 350 (or say 1000) one should consider to use the dot-product method instead.

Categories