Check if arrays share common elements - python

I want to check if two arrays share at least one common element. For two arrays of equal size I can do the following:
import numpy as np
A = np.array([0,1,2,3,4])
B = np.array([5,6,7,8,9])
print(np.isin(A,B).any())
False
In my task, however, I want to do this over a 2d array of variable size. Example:
A = np.array([[0,1,2,3,4],[3,4,5], [2,4,7], [12,14]])
B = np.array([5,6,7,8,9])
function(A,B)
should return:
[False, True, True, False]
How can this task be performed efficiently?

A = np.array([[0,1,2,3,4], [3,4,5], [2,4,7], [12,14]])
B = np.array([5,6,7,8,9])
result = [np.isin(x, B).any() for x in A]
This might be what you're looking for.

Solution without a loop:
import numpy as np
A = np.array([[0,1,2,3,4],[3,4,5], [2,4,7], [12,14]], dtype=object)
B = np.array([5,6,7,8,9])
result = np.intersect1d(np.hstack(A), B)
print(result)
Prints:
[5 7]

Related

Why aren't my matrices being correctly compared? [duplicate]

What is the simplest way to compare two NumPy arrays for equality (where equality is defined as: A = B iff for all indices i: A[i] == B[i])?
Simply using == gives me a boolean array:
>>> numpy.array([1,1,1]) == numpy.array([1,1,1])
array([ True, True, True], dtype=bool)
Do I have to and the elements of this array to determine if the arrays are equal, or is there a simpler way to compare?
(A==B).all()
test if all values of array (A==B) are True.
Note: maybe you also want to test A and B shape, such as A.shape == B.shape
Special cases and alternatives (from dbaupp's answer and yoavram's comment)
It should be noted that:
this solution can have a strange behavior in a particular case: if either A or B is empty and the other one contains a single element, then it return True. For some reason, the comparison A==B returns an empty array, for which the all operator returns True.
Another risk is if A and B don't have the same shape and aren't broadcastable, then this approach will raise an error.
In conclusion, if you have a doubt about A and B shape or simply want to be safe: use one of the specialized functions:
np.array_equal(A,B) # test if same shape, same elements values
np.array_equiv(A,B) # test if broadcastable shape, same elements values
np.allclose(A,B,...) # test if same shape, elements have close enough values
The (A==B).all() solution is very neat, but there are some built-in functions for this task. Namely array_equal, allclose and array_equiv.
(Although, some quick testing with timeit seems to indicate that the (A==B).all() method is the fastest, which is a little peculiar, given it has to allocate a whole new array.)
If you want to check if two arrays have the same shape AND elements you should use np.array_equal as it is the method recommended in the documentation.
Performance-wise don't expect that any equality check will beat another, as there is not much room to optimize comparing two elements. Just for the sake, i still did some tests.
import numpy as np
import timeit
A = np.zeros((300, 300, 3))
B = np.zeros((300, 300, 3))
C = np.ones((300, 300, 3))
timeit.timeit(stmt='(A==B).all()', setup='from __main__ import A, B', number=10**5)
timeit.timeit(stmt='np.array_equal(A, B)', setup='from __main__ import A, B, np', number=10**5)
timeit.timeit(stmt='np.array_equiv(A, B)', setup='from __main__ import A, B, np', number=10**5)
> 51.5094
> 52.555
> 52.761
So pretty much equal, no need to talk about the speed.
The (A==B).all() behaves pretty much as the following code snippet:
x = [1,2,3]
y = [1,2,3]
print all([x[i]==y[i] for i in range(len(x))])
> True
Let's measure the performance by using the following piece of code.
import numpy as np
import time
exec_time0 = []
exec_time1 = []
exec_time2 = []
sizeOfArray = 5000
numOfIterations = 200
for i in xrange(numOfIterations):
A = np.random.randint(0,255,(sizeOfArray,sizeOfArray))
B = np.random.randint(0,255,(sizeOfArray,sizeOfArray))
a = time.clock()
res = (A==B).all()
b = time.clock()
exec_time0.append( b - a )
a = time.clock()
res = np.array_equal(A,B)
b = time.clock()
exec_time1.append( b - a )
a = time.clock()
res = np.array_equiv(A,B)
b = time.clock()
exec_time2.append( b - a )
print 'Method: (A==B).all(), ', np.mean(exec_time0)
print 'Method: np.array_equal(A,B),', np.mean(exec_time1)
print 'Method: np.array_equiv(A,B),', np.mean(exec_time2)
Output
Method: (A==B).all(), 0.03031857
Method: np.array_equal(A,B), 0.030025185
Method: np.array_equiv(A,B), 0.030141515
According to the results above, the numpy methods seem to be faster than the combination of the == operator and the all() method and by comparing the numpy methods the fastest one seems to be the numpy.array_equal method.
Usually two arrays will have some small numeric errors,
You can use numpy.allclose(A,B), instead of (A==B).all(). This returns a bool True/False
Now use np.array_equal. From documentation:
np.array_equal([1, 2], [1, 2])
True
np.array_equal(np.array([1, 2]), np.array([1, 2]))
True
np.array_equal([1, 2], [1, 2, 3])
False
np.array_equal([1, 2], [1, 4])
False
On top of the other answers, you can now use an assertion:
numpy.testing.assert_array_equal(x, y)
You also have similar function such as numpy.testing.assert_almost_equal()
https://numpy.org/doc/stable/reference/generated/numpy.testing.assert_array_equal.html
Just for the sake of completeness. I will add the
pandas approach for comparing two arrays:
import numpy as np
a = np.arange(0.0, 10.2, 0.12)
b = np.arange(0.0, 10.2, 0.12)
ap = pd.DataFrame(a)
bp = pd.DataFrame(b)
ap.equals(bp)
True
FYI: In case you are looking of How to
compare Vectors, Arrays or Dataframes in R.
You just you can use:
identical(iris1, iris2)
#[1] TRUE
all.equal(array1, array2)
#> [1] TRUE

Check whether array elements of array are present in another array in python

I have two arrays as:
a = [[1,2,3,4],[3,1,2,4],[4,3,1,2],...] and b = [[4,1,2,3],[1,2,3,4],[3,2,4,1]....]
I want to return a True for every row element of a if it is found in b. For the visible part of the above example the result should be c = [True, False, False]. Numpy solution is welcomed.
The most naive solution:
[(x in b) for x in a]
#[True, False, False]
A more efficient solution (works much faster for large lists because sets have constant lookup time but lists have linear lookup time):
set_b = set(map(tuple, b)) # Convert b into a set of tuples
[(x in set_b) for x in map(tuple,a)]
#[True, False, False]
You can use numpy for this:
import numpy as np
a = np.array([[1,2,3,4],[3,1,2,4],[4,3,1,2]])
b = np.array([[4,1,2,3],[1,2,3,4],[3,2,4,1]])
(a[:, None] == b).all(-1).any(-1)
out: array([ True, False, False])

Python: check if every element in a list exists a specific time

I have the following question. I have a list of ranges like this:
parameterRanges2 = [(1,5),(1,5),(1,7),(1,7),(0,10),(1,20),(1,3),(0,1)]
And I have a numpy array like this :
arr = np.array([[2.0,4.0,3.0,5.0,1.0,3.0,2.0,4.0,2.0,4.0,3.0,5.0,1.0,3.0,2.0,4.0,2.0,4.0,3.0,5.0],
[4.0,2.0,3.0,4.0,2.0,4.0,5.0,1.0,2.0,4.0,1.0,3.0,4.0,2.0,3.0,5.0,1.0,3.0,4.0,2.0],
[2.0,3.0,4.0,6.0,7.0,1.0,2.0,3.0,5.0,6.0,1.0,2.0,4.0,5.0,6.0,2.0,3.0,4.0,5.0,6.0],
[6.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,2.0,2.0,3.0,4.0,5.0],
[8.0,9.0,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,7.0,8.0,9.0,0.0,1.0,2.0,3.0,4.0,5.0,6.0],
[11.0,13.0,14.0,16.0,17.0,19.0,1.0,3.0,4.0,6.0,7.0,9.0,10.0,11.0,13.0,14.0,16.0,17.0,19.0,1.0],
[1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0],
[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0]])
Every parameterRange tuple in the list corresponds with the subarray in the numpy array. Is there a way to check if all elements in the corresponding range exists at least one time? so for example that in the first sublist in the numpy array all numbers 1,2,3,4,5 exists at least one time, in the second sublists exists one time and in the third list for example the numbers 1,2,3,4,5,6,7 exists one time and so on.
Exploiting that the ranges are integer we can give an O(nm) solution, nxm being the shape of arr. The algo works as follows:
discard all non-int elements and all that are outside their range
use np.add.at to efficiently (O(mn)) generate bincounts for in-range numbers
count the above threshold bins in each row and compare to the range
.
import numpy as np
parameterRanges2 = np.array([(1,5),(1,5),(1,7),(1,7),(0,10),(1,20),(1,3),(0,1)])
arr = np.array([[2.0,4.0,3.0,5.0,1.0,3.0,2.0,4.0,2.0,4.0,3.0,5.0,1.0,3.0,2.0,4.0,2.0,4.0,3.0,5.0],
[4.0,2.0,3.0,4.0,2.0,4.0,5.0,1.0,2.0,4.0,1.0,3.0,4.0,2.0,3.0,5.0,1.0,3.0,4.0,2.0],
[2.0,3.0,4.0,6.0,7.0,1.0,2.0,3.0,5.0,6.0,1.0,2.0,4.0,5.0,6.0,2.0,3.0,4.0,5.0,6.0],
[6.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,2.0,2.0,3.0,4.0,5.0],
[8.0,9.0,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,7.0,8.0,9.0,0.0,1.0,2.0,3.0,4.0,5.0,6.0],
[11.0,13.0,14.0,16.0,17.0,19.0,1.0,3.0,4.0,6.0,7.0,9.0,10.0,11.0,13.0,14.0,16.0,17.0,19.0,1.0],
[1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0],
[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0]])
min_occ = 2
dp = np.diff(parameterRanges2, axis=-1)
m = np.zeros((parameterRanges2.shape[0], np.max(dp) + 2), dtype=int)
arr = arr - parameterRanges2[:, :1]
ia = arr.astype(int)
idx = np.where((arr==ia) & (ia>=0) & (ia<=dp), ia, -1)
np.add.at(m, (np.arange(parameterRanges2.shape[0])[:, None], idx), 1)
res = (m[:, :-1] >= min_occ).sum(axis=-1) == dp.ravel() + 1
print(res)
Output:
[ True True False True False False True True]
There may be a more efficient way just using Numpy functions, but the code below works. I can't think of a simple Numpy way to do it since we can't make a standard Numpy array of ranges, since all the row aren't the same length.
import numpy as np
arr = np.array([
[2.0,4.0,3.0,5.0,1.0,3.0,2.0,4.0,2.0,4.0,3.0,5.0,1.0,3.0,2.0,4.0,2.0,4.0,3.0,5.0],
[4.0,2.0,3.0,4.0,2.0,4.0,5.0,1.0,2.0,4.0,1.0,3.0,4.0,2.0,3.0,5.0,1.0,3.0,4.0,2.0],
[2.0,3.0,4.0,6.0,7.0,1.0,2.0,3.0,5.0,6.0,1.0,2.0,4.0,5.0,6.0,2.0,3.0,4.0,5.0,6.0],
[6.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,2.0,2.0,3.0,4.0,5.0],
[8.0,9.0,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,7.0,8.0,9.0,0.0,1.0,2.0,3.0,4.0,5.0,6.0],
[11.0,13.0,14.0,16.0,17.0,19.0,1.0,3.0,4.0,6.0,7.0,9.0,10.0,11.0,13.0,14.0,16.0,17.0,19.0,1.0],
[1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,3.0,3.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0],
[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0]
])
parameterRanges2 = [(1,5),(1,5),(1,7),(1,7),(0,10),(1,20),(1,3),(0,1)]
ranges = [np.arange(u, v+1, dtype='float64') for u, v in parameterRanges2]
print([np.all(np.isin(u,v)) for u, v in zip(ranges, arr)])
output
[True, True, True, True, False, False, True, True]

numpy.where for 2+ specific values

Can the numpy.where function be used for more than one specific value?
I can specify a specific value:
>>> x = numpy.arange(5)
>>> numpy.where(x == 2)[0][0]
2
But I would like to do something like the following. It gives an error of course.
>>> numpy.where(x in [3,4])[0][0]
[3,4]
Is there a way to do this without iterating through the list and combining the resulting arrays?
EDIT: I also have a lists of lists of unknown lengths and unknown values so I cannot easily form the parameters of np.where() to search for multiple items. It would be much easier to pass a list.
You can use the numpy.in1d function with numpy.where:
import numpy
numpy.where(numpy.in1d(x, [2,3]))
# (array([2, 3]),)
I guess np.in1d might help you, instead:
>>> x = np.arange(5)
>>> np.in1d(x, [3,4])
array([False, False, False, True, True], dtype=bool)
>>> np.argwhere(_)
array([[3],
[4]])
If you only need to check for a few values you can:
import numpy as np
x = np.arange(4)
ret_arr = np.where([x == 1, x == 2, x == 4, x == 0])[1]
print "Ret arr = ",ret_arr
Output:
Ret arr = [1 2 0]

adding rows in python to table

I know in R there is a rbind function:
list = c(1,2,3)
blah = NULL
blah = rbind(blah,list)
How would I replicate this in python? I know you could possible write:
a = NULL
b= array([1,2,3])
for i in range(100):
a = np.vstack((a,b))
but I am not sure what to write in the a=NULL spot. I am essentially looping and adding rows to a table. What's the most efficient way to do this?
In numpy, things will be more efficient if you first pre-allocate space and then loop to fill that space than if you dynamically create successively larger arrays. If, for instance, the size is 500, you would:
a = np.empty((500, b.shape[0]))
Then, loop and enter values as needed:
for i in range(500):
a[i,:] = ...
Note that if you really just want to repeat b 500 times, you can just do:
In [1]: import numpy as np
In [2]: b = np.array([1,2,3])
In [3]: a = np.empty((500, b.shape[0]))
In [4]: a[:] = b
In [5]: a[0,:] == b
Out[5]: array([ True, True, True], dtype=bool)
To answer your question exactly
a = []
b= np.array([1,2,3])
for i in xrange(100):
a.append(b)
a = np.vstack( tuple(a) )
The tuple function casts an iterable (in this case a list) as a tuple object, and np.vstack takes a tuple of numpy arrays as argument.

Categories