Finding indices of matches of one array in another array - python

I have two numpy arrays, A and B. A conatains unique values and B is a sub-array of A.
Now I am looking for a way to get the index of B's values within A.
For example:
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
# I need a function fun() that:
fun(A,B)
>> 0,6,9

You can use np.in1d with np.nonzero -
np.nonzero(np.in1d(A,B))[0]
You can also use np.searchsorted, if you care about maintaining the order -
np.searchsorted(A,B)
For a generic case, when A & B are unsorted arrays, you can bring in the sorter option in np.searchsorted, like so -
sort_idx = A.argsort()
out = sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
I would add in my favorite broadcasting too in the mix to solve a generic case -
np.nonzero(B[:,None] == A)[1]
Sample run -
In [125]: A
Out[125]: array([ 7, 5, 1, 6, 10, 9, 8])
In [126]: B
Out[126]: array([ 1, 10, 7])
In [127]: sort_idx = A.argsort()
In [128]: sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
Out[128]: array([2, 4, 0])
In [129]: np.nonzero(B[:,None] == A)[1]
Out[129]: array([2, 4, 0])

Have you tried searchsorted?
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
A.searchsorted(B)
# array([0, 6, 9])

Just for completeness: If the values in A are non negative and reasonably small:
lookup = np.empty((np.max(A) + 1), dtype=int)
lookup[A] = np.arange(len(A))
indices = lookup[B]

I had the same question these days. However, the timing performance is very critical for me. Therefore, I guess the timing comparison of different solutions may be useful for others.
As Divakar mentioned, you can use np.in1d(A, B) with np.where, np.nonzero. Moreover, you can use the np.in1d(A, B) with np.intersect1d (based on this page). Also, you can use np.searchsorted as another useful approach for sorted arrays.
I want to add another simple solution. You can use the comprehension list. It may take longer that the previous ones. However, if you take the advantage of Numba python package, it is much less time-consuming.
In [1]: import numpy as np
In [2]: from numba import njit
In [3]: a = np.array([1,2,3,4,5,6,7,8,9,10])
In [4]: b = np.array([1,7,10])
In [5]: np.where(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [6]: np.nonzero(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [7]: np.searchsorted(a, b)
...: array([0, 6, 9])
In [8]: np.searchsorted(a, np.intersect1d(a, b))
...: array([0, 6, 9])
In [9]: [i for i, x in enumerate(a) if x in b]
...: [0, 6, 9]
In [10]: #njit
...: def func(a, b):
...: return [i for i, x in enumerate(a) if x in b]
In [11]: func(a, b)
...: [0, 6, 9]
Now, let's compare the timing performance of these solutions.
In [12]: %timeit np.where(np.in1d(a, b))[0]
4.26 µs ± 6.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit np.nonzero(np.in1d(a, b))[0]
4.39 µs ± 14.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit np.searchsorted(a, b)
800 ns ± 6.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [15]: %timeit np.searchsorted(a, np.intersect1d(a, b))
8.8 µs ± 73.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [16]: %timeit [i for i, x in enumerate(a) if x in b]
15.4 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [17]: %timeit func(a, b)
336 ns ± 0.579 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Related

Python numpy split with indices

I'm looking for a numpy equivalent of my suboptimal Python code. The calculation I want to do can be summarized by:
The average of the peak of each section for each row.
Here the code with a sample array and list of indices. Sections can be of different sizes.
x = np.array([[1, 2, 3, 4],
[5, 6, 7, 8]])
indices = [2]
result = np.empty((1, x.shape[0]))
for row in x:
splited = np.array_split(row, indexes)
peak = [np.amax(a) for a in splited]
result[0, i] = np.average(peak)
Which gives: result = array([[3., 7.]])
What is the optimized numpy way to suppress both loop?
You could just take off the for loop and use axis instead:
result2 = np.mean([np.max(arr, 1) for arr in np.array_split(x_large, indices, 1)], axis=0)
Output:
array([3., 7.])
Benchmark:
x_large = np.array([[1, 2, 3, 4],
[5, 6, 7, 8]] * 1000)
%%timeit
result = []
for row in x_large:
splited = np.array_split(row, indices)
peak = [np.amax(a) for a in splited]
result.append(np.average(peak))
# 29.9 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.mean([np.max(arr, 1) for arr in np.array_split(x_large, indices, 1)], axis=0)
# 37.4 µs ± 499 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Validation:
np.array_equal(result, result2)
# True

Python 2D array reformating to named dict

I have a 2D array in this format:
arr = [
[100000, 5],
[100060, 3],
[100120, 7],
...
]
I want it to reformat it as dict:
dct = {
x_values: [100000, 100060, 100120],
y_values: [5, 3, 7]
}
What is the best performance way to do it?
Note: values are not always integer.
I don't know if there is a faster solution:
arr = [[100000, 5],
[100060, 3],
[100120, 7],
]
dict_ = {"x_value":[], "y_value":[]}
for x_value, y_value in arr:
dict_["x_value"].append(x_value)
dict_["y_value"].append(y_value)
print(dict_)
This is probably faster:
In [19]: %timeit d = dict(zip(('x', 'y'), zip(*arr)))
553 ns ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [18]: d
Out[18]: {'x': (100000, 100060, 100120), 'y': (5, 3, 7)}
Edit: marginally at least
In [38]: %%timeit
...: dict_ = {"x_value":[], "y_value":[]}
...: for x_value, y_value in arr:
...: dict_["x_value"].append(x_value)
...: dict_["y_value"].append(y_value)
...:
620 ns ± 20 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Find largest row in a matrix with numpy (row with highest length)

I have a massive array with rows and columns. Some rows are larger than others. I need to get the max length row, that is, the row that has the highest length. I wrote a simple function for this, but I wanted it to be as fas as possible, like numpy fast. Currently, it looks like this:
Example array:
values = [
[1,2,3],
[4,5,6,7,8,9],
[10,11,12,13]
]
def values_max_width(values):
max_width = 1
for row in values:
if len(row) > max_width:
max_width = len(row)
return max_width
Is there any way to accomplish this with numpy?
In [261]: values = [
...: [1,2,3],
...: [4,5,6,7,8,9],
...: [10,11,12,13]
...: ]
...:
In [262]:
In [262]: values
Out[262]: [[1, 2, 3], [4, 5, 6, 7, 8, 9], [10, 11, 12, 13]]
In [263]: def values_max_width(values):
...: max_width = 1
...: for row in values:
...: if len(row) > max_width:
...: max_width = len(row)
...: return max_width
...:
In [264]: values_max_width(values)
Out[264]: 6
In [265]: [len(v) for v in values]
Out[265]: [3, 6, 4]
In [266]: max([len(v) for v in values])
Out[266]: 6
In [267]: np.max([len(v) for v in values])
Out[267]: 6
Your loop and the list comprehension are similar in speed, np.max is much slower - it has to first turn the list into an array.
In [268]: timeit max([len(v) for v in values])
656 ns ± 16.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [269]: timeit np.max([len(v) for v in values])
13.9 µs ± 181 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [271]: timeit values_max_width(values)
555 ns ± 13 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
If you are starting with a list, it's a good idea to thoroughly test the list implementation. numpy is fast when it is doing compiled array stuff, but creating an array from a list is time consuming.
Making an array directly from values isn't much help. The result in a object dtype array:
In [272]: arr = np.array(values)
In [273]: arr
Out[273]:
array([list([1, 2, 3]), list([4, 5, 6, 7, 8, 9]), list([10, 11, 12, 13])],
dtype=object)
Math on such an array is hit-or-miss, and always slower than math on pure numeric arrays. We can iterate on such an array, but that iteration is slower than on a list.
In [275]: values_max_width(arr)
Out[275]: 6
In [276]: timeit values_max_width(arr)
1.3 µs ± 8.27 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Not sure how you can make it faster. I've tried using np.max over the length of each item, but that will take even longer:
import numpy as np
import time
values = []
for k in range(100000):
values.append(list(np.random.randint(100, size=np.random.randint(1000))))
def timeit(func):
def wrapper(*args, **kwargs):
now = time.time()
retval = func(*args, **kwargs)
print('{} took {:.5f}s'.format(func.__name__, time.time() - now))
return retval
return wrapper
#timeit
def values_max_width(values):
max_width = 1
for row in values:
if len(row) > max_width:
max_width = len(row)
return max_width
#timeit
def value_max_width_len(values):
return np.max([len(l) for l in values])
values_max_width(values)
value_max_width_len(values)
values_max_width took 0.00598s
value_max_width_len took 0.00994s
* Edit *
As #Mstaino suggested, using map does make this code faster:
#timeit
def value_max_width_len(values):
return max(map(len, values))
values_max_width took 0.00598s
value_max_width_len took 0.00499s

Search indexes where values in my array match a value in a different array (python) [duplicate]

I have two numpy arrays, A and B. A conatains unique values and B is a sub-array of A.
Now I am looking for a way to get the index of B's values within A.
For example:
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
# I need a function fun() that:
fun(A,B)
>> 0,6,9
You can use np.in1d with np.nonzero -
np.nonzero(np.in1d(A,B))[0]
You can also use np.searchsorted, if you care about maintaining the order -
np.searchsorted(A,B)
For a generic case, when A & B are unsorted arrays, you can bring in the sorter option in np.searchsorted, like so -
sort_idx = A.argsort()
out = sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
I would add in my favorite broadcasting too in the mix to solve a generic case -
np.nonzero(B[:,None] == A)[1]
Sample run -
In [125]: A
Out[125]: array([ 7, 5, 1, 6, 10, 9, 8])
In [126]: B
Out[126]: array([ 1, 10, 7])
In [127]: sort_idx = A.argsort()
In [128]: sort_idx[np.searchsorted(A,B,sorter = sort_idx)]
Out[128]: array([2, 4, 0])
In [129]: np.nonzero(B[:,None] == A)[1]
Out[129]: array([2, 4, 0])
Have you tried searchsorted?
A = np.array([1,2,3,4,5,6,7,8,9,10])
B = np.array([1,7,10])
A.searchsorted(B)
# array([0, 6, 9])
Just for completeness: If the values in A are non negative and reasonably small:
lookup = np.empty((np.max(A) + 1), dtype=int)
lookup[A] = np.arange(len(A))
indices = lookup[B]
I had the same question these days. However, the timing performance is very critical for me. Therefore, I guess the timing comparison of different solutions may be useful for others.
As Divakar mentioned, you can use np.in1d(A, B) with np.where, np.nonzero. Moreover, you can use the np.in1d(A, B) with np.intersect1d (based on this page). Also, you can use np.searchsorted as another useful approach for sorted arrays.
I want to add another simple solution. You can use the comprehension list. It may take longer that the previous ones. However, if you take the advantage of Numba python package, it is much less time-consuming.
In [1]: import numpy as np
In [2]: from numba import njit
In [3]: a = np.array([1,2,3,4,5,6,7,8,9,10])
In [4]: b = np.array([1,7,10])
In [5]: np.where(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [6]: np.nonzero(np.in1d(a, b))[0]
...: array([0, 6, 9])
In [7]: np.searchsorted(a, b)
...: array([0, 6, 9])
In [8]: np.searchsorted(a, np.intersect1d(a, b))
...: array([0, 6, 9])
In [9]: [i for i, x in enumerate(a) if x in b]
...: [0, 6, 9]
In [10]: #njit
...: def func(a, b):
...: return [i for i, x in enumerate(a) if x in b]
In [11]: func(a, b)
...: [0, 6, 9]
Now, let's compare the timing performance of these solutions.
In [12]: %timeit np.where(np.in1d(a, b))[0]
4.26 µs ± 6.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit np.nonzero(np.in1d(a, b))[0]
4.39 µs ± 14.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit np.searchsorted(a, b)
800 ns ± 6.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [15]: %timeit np.searchsorted(a, np.intersect1d(a, b))
8.8 µs ± 73.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [16]: %timeit [i for i, x in enumerate(a) if x in b]
15.4 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [17]: %timeit func(a, b)
336 ns ± 0.579 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

How to find the index of the element in a list that first appears in another given list?

a = [3, 4, 2, 1, 7, 6, 5]
b = [4, 6]
The answer should be 1. Because in a, 4 appears first in list b, and it's index is 1.
The question is that is there any fast code in python to achieve this?
PS: Actually a is a random permutation and b is a subset of a, but it's represented as a list.
If b is to be seen as a subset (order doesn't matter, all values are present in a), then use min() with a map():
min(map(a.index, b))
This returns the lowest index. This is a O(NK) solution (where N is the length of a, K that of b), but all looping is executed in C code.
Another option is to convert a to a set and use next() on a loop over enumerate():
bset = set(b)
next(i for i, v in enumerate(a) if v in bset)
This is a O(N) solution, but has higher constant cost (Python bytecode to execute). It heavily depends on the sizes of a and b which one is going to be faster.
For the small input example in the question, min(map(...)) wins:
In [86]: a = [3, 4, 2, 1, 7, 6, 5]
...: b = [4, 6]
...:
In [87]: %timeit min(map(a.index, b))
...:
608 ns ± 64.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [88]: bset = set(b)
...:
In [89]: %timeit next(i for i, v in enumerate(a) if v in bset)
...:
717 ns ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In one line :
print("".join([str(index) for item in b for index,item1 in enumerate(a) if item==item1][:1]))
output:
1
In detail :
a = [3, 4, 2, 1, 7, 6, 5]
b = [4, 6]
new=[]
for item in b:
for index,item1 in enumerate(a):
if item==item1:
new.append(index)
print("".join([str(x) for x in new[:1]]))
For little B sample, the set approach is output dependent, execution time grow linearly with index output. Numpy can provide better solution in this case.
N=10**6
A=np.unique(np.random.randint(0,N,N))
np.random.shuffle(A)
B=A[:3].copy()
np.random.shuffle(A)
def find(A,B):
pos=np.in1d(A,B).nonzero()[0]
return pos[A[pos].argsort()][B.argsort().argsort()].min()
def findset(A,B):
bset = set(B)
return next(i for i, v in enumerate(A) if v in bset)
#In [29]: find(A,B)==findset(A,B)
#Out[29]: True
#In [30]: %timeit findset(A,B)
# 63.5 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#
# In [31]: %timeit find(A,B)
# 2.24 ms ± 52.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Categories