a list of identical elements in the merge list - python

I need to merge the list and have a function that can be implemented, but when the number of merges is very slow and unbearable, I wonder if there is a more efficient way
Consolidation conditions:Sub-lists contain identical numbers to each other Thank you
Simple Association:
[7,8,9] = [7,8]+[8,9] #The same number 8
Cascade contains:
[1,2,3] = [1,2,3]+[3,4] #The same number 3
[3,4,5,6] = [3,4],[4,5,6] #The same number 4
[1,2,3,4,5,6] = [1,2,3]+[3,4,5,6] #The same number 3
Function:
a = [ [1,2,3],[4,5,6],[3,4],[7,8],[8,9],[6,12,13] ]
b = len(a)
for i in range(b):
for j in range(b):
x = list(set(a[i]+a[j]))
y = len(a[j])+len(a[i])
if i == j or a[i] == 0 or a[j] == 0:
break
elif len(x) < y:
a[i] = x
a[j] = [0]
print a
print [i for i in a if i!= [0]]
result:
[[8, 9, 7], [1, 2, 3, 4, 5, 6, 10, 11]]
Above is an example where each sub-list in the actual calculation has a length of only 2,
a = [[1,3],[5,6],[3,4],[7,8],[8,9],[12,13]]
I want to miss out more data, here is a simulation data.
a = np.random.rand(150,150)>0.99
a[np.tril_indices(a.shape[1], -1)] = 0
a[np.diag_indices(a.shape[1])] = 0
a = [list(x) for x in np.c_[np.where(a)]]
consolidate(a)

I think your algorithm is close to optimal, except that the inner loop can be shortened because the intersection operation is symmetric, i.e. if you check that (A, B) intersect, there is no need to check for (B, A).
This way you would go from O(n²) to O(n * (n / 2)).
However, I would rewrite the piece of code more cleanly and I would also avoid modifying the input.
Note also, that since sets do not guarantee ordering, it is a good idea to do some sorting before getting to list.
Here is my proposed code (EDITED to reduce the number of castings and sortings):
def consolidate(items):
items = [set(item.copy()) for item in items]
for i, x in enumerate(items):
for j, y in enumerate(items[i + 1:]):
if x & y:
items[i + j + 1] = x | y
items[i] = None
return [sorted(x) for x in items if x]
Encapsulating your code in a function, I would get:
def consolidate_orig(a):
a = [x.copy() for x in a]
b = len(a)
for i in range(b):
for j in range(b):
x = list(set(a[i]+a[j]))
y = len(a[j])+len(a[i])
if i == j or a[i] == 0 or a[j] == 0:
break
elif len(x) < y:
a[i] = x
a[j] = [0]
return [i for i in a if i!= [0]]
This would allow us to do some clean micro-benchmarking (for completeness I have included also #zipa's merge()):
EDIT:
#zipa's code is not properly encapsulated, here is an equivalent version with proper encapsulation:
def merge(iterable, base=None):
if base is None:
base = iterable
merged = set([tuple(set(i).union(
*[j for j in base if set(i).intersection(j)])) for i in iterable])
if merged == iterable:
return merged
else:
return merge(merged, base)
and updated timings:
in_list = [[1,2,3], [4,5,6], [3,4], [7,8], [8,9], [6,12,13]]
%timeit consolidate_orig(in_list)
# 17.9 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit consolidate(in_list)
# 6.15 µs ± 30 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit merge(in_list)
# 53.6 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
in_list = [[1, 3], [5, 6], [3, 4], [7, 8], [8, 9], [12, 13]]
%timeit consolidate_orig(in_list)
# 16.1 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit consolidate(in_list)
# 5.87 µs ± 71.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit merge(in_list)
# 27 µs ± 701 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Showing that, at least for this input, the proposed solution is consistently faster.
Since it is not too straightforward to generate large meaningful inputs, I'll leave to you to check that this is more efficient then your approach for the larger inputs you have in mind.
EDIT
With larger, but probably meaningless inputs, the timings are still favorable for the proposed version:
in_list = [[1,2,3], [4,5,6], [3,4], [7,8], [8,9], [6,12,13]] * 300
%timeit consolidate_orig(in_list)
# 1.04 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit consolidate(in_list)
# 724 ms ± 7.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit merge(in_list)
# 1.04 s ± 7.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
in_list = [[1, 3], [5, 6], [3, 4], [7, 8], [8, 9], [12, 13]] * 300
%timeit consolidate_orig(in_list)
# 1.03 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit consolidate(in_list)
# 354 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit merge(in_list)
# 967 ms ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This approach should perform faster on larger nested lists:
def merge(iterable):
merged = set([tuple(set(i).union(*[j for j in a if set(i).intersection(j)])) for i in iterable])
if merged == iterable:
return merged
else:
return merge(merged)
merged(a)
#set([(1, 2, 3, 4, 5, 6, 12, 13), (8, 9, 7)])
It recursively combines lists until all the combinations are exhausted.

Related

Python - Get individual elements of dataframe column

I have a dataframe which has a column that is a list. I want to extract the individual elements in every list in the column. So given this input dataframe:
A
0 [5, 4, 3, 6]
1 [7, 8, 9, 6]
The intended output should be a list:
[5, 4, 3, 6,7, 8, 9, 6]
You can use list comprehension with flatten:
a = [y for x in df.A for y in x]
Or use itertools.chain:
from itertools import chain
a = list(chain.from_iterable(df.A))
Or use numpy.concatenate:
a = np.concatenate(df.A).tolist()
Or Series.explode, working for pandas 0.25+:
a = df.A.explode().tolist()
Performance with sample data for 100k rows:
df = pd.DataFrame({
'A':[[5, 4, 3, 6], [7, 8, 9, 6]] * 50000})
print (df)
In [263]: %timeit [y for x in df.A for y in x]
37.7 ms ± 3.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %timeit list(chain.from_iterable(df.A))
27.3 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [265]: %timeit np.concatenate(df.A).tolist()
1.71 s ± 86.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [266]: %timeit df.A.explode().tolist()
207 ms ± 3.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#ansev1
In [267]: %timeit np.hstack(df['A']).tolist()
328 ms ± 6.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Generate a list of zeros and ones by comparing two different lists of integers in python

I recently got started in Python and have the following problem which I will try best to explain in words:
I have two different lists as below:
list_a = [1,2,3,4,5]
list_b = [[2,5],[1,4]]
I would like to compare both lists and generate a third list such that for each number in each list in list_b, if the number equals the corresponding number in list_a, a one is generated and where there is no match, a zero is generated.
The length of each list in my output list should equal the length of list_a (i.e. length of 5, a one where there is a match and a zero if there is no match).
Therefore the output list that I am seeking should be as follows:
out = [[0,1,0,0,1],[1,0,0,1,0]]
Would greatly appreciate if you could help me out. Thanks!
Use nested list comprehensions
[[int(el in list_b_el) for el in list_a] for list_b_el in list_b]
result
[[0, 1, 0, 0, 1], [1, 0, 0, 1, 0]]
Multiple loops in a list comprehension can get a bit confusing to read so it's easier to write the loops out for readability:
result = []
for b in list_b:
sublist = []
for a in list_a:
if a in b:
sublist.append(1)
else:
sublist.append(0)
result.append(sublist)
You can create a 5-zeros-list for every sublist in list_b and then only iterate over the numbers in each sublist using them as index to switch those zeros to 1:
list_b = [[2,5],[1,4]]
out = []
for lb in list_b:
out.append([0]*5)
for idx in lb:
out[-1][idx-1] = 1
Performance
If anybody is interested in execution speed, here's a glance at the timings:
def pythonic():
[[int(el in list_b_el) for el in list_a] for list_b_el in list_b]
def Tyger():
result = []
for b in list_b:
sublist = []
for a in list_a:
if a in b:
sublist.append(1)
else:
sublist.append(0)
result.append(sublist)
def SpghttCd():
out = []
for lb in list_b:
out.append([0]*5)
for idx in lb:
out[-1][idx-1] = 1
list_a = [1,2,3,4,5]
list_b = [[2,5],[1,4]]
%timeit pythonic()
%timeit Tyger()
%timeit SpghttCd()
3.03 µs ± 147 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.63 µs ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.02 µs ± 9.64 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Now with these short sublists in list_b the functions SpghttCd and Tyger have much less iterations than pythonic, so a worst case trial:
list_a = [1,2,3,4,5]
list_b = [[1,2,3,4,5],[1,2,3,4,5]]
%timeit pythonic()
%timeit Tyger()
%timeit SpghttCd()
3.03 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.8 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
1.53 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

sum( condition ) equivalent in python numpy

I'm trying to convert piece of matlab code to python.
a=[1 2 3;4 5 6]
b= sum(a<5)
//output :
ans :
2 1 1
Actually return the number of elements in every column which has the condition.
Is there any equivalent function in numpy (python) to do this ?
Its the same.
a=np.array([[1, 2, 3],[4, 5, 6]])
b=np.sum(a<5,axis=0) # the only difference is that you need to explicitly set the dimension
Although not made for this purpose, an alternate solution would be
a=np.array([[1, 2, 3],[4, 5, 6]])
np.count_nonzero(a<5, axis=0)
# array([2, 1, 1])
Performance
For small arrays, np.sum seems to be slightly faster
x = np.repeat([1, 2, 3], 100)
y = np.repeat([4, 5, 6], 100)
a=np.array([x,y])
%timeit np.sum(a<5, axis=0)
# 7.18 µs ± 669 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.count_nonzero(a<5, axis=0)
# 11.8 µs ± 386 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For very large arrays, np.count_nonzero seems to be slightly faster
x = np.repeat([1, 2, 3], 5000000)
y = np.repeat([4, 5, 6], 5000000)
a=np.array([x,y])
%timeit np.sum(a<5, axis=0)
# 126 ms ± 6.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.count_nonzero(a<5, axis=0)
# 100 ms ± 6.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The fastest way to create np.arrays for each item from list of tuples

There is a list of tuples l = [(x,y,z), (x,y,z), (x,y,z)]
The idea is to find the fastest way to create different np.arrays for each x-s, y-s, z-s. Need help with finding the fastest solution to do it. To make speed comparison I use code attached below
import time
def myfast():
code
n = 1000000
t0 = time.time()
for i in range(n): myfast()
t1 = time.time()
total_n = t1-t0
1. np.array([i[0] for i in l])
np.array([i[1] for i in l])
np.array([i[2] for i in l])
output: 0.9980638027191162
2. array_x = np.zeros((len(l), 1), dtype="float")
array_y = np.zeros((len(l), 1), dtype="float")
array_z = np.zeros((len(l), 1), dtype="float")
for i, zxc in enumerate(l):
array_x[i] = zxc[0]
array_y[i] = zxc[1]
array_z[i] = zxc[2]
output 5.5509934425354
3. [np.array(x) for x in zip(*l)]
output 2.5070037841796875
5. array_x, array_y, array_z = np.array(list(zip(*l)))
output 2.725318431854248
There are some really good option in here, so I summarized them and compared speed:
import numpy as np
def f1(input_data):
array_x = np.array([elem[0] for elem in input_data])
array_y = np.array([elem[1] for elem in input_data])
array_z = np.array([elem[2] for elem in input_data])
return array_x, array_y, array_z
def f2(input_data):
array_x = np.zeros((len(input_data), ), dtype="float")
array_y = np.zeros((len(input_data), ), dtype="float")
array_z = np.zeros((len(input_data), ), dtype="float")
for i, elem in enumerate(input_data):
array_x[i] = elem[0]
array_y[i] = elem[1]
array_z[i] = elem[2]
return array_x, array_y, array_z
def f3(input_data):
return [np.array(elem) for elem in zip(*input_data)]
def f4(input_data):
return np.array(list(zip(*input_data)))
def f5(input_data):
return np.array(input_data).transpose()
def f6(input_data):
array_all = np.array(input_data)
array_x = array_all[:, 0]
array_y = array_all[:, 1]
array_z = array_all[:, 2]
return array_x, array_y, array_z
First I asserted that all of them return the same data (using np.array_equal()):
data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
for array_list in zip(f1(data), f2(data), f3(data), f4(data), f5(data), f6(data)):
# print()
# for i, arr in enumerate(array_list):
# print('array from function', i+1)
# print(arr)
for i, arr in enumerate(array_list[:-1]):
assert np.array_equal(arr, array_list[i+1])
And the time comparisson:
import timeit
for f in [f1, f2, f3, f4, f5, f6]:
t = timeit.timeit('f(data)', 'from __main__ import data, f', number=100000)
print('{:5s} {:10.4f} seconds'.format(f.__name__, t))
gives these results:
data = [(1, 2, 3), (4, 5, 6), (7, 8, 9)] # 3 tuples
timeit number=100000
f1 0.3184 seconds
f2 0.4013 seconds
f3 0.2826 seconds
f4 0.2091 seconds
f5 0.1732 seconds
f6 0.2159 seconds
data = [(1, 2, 3) for _ in range(10**6)] # 1 millon tuples
timeit number=10
f1 2.2168 seconds
f2 2.8657 seconds
f3 2.0150 seconds
f4 1.9790 seconds
f5 2.6380 seconds
f6 2.6586 seconds
making f5() the fastest option for short input and f4() the fastest option for big input.
If the number of elements in each tuple will be more than 3, then only 3 functions apply to that case (the others are hardcoded for 3 elements in each tuple):
data = [tuple(range(10**4)) for _ in range(10**3)]
timeit number=10
f3 11.8396 seconds
f4 13.4672 seconds
f5 4.6251 seconds
making f5() again the fastest option for these criteria.
you could try:
import numpy
array_x, array_y, array_z = numpy.array(list(zip(*l)))
or just:
numpy.array(list(zip(*l)))
and more elegant way:
numpy.array(l).transpose()
Maybe I am missing something, but why not just pass list of tuples directly to np.array? Say if:
n = 100
l = [(0, 1, 2) for _ in range(n)]
arr = np.array(l)
x = arr[:, 0]
y = arr[:, 1]
z = arr[:, 2]
Btw, I prefer to use the following to time code:
from timeit import default_timer as timer
t0 = timer()
do_heavy_calculation()
print("Time taken [sec]:", timer() - t0)
I believe most (but not all) of the ingredients of this answer are actually present in the other answers, but on all the answers so far I have not seen a apple-to-apple comparison, in the sense that some approaches were not returning a list of np.ndarray objects, but rather a (convenient in my opinion) single np.ndarray().
It is not clear whether this is acceptable to you, so I am adding proper code for this.
Besides that the performances may be different because is some cases you are adding an extra step, while for some others you may not need to create large objects (that could reside in different memory pages).
In the end, for smaller inputs (3 x 10), the list of np.ndarray()s is just some additional burden that adds up significantly to the timing.
For larger inputs (3 x 1000) and above the extra computation is not significant any longer, but an approach involving comprehensions and avoiding the creation of a large numpy array can becomes as fast as (or even faster than) the fastest methods for smaller inputs.
Also, all the code I present work for arbitrary sizes of the tuples/list (as long as the inner tuples all have the same size, of course).
(EDIT: added a comment on the final results)
The tested methods are:
import numpy as np
def to_arrays_zip(items):
return np.array(list(zip(*items)))
def to_arrays_transpose(items):
return np.array(items).transpose()
def to_arrays_zip_split(items):
return [arr for arr in np.array(list(zip(*items)))]
def to_arrays_transpose_split(items):
return [arr for arr in np.array(items).transpose()]
def to_arrays_comprehension(items):
return [np.array([items[i][j] for i in range(len(items))]) for j in range(len(items[0]))]
def to_arrays_comprehension2(items):
return [np.array([item[j] for item in items]) for j in range(len(items[0]))]
(This is a convenient function to check that the results are the same.)
def test_equal(items1, items2):
return all(np.all(x == y) for x, y in zip(items1, items2))
For small inputs:
N = 3
M = 10
ll = [tuple(range(N)) for _ in range(M)]
print(to_arrays_comprehension2(ll))
print('Returning `np.ndarray()`')
%timeit to_arrays_zip(ll)
# 2.82 µs ± 28 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_transpose(ll)
# 3.18 µs ± 30 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
print('Returning a list')
%timeit to_arrays_zip_split(ll)
# 3.71 µs ± 47 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_transpose_split(ll)
# 3.97 µs ± 42.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_comprehension(ll)
# 5.91 µs ± 96.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit to_arrays_comprehension2(ll)
# 5.14 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Where the podium is:
to_arrays_zip_split() (the non-_split if you are OK with a single array)
to_arrays_zip_transpose_split() (the non-_split if you are OK with a single array)
to_arrays_comprehension2()
For somewhat larger inputs:
N = 3
M = 1000
ll = [tuple(range(N)) for _ in range(M)]
print('Returning `np.ndarray()`')
%timeit to_arrays_zip(ll)
# 146 µs ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit to_arrays_transpose(ll)
# 222 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
print('Returning a list')
%timeit to_arrays_zip_split(ll)
# 147 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit to_arrays_transpose_split(ll)
# 221 µs ± 2.58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit to_arrays_comprehension(ll)
# 261 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit to_arrays_comprehension2(ll)
# 212 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The podium becomes:
to_arrays_zip_split() (whether you use the _split or non-_split variants, does not make much difference)
to_arrays_comprehension2()
to_arrays_zip_transpose_split() (whether you use the _split or non-_split variants, does not make much difference)
For even larger inputs:
N = 3
M = 1000000
ll = [tuple(range(N)) for _ in range(M)]
print('Returning `np.ndarray()`')
%timeit to_arrays_zip(ll)
# 215 ms ± 4.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_transpose(ll)
# 220 ms ± 4.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print('Returning a list')
%timeit to_arrays_zip_split(ll)
# 218 ms ± 6.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_transpose_split(ll)
# 222 ms ± 3.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_comprehension(ll)
# 248 ms ± 3.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit to_arrays_comprehension2(ll)
# 186 ms ± 481 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The podium becomes:
to_arrays_comprehension2()
to_arrays_zip_split() (whether you use the _split or non-_split variants, does not make much difference)
to_arrays_zip_transpose_split() (whether you use the _split or non-_split variants, does not make much difference)
and the the _zip and _transpose variants are pretty close to each other.
(I also tried to speed things up with Numba that didn't go well)

Searching large array by two columns

I have a large array, that looks like something below:
np.random.seed(42)
arr = np.random.permutation(np.array([
(1,1,2,2,2,2,3,3,4,4,4),
(8,9,3,4,7,9,1,9,3,4,50000)
]).T)
It isn't sorted, the rows of this array are unique, I also know the bounds for the values in both columns, they are [0, n] and [0, k]. So the maximum possible size of the array is (n+1)*(k+1), but the actual size is closer to log of that.
I need to search the array by both columns to find such row that arr[row,:] = (i,j), and return -1 when (i,j) is absent in the array. The naive implementation for such function is:
def get(arr, i, j):
cond = (arr[:,0] == i) & (arr[:,1] == j)
if np.any(cond):
return np.where(cond)[0][0]
else:
return -1
Unfortunately, since in my case arr is very large (>90M rows), this is very inefficient, especially since I would need to call get() multiple times.
Alternatively I tried translating this to a dict with (i,j) keys, such that
index[(i,j)] = row
that can be accessed by:
def get(index, i, j):
try:
retuen index[(i,j)]
except KeyError:
return -1
This works (and is much faster when tested on smaller data than I have), but again, creating the dict on-the-fly by
index = {}
for row in range(arr.shape[0]):
i,j = arr[row, :]
index[(i,j)] = row
takes huge amount of time and eats lots of RAM in my case. I was also thinking of first sorting arr and then using something like np.searchsorted, but this didn't lead me anywhere.
So what I need is a fast function get(arr, i, j) that returns
>>> get(arr, 2, 3)
4
>>> get(arr, 4, 100)
-1
A partial solution would be:
In [36]: arr
Out[36]:
array([[ 2, 9],
[ 1, 8],
[ 4, 4],
[ 4, 50000],
[ 2, 3],
[ 1, 9],
[ 4, 3],
[ 2, 7],
[ 3, 9],
[ 2, 4],
[ 3, 1]])
In [37]: (i,j) = (2, 3)
# we can use `assume_unique=True` which can speed up the calculation
In [38]: np.all(np.isin(arr, [i,j], assume_unique=True), axis=1, keepdims=True)
Out[38]:
array([[False],
[False],
[False],
[False],
[ True],
[False],
[False],
[False],
[False],
[False],
[False]])
# we can use `assume_unique=True` which can speed up the calculation
In [39]: mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1, keepdims=True)
In [40]: np.argwhere(mask)
Out[40]: array([[4, 0]])
If you need the final result as a scalar, then don't use keepdims argument and cast the array to a scalar like:
# we can use `assume_unique=True` which can speed up the calculation
In [41]: mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1)
In [42]: np.argwhere(mask)
Out[42]: array([[4]])
In [43]: np.asscalar(np.argwhere(mask))
Out[43]: 4
Solution
Python offers a set type to store unique values, but sadly no ordered version of a set. But you can use the ordered-set package.
Create an OrderedSet from the data. Fortunately, this only needs to be done once:
import ordered_set
o = ordered_set.OrderedSet(map(tuple, arr))
def ordered_get(o, i, j):
try:
return o.index((i,j))
except KeyError:
return -1
Runtime
Finding the index of a value should be O(1), according to the documentation:
In [46]: %timeit get(arr, 2, 3)
10.6 µs ± 39 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [47]: %timeit ordered_get(o, 2, 3)
1.16 µs ± 14.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [48]: %timeit ordered_get(o, 2, 300)
1.05 µs ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Testing this for a much larger array:
a2 = random.randint(10000, size=1000000).reshape(-1,2)
o2 = ordered_set.OrderedSet()
for t in map(tuple, a2):
o2.add(t)
In [65]: %timeit get(a2, 2, 3)
1.05 ms ± 2.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [66]: %timeit ordered_get(o2, 2, 3)
1.03 µs ± 2.12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [67]: %timeit ordered_get(o2, 2, 30000)
1.06 µs ± 28.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Looks like it indeed is O(1) runtime.
def get_agn(arr, i, j):
idx = np.flatnonzero((arr[:,0] == j) & (arr[:,1] == j))
return -1 if idx.size == 0 else idx[0]
Also, just in case you are thinking about the ordered_set solution, here is a better one (however, in both cases see timing tests below):
d = { (i, j): k for k, (i, j) in enumerate(arr)}
def unordered_get(d, i, j):
return d.get((i, j), -1)
and it's "full" equivalent (that builds the dictionary inside the function):
def unordered_get_full(arr, i, j):
d = { (i, j): k for k, (i, j) in enumerate(arr)}
return d.get((i, j), -1)
Timing tests:
First, define #kmario23 function:
def get_kmario23(arr, i, j):
# fundamentally, kmario23's code re-aranged to return scalars
# and -1 when (i, j) not found:
mask = np.all(np.isin(arr, [i,j], assume_unique=True), axis=1)
idx = np.argwhere(mask)[0]
return -1 if idx.size == 0 else np.asscalar(idx[0])
Second, define #ChristophTerasa function (original and the full version):
import ordered_set
o = ordered_set.OrderedSet(map(tuple, arr))
def ordered_get(o, i, j):
try:
return o.index((i,j))
except KeyError:
return -1
def ordered_get_full(arr, i, j):
# "Full" version that builds ordered set inside the function
o = ordered_set.OrderedSet(map(tuple, arr))
try:
return o.index((i,j))
except KeyError:
return -1
Generate some large data:
arr = np.random.randint(1, 2000, 200000).reshape((-1, 2))
Timing results:
In [55]: %timeit get_agn(arr, *arr[-1])
149 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [56]: %timeit get_kmario23(arr, *arr[-1])
1.42 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [57]: %timeit get_kmario23(arr, *arr[0])
1.2 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Ordered set tests:
In [80]: o = ordered_set.OrderedSet(map(tuple, arr))
In [81]: %timeit ordered_get(o, *arr[-1])
1.74 µs ± 32.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [82]: %timeit ordered_get_full(arr, *arr[-1]) # include ordered set creation time
166 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Unordered dictionary tests:
In [83]: d = { (i, j): k for k, (i, j) in enumerate(arr)}
In [84]: %timeit unordered_get(d, *arr[-1])
1.18 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [85]: %timeit unordered_get_full(arr, *arr[-1])
102 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, when taking into account the time needed to create either ordered set or unordered dictionary, these methods are quite slow. You must plan running several hundred searches on the same data for these methods to make sense. Even then, there is no need to use ordered_set package - regular dictionaries are faster.
It seems I was over-thinking this problem, there is easy solution. I was considering either filtering and subsetting the array or using dict index[(i,j)] = row. Filtering and subsetting was slow (O(n) when searching), while using dict was fast (O(1) access time), but creating the dict was slow and memory intensive.
The simple solution for this problem is using nested dicts.
index = {}
for row in range(arr.shape[0]):
i,j = arr[row, :]
try:
index[i][j] = row
except KeyError:
index[i] = {}
index[i][j] = row
def get(index, i, j):
try:
return index[i][j]
except KeyError:
return -1
Alternatively, instead of dict on higher level, I could use index = defaultdict(dict), what would allow for assigning index[i][j] = row
directly, without the try ... except conditions, but then the defaultdict(dict) object would create empty {} when queried for nonexistent i by the get(index, i, j) function, so it would be expanding the index unnecessarily.
The access time is O(1) for the first dict and O(1) for the nested dicts, so basically it's O(1). The upper level dict has manageable size (bounded by n < n*k), while the nested dicts are small (the nesting order is chosen based on the fact that in my case k << n). Building the nested dict is also very fast, even for >90M rows in the array. Moreover, it can be easily extended to more complicated cases.

Categories