Python: faster operation for indexing - python

I have the following snippet that extracts indices of all unique values (hashable) in a sequence-like data with canonical indices and store them in a dictionary as lists:
from collections import defaultdict
idx_lists = defaultdict(list)
for idx, ele in enumerate(data):
idx_lists[ele].append(idx)
This looks like to me a quite common use case. And it happens that 90% of the execution time of my code is spent in these few lines. This part is passed through over 10000 times during execution, and len(data) is around 50000 to 100000 each time this is run. Number of unique elements ranges from 50 to 150 roughly.
Is there a faster way, perhaps vectorized/c-extended (e.g. numpy or pandas methods), that achieves the same thing?
Many many thanks.

Not as impressive as I hoped for originally (there's still a fair bit of pure Python in the groupby code path), but you might be able to cut the time down by a factor of 2-4, depending on how much you care about the exact final types involved:
import numpy as np, pandas as pd
from collections import defaultdict
def by_dd(data):
idx_lists = defaultdict(list)
for idx, ele in enumerate(data):
idx_lists[ele].append(idx)
return idx_lists
def by_pand1(data):
return {k: v.tolist() for k,v in data.groupby(data.values).indices.items()}
def by_pand2(data):
return data.groupby(data.values).indices
data = pd.Series(np.random.randint(0, 100, size=10**5))
gives me
>>> %timeit by_dd(data)
10 loops, best of 3: 42.9 ms per loop
>>> %timeit by_pand1(data)
100 loops, best of 3: 18.2 ms per loop
>>> %timeit by_pand2(data)
100 loops, best of 3: 11.5 ms per loop

Though it's not the perfect solution (it's O(NlogN) instead of O(N)), a much faster, vectorized way to do it is:
def data_to_idxlists(data):
sorting_ixs = np.argsort(data)
uniques, unique_indices = np.unique(data[sorting_ixs], return_index = True)
return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])}
Another solution that is O(N*U), (where U is the number of unique groups):
def data_to_idxlists(data):
u, ixs = np.unique(data, return_inverse=True)
return {u: np.nonzero(ixs==i) for i, u in enumerate(u)}

I found this question to be pretty interesting and while I wasn't able to get a large improvement over the other proposed methods I did find a pure numpy method that was slightly faster than the other proposed methods.
import numpy as np
import pandas as pd
from collections import defaultdict
data = np.random.randint(0, 10**2, size=10**5)
series = pd.Series(data)
def get_values_and_indicies(input_data):
input_data = np.asarray(input_data)
sorted_indices = input_data.argsort() # Get the sorted indices
# Get the sorted data so we can see where the values change
sorted_data = input_data[sorted_indices]
# Find the locations where the values change and include the first and last values
run_endpoints = np.concatenate(([0], np.where(sorted_data[1:] != sorted_data[:-1])[0] + 1, [len(input_data)]))
# Get the unique values themselves
unique_vals = sorted_data[run_endpoints[:-1]]
# Return the unique values along with the indices associated with that value
return {unique_vals[i]: sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist() for i in range(num_values)}
def by_dd(input_data):
idx_lists = defaultdict(list)
for idx, ele in enumerate(input_data):
idx_lists[ele].append(idx)
return idx_lists
def by_pand1(input_data):
idx_lists = defaultdict(list)
return {k: v.tolist() for k,v in series.groupby(input_data).indices.items()}
def by_pand2(input_data):
return series.groupby(input_data).indices
def data_to_idxlists(input_data):
u, ixs = np.unique(input_data, return_inverse=True)
return {u: np.nonzero(ixs==i) for i, u in enumerate(u)}
def data_to_idxlists_unique(input_data):
sorting_ixs = np.argsort(input_data)
uniques, unique_indices = np.unique(input_data[sorting_ixs], return_index = True)
return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])}
The resulting timings were (from fastest to slowest):
>>> %timeit get_values_and_indicies(data)
100 loops, best of 3: 4.25 ms per loop
>>> %timeit by_pand2(series)
100 loops, best of 3: 5.22 ms per loop
>>> %timeit data_to_idxlists_unique(data)
100 loops, best of 3: 6.23 ms per loop
>>> %timeit by_pand1(series)
100 loops, best of 3: 10.2 ms per loop
>>> %timeit data_to_idxlists(data)
100 loops, best of 3: 15.5 ms per loop
>>> %timeit by_dd(data)
10 loops, best of 3: 21.4 ms per loop
and it should be noted that unlike by_pand2 it results a dict of lists as given in the example. If you would prefer to return a defaultdict you can simply change the last time to return defaultdict(list, ((unique_vals[i], sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist()) for i in range(num_values))) which increased the overall timing in my tests to 4.4 ms.
Lastly, I should note that these timing are data sensitive. When I used only 10 different values I got:
get_values_and_indicies: 4.34 ms per loop
data_to_idxlists_unique: 4.42 ms per loop
by_pand2: 4.83 ms per loop
data_to_idxlists: 6.09 ms per loop
by_pand1: 9.39 ms per loop
by_dd: 22.4 ms per loop
while if I used 10,000 different values I got:
get_values_and_indicies: 7.00 ms per loop
data_to_idxlists_unique: 14.8 ms per loop
by_dd: 29.8 ms per loop
by_pand2: 47.7 ms per loop
by_pand1: 67.3 ms per loop
data_to_idxlists: 869 ms per loop

Related

Is there a faster method than np.isin and np.where for large arrays?

I have a 1xN array A and a 2xM array B. I want to make two new 1xN arrays
a boolean one that checks whether the first column of B is in A
another one with entries i that are B[1,i] if B[0,i] is in A, and np.nan otherwise
Whatever method I use needs to be super fast as it’ll be called a lot. I can do the first part using this: Is there method faster than np.isin for large array?
But I’m stumped on a good way to do the second part. Here’s what I’ve got so far (adapting the code in the post above):
import numpy as np
import numba as nb
#nb.jit(parallel=True)
def isinvals(arr, vals):
n = len(arr)
result = np.full(n, False)
result_vals = np.full(n, np.nan)
set_vals = set(vals[0,:])
list_vals = list(vals[0,:])
for i in nb.prange(n):
if arr[i] in set_vals:
ind = list_vals.index(arr[i]) ## THIS LINE IS WAY TOO SLOW
result[i] = True
result_vals[i] = vals[1,ind]
return result, result_vals
N = int(1e5)
M = int(20e3)
num_arr = 100e3
num_vals = 20e3
num_types = 6
arr = np.random.randint(0, num_arr, N)
vals_col1 = np.random.randint(0, num_vals, M)
vals_col2 = np.random.randint(0, num_types, M)
vals = np.array([vals_col1, vals_col2])
%timeit result, result_vals = isinvals(arr,vals)
46.4 ms ± 3.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The line I've marked above (list_vals.index(arr[i])) is the slow part. If I don't use that I can make a super fast version:
#nb.jit(parallel=True)
def isinvals_cheating(arr, vals):
n = len(arr)
result = np.full(n, False)
result_vals = np.full(n, np.nan)
set_vals = set(vals[0,:])
list_vals = list(vals[0,:])
for i in nb.prange(n):
if arr[i] in set_vals:
ind = 0 ## TEMPORARILY SETTING TO 0 TO INDICATE SPEED DIFFERENCE
result[i] = True
result_vals[i] = vals[1,ind]
return result, result_vals
%timeit result, result_vals = isinvals_cheating(arr,vals)
1.13 ms ± 59.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
i.e. that single line is making it 40 times slower.
Any ideas? I've also tried using np.where() but it's even slower.
Assuming OP's solution gives the desired result since the question seems ambiguous for non-unique values ​​in vals[0, idx] with different corresponding values vals[1, idx]. A lookup table is faster, but needs len(arr) additional space.
#nb.njit # tested with numba 0.55.1
def isin_nb(arr, vals):
lookup = np.empty(len(arr), np.float32)
lookup.fill(np.nan)
lookup[vals[0, ::-1]] = vals[1, ::-1]
res_val = lookup[arr]
return ~np.isnan(res_val), res_val
With the example data used in the question
res, res_val = isin_nb(arr, vals)
# %timeit 1000 loops, best of 5: 294 µs per loop
Asserting equal results
np.testing.assert_equal(res, result)
np.testing.assert_equal(res_val, result_vals)

Python dict: the size affects timing?

Let' say you have one key in dictionary A vs 1 billion keys in dictionary B
Algorithmically a lookup op is O(1)
However, the actual time (program execution time) to look up different based on the size of the dict?
onekey_stime = time.time()
print one_key_dict.get('firstkey')
onekey_dur = time.time() - onekey_stime
manykeys_stime = time.time()
print manykeys_dict.get('randomkey')
manykeys_dur = time.time() - manykey_stime
Would i see any time difference between onekey_dur and manykeys_dur?
Pretty much identical in a test with a small and large dict:
In [31]: random_key = lambda: ''.join(np.random.choice(list(string.ascii_letters), 20))
In [32]: few_keys = {random_key(): np.random.random() for _ in xrange(100)}
In [33]: many_keys = {random_key(): np.random.random() for _ in xrange(1000000)}
In [34]: few_lookups = np.random.choice(few_keys.keys(), 50)
In [35]: many_lookups = np.random.choice(many_keys.keys(), 50)
In [36]: %timeit [few_keys[k] for k in few_lookups]
100000 loops, best of 3: 6.25 µs per loop
In [37]: %timeit [many_keys[k] for k in many_lookups]
100000 loops, best of 3: 7.01 µs per loop
EDIT: For you, #ShadowRanger -- missed lookups are pretty close too:
In [38]: %timeit [few_keys.get(k) for k in many_lookups]
100000 loops, best of 3: 7.99 µs per loop
In [39]: %timeit [many_keys.get(k) for k in few_lookups]
100000 loops, best of 3: 8.78 µs per loop

Optimizing Fourier transformed signal length

I recently stumbled on an interessting problem, when computing the fourier transform of a signal with np.fft.fft. The reproduced problem is:
%timeit np.fft.fft(np.random.rand(59601))
1 loops, best of 3: 1.34 s per loop
I found that the amount of time is unexpectedly long. For instance lets look at some other fft's, but with a slightly longer/shorter signal:
%timeit np.fft.fft(np.random.rand(59600))
100 loops, best of 3: 6.18 ms per loop
%timeit np.fft.fft(np.random.rand(59602))
10 loops, best of 3: 61.3 ms per loop
%timeit np.fft.fft(np.random.rand(59603))
10 loops, best of 3: 113 ms per loop
%timeit np.fft.fft(np.random.rand(59604))
1 loops, best of 3: 182 ms per loop
%timeit np.fft.fft(np.random.rand(59605))
100 loops, best of 3: 6.53 ms per loop
%timeit np.fft.fft(np.random.rand(59606))
1 loops, best of 3: 2.17 s per loop
%timeit np.fft.fft(np.random.rand(59607))
100 loops, best of 3: 8.14 ms per loop
We can observe that the times are now in miliseconds, except for np.random.rand(59606), which lasts 2.17 s.
Note, the numpy documentation states:
FFT (Fast Fourier Transform) refers to a way the discrete Fourier Transform (DFT) can be calculated efficiently, by using symmetries in the calculated terms. The symmetry is highest when n is a power of 2, and the transform is therefore most efficient for these sizes.
However these vectors do not have the length of a power of 2. Could someone explain how to avoid/predict cases, when computation times are considerably higher?
As some comments have pointed, the prime factor decomposition allows you to predict the time to calculate the FFT. The following graphs show your results. Remark the logarithmic scale!
This image is generated with the following code:
import numpy as np
import matplotlib.pyplot as plt
def prime_factors(n):
"""Returns all the prime factors of a positive integer"""
#from http://stackoverflow.com/questions/23287/largest-prime-factor-of-a-number/412942#412942
factors = []
d = 2
while n > 1:
while n % d == 0:
factors.append(d)
n /= d
d = d + 1
return factors
times = []
decomp = []
for i in range(59600, 59613):
print(i)
t= %timeit -o np.fft.fft(np.random.rand(i))
times.append(t.best)
decomp.append(max(prime_factors(i)))
plt.loglog(decomp, times, 'o')
plt.ylabel("best time")
plt.xlabel("largest prime in prime factor decomposition")
plt.title("FFT timings")

Counting elements matching a pattern in a tuple of tuples

I have a matrix m where I would like to calculate the number of zeros.
m=((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0))
My current code is as follows:
def zeroCount(M):
return [item for row in M for item in row].count(0)
# list of lists is flattened to form single list, and number of 0 are counted
Is there any way to do this quicker? Currently, I'm taking 0.4s to execute the function 20,000 times on 4 by 4 matrices, where the matrices are equally likely to contain zeros as they are not.
Some possible places to start (but which I could not make to work faster than my code) are these other questions: counting non-zero elements in numpy array, finding the indices of non-zero elements, and counting non-zero elements in iterable.
The fastest so far:
def count_zeros(matrix):
total = 0
for row in matrix:
total += row.count(0)
return total
For 2D tuple you could use a generator expression:
def count_zeros_gen(matrix):
return sum(row.count(0) for row in matrix)
Time comparison:
%timeit [item for row in m for item in row].count(0) # OP
1000000 loops, best of 3: 1.15 µs per loop
%timeit len([item for row in m for item in row if item == 0]) # #thefourtheye
1000000 loops, best of 3: 913 ns per loop
%timeit sum(row.count(0) for row in m)
1000000 loops, best of 3: 1 µs per loop
%timeit count_zeros(m)
1000000 loops, best of 3: 775 ns per loop
For the baseline:
def f(m): pass
%timeit f(m)
10000000 loops, best of 3: 110 ns per loop
Here is my answer.
reduce(lambda a, b: a + b, m).count(0)
Time:
%timeit count_zeros(m) ##J.F. Sebastian
1000000 loops, best of 3: 813 ns per loop
%timeit len([item for row in m for item in row if item == 0]) ##thefourtheye
1000000 loops, best of 3: 974 ns per loop
%timeit reduce(lambda a, b: a + b, m).count(0) #Mine
1000000 loops, best of 3: 1.02 us per loop
%timeit countzeros(m) ##frostnational
1000000 loops, best of 3: 1.07 us per loop
%timeit sum(row.count(0) for row in m) ##J.F. Sebastian
1000000 loops, best of 3: 1.28 us per loop
%timeit [item for row in m for item in row].count(0) #OP
1000000 loops, best of 3: 1.53 us per loop
#thefourtheye's is the fastest. This is because of few function call.
#J.F. Sebastian's is the fastest in my environment. I don't know why...
The problem with your solution is that, you have to iterate the list again to get the count O(N). But the len function can get the count in O(1).
You can make this a lot quicker with this
def zeroCount(M):
return len([item for row in M for item in row if item == 0])
Check this out:
from itertools import chain, filterfalse # ifilterfalse for Python 2
def zeroCount(m):
total = 0
for x in filterfalse(bool, chain(*m)):
total += 1
return total
Perfomance tests on Python 3.3.3:
from timeit import timeit
from itertools import chain, filterfalse
import functools
m = ((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0))
def zeroCountOP():
return [item for row in m for item in row].count(0)
def zeroCountTFE():
return len([item for row in m for item in row if item == 0])
def zeroCountJFS():
return sum(row.count(0) for row in m)
def zeroCountuser2931409():
# `reduce` is in `functools` in Py3k
return functools.reduce(lambda a, b: a + b, m).count(0)
def zeroCount():
total = 0
for x in filterfalse(bool, chain(*m)):
total += 1
return total
print('Original code ', timeit(zeroCountOP, number=100000))
print('#J.F.Sebastian ', timeit(zeroCountJFS, number=100000))
print('#thefourtheye ', timeit(zeroCountTFE, number=100000))
print('#user2931409 ', timeit(zeroCountuser2931409, number=100000))
print('#frostnational ', timeit(zeroCount, number=100000))
The above give me these results:
Original code 0.244224319984056
#thefourtheye 0.22169152169497108
#user2931409 0.19247795242092186
#frostnational 0.18846473728790825
#J.F.Sebastian 0.1439318853410907
#J.F.Sebastian's solution is a winner, mine is a runner-up (about 20% slower).
Comprehensive solution for both Python 2 and Python 3:
import sys
import itertools
if sys.version_info < (3, 0, 0):
filterfalse = getattr(itertools, 'ifilterfalse')
else:
filterfalse = getattr(itertools, 'filterfalse')
def countzeros(matrix):
''' Make a good use of `itertools.filterfalse`
(`itertools.ifilterfalse` in case of Python 2) to count
all 0s in `matrix`. '''
counter = 0
for _ in filterfalse(bool, itertools.chain(*matrix)):
counter += 1
return counter
if __name__ == '__main__':
# Benchmark
from timeit import repeat
print(repeat('countzeros(((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0)))',
'from __main__ import countzeros',
repeat=10,
number=100000))
Use numpy:
import numpy
m=((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0))
numpy_m = numpy.array(m)
print numpy.sum(numpy_m == 0)
How does the above work? First, your "matrix" is converted to a numpy array (numpy.array(m)). Then, each entry is checked for equality with zero (numpy_m == 0). This yields a binary array. Summing over this binary array gives the number of zero elements in the original array.
Note that numpy will be clearly efficient for larger matrices. 4x4 might be too small to see a large performance difference vs. ordinary python code, esp. if you are initializing a python "matrix" like above.
One numpy solution is:
import numpy as np
m = ((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0))
mm = np.array(m)
def zeroCountSmci():
return (mm==0).sum() # sums across all axes, by default

Optimizing access on numpy arrays for numba

I recently stumbled upon numba and thought about replacing some homemade C extensions with more elegant autojitted python code. Unfortunately I wasn't happy, when I tried a first, quick benchmark. It seems like numba is not doing much better than ordinary python here, though I would have expected nearly C-like performance:
from numba import jit, autojit, uint, double
import numpy as np
import imp
import logging
logging.getLogger('numba.codegen.debug').setLevel(logging.INFO)
def sum_accum(accmap, a):
res = np.zeros(np.max(accmap) + 1, dtype=a.dtype)
for i in xrange(len(accmap)):
res[accmap[i]] += a[i]
return res
autonumba_sum_accum = autojit(sum_accum)
numba_sum_accum = jit(double[:](int_[:], double[:]),
locals=dict(i=uint))(sum_accum)
accmap = np.repeat(np.arange(1000), 2)
np.random.shuffle(accmap)
accmap = np.repeat(accmap, 10)
a = np.random.randn(accmap.size)
ref = sum_accum(accmap, a)
assert np.all(ref == numba_sum_accum(accmap, a))
assert np.all(ref == autonumba_sum_accum(accmap, a))
%timeit sum_accum(accmap, a)
%timeit autonumba_sum_accum(accmap, a)
%timeit numba_sum_accum(accmap, a)
accumarray = imp.load_source('accumarray', '/path/to/accumarray.py')
assert np.all(ref == accumarray.accum(accmap, a))
%timeit accumarray.accum(accmap, a)
This gives on my machine:
10 loops, best of 3: 52 ms per loop
10 loops, best of 3: 42.2 ms per loop
10 loops, best of 3: 43.5 ms per loop
1000 loops, best of 3: 321 us per loop
I'm running the latest numba version from pypi, 0.11.0. Any suggestions, how to fix the code, so it runs reasonably fast with numba?
I figured out myself. numba wasn't able to determine the type of the result of np.max(accmap), even if the type of accmap was set to int. This somehow slowed down everything, but the fix is easy:
#autojit(locals=dict(reslen=uint))
def sum_accum(accmap, a):
reslen = np.max(accmap) + 1
res = np.zeros(reslen, dtype=a.dtype)
for i in range(len(accmap)):
res[accmap[i]] += a[i]
return res
The result is quite impressive, about 2/3 of the C version:
10000 loops, best of 3: 192 us per loop
Update 2022:
The work on this issue led to the python package numpy_groupies, which is available here:
https://github.com/ml31415/numpy-groupies
#autojit
def numbaMax(arr):
MAX = arr[0]
for i in arr:
if i > MAX:
MAX = i
return MAX
#autojit
def autonumba_sum_accum2(accmap, a):
res = np.zeros(numbaMax(accmap) + 1)
for i in xrange(len(accmap)):
res[accmap[i]] += a[i]
return res
10 loops, best of 3: 26.5 ms per loop <- original
100 loops, best of 3: 15.1 ms per loop <- with numba but the slow numpy max
10000 loops, best of 3: 47.9 µs per loop <- with numbamax

Categories