I have a matrix m where I would like to calculate the number of zeros.
m=((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0))
My current code is as follows:
def zeroCount(M):
return [item for row in M for item in row].count(0)
# list of lists is flattened to form single list, and number of 0 are counted
Is there any way to do this quicker? Currently, I'm taking 0.4s to execute the function 20,000 times on 4 by 4 matrices, where the matrices are equally likely to contain zeros as they are not.
Some possible places to start (but which I could not make to work faster than my code) are these other questions: counting non-zero elements in numpy array, finding the indices of non-zero elements, and counting non-zero elements in iterable.
The fastest so far:
def count_zeros(matrix):
total = 0
for row in matrix:
total += row.count(0)
return total
For 2D tuple you could use a generator expression:
def count_zeros_gen(matrix):
return sum(row.count(0) for row in matrix)
Time comparison:
%timeit [item for row in m for item in row].count(0) # OP
1000000 loops, best of 3: 1.15 µs per loop
%timeit len([item for row in m for item in row if item == 0]) # #thefourtheye
1000000 loops, best of 3: 913 ns per loop
%timeit sum(row.count(0) for row in m)
1000000 loops, best of 3: 1 µs per loop
%timeit count_zeros(m)
1000000 loops, best of 3: 775 ns per loop
For the baseline:
def f(m): pass
%timeit f(m)
10000000 loops, best of 3: 110 ns per loop
Here is my answer.
reduce(lambda a, b: a + b, m).count(0)
Time:
%timeit count_zeros(m) ##J.F. Sebastian
1000000 loops, best of 3: 813 ns per loop
%timeit len([item for row in m for item in row if item == 0]) ##thefourtheye
1000000 loops, best of 3: 974 ns per loop
%timeit reduce(lambda a, b: a + b, m).count(0) #Mine
1000000 loops, best of 3: 1.02 us per loop
%timeit countzeros(m) ##frostnational
1000000 loops, best of 3: 1.07 us per loop
%timeit sum(row.count(0) for row in m) ##J.F. Sebastian
1000000 loops, best of 3: 1.28 us per loop
%timeit [item for row in m for item in row].count(0) #OP
1000000 loops, best of 3: 1.53 us per loop
#thefourtheye's is the fastest. This is because of few function call.
#J.F. Sebastian's is the fastest in my environment. I don't know why...
The problem with your solution is that, you have to iterate the list again to get the count O(N). But the len function can get the count in O(1).
You can make this a lot quicker with this
def zeroCount(M):
return len([item for row in M for item in row if item == 0])
Check this out:
from itertools import chain, filterfalse # ifilterfalse for Python 2
def zeroCount(m):
total = 0
for x in filterfalse(bool, chain(*m)):
total += 1
return total
Perfomance tests on Python 3.3.3:
from timeit import timeit
from itertools import chain, filterfalse
import functools
m = ((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0))
def zeroCountOP():
return [item for row in m for item in row].count(0)
def zeroCountTFE():
return len([item for row in m for item in row if item == 0])
def zeroCountJFS():
return sum(row.count(0) for row in m)
def zeroCountuser2931409():
# `reduce` is in `functools` in Py3k
return functools.reduce(lambda a, b: a + b, m).count(0)
def zeroCount():
total = 0
for x in filterfalse(bool, chain(*m)):
total += 1
return total
print('Original code ', timeit(zeroCountOP, number=100000))
print('#J.F.Sebastian ', timeit(zeroCountJFS, number=100000))
print('#thefourtheye ', timeit(zeroCountTFE, number=100000))
print('#user2931409 ', timeit(zeroCountuser2931409, number=100000))
print('#frostnational ', timeit(zeroCount, number=100000))
The above give me these results:
Original code 0.244224319984056
#thefourtheye 0.22169152169497108
#user2931409 0.19247795242092186
#frostnational 0.18846473728790825
#J.F.Sebastian 0.1439318853410907
#J.F.Sebastian's solution is a winner, mine is a runner-up (about 20% slower).
Comprehensive solution for both Python 2 and Python 3:
import sys
import itertools
if sys.version_info < (3, 0, 0):
filterfalse = getattr(itertools, 'ifilterfalse')
else:
filterfalse = getattr(itertools, 'filterfalse')
def countzeros(matrix):
''' Make a good use of `itertools.filterfalse`
(`itertools.ifilterfalse` in case of Python 2) to count
all 0s in `matrix`. '''
counter = 0
for _ in filterfalse(bool, itertools.chain(*matrix)):
counter += 1
return counter
if __name__ == '__main__':
# Benchmark
from timeit import repeat
print(repeat('countzeros(((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0)))',
'from __main__ import countzeros',
repeat=10,
number=100000))
Use numpy:
import numpy
m=((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0))
numpy_m = numpy.array(m)
print numpy.sum(numpy_m == 0)
How does the above work? First, your "matrix" is converted to a numpy array (numpy.array(m)). Then, each entry is checked for equality with zero (numpy_m == 0). This yields a binary array. Summing over this binary array gives the number of zero elements in the original array.
Note that numpy will be clearly efficient for larger matrices. 4x4 might be too small to see a large performance difference vs. ordinary python code, esp. if you are initializing a python "matrix" like above.
One numpy solution is:
import numpy as np
m = ((2,0,2,2),(4,4,5,4),(0,9,4,8),(2,2,0,0))
mm = np.array(m)
def zeroCountSmci():
return (mm==0).sum() # sums across all axes, by default
Related
Here is an example :
4 digits
first, second digit's range is : 0 ~ 5 (total six number)
third, fourth digit's range is : 0 ~ 4 (total five number)
So, 0000, 0040, 0111, 4455 are ok but 5555, 4555, 4466 are not ok.
What I want to is to find what is the 2345 in ordinal? (from start zero index)`
For example, 0001 is "1" in ordinal. Likewise, 0010 is "5".
It could be calculated by,
(5*6*6*1)*2 + (6*6*1)*3 + (6*1)*4 + (1)*5 = 497
I made a function in Python
import numpy as np
def find_real_index_of_state(state, num_cnt_in_each_digit):
"""
parameter
=========
state(str)
num_cnt_in_each_digit(list) : the number of number in each digit
"""
num_of_digit = len(state)
digit_list = [int(i) for i in state]
num_cnt_in_each_digit.append(1)
real_index = 0
for i in range(num_of_digit):
real_index += np.product(num_cnt_in_each_digit[num_of_digit-i:]) * digit_list[num_of_digit-i-1]
return real_index
find_real_index_of_state("2345", [5,5,6,6])
Its result is same as 497.
Problem is though, this function is really slow. I need much more faster version, but this one is the best I can think about.
I really need your advice to improve it performance. (e.g vectorization etc)
Thanks
hope I understood you correctly.
First thing I notice is you do not need to recalculate everything each loop. I.e. you calculate (5*6*6*1),(6*6*1),(6*1),(1) individually instead you only need to calculate once.
def find_real_index_of_state(state,num_cnt_in_each_digit):
factor = 1
total = 0
for digit, num_cnt in zip(reversed(state), reversed(num_cnt_in_each_digit)):
digit = int(digit)
total += digit*factor
factor*= num_cnt
return total
Here's one vectorized approach making use of np.cumprod to perform the iterative np.product and then np.dot for the sum-reductions -
def real_index_vectorized(n, count):
num = [int(d) for d in str(n)]
# Or np.array([n]).view((str,1)).astype(int) #Thanks to #Eric
# Or (int(n)//(10**np.arange(len(n)-1,-1,-1)))%10
return np.dot( np.cumprod(count[:0:-1]), num[-2::-1]) + num[-1]
Runtime test -
1) Original sample :
In [66]: %timeit find_real_index_of_state("2345",[5,5,6,6])
100000 loops, best of 3: 14.1 µs per loop
In [67]: %timeit real_index_vectorized("2345",[5,5,6,6])
100000 loops, best of 3: 8.19 µs per loop
2) A bit bigger sample :
In [69]: %timeit find_real_index_of_state("234532321321323",[5,5,6,6,3,5,4,6,4,5,2,3,5,3,3])
10000 loops, best of 3: 52.7 µs per loop
In [70]: %timeit real_index_vectorized("234532321321323",[5,5,6,6,3,5,4,6,4,5,2,3,5,3,3])
100000 loops, best of 3: 12.5 µs per loop
Being a vectorized solution, it would scale well when it competes against a loopy version that has a good number of loop iterations.
For performance, I propose you vectorize your states first :
base=np.array([5*6*6,6*6,6,1])
states=np.array(["2345","0010"])
numbers=np.frombuffer(states,np.uint32).reshape(-1,4)-48 # faster
ordinals=(base*numbers).sum(1)
#array([497, 6], dtype=int64)
Consider the following operation in the limit of low length iterables,
d = (3, slice(None, None, None), slice(None, None, None))
In [215]: %timeit any([type(i) == slice for i in d])
1000000 loops, best of 3: 695 ns per loop
In [214]: %timeit any(type(i) == slice for i in d)
1000000 loops, best of 3: 929 ns per loop
Setting as a list is 25% faster than using a generator expression?
Why is this the case as setting as a list is an extra operation.
Note: In both runs I obtained the warning: The slowest run took 6.42 times longer than the fastest. This could mean that an intermediate result is being cached I
Analysis
In this particular test, list() structures are faster up to a length of 4 from which the generator has increased performance.
The red line shows where this event occurs and the black line shows where both are equal in performance.
The code takes about 1min to run on my MacBook Pro by utilising all the cores:
import timeit, pylab, multiprocessing
import numpy as np
manager = multiprocessing.Manager()
g = manager.list([])
l = manager.list([])
rng = range(1,16) # list lengths
max_series = [3,slice(None, None, None)]*rng[-1] # alternate array types
series = [max_series[:n] for n in rng]
number, reps = 1000000, 5
def func_l(d):
l.append(timeit.repeat("any([type(i) == slice for i in {}])".format(d),repeat=reps, number=number))
print "done List, len:{}".format(len(d))
def func_g(d):
g.append(timeit.repeat("any(type(i) == slice for i in {})".format(d), repeat=reps, number=number))
print "done Generator, len:{}".format(len(d))
p = multiprocessing.Pool(processes=min(16,rng[-1])) # optimize for 16 processors
p.map(func_l, series) # pool list
p.map(func_g, series) # pool gens
ratio = np.asarray(g).mean(axis=1) / np.asarray(l).mean(axis=1)
pylab.plot(rng, ratio, label='av. generator time / av. list time')
pylab.title("{} iterations, averaged over {} runs".format(number,reps))
pylab.xlabel("length of iterable")
pylab.ylabel("Time Ratio (Higher is worse)")
pylab.legend()
lt_zero = np.argmax(ratio<1.)
pylab.axhline(y=1, color='k')
pylab.axvline(x=lt_zero+1, color='r')
pylab.ion() ; pylab.show()
The catch is the size of the items you are applying any on. Repeat the same process on a larger dataset:
In [2]: d = ([3] * 1000) + [slice(None, None, None), slice(None, None, None)]*1000
In [3]: %timeit any([type(i) == slice for i in d])
1000 loops, best of 3: 736 µs per loop
In [4]: %timeit any(type(i) == slice for i in d)
1000 loops, best of 3: 285 µs per loop
Then, using a list (loading all the items into memory) becomes much slower, and the generator expression plays out better.
I have the following snippet that extracts indices of all unique values (hashable) in a sequence-like data with canonical indices and store them in a dictionary as lists:
from collections import defaultdict
idx_lists = defaultdict(list)
for idx, ele in enumerate(data):
idx_lists[ele].append(idx)
This looks like to me a quite common use case. And it happens that 90% of the execution time of my code is spent in these few lines. This part is passed through over 10000 times during execution, and len(data) is around 50000 to 100000 each time this is run. Number of unique elements ranges from 50 to 150 roughly.
Is there a faster way, perhaps vectorized/c-extended (e.g. numpy or pandas methods), that achieves the same thing?
Many many thanks.
Not as impressive as I hoped for originally (there's still a fair bit of pure Python in the groupby code path), but you might be able to cut the time down by a factor of 2-4, depending on how much you care about the exact final types involved:
import numpy as np, pandas as pd
from collections import defaultdict
def by_dd(data):
idx_lists = defaultdict(list)
for idx, ele in enumerate(data):
idx_lists[ele].append(idx)
return idx_lists
def by_pand1(data):
return {k: v.tolist() for k,v in data.groupby(data.values).indices.items()}
def by_pand2(data):
return data.groupby(data.values).indices
data = pd.Series(np.random.randint(0, 100, size=10**5))
gives me
>>> %timeit by_dd(data)
10 loops, best of 3: 42.9 ms per loop
>>> %timeit by_pand1(data)
100 loops, best of 3: 18.2 ms per loop
>>> %timeit by_pand2(data)
100 loops, best of 3: 11.5 ms per loop
Though it's not the perfect solution (it's O(NlogN) instead of O(N)), a much faster, vectorized way to do it is:
def data_to_idxlists(data):
sorting_ixs = np.argsort(data)
uniques, unique_indices = np.unique(data[sorting_ixs], return_index = True)
return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])}
Another solution that is O(N*U), (where U is the number of unique groups):
def data_to_idxlists(data):
u, ixs = np.unique(data, return_inverse=True)
return {u: np.nonzero(ixs==i) for i, u in enumerate(u)}
I found this question to be pretty interesting and while I wasn't able to get a large improvement over the other proposed methods I did find a pure numpy method that was slightly faster than the other proposed methods.
import numpy as np
import pandas as pd
from collections import defaultdict
data = np.random.randint(0, 10**2, size=10**5)
series = pd.Series(data)
def get_values_and_indicies(input_data):
input_data = np.asarray(input_data)
sorted_indices = input_data.argsort() # Get the sorted indices
# Get the sorted data so we can see where the values change
sorted_data = input_data[sorted_indices]
# Find the locations where the values change and include the first and last values
run_endpoints = np.concatenate(([0], np.where(sorted_data[1:] != sorted_data[:-1])[0] + 1, [len(input_data)]))
# Get the unique values themselves
unique_vals = sorted_data[run_endpoints[:-1]]
# Return the unique values along with the indices associated with that value
return {unique_vals[i]: sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist() for i in range(num_values)}
def by_dd(input_data):
idx_lists = defaultdict(list)
for idx, ele in enumerate(input_data):
idx_lists[ele].append(idx)
return idx_lists
def by_pand1(input_data):
idx_lists = defaultdict(list)
return {k: v.tolist() for k,v in series.groupby(input_data).indices.items()}
def by_pand2(input_data):
return series.groupby(input_data).indices
def data_to_idxlists(input_data):
u, ixs = np.unique(input_data, return_inverse=True)
return {u: np.nonzero(ixs==i) for i, u in enumerate(u)}
def data_to_idxlists_unique(input_data):
sorting_ixs = np.argsort(input_data)
uniques, unique_indices = np.unique(input_data[sorting_ixs], return_index = True)
return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])}
The resulting timings were (from fastest to slowest):
>>> %timeit get_values_and_indicies(data)
100 loops, best of 3: 4.25 ms per loop
>>> %timeit by_pand2(series)
100 loops, best of 3: 5.22 ms per loop
>>> %timeit data_to_idxlists_unique(data)
100 loops, best of 3: 6.23 ms per loop
>>> %timeit by_pand1(series)
100 loops, best of 3: 10.2 ms per loop
>>> %timeit data_to_idxlists(data)
100 loops, best of 3: 15.5 ms per loop
>>> %timeit by_dd(data)
10 loops, best of 3: 21.4 ms per loop
and it should be noted that unlike by_pand2 it results a dict of lists as given in the example. If you would prefer to return a defaultdict you can simply change the last time to return defaultdict(list, ((unique_vals[i], sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist()) for i in range(num_values))) which increased the overall timing in my tests to 4.4 ms.
Lastly, I should note that these timing are data sensitive. When I used only 10 different values I got:
get_values_and_indicies: 4.34 ms per loop
data_to_idxlists_unique: 4.42 ms per loop
by_pand2: 4.83 ms per loop
data_to_idxlists: 6.09 ms per loop
by_pand1: 9.39 ms per loop
by_dd: 22.4 ms per loop
while if I used 10,000 different values I got:
get_values_and_indicies: 7.00 ms per loop
data_to_idxlists_unique: 14.8 ms per loop
by_dd: 29.8 ms per loop
by_pand2: 47.7 ms per loop
by_pand1: 67.3 ms per loop
data_to_idxlists: 869 ms per loop
I recently stumbled upon numba and thought about replacing some homemade C extensions with more elegant autojitted python code. Unfortunately I wasn't happy, when I tried a first, quick benchmark. It seems like numba is not doing much better than ordinary python here, though I would have expected nearly C-like performance:
from numba import jit, autojit, uint, double
import numpy as np
import imp
import logging
logging.getLogger('numba.codegen.debug').setLevel(logging.INFO)
def sum_accum(accmap, a):
res = np.zeros(np.max(accmap) + 1, dtype=a.dtype)
for i in xrange(len(accmap)):
res[accmap[i]] += a[i]
return res
autonumba_sum_accum = autojit(sum_accum)
numba_sum_accum = jit(double[:](int_[:], double[:]),
locals=dict(i=uint))(sum_accum)
accmap = np.repeat(np.arange(1000), 2)
np.random.shuffle(accmap)
accmap = np.repeat(accmap, 10)
a = np.random.randn(accmap.size)
ref = sum_accum(accmap, a)
assert np.all(ref == numba_sum_accum(accmap, a))
assert np.all(ref == autonumba_sum_accum(accmap, a))
%timeit sum_accum(accmap, a)
%timeit autonumba_sum_accum(accmap, a)
%timeit numba_sum_accum(accmap, a)
accumarray = imp.load_source('accumarray', '/path/to/accumarray.py')
assert np.all(ref == accumarray.accum(accmap, a))
%timeit accumarray.accum(accmap, a)
This gives on my machine:
10 loops, best of 3: 52 ms per loop
10 loops, best of 3: 42.2 ms per loop
10 loops, best of 3: 43.5 ms per loop
1000 loops, best of 3: 321 us per loop
I'm running the latest numba version from pypi, 0.11.0. Any suggestions, how to fix the code, so it runs reasonably fast with numba?
I figured out myself. numba wasn't able to determine the type of the result of np.max(accmap), even if the type of accmap was set to int. This somehow slowed down everything, but the fix is easy:
#autojit(locals=dict(reslen=uint))
def sum_accum(accmap, a):
reslen = np.max(accmap) + 1
res = np.zeros(reslen, dtype=a.dtype)
for i in range(len(accmap)):
res[accmap[i]] += a[i]
return res
The result is quite impressive, about 2/3 of the C version:
10000 loops, best of 3: 192 us per loop
Update 2022:
The work on this issue led to the python package numpy_groupies, which is available here:
https://github.com/ml31415/numpy-groupies
#autojit
def numbaMax(arr):
MAX = arr[0]
for i in arr:
if i > MAX:
MAX = i
return MAX
#autojit
def autonumba_sum_accum2(accmap, a):
res = np.zeros(numbaMax(accmap) + 1)
for i in xrange(len(accmap)):
res[accmap[i]] += a[i]
return res
10 loops, best of 3: 26.5 ms per loop <- original
100 loops, best of 3: 15.1 ms per loop <- with numba but the slow numpy max
10000 loops, best of 3: 47.9 µs per loop <- with numbamax
I know I can do it like the following:
import numpy as np
N=10
a=np.arange(1,100,1)
np.argsort()[-N:]
However, it is very slow since it did a full sort.
I wonder whether numpy provide some methods the do it fast.
numpy 1.8 implements partition and argpartition that perform partial sort ( in O(n) time as opposed to full sort that is O(n) * log(n)).
import numpy as np
test = np.array([9,1,3,4,8,7,2,5,6,0])
temp = np.argpartition(-test, 4)
result_args = temp[:4]
temp = np.partition(-test, 4)
result = -temp[:4]
Result:
>>> result_args
array([0, 4, 8, 5]) # indices of highest vals
>>> result
array([9, 8, 6, 7]) # highest vals
Timing:
In [16]: a = np.arange(10000)
In [17]: np.random.shuffle(a)
In [18]: %timeit np.argsort(a)
1000 loops, best of 3: 1.02 ms per loop
In [19]: %timeit np.argpartition(a, 100)
10000 loops, best of 3: 139 us per loop
In [20]: %timeit np.argpartition(a, 1000)
10000 loops, best of 3: 141 us per loop
The bottleneck module has a fast partial sort method that works directly with Numpy arrays: bottleneck.partition().
Note that bottleneck.partition() returns the actual values sorted, if you want the indexes of the sorted values (what numpy.argsort() returns) you should use bottleneck.argpartition().
I've benchmarked:
z = -bottleneck.partition(-a, 10)[:10]
z = a.argsort()[-10:]
z = heapq.nlargest(10, a)
where a is a random 1,000,000-element array.
The timings were as follows:
bottleneck.partition(): 25.6 ms per loop
np.argsort(): 198 ms per loop
heapq.nlargest(): 358 ms per loop
I had this problem and, since this question is 5 years old, I had to redo all benchmarks and change the syntax of bottleneck (there is no partsort anymore, it's partition now).
I used the same arguments as kwgoodman, except the number of elements retrieved, which I increased to 50 (to better fit my particular situation).
I got these results:
bottleneck 1: 01.12 ms per loop
bottleneck 2: 00.95 ms per loop
pandas : 01.65 ms per loop
heapq : 08.61 ms per loop
numpy : 12.37 ms per loop
numpy 2 : 00.95 ms per loop
So, bottleneck_2 and numpy_2 (adas's solution) were tied.
But, using np.percentile (numpy_2) you have those topN elements already sorted, which is not the case for the other solutions. On the other hand, if you are also interested on the indexes of those elements, percentile is not useful.
I added pandas too, which uses bottleneck underneath, if available (http://pandas.pydata.org/pandas-docs/stable/install.html#recommended-dependencies). If you already have a pandas Series or DataFrame to start with, you are in good hands, just use nlargest and you're done.
The code used for the benchmark is as follows (python 3, please):
import time
import numpy as np
import bottleneck as bn
import pandas as pd
import heapq
def bottleneck_1(a, n):
return -bn.partition(-a, n)[:n]
def bottleneck_2(a, n):
return bn.partition(a, a.size-n)[-n:]
def numpy(a, n):
return a[a.argsort()[-n:]]
def numpy_2(a, n):
M = a.shape[0]
perc = (np.arange(M-n,M)+1.0)/M*100
return np.percentile(a,perc)
def pandas(a, n):
return pd.Series(a).nlargest(n)
def hpq(a, n):
return heapq.nlargest(n, a)
def do_nothing(a, n):
return a[:n]
def benchmark(func, size=1000000, ntimes=100, topn=50):
t1 = time.time()
for n in range(ntimes):
a = np.random.rand(size)
func(a, topn)
t2 = time.time()
ms_per_loop = 1000000 * (t2 - t1) / size
return ms_per_loop
t1 = benchmark(bottleneck_1)
t2 = benchmark(bottleneck_2)
t3 = benchmark(pandas)
t4 = benchmark(hpq)
t5 = benchmark(numpy)
t6 = benchmark(numpy_2)
t0 = benchmark(do_nothing)
print("bottleneck 1: {:05.2f} ms per loop".format(t1 - t0))
print("bottleneck 2: {:05.2f} ms per loop".format(t2 - t0))
print("pandas : {:05.2f} ms per loop".format(t3 - t0))
print("heapq : {:05.2f} ms per loop".format(t4 - t0))
print("numpy : {:05.2f} ms per loop".format(t5 - t0))
print("numpy 2 : {:05.2f} ms per loop".format(t6 - t0))
Each negative sign in the proposed bottleneck solution
-bottleneck.partsort(-a, 10)[:10]
makes a copy of the data. We can remove the copies by doing
bottleneck.partsort(a, a.size-10)[-10:]
Also the proposed numpy solution
a.argsort()[-10:]
returns indices not values. The fix is to use the indices to find the values:
a[a.argsort()[-10:]]
The relative speed of the two bottleneck solutions depends on the ordering of the elements in the initial array because the two approaches partition the data at different points.
In other words, timing with any one particular random array can make either method look faster.
Averaging the timing across 100 random arrays, each with 1,000,000 elements, gives
-bn.partsort(-a, 10)[:10]: 1.76 ms per loop
bn.partsort(a, a.size-10)[-10:]: 0.92 ms per loop
a[a.argsort()[-10:]]: 15.34 ms per loop
where the timing code is as follows:
import time
import numpy as np
import bottleneck as bn
def bottleneck_1(a):
return -bn.partsort(-a, 10)[:10]
def bottleneck_2(a):
return bn.partsort(a, a.size-10)[-10:]
def numpy(a):
return a[a.argsort()[-10:]]
def do_nothing(a):
return a
def benchmark(func, size=1000000, ntimes=100):
t1 = time.time()
for n in range(ntimes):
a = np.random.rand(size)
func(a)
t2 = time.time()
ms_per_loop = 1000000 * (t2 - t1) / size
return ms_per_loop
t1 = benchmark(bottleneck_1)
t2 = benchmark(bottleneck_2)
t3 = benchmark(numpy)
t4 = benchmark(do_nothing)
print "-bn.partsort(-a, 10)[:10]: %0.2f ms per loop" % (t1 - t4)
print "bn.partsort(a, a.size-10)[-10:]: %0.2f ms per loop" % (t2 - t4)
print "a[a.argsort()[-10:]]: %0.2f ms per loop" % (t3 - t4)
Perhaps heapq.nlargest
import numpy as np
import heapq
x = np.array([1,-5,4,6,-3,3])
z = heapq.nlargest(3,x)
Result:
>>> z
[6, 4, 3]
If you want to find the indices of the n largest elements using bottleneck you could use
bottleneck.argpartsort
>>> x = np.array([1,-5,4,6,-3,3])
>>> z = bottleneck.argpartsort(-x, 3)[:3]
>>> z
array([3, 2, 5]
You can also use numpy's percentile function. In my case it was slightly faster then bottleneck.partsort():
import timeit
import bottleneck as bn
N,M,K = 10,1000000,100
start = timeit.default_timer()
for k in range(K):
a=np.random.uniform(size=M)
tmp=-bn.partsort(-a, N)[:N]
stop = timeit.default_timer()
print (stop - start)/K
start = timeit.default_timer()
perc = (np.arange(M-N,M)+1.0)/M*100
for k in range(K):
a=np.random.uniform(size=M)
tmp=np.percentile(a,perc)
stop = timeit.default_timer()
print (stop - start)/K
Average time per loop:
bottleneck.partsort(): 59 ms
np.percentile(): 54 ms
If storing the array as a list of numbers isn't problematic, you can use
import heapq
heapq.nlargest(N, a)
to get the N largest members.