Is there a faster way to write "compute_optimal_weights" function in Python. I run it hundreds of millions of times, so any speed increase would help. The arguments of the function are different each time I run it.
c1 = 0.25
c2 = 0.67
def compute_optimal_weights(input_prices):
input_weights_optimal = {}
for i in input_prices:
price = input_prices[i]
input_weights_optimal[i] = c2 / sum([(price/n) ** c1 for n in input_prices.values()])
return input_weights_optimal
input_sellers_ID = range(10)
input_prices = {}
for i in input_sellers_ID:
input_prices[i] = random.uniform(0,1)
t0 = time.time()
for i in xrange(1000000):
compute_optimal_weights(input_prices)
t1 = time.time()
print "old time", (t1 - t0)
The number of elements in list and dictionary vary, but on average there are about 10 elements. They keys in input_prices are the same across all calls but the values change, so the same key will have different values over different runs.
Using a little bit of math, you can calculate part of your sum_price_ratio_scaled as a constant earlier in the loop and speed up your program by ~80% (for the average input size of 10).
Optimized Implementation (Python 3):
def compute_optimal_weights(ids, prices):
scaled_sum = 0
for i in ids:
scaled_sum += prices[i] ** -0.25
result = {}
for i in ids:
result[i] = 0.67 * (prices[i] ** -0.25) / scaled_sum
return result
Edit, in response to this answer: While using numpy will prove more performant with massive data sets, given that "on average there are about 10 elements" in your input_sellers_ID list, I doubt that this approach is worth its own weight for your particular application.
Although it might be tempting to leverage the terseness of generator expressions and dictionary comprehensions, I noticed when running on my machine that the best performance was obtained by using regular for-in loops and avoiding function calls like sum(...). For the sake of completeness, though, here is what the above implementation would look like in a more 'pythonic' style:
def compute_optimal_weights(ids, prices):
scaled_sum = sum(prices[i] ** -0.25 for i in ids)
return {i: 0.67 * (prices[i] ** -0.25) / scaled_sum for i in ids}
Reasoning / Math:
Based on your posted algorithm, you are trying to create a dictionary with values represented by the function f(i) below, where i is one of the elements in your input_sellers_ID list.
When you initially write out the formula for f(i), it appears as though prices[i] must be recalculated for every step of the summation process, which is costly. Simplifying the expression using the rules of exponents, however, you can see that the simplest summation needed to determine f(i) is actually independent of i (only the index value of j is ever used), meaning that that term is a constant and can be calculated outside of the loop which sets the dictionary values.
Note that above I refer to input_prices as prices and input_sellers_ID as ids.
Performance Profile (~80% speed improvement on my machine, size 10):
import time
import random
def compute_optimal_weights(ids, prices):
scaled_sum = 0
for i in ids:
scaled_sum += prices[i] ** -0.25
result = {}
for i in ids:
result[i] = 0.67 * (prices[i] ** -0.25) / scaled_sum
return result
def compute_optimal_weights_old(input_sellers_ID, input_prices):
input_weights_optimal = {}
for i in input_sellers_ID:
sum_price_ratio_scaled = 0
for j in input_sellers_ID:
price_ratio = input_prices[i] / input_prices[j]
scaled_price_ratio = price_ratio ** c1
sum_price_ratio_scaled += scaled_price_ratio
input_weights_optimal[i] = c2 / sum_price_ratio_scaled
return input_weights_optimal
c1 = 0.25
c2 = 0.67
input_sellers_ID = range(10)
input_prices = {i: random.uniform(0,1) for i in input_sellers_ID}
start = time.clock()
for _ in range(1000000):
compute_optimal_weights_old(input_sellers_ID, input_prices) and None
old_time = time.clock() - start
start = time.clock()
for _ in range(1000000):
compute_optimal_weights(input_sellers_ID, input_prices) and None
new_time = time.clock() - start
print('Old:', compute_optimal_weights_old(input_sellers_ID, input_prices))
print('New:', compute_optimal_weights(input_sellers_ID, input_prices))
print('New algorithm is {:.2%} faster.'.format(1 - new_time / old_time))
I believe we could speed-up the function by factoring the loop. Let a = price, b = n and c = c1, if my maths are not wrong (e.g. (5/6)**3 == 5**3 / 6**3:
(5./6.)**2 + (5./4.)**2
==
5**2 / 6.**2 + 5**2 / 4.**2
==
5**2 * (1/6.**2 + 1/4.**2)
With variables:
sum( (a / b) ** c for each b)
==
sum( a**c * (1/b) ** c for each b)
==
a**c * sum((1./b)**c for each b)
The second term is constant and can be taken out. Which leaves:
Faster implementation - Raw Python
Using generators and dict-comprehension:
def compute_optimal_weights(input_prices):
sconst = sum(1/w**c1 for w in input_prices.values())
return {k: c2 / (v**c1 * sconst) for k, v in input_prices.items()}
NOTE: if you are using Python2 replace .values() and .items() with .itervalues() and .iteritems() for extra speedup (few ms with large lists).
Even Faster - Numpy
Additionally, if you don't care that much about the dictionary and just want the values, you could speed it up using numpy (for large inputs >100):
def compute_optimal_weights_np(input_prices):
data = np.asarray(input_prices.values()) ** c1
return c2 / (data * np.sum(1./data))
Few timings for different input size:
N = 10 inputs:
MINE: 100000 loops, best of 3: 6.02 µs per loop
NUMPY: 100000 loops, best of 3: 10.6 µs per loop
YOURS: 10000 loops, best of 3: 23.8 µs per loop
N = 100 inputs:
MINE: 10000 loops, best of 3: 49.1 µs per loop
NUMPY: 10000 loops, best of 3: 22.6 µs per loop
YOURS: 1000 loops, best of 3: 1.86 ms per loop
N = 1000 inputs:
MINE: 1000 loops, best of 3: 458 µs per loop
NUMPY: 10000 loops, best of 3: 121 µs per loop
YOURS: 10 loops, best of 3: 173 ms per loop
N = 100000 inputs:
MINE: 10 loops, best of 3: 54.2 ms per loop
NUMPY: 100 loops, best of 3: 11.1 ms per loop
YOURS: didn't finish in a couple of minutes
Both options here are considerably faster than the one presented in the question. The benefit of using numpy if you can give consistent input (in the form of array instead of a dictionary) becomes apparent when the size grows:
Related
I'm working with DNA sequence alignments and trying to implement a simple scoring algorithm. Since i have to use a matrix for the calculations, i thought numpy should be way faster than a list of lists, but as I tested both, the python lists seem to be way faster. I found this thread (Why use numpy over list based on speed?) but still; i'm using preallocated numpy vs preallocated lists and list of lists are the clear winners.
Here is my code:
Lists
def edirDistance(x, y):
x_dim = len(x)+1
y_dim = len(y)+1
D = []
for i in range(x_dim):
D.append([0] * (y_dim))
#Filling the matrix borders
for i in range(x_dim):
D[i][0] = i
for i in range(y_dim):
D[0][i] = i
for i in range(1, x_dim):
for j in range(1, y_dim):
distHor = D[i][j-1] + 1
distVer = D[i-1][j] + 1
if x[i-1] == y[j-1]:
distDiag = D[i-1][j-1]
else:
distDiag = D[i-1][j-1] + 1
D[i][j] = min(distHor, distVer,distDiag)
return D
Numpy
def NP_edirDistance(x, y):
x_dim = len(x)+1
y_dim = len(y)+1
D = np.zeros((x_dim,y_dim))
#Filling the matrix borders
for i in range(x_dim):
D[i][0] = i
for i in range(y_dim):
D[0][i] = i
for i in range(1, x_dim):
for j in range(1, y_dim):
distHor = D[i][j-1] + 1
distVer = D[i-1][j] + 1
if x[i-1] == y[j-1]:
distDiag = D[i-1][j-1]
else:
distDiag = D[i-1][j-1] + 1
D[i][j] = min(distHor, distVer,distDiag)
return D
I'm not timing the np import.
a = 'ACGTACGACTATCGACTAGCTACGAA'
b = 'ACCCACGTATAACGACTAGCTAGGGA'
%%time
edirDistance(a, b)
total: 1.41 ms
%%time
NP_edirDistance(a, b)
total: 4.43 ms
Replacing D[i][j] by D[i,j] greatly improved time, but still slower. (Thanks #Learning is a mess !)
total: 2.64 ms
I tested with even larger DNA sequences (around 10.000 letters each) and still lists are winning.
Can someone help me improve timing?
Are lists better for this use?
One way to have faster run is to use GPU/TPU-aided accelerators such as numba and …. I have tested your codes by that a and b on google colab TPU without using accelerators:
1000 loops, best of 5: 563 µs per loop
1000 loops, best of 5: 1.95 ms per loop # NumPy
But with using numba as nopython=True, without any changes to your codes:
import numba as nb
#nb.njit()
def edirDistance(x, y):
.
.
#nb.njit()
def NP_edirDistance(x, y):
.
.
It gets:
1000 loops, best of 5: 213 µs per loop
1000 loops, best of 5: 153 µs per loop # NumPy
Which will get significant difference between them using huge samples or by improving and vectorizing your NumPy codes. This method results as below for samples with 10000 length:
35.50053691864014
22.95994758605957 # NumPy (seconds)
So I am experimenting on the performance boost of combining vectorization and for-loop powered by #njit in numba(I am currently using numba 0.45.1). Disappointingly, I found out it is actually slower than the pure nested-loop implementation in my code.
This is my code:
import numpy as np
from numba import njit
#njit
def func3(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
w = w + (1-alpha_arr)**i
e = e*(1-alpha_arr) + arr_in[i]
result[i,:] = e /w
return result
#njit
def func4(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
for col in range(len(win_arr)):
w[col] = w[col] + (1-alpha_arr[col])**i
e[col] = e[col]*(1-alpha_arr[col]) + arr_in[i]
result[i,col] = e[col] /w[col]
return result
if __name__ == '__main__':
np.random.seed(0)
data_size = 200000
winarr_size = 1000
data = np.random.uniform(0,1000, size = data_size)+29000
win_array = np.arange(1, winarr_size+1)
abc_test3= func3(data, win_array)
abc_test4= func4(data, win_array)
print(np.allclose(abc_test3, abc_test4, equal_nan = True))
I benchmarked the two functions using the following configurations:
(data_size,winarr_size) = (200000,100), (200000,200),(200000,1000), (200000,2000), (20000,10000), (2000,100000).
And found that the pure nested-for-loop implementation(func4) is consistently faster (about 2-5% faster) than the implementation with a for-loop mixed with vectorization (func3).
My questions are the following:
1) what needs to be changed to further improve the speed of the code?
2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.
3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?
It seems you misunderstood what "vectorized" means. Vectorized means that you write code that operates on arrays as-if they were scalars - but that's just how the code looks like, not related to performance.
In the Python/NumPy world vectorized also carries the meaning that the overhead of the loop in vectorized operations is (often) much smaller compared to loopy code. However the vectorized code still has to do the loop (even if it's hidden in a library)!
Also, if you write a loop with numba, numba will compile it and create fast code that performs (generally) as fast as vectorized NumPy code. That means inside a numba function there's no significant performance difference between vectorized and non-vectorized code.
So that should answer your questions:
2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.
It grows linearly because it still has to iterate. In vectorized code the loop is just hidden inside a library routine.
3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?
No.
You also asked what could be done to make it faster.
The comments already mentioned that you could parallelize it:
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def func6(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
for col in nb.prange(len(win_arr)):
w[col] = w[col] + (1-alpha_arr[col])**i
e[col] = e[col] * (1-alpha_arr[col]) + arr_in[i]
result[i,col] = e[col] /w[col]
return result
This makes the code a bit faster on my machine (4cores).
However there's also a problem that your algorithm may be numerically unstable. The (1-alpha_arr[col])**i will underflow at some point when you raise it to powers of hundred-thousands:
>>> alpha = 0.01
>>> for i in [1, 10, 100, 1_000, 10_000, 50_000, 100_000, 200_000]:
... print((1-alpha)**i)
0.99
0.9043820750088044
0.3660323412732292
4.317124741065786e-05
2.2487748498162805e-44
5.750821364590612e-219
0.0 # <-- underflow
0.0
Always think twice about complicated mathematical operations like (pow, divisions,...). If you can replace them by easy operations like multiplications, additions and subtractions it is always worth a try.
Please note that multiplying alpha repeatedly with itself is only algebraically the same as directly calculating with exponentiation. Since this is numerical math the results can differ.
Also avoid unnecessary temporary arrays.
First try
#nb.njit(error_model="numpy",parallel=True)
def func5(arr_in, win_arr):
#filling the whole array with NaNs isn't necessary
result = np.empty((win_arr.shape[0],arr_in.shape[0]))
for col in range(win_arr.shape[0]):
result[col,0]=np.nan
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[:two_index,0] = arr_in[0]
for col in nb.prange(win_arr.shape[0]):
alpha=1.-(2./ (win_arr[col] + 1.))
alpha_exp=alpha
w=1.
e=arr_in[0]
for i in range(1, arr_in.shape[0]):
w+= alpha_exp
e = e*alpha + arr_in[i]
result[col,i] = e/w
alpha_exp*=alpha
return result.T
Second try (avoiding underflow)
#nb.njit(error_model="numpy",parallel=True)
def func7(arr_in, win_arr):
#filling the whole array with NaNs isn't necessary
result = np.empty((win_arr.shape[0],arr_in.shape[0]))
for col in range(win_arr.shape[0]):
result[col,0]=np.nan
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[:two_index,0] = arr_in[0]
for col in nb.prange(win_arr.shape[0]):
alpha=1.-(2./ (win_arr[col] + 1.))
alpha_exp=alpha
w=1.
e=arr_in[0]
for i in range(1, arr_in.shape[0]):
w+= alpha_exp
e = e*alpha + arr_in[i]
result[col,i] = e/w
if np.abs(alpha_exp)>=1e-308:
alpha_exp*=alpha
else:
alpha_exp=0.
return result.T
Timings
%timeit abc_test3= func3(data, win_array)
7.17 s ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test4= func4(data, win_array)
7.13 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#from MSeifert answer (parallelized)
%timeit abc_test6= func6(data, win_array)
3.42 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test5= func5(data, win_array)
1.22 s ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test7= func7(data, win_array)
238 ms ± 5.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm concerned with the speed of the following function:
def cch(tau):
return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
Where "cartprod" is a variable for a list that looks like this:
cartprod = np.ndarray([[0.0123,0.0123],[0.0123,0.0459],...])
The length of this list is about 25 million. Basically, I'm trying to find a significantly faster way to return a list of differences for every pair list in that np.ndarray. Is there an algorithmic way or function that's faster than np.diff? Or, is np.diff the end all be all? I'm also open to anything else.
EDIT: Thank you all for your solutions!
I think you're hitting a wall by repeatedly returning multiple np.arrays of length ~25 million rather than np.diff being slow. I wrote an equivalent function that iterates over the array and tallies the results as it goes along. The function needs to be jitted with numba to be fast. I hope that is acceptable.
arr = np.random.rand(25000000, 2)
def cch(tau, cartprod):
return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
%timeit cch(0.01, arr)
#jit(nopython=True)
def cch_jit(tau, cartprod):
count = 0
tau = -tau
for i in range(cartprod.shape[0]):
count += np.less(np.abs(tau - (cartprod[i, 1]- cartprod[i, 0])), 0.001)
return count
%timeit cch_jit(0.01, arr)
produces
294 ms ± 2.82 ms
42.7 ms ± 483 µs
which is about ~6 times faster.
We can leverage multi-core with numexpr module for large data and to gain memory efficiency and hence performance with some help from array-slicing -
import numexpr as ne
def cch_numexpr(a, tau):
d = {'a0':a[:,0],'a1':a[:,1]}
return np.count_nonzero(ne.evaluate('abs(a0-a1-tau)<0.001',d))
Sample run and timings on 25M sized data -
In [83]: cartprod = np.random.rand(25000000,2)
In [84]: cch(cartprod, tau=0.5) == cch_numexpr(cartprod, tau=0.5)
Out[84]: True
In [85]: %timeit cch(cartprod, tau=0.5)
10 loops, best of 3: 150 ms per loop
In [86]: %timeit cch_numexpr(cartprod, tau=0.5)
10 loops, best of 3: 25.5 ms per loop
Around 6x speedup.
This was with 8 threads. Thus, with more number of threads available for compute, it should improve further. Related post on how to control multi-core functionality.
Just out of curiosity I compared the solutions of #Divakar numexpr and #alexdor numba.jit. The implementation numexpr.evaluate seems to be twice as fast as using numba's jit compiler. The results are shown for 100 runs each:
np.sum: 111.07543396949768
numexpr: 12.282189846038818
JIT: 6.2505223751068115
'np.sum' returns same result as 'numexpr'
'np.sum' returns same result as 'jit'
'numexpr' returns same result as 'jit'
Script so reproduce the results:
import numpy as np
import time
import numba
import numexpr
arr = np.random.rand(25000000, 2)
runs = 100
def cch(tau, cartprod):
return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
def cch_ne(tau, cartprod):
d = {'a0':cartprod[:,0],'a1':cartprod[:,1], 'tau': tau}
count = np.count_nonzero(numexpr.evaluate('abs(a0-a1-tau)<0.001',d))
return count
#numba.jit(nopython=True)
def cch_jit(tau, cartprod):
count = 0
tau = -tau
for i in range(cartprod.shape[0]):
count += np.less(np.abs(tau - (cartprod[i, 1]- cartprod[i, 0])), 0.001)
return count
start = time.time()
for x in range(runs):
x1 = cch(0.01, arr)
print('np.sum:\t\t', time.time() - start)
start = time.time()
for x in range(runs):
x2 = cch_ne(0.01, arr)
print('numexpr:\t', time.time() - start)
x3 = cch_jit(0.01, arr)
start = time.time()
for x in range(runs):
x3 = cch_jit(0.01, arr)
print('JIT:\t\t', time.time() - start)
if x1 == x2: print('\'np.sum\' returns same result as \'numexpr\'')
if x1 == x3: print('\'np.sum\' returns same result as \'jit\'')
if x2 == x3: print('\'numexpr\' returns same result as \'jit\'')
Here is an example :
4 digits
first, second digit's range is : 0 ~ 5 (total six number)
third, fourth digit's range is : 0 ~ 4 (total five number)
So, 0000, 0040, 0111, 4455 are ok but 5555, 4555, 4466 are not ok.
What I want to is to find what is the 2345 in ordinal? (from start zero index)`
For example, 0001 is "1" in ordinal. Likewise, 0010 is "5".
It could be calculated by,
(5*6*6*1)*2 + (6*6*1)*3 + (6*1)*4 + (1)*5 = 497
I made a function in Python
import numpy as np
def find_real_index_of_state(state, num_cnt_in_each_digit):
"""
parameter
=========
state(str)
num_cnt_in_each_digit(list) : the number of number in each digit
"""
num_of_digit = len(state)
digit_list = [int(i) for i in state]
num_cnt_in_each_digit.append(1)
real_index = 0
for i in range(num_of_digit):
real_index += np.product(num_cnt_in_each_digit[num_of_digit-i:]) * digit_list[num_of_digit-i-1]
return real_index
find_real_index_of_state("2345", [5,5,6,6])
Its result is same as 497.
Problem is though, this function is really slow. I need much more faster version, but this one is the best I can think about.
I really need your advice to improve it performance. (e.g vectorization etc)
Thanks
hope I understood you correctly.
First thing I notice is you do not need to recalculate everything each loop. I.e. you calculate (5*6*6*1),(6*6*1),(6*1),(1) individually instead you only need to calculate once.
def find_real_index_of_state(state,num_cnt_in_each_digit):
factor = 1
total = 0
for digit, num_cnt in zip(reversed(state), reversed(num_cnt_in_each_digit)):
digit = int(digit)
total += digit*factor
factor*= num_cnt
return total
Here's one vectorized approach making use of np.cumprod to perform the iterative np.product and then np.dot for the sum-reductions -
def real_index_vectorized(n, count):
num = [int(d) for d in str(n)]
# Or np.array([n]).view((str,1)).astype(int) #Thanks to #Eric
# Or (int(n)//(10**np.arange(len(n)-1,-1,-1)))%10
return np.dot( np.cumprod(count[:0:-1]), num[-2::-1]) + num[-1]
Runtime test -
1) Original sample :
In [66]: %timeit find_real_index_of_state("2345",[5,5,6,6])
100000 loops, best of 3: 14.1 µs per loop
In [67]: %timeit real_index_vectorized("2345",[5,5,6,6])
100000 loops, best of 3: 8.19 µs per loop
2) A bit bigger sample :
In [69]: %timeit find_real_index_of_state("234532321321323",[5,5,6,6,3,5,4,6,4,5,2,3,5,3,3])
10000 loops, best of 3: 52.7 µs per loop
In [70]: %timeit real_index_vectorized("234532321321323",[5,5,6,6,3,5,4,6,4,5,2,3,5,3,3])
100000 loops, best of 3: 12.5 µs per loop
Being a vectorized solution, it would scale well when it competes against a loopy version that has a good number of loop iterations.
For performance, I propose you vectorize your states first :
base=np.array([5*6*6,6*6,6,1])
states=np.array(["2345","0010"])
numbers=np.frombuffer(states,np.uint32).reshape(-1,4)-48 # faster
ordinals=(base*numbers).sum(1)
#array([497, 6], dtype=int64)
I know I can do it like the following:
import numpy as np
N=10
a=np.arange(1,100,1)
np.argsort()[-N:]
However, it is very slow since it did a full sort.
I wonder whether numpy provide some methods the do it fast.
numpy 1.8 implements partition and argpartition that perform partial sort ( in O(n) time as opposed to full sort that is O(n) * log(n)).
import numpy as np
test = np.array([9,1,3,4,8,7,2,5,6,0])
temp = np.argpartition(-test, 4)
result_args = temp[:4]
temp = np.partition(-test, 4)
result = -temp[:4]
Result:
>>> result_args
array([0, 4, 8, 5]) # indices of highest vals
>>> result
array([9, 8, 6, 7]) # highest vals
Timing:
In [16]: a = np.arange(10000)
In [17]: np.random.shuffle(a)
In [18]: %timeit np.argsort(a)
1000 loops, best of 3: 1.02 ms per loop
In [19]: %timeit np.argpartition(a, 100)
10000 loops, best of 3: 139 us per loop
In [20]: %timeit np.argpartition(a, 1000)
10000 loops, best of 3: 141 us per loop
The bottleneck module has a fast partial sort method that works directly with Numpy arrays: bottleneck.partition().
Note that bottleneck.partition() returns the actual values sorted, if you want the indexes of the sorted values (what numpy.argsort() returns) you should use bottleneck.argpartition().
I've benchmarked:
z = -bottleneck.partition(-a, 10)[:10]
z = a.argsort()[-10:]
z = heapq.nlargest(10, a)
where a is a random 1,000,000-element array.
The timings were as follows:
bottleneck.partition(): 25.6 ms per loop
np.argsort(): 198 ms per loop
heapq.nlargest(): 358 ms per loop
I had this problem and, since this question is 5 years old, I had to redo all benchmarks and change the syntax of bottleneck (there is no partsort anymore, it's partition now).
I used the same arguments as kwgoodman, except the number of elements retrieved, which I increased to 50 (to better fit my particular situation).
I got these results:
bottleneck 1: 01.12 ms per loop
bottleneck 2: 00.95 ms per loop
pandas : 01.65 ms per loop
heapq : 08.61 ms per loop
numpy : 12.37 ms per loop
numpy 2 : 00.95 ms per loop
So, bottleneck_2 and numpy_2 (adas's solution) were tied.
But, using np.percentile (numpy_2) you have those topN elements already sorted, which is not the case for the other solutions. On the other hand, if you are also interested on the indexes of those elements, percentile is not useful.
I added pandas too, which uses bottleneck underneath, if available (http://pandas.pydata.org/pandas-docs/stable/install.html#recommended-dependencies). If you already have a pandas Series or DataFrame to start with, you are in good hands, just use nlargest and you're done.
The code used for the benchmark is as follows (python 3, please):
import time
import numpy as np
import bottleneck as bn
import pandas as pd
import heapq
def bottleneck_1(a, n):
return -bn.partition(-a, n)[:n]
def bottleneck_2(a, n):
return bn.partition(a, a.size-n)[-n:]
def numpy(a, n):
return a[a.argsort()[-n:]]
def numpy_2(a, n):
M = a.shape[0]
perc = (np.arange(M-n,M)+1.0)/M*100
return np.percentile(a,perc)
def pandas(a, n):
return pd.Series(a).nlargest(n)
def hpq(a, n):
return heapq.nlargest(n, a)
def do_nothing(a, n):
return a[:n]
def benchmark(func, size=1000000, ntimes=100, topn=50):
t1 = time.time()
for n in range(ntimes):
a = np.random.rand(size)
func(a, topn)
t2 = time.time()
ms_per_loop = 1000000 * (t2 - t1) / size
return ms_per_loop
t1 = benchmark(bottleneck_1)
t2 = benchmark(bottleneck_2)
t3 = benchmark(pandas)
t4 = benchmark(hpq)
t5 = benchmark(numpy)
t6 = benchmark(numpy_2)
t0 = benchmark(do_nothing)
print("bottleneck 1: {:05.2f} ms per loop".format(t1 - t0))
print("bottleneck 2: {:05.2f} ms per loop".format(t2 - t0))
print("pandas : {:05.2f} ms per loop".format(t3 - t0))
print("heapq : {:05.2f} ms per loop".format(t4 - t0))
print("numpy : {:05.2f} ms per loop".format(t5 - t0))
print("numpy 2 : {:05.2f} ms per loop".format(t6 - t0))
Each negative sign in the proposed bottleneck solution
-bottleneck.partsort(-a, 10)[:10]
makes a copy of the data. We can remove the copies by doing
bottleneck.partsort(a, a.size-10)[-10:]
Also the proposed numpy solution
a.argsort()[-10:]
returns indices not values. The fix is to use the indices to find the values:
a[a.argsort()[-10:]]
The relative speed of the two bottleneck solutions depends on the ordering of the elements in the initial array because the two approaches partition the data at different points.
In other words, timing with any one particular random array can make either method look faster.
Averaging the timing across 100 random arrays, each with 1,000,000 elements, gives
-bn.partsort(-a, 10)[:10]: 1.76 ms per loop
bn.partsort(a, a.size-10)[-10:]: 0.92 ms per loop
a[a.argsort()[-10:]]: 15.34 ms per loop
where the timing code is as follows:
import time
import numpy as np
import bottleneck as bn
def bottleneck_1(a):
return -bn.partsort(-a, 10)[:10]
def bottleneck_2(a):
return bn.partsort(a, a.size-10)[-10:]
def numpy(a):
return a[a.argsort()[-10:]]
def do_nothing(a):
return a
def benchmark(func, size=1000000, ntimes=100):
t1 = time.time()
for n in range(ntimes):
a = np.random.rand(size)
func(a)
t2 = time.time()
ms_per_loop = 1000000 * (t2 - t1) / size
return ms_per_loop
t1 = benchmark(bottleneck_1)
t2 = benchmark(bottleneck_2)
t3 = benchmark(numpy)
t4 = benchmark(do_nothing)
print "-bn.partsort(-a, 10)[:10]: %0.2f ms per loop" % (t1 - t4)
print "bn.partsort(a, a.size-10)[-10:]: %0.2f ms per loop" % (t2 - t4)
print "a[a.argsort()[-10:]]: %0.2f ms per loop" % (t3 - t4)
Perhaps heapq.nlargest
import numpy as np
import heapq
x = np.array([1,-5,4,6,-3,3])
z = heapq.nlargest(3,x)
Result:
>>> z
[6, 4, 3]
If you want to find the indices of the n largest elements using bottleneck you could use
bottleneck.argpartsort
>>> x = np.array([1,-5,4,6,-3,3])
>>> z = bottleneck.argpartsort(-x, 3)[:3]
>>> z
array([3, 2, 5]
You can also use numpy's percentile function. In my case it was slightly faster then bottleneck.partsort():
import timeit
import bottleneck as bn
N,M,K = 10,1000000,100
start = timeit.default_timer()
for k in range(K):
a=np.random.uniform(size=M)
tmp=-bn.partsort(-a, N)[:N]
stop = timeit.default_timer()
print (stop - start)/K
start = timeit.default_timer()
perc = (np.arange(M-N,M)+1.0)/M*100
for k in range(K):
a=np.random.uniform(size=M)
tmp=np.percentile(a,perc)
stop = timeit.default_timer()
print (stop - start)/K
Average time per loop:
bottleneck.partsort(): 59 ms
np.percentile(): 54 ms
If storing the array as a list of numbers isn't problematic, you can use
import heapq
heapq.nlargest(N, a)
to get the N largest members.