numpy.take can be applied in 2 dimensions with
np.take(np.take(T,ix,axis=0), iy,axis=1 )
I tested the stencil of the discret 2-dimensional Laplacian
ΔT = T[ix-1,iy] + T[ix+1, iy] + T[ix,iy-1] + T[ix,iy+1] - 4 * T[ix,iy]
with 2 take-schemes and the usual numpy.array scheme. The functions p and q are introduced for a leaner code writing and adress the axis 0 and 1 in different order. This is the code:
nx = 300; ny= 300
T = np.arange(nx*ny).reshape(nx, ny)
ix = np.linspace(1,nx-2,nx-2,dtype=int)
iy = np.linspace(1,ny-2,ny-2,dtype=int)
#------------------------------------------------------------
def p(Φ,kx,ky):
return np.take(np.take(Φ,ky,axis=1), kx,axis=0 )
#------------------------------------------------------------
def q(Φ,kx,ky):
return np.take(np.take(Φ,kx,axis=0), ky,axis=1 )
#------------------------------------------------------------
%timeit ΔT_n = T[0:nx-2,1:ny-1] + T[2:nx,1:ny-1] + T[1:nx-1,0:ny-2] + T[1:nx-1,2:ny] - 4.0 * T[1:nx-1,1:ny-1]
%timeit ΔT_t = p(T,ix-1,iy) + p(T,ix+1,iy) + p(T,ix,iy-1) + p(T,ix,iy+1) - 4.0 * p(T,ix,iy)
%timeit ΔT_t = q(T,ix-1,iy) + q(T,ix+1,iy) + q(T,ix,iy-1) + q(T,ix,iy+1) - 4.0 * q(T,ix,iy)
.
1000 loops, best of 3: 944 µs per loop
100 loops, best of 3: 3.11 ms per loop
100 loops, best of 3: 2.02 ms per loop
The results seem to be obvious:
usual numpy index arithmeitk is fastest
take-scheme q takes 100% longer (= C-ordering ?)
take-scheme p takes 200% longer (= Fortran-ordering ?)
Not even the 1-dimensional
example of the scipy manual indicates that numpy.take is fast:
a = np.array([4, 3, 5, 7, 6, 8])
indices = [0, 1, 4]
%timeit np.take(a, indices)
%timeit a[indices]
.
The slowest run took 6.58 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.32 µs per loop
The slowest run took 7.34 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.87 µs per loop
Does anybody has experiences how to make numpy.take fast ? It would be an flexible and attractive way for lean code writing that is fast in coding and
is told to be fast in execution as well. Thank your for some hints to improve my approach !
The indexed version might be cleaned up with slice objects like this:
T[0:nx-2,1:ny-1] + T[2:nx,1:ny-1] + T[1:nx-1,0:ny-2] + T[1:nx-1,2:ny] - 4.0 * T[1:nx-1,1:ny-1]
sy1 = slice(1,ny-1)
sx1 = slice(1,nx-1)
sy2 = slice(2,ny)
sy_2 = slice(0,ny-2)
T[0:nx-2,sy1] + T[2:nx,sy1] + T[sx1,xy_2] + T[sx1,sy2] - 4.0 * T[sx1,sy1]
Thanks #Divakar and #hpaulj ! Yes, working with slice is viable too. Comparing all 4 approaches gives:
fastest ex aequo: t(usual np) and t(slice)
t(take) = 2 * t(slice)
t(ix_) = 3 * t(slice)
Here the code and the results:
import numpy as np
from numpy import ix_ as r
nx = 500; ny = 500
T = np.arange(nx*ny).reshape(nx, ny)
ix = np.arange(1,nx-1);
iy = np.arange(1,ny-1);
jx = slice(1,nx-1); jxm = slice(0,nx-2); jxp = slice(2,nx)
jy = slice(1,ny-1); jym = slice(0,ny-2); jyp = slice(2,ny)
#------------------------------------------------------------
def p(U,kx,ky):
return np.take(np.take(U,kx, axis=0), ky,axis=1)
#------------------------------------------------------------
%timeit ΔT_slice= -T[jxm,jy] + T[jxp,jy] - T[jx,jym] + T[jx,jyp] - 0.0 * T[jx,jy]
%timeit ΔT_npy = -T[0:nx-2,1:ny-1] + T[2:nx,1:ny-1] - T[1:nx-1,0:ny-2] + T[1:nx-1,2:ny] - 0.0 * T[1:nx-1,1:ny-1]
%timeit ΔT_take = -p(T,ix-1,iy) + p(T,ix+1,iy) - p(T,ix,iy-1) + p(T,ix,iy+1) - 0.0 * p(T,ix,iy)
%timeit ΔT_ix_ = -T[r(ix-1,iy)] + T[r(ix+1,iy)] - T[r(ix,iy-1)] + T[r(ix,iy+1)] - 0.0 * T[r(ix,iy)]
.
100 loops, best of 3: 3.14 ms per loop
100 loops, best of 3: 3.13 ms per loop
100 loops, best of 3: 7.03 ms per loop
100 loops, best of 3: 9.58 ms per loop
Concerning the discussion about view and copy the following might be instructive:
print("if False --> a view ; if True --> a copy" )
print("_slice_ :", T[jx,jy].base is None)
print("_npy_ :", T[1:nx-1,1:ny-1].base is None)
print("_take_ :", p(T,ix,iy).base is None)
print("_ix_ :", T[r(ix,iy)].base is None)
.
if False --> a view ; if True --> a copy
_slice_ : False
_npy_ : False
_take_ : True
_ix_ : True
Related
I have a two dataframes s and sk with around 1M elements and I need to generate a new dataframe df from it where:
df.iloc[i] = s.iloc[f(i)] / sk.iloc[g(i)]
where f and g are functions that return integers.
Currently I'm doing:
data = []
for i in range(s.shape[0])):
data.append(s.iloc[f(i)] / sk.iloc[g(i)])
df = pd.DataFrame(data, columns=s.columns)
But this seems slow. It's taking about 5 minutes (the dataframes have 9 float columns).
There are only10M divisions, so 5 minutes seems sub-par. All the time seems to be spent iterating s and sk, so I was wondering if there was a way to build s[f] and sk[g] quickly?
edit
f and g are simple functions similar to
def f(i): return math.ceil(i / 23)
def g(i): return math.ceil(i / 23) + ((i - 1) % 23)
Your functions are easily vectorized.
def f_vec(i):
return np.ceil(i / 23).astype(int)
def g_vec(i):
return (np.ceil(i / 23) + ((i - 1) % 23)).astype(int)
As #Wen points out, we can further optimize this by writing a wrapper to only calculate the ceiling once.
def wrapper(i, a, b):
cache_ceil = np.ceil(i / 23).astype(int)
fidx = cache_ceil
gidx = cache_ceil + ((i - 1) % 23)
return a.iloc[fidx].to_numpy() / b.iloc[gidx].to_numpy()
Index alignment is also not working in your favor here. If you truly want the elementwise division of the two results, drop down to numpy before dividing:
s.iloc[f_vec(idx)].to_numpy() / sk.iloc[g_vec(idx)].to_numpy()
Now to test out the speed.
Setup
a = np.random.randint(1, 10, (1_000_000, 10))
s = pd.DataFrame(a)
sk = pd.DataFrame(a)
idx = np.arange(1_000_000)
Performance
%timeit s.iloc[f_vec(idx)].to_numpy() / sk.iloc[g_vec(idx)].to_numpy()
265 ms ± 5.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit wrapper(idx, s, sk)
200 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So I am experimenting on the performance boost of combining vectorization and for-loop powered by #njit in numba(I am currently using numba 0.45.1). Disappointingly, I found out it is actually slower than the pure nested-loop implementation in my code.
This is my code:
import numpy as np
from numba import njit
#njit
def func3(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
w = w + (1-alpha_arr)**i
e = e*(1-alpha_arr) + arr_in[i]
result[i,:] = e /w
return result
#njit
def func4(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
for col in range(len(win_arr)):
w[col] = w[col] + (1-alpha_arr[col])**i
e[col] = e[col]*(1-alpha_arr[col]) + arr_in[i]
result[i,col] = e[col] /w[col]
return result
if __name__ == '__main__':
np.random.seed(0)
data_size = 200000
winarr_size = 1000
data = np.random.uniform(0,1000, size = data_size)+29000
win_array = np.arange(1, winarr_size+1)
abc_test3= func3(data, win_array)
abc_test4= func4(data, win_array)
print(np.allclose(abc_test3, abc_test4, equal_nan = True))
I benchmarked the two functions using the following configurations:
(data_size,winarr_size) = (200000,100), (200000,200),(200000,1000), (200000,2000), (20000,10000), (2000,100000).
And found that the pure nested-for-loop implementation(func4) is consistently faster (about 2-5% faster) than the implementation with a for-loop mixed with vectorization (func3).
My questions are the following:
1) what needs to be changed to further improve the speed of the code?
2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.
3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?
It seems you misunderstood what "vectorized" means. Vectorized means that you write code that operates on arrays as-if they were scalars - but that's just how the code looks like, not related to performance.
In the Python/NumPy world vectorized also carries the meaning that the overhead of the loop in vectorized operations is (often) much smaller compared to loopy code. However the vectorized code still has to do the loop (even if it's hidden in a library)!
Also, if you write a loop with numba, numba will compile it and create fast code that performs (generally) as fast as vectorized NumPy code. That means inside a numba function there's no significant performance difference between vectorized and non-vectorized code.
So that should answer your questions:
2) why is it that the computation time of the vectorized version of the function grows linearly with the size of the win_arr? I thought the vectorization should make it so that the operation speed is constant no matter how big/small the vector is, but apparently this does not hold true in this case.
It grows linearly because it still has to iterate. In vectorized code the loop is just hidden inside a library routine.
3) Are there any general conditions under which the computation time of the vectorized operation will still grow linearly with the input size?
No.
You also asked what could be done to make it faster.
The comments already mentioned that you could parallelize it:
import numpy as np
import numba as nb
#nb.njit(parallel=True)
def func6(arr_in, win_arr):
n = arr_in.shape[0]
win_len = len(win_arr)
result = np.full((n, win_len), np.nan)
alpha_arr = 2 / (win_arr + 1)
e = np.full(win_len, arr_in[0])
w = np.ones(win_len)
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[0, :two_index] = arr_in[0]
for i in range(1, n):
for col in nb.prange(len(win_arr)):
w[col] = w[col] + (1-alpha_arr[col])**i
e[col] = e[col] * (1-alpha_arr[col]) + arr_in[i]
result[i,col] = e[col] /w[col]
return result
This makes the code a bit faster on my machine (4cores).
However there's also a problem that your algorithm may be numerically unstable. The (1-alpha_arr[col])**i will underflow at some point when you raise it to powers of hundred-thousands:
>>> alpha = 0.01
>>> for i in [1, 10, 100, 1_000, 10_000, 50_000, 100_000, 200_000]:
... print((1-alpha)**i)
0.99
0.9043820750088044
0.3660323412732292
4.317124741065786e-05
2.2487748498162805e-44
5.750821364590612e-219
0.0 # <-- underflow
0.0
Always think twice about complicated mathematical operations like (pow, divisions,...). If you can replace them by easy operations like multiplications, additions and subtractions it is always worth a try.
Please note that multiplying alpha repeatedly with itself is only algebraically the same as directly calculating with exponentiation. Since this is numerical math the results can differ.
Also avoid unnecessary temporary arrays.
First try
#nb.njit(error_model="numpy",parallel=True)
def func5(arr_in, win_arr):
#filling the whole array with NaNs isn't necessary
result = np.empty((win_arr.shape[0],arr_in.shape[0]))
for col in range(win_arr.shape[0]):
result[col,0]=np.nan
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[:two_index,0] = arr_in[0]
for col in nb.prange(win_arr.shape[0]):
alpha=1.-(2./ (win_arr[col] + 1.))
alpha_exp=alpha
w=1.
e=arr_in[0]
for i in range(1, arr_in.shape[0]):
w+= alpha_exp
e = e*alpha + arr_in[i]
result[col,i] = e/w
alpha_exp*=alpha
return result.T
Second try (avoiding underflow)
#nb.njit(error_model="numpy",parallel=True)
def func7(arr_in, win_arr):
#filling the whole array with NaNs isn't necessary
result = np.empty((win_arr.shape[0],arr_in.shape[0]))
for col in range(win_arr.shape[0]):
result[col,0]=np.nan
two_index = np.nonzero(win_arr <= 2)[0][-1]+1
result[:two_index,0] = arr_in[0]
for col in nb.prange(win_arr.shape[0]):
alpha=1.-(2./ (win_arr[col] + 1.))
alpha_exp=alpha
w=1.
e=arr_in[0]
for i in range(1, arr_in.shape[0]):
w+= alpha_exp
e = e*alpha + arr_in[i]
result[col,i] = e/w
if np.abs(alpha_exp)>=1e-308:
alpha_exp*=alpha
else:
alpha_exp=0.
return result.T
Timings
%timeit abc_test3= func3(data, win_array)
7.17 s ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test4= func4(data, win_array)
7.13 s ± 13.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#from MSeifert answer (parallelized)
%timeit abc_test6= func6(data, win_array)
3.42 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test5= func5(data, win_array)
1.22 s ± 22.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit abc_test7= func7(data, win_array)
238 ms ± 5.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is there a faster way to write "compute_optimal_weights" function in Python. I run it hundreds of millions of times, so any speed increase would help. The arguments of the function are different each time I run it.
c1 = 0.25
c2 = 0.67
def compute_optimal_weights(input_prices):
input_weights_optimal = {}
for i in input_prices:
price = input_prices[i]
input_weights_optimal[i] = c2 / sum([(price/n) ** c1 for n in input_prices.values()])
return input_weights_optimal
input_sellers_ID = range(10)
input_prices = {}
for i in input_sellers_ID:
input_prices[i] = random.uniform(0,1)
t0 = time.time()
for i in xrange(1000000):
compute_optimal_weights(input_prices)
t1 = time.time()
print "old time", (t1 - t0)
The number of elements in list and dictionary vary, but on average there are about 10 elements. They keys in input_prices are the same across all calls but the values change, so the same key will have different values over different runs.
Using a little bit of math, you can calculate part of your sum_price_ratio_scaled as a constant earlier in the loop and speed up your program by ~80% (for the average input size of 10).
Optimized Implementation (Python 3):
def compute_optimal_weights(ids, prices):
scaled_sum = 0
for i in ids:
scaled_sum += prices[i] ** -0.25
result = {}
for i in ids:
result[i] = 0.67 * (prices[i] ** -0.25) / scaled_sum
return result
Edit, in response to this answer: While using numpy will prove more performant with massive data sets, given that "on average there are about 10 elements" in your input_sellers_ID list, I doubt that this approach is worth its own weight for your particular application.
Although it might be tempting to leverage the terseness of generator expressions and dictionary comprehensions, I noticed when running on my machine that the best performance was obtained by using regular for-in loops and avoiding function calls like sum(...). For the sake of completeness, though, here is what the above implementation would look like in a more 'pythonic' style:
def compute_optimal_weights(ids, prices):
scaled_sum = sum(prices[i] ** -0.25 for i in ids)
return {i: 0.67 * (prices[i] ** -0.25) / scaled_sum for i in ids}
Reasoning / Math:
Based on your posted algorithm, you are trying to create a dictionary with values represented by the function f(i) below, where i is one of the elements in your input_sellers_ID list.
When you initially write out the formula for f(i), it appears as though prices[i] must be recalculated for every step of the summation process, which is costly. Simplifying the expression using the rules of exponents, however, you can see that the simplest summation needed to determine f(i) is actually independent of i (only the index value of j is ever used), meaning that that term is a constant and can be calculated outside of the loop which sets the dictionary values.
Note that above I refer to input_prices as prices and input_sellers_ID as ids.
Performance Profile (~80% speed improvement on my machine, size 10):
import time
import random
def compute_optimal_weights(ids, prices):
scaled_sum = 0
for i in ids:
scaled_sum += prices[i] ** -0.25
result = {}
for i in ids:
result[i] = 0.67 * (prices[i] ** -0.25) / scaled_sum
return result
def compute_optimal_weights_old(input_sellers_ID, input_prices):
input_weights_optimal = {}
for i in input_sellers_ID:
sum_price_ratio_scaled = 0
for j in input_sellers_ID:
price_ratio = input_prices[i] / input_prices[j]
scaled_price_ratio = price_ratio ** c1
sum_price_ratio_scaled += scaled_price_ratio
input_weights_optimal[i] = c2 / sum_price_ratio_scaled
return input_weights_optimal
c1 = 0.25
c2 = 0.67
input_sellers_ID = range(10)
input_prices = {i: random.uniform(0,1) for i in input_sellers_ID}
start = time.clock()
for _ in range(1000000):
compute_optimal_weights_old(input_sellers_ID, input_prices) and None
old_time = time.clock() - start
start = time.clock()
for _ in range(1000000):
compute_optimal_weights(input_sellers_ID, input_prices) and None
new_time = time.clock() - start
print('Old:', compute_optimal_weights_old(input_sellers_ID, input_prices))
print('New:', compute_optimal_weights(input_sellers_ID, input_prices))
print('New algorithm is {:.2%} faster.'.format(1 - new_time / old_time))
I believe we could speed-up the function by factoring the loop. Let a = price, b = n and c = c1, if my maths are not wrong (e.g. (5/6)**3 == 5**3 / 6**3:
(5./6.)**2 + (5./4.)**2
==
5**2 / 6.**2 + 5**2 / 4.**2
==
5**2 * (1/6.**2 + 1/4.**2)
With variables:
sum( (a / b) ** c for each b)
==
sum( a**c * (1/b) ** c for each b)
==
a**c * sum((1./b)**c for each b)
The second term is constant and can be taken out. Which leaves:
Faster implementation - Raw Python
Using generators and dict-comprehension:
def compute_optimal_weights(input_prices):
sconst = sum(1/w**c1 for w in input_prices.values())
return {k: c2 / (v**c1 * sconst) for k, v in input_prices.items()}
NOTE: if you are using Python2 replace .values() and .items() with .itervalues() and .iteritems() for extra speedup (few ms with large lists).
Even Faster - Numpy
Additionally, if you don't care that much about the dictionary and just want the values, you could speed it up using numpy (for large inputs >100):
def compute_optimal_weights_np(input_prices):
data = np.asarray(input_prices.values()) ** c1
return c2 / (data * np.sum(1./data))
Few timings for different input size:
N = 10 inputs:
MINE: 100000 loops, best of 3: 6.02 µs per loop
NUMPY: 100000 loops, best of 3: 10.6 µs per loop
YOURS: 10000 loops, best of 3: 23.8 µs per loop
N = 100 inputs:
MINE: 10000 loops, best of 3: 49.1 µs per loop
NUMPY: 10000 loops, best of 3: 22.6 µs per loop
YOURS: 1000 loops, best of 3: 1.86 ms per loop
N = 1000 inputs:
MINE: 1000 loops, best of 3: 458 µs per loop
NUMPY: 10000 loops, best of 3: 121 µs per loop
YOURS: 10 loops, best of 3: 173 ms per loop
N = 100000 inputs:
MINE: 10 loops, best of 3: 54.2 ms per loop
NUMPY: 100 loops, best of 3: 11.1 ms per loop
YOURS: didn't finish in a couple of minutes
Both options here are considerably faster than the one presented in the question. The benefit of using numpy if you can give consistent input (in the form of array instead of a dictionary) becomes apparent when the size grows:
I recently stumbled upon numba and thought about replacing some homemade C extensions with more elegant autojitted python code. Unfortunately I wasn't happy, when I tried a first, quick benchmark. It seems like numba is not doing much better than ordinary python here, though I would have expected nearly C-like performance:
from numba import jit, autojit, uint, double
import numpy as np
import imp
import logging
logging.getLogger('numba.codegen.debug').setLevel(logging.INFO)
def sum_accum(accmap, a):
res = np.zeros(np.max(accmap) + 1, dtype=a.dtype)
for i in xrange(len(accmap)):
res[accmap[i]] += a[i]
return res
autonumba_sum_accum = autojit(sum_accum)
numba_sum_accum = jit(double[:](int_[:], double[:]),
locals=dict(i=uint))(sum_accum)
accmap = np.repeat(np.arange(1000), 2)
np.random.shuffle(accmap)
accmap = np.repeat(accmap, 10)
a = np.random.randn(accmap.size)
ref = sum_accum(accmap, a)
assert np.all(ref == numba_sum_accum(accmap, a))
assert np.all(ref == autonumba_sum_accum(accmap, a))
%timeit sum_accum(accmap, a)
%timeit autonumba_sum_accum(accmap, a)
%timeit numba_sum_accum(accmap, a)
accumarray = imp.load_source('accumarray', '/path/to/accumarray.py')
assert np.all(ref == accumarray.accum(accmap, a))
%timeit accumarray.accum(accmap, a)
This gives on my machine:
10 loops, best of 3: 52 ms per loop
10 loops, best of 3: 42.2 ms per loop
10 loops, best of 3: 43.5 ms per loop
1000 loops, best of 3: 321 us per loop
I'm running the latest numba version from pypi, 0.11.0. Any suggestions, how to fix the code, so it runs reasonably fast with numba?
I figured out myself. numba wasn't able to determine the type of the result of np.max(accmap), even if the type of accmap was set to int. This somehow slowed down everything, but the fix is easy:
#autojit(locals=dict(reslen=uint))
def sum_accum(accmap, a):
reslen = np.max(accmap) + 1
res = np.zeros(reslen, dtype=a.dtype)
for i in range(len(accmap)):
res[accmap[i]] += a[i]
return res
The result is quite impressive, about 2/3 of the C version:
10000 loops, best of 3: 192 us per loop
Update 2022:
The work on this issue led to the python package numpy_groupies, which is available here:
https://github.com/ml31415/numpy-groupies
#autojit
def numbaMax(arr):
MAX = arr[0]
for i in arr:
if i > MAX:
MAX = i
return MAX
#autojit
def autonumba_sum_accum2(accmap, a):
res = np.zeros(numbaMax(accmap) + 1)
for i in xrange(len(accmap)):
res[accmap[i]] += a[i]
return res
10 loops, best of 3: 26.5 ms per loop <- original
100 loops, best of 3: 15.1 ms per loop <- with numba but the slow numpy max
10000 loops, best of 3: 47.9 µs per loop <- with numbamax
I wish to compute a simple checksum : just adding the values of all bytes.
The quickest way I found is:
checksum = sum([ord(c) for c in buf])
But for 13 Mb data buf, it takes 4.4 s : too long (in C, it takes 0.5 s)
If I use :
checksum = zlib.adler32(buf) & 0xffffffff
it takes 0.8 s, but the result is not the one I want.
So my question is: is there any function, or lib or C to include in python 2.6, to compute a simple checksum ?
Thanks by advance,
Eric.
You could use sum(bytearray(buf)):
In [1]: buf = b'a'*(13*(1<<20))
In [2]: %timeit sum(ord(c) for c in buf)
1 loops, best of 3: 1.25 s per loop
In [3]: %timeit sum(imap(ord, buf))
1 loops, best of 3: 564 ms per loop
In [4]: %timeit b=bytearray(buf); sum(b)
10 loops, best of 3: 101 ms per loop
Here's a C extension for Python written in Cython, sumbytes.pyx file:
from libc.limits cimport ULLONG_MAX, UCHAR_MAX
def sumbytes(bytes buf not None):
cdef:
unsigned long long total = 0
unsigned char c
if len(buf) > (ULLONG_MAX // <size_t>UCHAR_MAX):
raise NotImplementedError #todo: implement for > 8 PiB available memory
for c in buf:
total += c
return total
sumbytes is ~10 time faster than bytearray variant:
name time ratio
sumbytes_sumbytes 12 msec 1.00
sumbytes_numpy 29.6 msec 2.48
sumbytes_bytearray 122 msec 10.19
To reproduce the time measurements, download reporttime.py and run:
#!/usr/bin/env python
# compile on-the-fly
import pyximport; pyximport.install() # pip install cython
import numpy as np
from reporttime import get_functions_with_prefix, measure
from sumbytes import sumbytes # from sumbytes.pyx
def sumbytes_sumbytes(input):
return sumbytes(input)
def sumbytes_bytearray(input):
return sum(bytearray(input))
def sumbytes_numpy(input):
return np.frombuffer(input, 'uint8').sum() # #root's answer
def main():
funcs = get_functions_with_prefix('sumbytes_')
buf = ''.join(map(unichr, range(256))).encode('latin1') * (1 << 16)
measure(funcs, args=[buf])
main()
Use numpy.frombuffer(buf, "uint8").sum(), it seems to be about 70 times faster than your example:
In [9]: import numpy as np
In [10]: buf = b'a'*(13*(1<<20))
In [11]: sum(bytearray(buf))
Out[11]: 1322254336
In [12]: %timeit sum(bytearray(buf))
1 loops, best of 3: 253 ms per loop
In [13]: np.frombuffer(buf, "uint8").sum()
Out[13]: 1322254336
In [14]: %timeit np.frombuffer(buf, "uint8").sum()
10 loops, best of 3: 36.7 ms per loop
In [15]: %timeit sum([ord(c) for c in buf])
1 loops, best of 3: 2.65 s per loop