Optimizing access on numpy arrays for numba

Optimizing access on numpy arrays for numba - python

I recently stumbled upon numba and thought about replacing some homemade C extensions with more elegant autojitted python code. Unfortunately I wasn't happy, when I tried a first, quick benchmark. It seems like numba is not doing much better than ordinary python here, though I would have expected nearly C-like performance:
from numba import jit, autojit, uint, double
import numpy as np
import imp
import logging
logging.getLogger('numba.codegen.debug').setLevel(logging.INFO)
def sum_accum(accmap, a):
res = np.zeros(np.max(accmap) + 1, dtype=a.dtype)
for i in xrange(len(accmap)):
res[accmap[i]] += a[i]
return res
autonumba_sum_accum = autojit(sum_accum)
numba_sum_accum = jit(double[:](int_[:], double[:]),
locals=dict(i=uint))(sum_accum)
accmap = np.repeat(np.arange(1000), 2)
np.random.shuffle(accmap)
accmap = np.repeat(accmap, 10)
a = np.random.randn(accmap.size)
ref = sum_accum(accmap, a)
assert np.all(ref == numba_sum_accum(accmap, a))
assert np.all(ref == autonumba_sum_accum(accmap, a))
%timeit sum_accum(accmap, a)
%timeit autonumba_sum_accum(accmap, a)
%timeit numba_sum_accum(accmap, a)
accumarray = imp.load_source('accumarray', '/path/to/accumarray.py')
assert np.all(ref == accumarray.accum(accmap, a))
%timeit accumarray.accum(accmap, a)
This gives on my machine:
10 loops, best of 3: 52 ms per loop
10 loops, best of 3: 42.2 ms per loop
10 loops, best of 3: 43.5 ms per loop
1000 loops, best of 3: 321 us per loop
I'm running the latest numba version from pypi, 0.11.0. Any suggestions, how to fix the code, so it runs reasonably fast with numba?

I figured out myself. numba wasn't able to determine the type of the result of np.max(accmap), even if the type of accmap was set to int. This somehow slowed down everything, but the fix is easy:
#autojit(locals=dict(reslen=uint))
def sum_accum(accmap, a):
reslen = np.max(accmap) + 1
res = np.zeros(reslen, dtype=a.dtype)
for i in range(len(accmap)):
res[accmap[i]] += a[i]
return res
The result is quite impressive, about 2/3 of the C version:
10000 loops, best of 3: 192 us per loop
Update 2022:
The work on this issue led to the python package numpy_groupies, which is available here:
https://github.com/ml31415/numpy-groupies

#autojit
def numbaMax(arr):
MAX = arr[0]
for i in arr:
if i > MAX:
MAX = i
return MAX
#autojit
def autonumba_sum_accum2(accmap, a):
res = np.zeros(numbaMax(accmap) + 1)
for i in xrange(len(accmap)):
res[accmap[i]] += a[i]
return res
10 loops, best of 3: 26.5 ms per loop <- original
100 loops, best of 3: 15.1 ms per loop <- with numba but the slow numpy max
10000 loops, best of 3: 47.9 µs per loop <- with numbamax

Related

Comparing Numpy and Matlab array summation speed

I recently converted a MATLAB script to Python with Numpy, and found that it ran significantly slower. I expected similar performance, so I'm wondering if I'm doing something wrong.
As stripped-down example, I manually sum a geometric series:
MATLAB version:
function s = array_sum(a, array_size, iterations)
s = zeros(array_size);
for m = 1:iterations
s = a + 0.5*s;
end
end
% benchmark code
array_size = 500
iterations = 500
a = randn(array_size)
f = #() array_sum(a, array_size, iterations);
fprintf('run time: %.2f ms\n', timeit(f)*1e3);
Python/Numpy version:
import numpy as np
import timeit
def array_sum(a, array_size, iterations):
s = np.zeros((array_size, array_size))
for m in range(iterations):
s = a + 0.5*s
return s
array_size = 500
iterations = 500
a = np.random.randn(array_size, array_size)
timeit_iterations = 10
t1 = timeit.timeit(lambda: array_sum(a, array_size, iterations),
number=timeit_iterations)
print("run time: {:.2f} ms".format(1e3*t1/timeit_iterations))
On my machine, MATLAB completes in 58 ms. The Python version runs in 292 ms, or 5X slower.
I also tried speeding up the Python code by adding the Numba JIT decorator #jit('f8[:,:](i8, i8)', nopython=True), but the time only dropped to 236 ms (4X slower).
This is slower than I expected. Am I using timeit improperly? Is there something wrong with my Python code?
EDIT: edited so that the random matrix is created outside of benchmarked function.
EDIT 2: I ran the benchmark using Torch instead of Numpy (calculating the sum as s = torch.add(s, 0.5, a)) and it runs in just 52 ms on my computer!

From my experience, when using numba's jit function it's usually faster to expand array operations into loops. So I tried to rewrite your python function as:
#jit(nopython=True, cache=True)
def array_sum_numba(a, array_size, iterations):
s = np.zeros((array_size, array_size))
for m in range(iterations):
for i in range(array_size):
for j in range(array_size):
s[i,j] = a[i,j] + 0.5 * s[i,j]
return s
And out of curiosity, I've also tested #percusse's version with a little modification on the parameter:
def array_sum2(r, array_size, iterations):
s = np.zeros((array_size, array_size))
for m in range(iterations):
s /= 2
s += r
return s
The testing results on my machine are:
original version run time: 143.83 ms
numba jitted loop version run time: 26.99 ms
#percusse's version run time: 61.38 ms
This result is within my expectation. It's worthing mentioning that I've increased timeit iterations to 50, which results in some significant time reduction for numba version.
In summary: The Python code can still be significantly accelerated if you use numba's jit and write the function in loops. I don't have Matlab on my machine to test, but my guess is with numba the python version is faster.

Since you are updating the same variable suitable for inplace operations, you can update your function as
def array_sum2(array_size, iterations):
s = np.zeros((array_size, array_size))
r = np.random.randn(array_size, array_size)
for m in range(iterations):
s /= 2
s += r
return s
This has given the following speed benefit on my machine compared to array_sum
run time: 157.32 ms
run time2: 672.43 ms

Times include the randn call as well as the summation:
In [68]: timeit array_sum(array_size, 0)
16.6 ms ± 436 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [69]: timeit array_sum(array_size, 1)
18.9 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [70]: timeit array_sum(array_size, 20)
55.5 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [71]: (55-16)/20
Out[71]: 1.95
So it's 16ms for the setup, and 2ms per iteration. Same pattern with 500 iterations.
MATLAB does some JIT compilation. I don't know if that's the case here or not. I don't have MATLAB to test. In Octave (no timeit)
>> t = time(); array_sum(500,0); (time()-t)*1000
ans = 13.704
>> t = time(); array_sum(500,1); (time()-t)*1000
ans = 16.219
>> t = time(); array_sum(500,20); (time()-t)*1000
ans = 82.346
>> t = time(); array_sum(500,500); (time()-t)*1000
ans = 1610.6
Octave's random is faster, but the per iteration sum is slower.

Python dict: the size affects timing?

Let' say you have one key in dictionary A vs 1 billion keys in dictionary B
Algorithmically a lookup op is O(1)
However, the actual time (program execution time) to look up different based on the size of the dict?
onekey_stime = time.time()
print one_key_dict.get('firstkey')
onekey_dur = time.time() - onekey_stime
manykeys_stime = time.time()
print manykeys_dict.get('randomkey')
manykeys_dur = time.time() - manykey_stime
Would i see any time difference between onekey_dur and manykeys_dur?

Pretty much identical in a test with a small and large dict:
In [31]: random_key = lambda: ''.join(np.random.choice(list(string.ascii_letters), 20))
In [32]: few_keys = {random_key(): np.random.random() for _ in xrange(100)}
In [33]: many_keys = {random_key(): np.random.random() for _ in xrange(1000000)}
In [34]: few_lookups = np.random.choice(few_keys.keys(), 50)
In [35]: many_lookups = np.random.choice(many_keys.keys(), 50)
In [36]: %timeit [few_keys[k] for k in few_lookups]
100000 loops, best of 3: 6.25 µs per loop
In [37]: %timeit [many_keys[k] for k in many_lookups]
100000 loops, best of 3: 7.01 µs per loop
EDIT: For you, #ShadowRanger -- missed lookups are pretty close too:
In [38]: %timeit [few_keys.get(k) for k in many_lookups]
100000 loops, best of 3: 7.99 µs per loop
In [39]: %timeit [many_keys.get(k) for k in few_lookups]
100000 loops, best of 3: 8.78 µs per loop

Python: faster operation for indexing

I have the following snippet that extracts indices of all unique values (hashable) in a sequence-like data with canonical indices and store them in a dictionary as lists:
from collections import defaultdict
idx_lists = defaultdict(list)
for idx, ele in enumerate(data):
idx_lists[ele].append(idx)
This looks like to me a quite common use case. And it happens that 90% of the execution time of my code is spent in these few lines. This part is passed through over 10000 times during execution, and len(data) is around 50000 to 100000 each time this is run. Number of unique elements ranges from 50 to 150 roughly.
Is there a faster way, perhaps vectorized/c-extended (e.g. numpy or pandas methods), that achieves the same thing?
Many many thanks.

Not as impressive as I hoped for originally (there's still a fair bit of pure Python in the groupby code path), but you might be able to cut the time down by a factor of 2-4, depending on how much you care about the exact final types involved:
import numpy as np, pandas as pd
from collections import defaultdict
def by_dd(data):
idx_lists = defaultdict(list)
for idx, ele in enumerate(data):
idx_lists[ele].append(idx)
return idx_lists
def by_pand1(data):
return {k: v.tolist() for k,v in data.groupby(data.values).indices.items()}
def by_pand2(data):
return data.groupby(data.values).indices
data = pd.Series(np.random.randint(0, 100, size=10**5))
gives me
>>> %timeit by_dd(data)
10 loops, best of 3: 42.9 ms per loop
>>> %timeit by_pand1(data)
100 loops, best of 3: 18.2 ms per loop
>>> %timeit by_pand2(data)
100 loops, best of 3: 11.5 ms per loop

Though it's not the perfect solution (it's O(NlogN) instead of O(N)), a much faster, vectorized way to do it is:
def data_to_idxlists(data):
sorting_ixs = np.argsort(data)
uniques, unique_indices = np.unique(data[sorting_ixs], return_index = True)
return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])}
Another solution that is O(N*U), (where U is the number of unique groups):
def data_to_idxlists(data):
u, ixs = np.unique(data, return_inverse=True)
return {u: np.nonzero(ixs==i) for i, u in enumerate(u)}

I found this question to be pretty interesting and while I wasn't able to get a large improvement over the other proposed methods I did find a pure numpy method that was slightly faster than the other proposed methods.
import numpy as np
import pandas as pd
from collections import defaultdict
data = np.random.randint(0, 10**2, size=10**5)
series = pd.Series(data)
def get_values_and_indicies(input_data):
input_data = np.asarray(input_data)
sorted_indices = input_data.argsort() # Get the sorted indices
# Get the sorted data so we can see where the values change
sorted_data = input_data[sorted_indices]
# Find the locations where the values change and include the first and last values
run_endpoints = np.concatenate(([0], np.where(sorted_data[1:] != sorted_data[:-1])[0] + 1, [len(input_data)]))
# Get the unique values themselves
unique_vals = sorted_data[run_endpoints[:-1]]
# Return the unique values along with the indices associated with that value
return {unique_vals[i]: sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist() for i in range(num_values)}
def by_dd(input_data):
idx_lists = defaultdict(list)
for idx, ele in enumerate(input_data):
idx_lists[ele].append(idx)
return idx_lists
def by_pand1(input_data):
idx_lists = defaultdict(list)
return {k: v.tolist() for k,v in series.groupby(input_data).indices.items()}
def by_pand2(input_data):
return series.groupby(input_data).indices
def data_to_idxlists(input_data):
u, ixs = np.unique(input_data, return_inverse=True)
return {u: np.nonzero(ixs==i) for i, u in enumerate(u)}
def data_to_idxlists_unique(input_data):
sorting_ixs = np.argsort(input_data)
uniques, unique_indices = np.unique(input_data[sorting_ixs], return_index = True)
return {u: sorting_ixs[start:stop] for u, start, stop in zip(uniques, unique_indices, list(unique_indices[1:])+[None])}
The resulting timings were (from fastest to slowest):
>>> %timeit get_values_and_indicies(data)
100 loops, best of 3: 4.25 ms per loop
>>> %timeit by_pand2(series)
100 loops, best of 3: 5.22 ms per loop
>>> %timeit data_to_idxlists_unique(data)
100 loops, best of 3: 6.23 ms per loop
>>> %timeit by_pand1(series)
100 loops, best of 3: 10.2 ms per loop
>>> %timeit data_to_idxlists(data)
100 loops, best of 3: 15.5 ms per loop
>>> %timeit by_dd(data)
10 loops, best of 3: 21.4 ms per loop
and it should be noted that unlike by_pand2 it results a dict of lists as given in the example. If you would prefer to return a defaultdict you can simply change the last time to return defaultdict(list, ((unique_vals[i], sorted_indices[run_endpoints[i]:run_endpoints[i + 1]].tolist()) for i in range(num_values))) which increased the overall timing in my tests to 4.4 ms.
Lastly, I should note that these timing are data sensitive. When I used only 10 different values I got:
get_values_and_indicies: 4.34 ms per loop
data_to_idxlists_unique: 4.42 ms per loop
by_pand2: 4.83 ms per loop
data_to_idxlists: 6.09 ms per loop
by_pand1: 9.39 ms per loop
by_dd: 22.4 ms per loop
while if I used 10,000 different values I got:
get_values_and_indicies: 7.00 ms per loop
data_to_idxlists_unique: 14.8 ms per loop
by_dd: 29.8 ms per loop
by_pand2: 47.7 ms per loop
by_pand1: 67.3 ms per loop
data_to_idxlists: 869 ms per loop

Why is my Numpy test code 2X slower than in Matlab

I've been developing a Fresnel coefficient based reflectivity solver in Python and I've hit a bit of a roadblock as the performance in Python + Numpy is 2X slower than in Matlab. I've distilled the problem code into a simple example to show the operation being performed in each case:
Python Code for test case:
import numpy as np
import time
def compare_fn(i):
a = np.random.rand(400)
vec = np.random.rand(400)
t = time.time()
for j in xrange(i):
a = (2.3 + a * np.exp(2j*vec))/(1 + (2.3 * a * np.exp(2j*vec)))
print (time.time()-t)
return a
a = compare_fn(200000)
Output: 10.7989997864
Equivalent Matlab code:
function a = compare_fn(i)
a = rand(1, 400);
vec = rand(1, 400);
tic
for m = 1:i
a = (2.3 + a .* exp(2j*vec))./(1 + (2.3 * a .* exp(2j*vec)));
end
toc
a = compare_fn(200000);
Elapsed time is 5.644673 seconds.
I'm stumped by this. I already have MKL installed (Anaconda Academic License). I would greatly appreciate any help in identifying the issue with my example if any and how I can achieve equivalent if not better performance using Numpy.
In general, I cannot parallelize the loop as solving the Fresnel coefficients for a multilayer involves a recursive calculation which can be expressed in the form of a loop as above.

The following is similar to unutbu's deleted answer, and for your sample input runs 3x faster on my system. It will probably also run faster if you implement it like this in Matlab, but that's a different story. To be able to use ipython's %timeit functionality I have rewritten your original function as:
def fn(a, vec, i):
for j in xrange(i):
a = (2.3 + a * np.exp(2j*vec))/(1 + (2.3 * a * np.exp(2j*vec)))
return a
And I have optimized it by removing the exponential calculation from the loop:
def fn_bis(a, vec, n):
exp_vec = np.exp(2j*vec)
for j in xrange(n):
a = (2.3 + a * exp_vec) / (1 + 2.3 * a * exp_vec)
return a
Taking both approaches for a test ride:
In [2]: a = np.random.rand(400)
In [3]: vec = np.random.rand(400)
In [9]: np.allclose(fn(a, vec, 100), fn_bis(a, vec, 100))
Out[9]: True
In [10]: %timeit fn(a, vec, 100)
100 loops, best of 3: 8.43 ms per loop
In [11]: %timeit fn_bis(a, vec, 100)
100 loops, best of 3: 2.57 ms per loop
In [12]: %timeit fn(a, vec, 200000)
1 loops, best of 3: 16.9 s per loop
In [13]: %timeit fn_bis(a, vec, 200000)
1 loops, best of 3: 5.25 s per loop

I've been doing a lot of experimenting to try and determine the source of the speed difference between Matlab and Python/Numpy for the example in my original question. Some of the key findings have been:
Matlab now has a JIT compiler that provides significant benefit in situations involving loops. Turning it off reduces performance by 2X making it similar in speed to the native Python + Numpy code.
feature accel off
a = compare_fn(200000);
Elapsed time is 9.098062 seconds.
I then began exploring options for optimizing my example function using Numba and Cython to see how much better I could do. The one significant finding for me was that Numba JIT optimization on an explicit looped calculation was faster than native vectorized math operations on Numpy arrays. I don't quite understand why this is the case, but I have included my sample code and timing for tests below. I also played with Cython (I'm no expert) and although it was also quicker, Numba was still 2X faster than Cython, so I ended up sticking with Numba for the tests.
Here is the code for 3 equivalent functions. First one is a Numba optimized function with an explicit loop to perform elementwise calculations. Second function is a Python+Numpy function relying on Numpy vectorization to perform calculations. The third function tries to use Numba to optimize the vectorized Numpy code (and fails to improve as you can see in the results). Lastly, I've included the Cython code, though I only tested it for one case.
import numpy as np
import numba as nb
#nb.jit(nb.complex128[:](nb.int16, nb.int16))
def compare_fn_jit(i, j):
a = np.asarray(np.random.rand(j), dtype=np.complex128)
vec = np.random.rand(j)
exp_term = np.exp(2j*vec)
for k in xrange(i):
for l in xrange(j):
a[l] = (2.3 + a[l] * exp_term[l])/(1 + (2.3 * a[l] * exp_term[l]))
return a
def compare_fn(i, j):
a = np.asarray(np.random.rand(j), dtype=np.complex128)
vec = np.random.rand(j)
exp_term = np.exp(2j*vec)
for k in xrange(i):
a = (2.3 + a * exp_term)/(1 + (2.3 * a * exp_term))
return a
compare_fn_jit2 = nb.jit(nb.complex128[:](nb.int16, nb.int16))(compare_fn)
import numpy as np
cimport numpy as np
cimport cython
#cython.boundscheck(False)
def compare_fn_cython(int i, int j):
cdef int k, l
cdef np.ndarray[np.complex128_t, ndim=1] a, vec, exp_term
a = np.asarray(np.random.rand(j), dtype=np.complex128)
vec = np.asarray(np.random.rand(j), dtype=np.complex128)
exp_term = np.exp(2j*vec)
for k in xrange(i):
for l in xrange(j):
a[l] = (2.3 + a[l] * exp_term[l])/(1 + (2.3 * a[l] * exp_term[l]))
return a
Timing Results:
i. Timing for a single outer loop - Demonstrates efficiency of vectorized calculations
%timeit -n 1 -r 10 compare_fn_jit(1,1000000) 1 loops, best of 10: 352
ms per loop
%timeit -n 1 -r 10 compare_fn(1,1000000) 1 loops, best of 10: 498 ms
per loop
%timeit -n 1 -r 10 compare_fn_jit2(1,1000000) 1 loops, best of 10: 497
ms per loop
%timeit -n 1 -r 10 compare_fn_cython(1,1000000) 1 loops, best of 10:
424 ms per loop
ii. Timing in extreme case of large loops with calculations on short arrays (expect Numpy+Python to perform poorly)
%timeit -n 1 -r 5 compare_fn_jit(1000000,40) 1 loops, best of 5: 1.44
s per loop
%timeit -n 1 -r 5 compare_fn(1000000,40) 1 loops, best of 5: 28.2 s
per loop
%timeit -n 1 -r 5 compare_fn_jit2(1000000,40) 1 loops, best of 5: 29 s
per loop
iii. Test for somewhere mid-way between the two cases above
%timeit -n 1 -r 5 compare_fn_jit(100000,400) 1 loops, best of 5: 1.4 s
per loop
%timeit -n 1 -r 5 compare_fn(100000,400) 1 loops, best of 5: 5.26 s
per loop
%timeit -n 1 -r 5 compare_fn_jit2(100000,400) 1 loops, best of 5: 5.34
s per loop
As you can see, using Numba can improve efficiency by a factor ranging from 1.5X - 30X for this particular case. I am truly impressed with how efficient it is and how easy it is to use and implement when compared against Cython.

I don't know if numpypy is far enough along yet for what you're doing, but you might try it.
http://buildbot.pypy.org/numpy-status/latest.html

python how to compute a simple checksum as quickly as zlib.adler32

I wish to compute a simple checksum : just adding the values of all bytes.
The quickest way I found is:
checksum = sum([ord(c) for c in buf])
But for 13 Mb data buf, it takes 4.4 s : too long (in C, it takes 0.5 s)
If I use :
checksum = zlib.adler32(buf) & 0xffffffff
it takes 0.8 s, but the result is not the one I want.
So my question is: is there any function, or lib or C to include in python 2.6, to compute a simple checksum ?
Thanks by advance,
Eric.

You could use sum(bytearray(buf)):
In [1]: buf = b'a'*(13*(1<<20))
In [2]: %timeit sum(ord(c) for c in buf)
1 loops, best of 3: 1.25 s per loop
In [3]: %timeit sum(imap(ord, buf))
1 loops, best of 3: 564 ms per loop
In [4]: %timeit b=bytearray(buf); sum(b)
10 loops, best of 3: 101 ms per loop
Here's a C extension for Python written in Cython, sumbytes.pyx file:
from libc.limits cimport ULLONG_MAX, UCHAR_MAX
def sumbytes(bytes buf not None):
cdef:
unsigned long long total = 0
unsigned char c
if len(buf) > (ULLONG_MAX // <size_t>UCHAR_MAX):
raise NotImplementedError #todo: implement for > 8 PiB available memory
for c in buf:
total += c
return total
sumbytes is ~10 time faster than bytearray variant:
name time ratio
sumbytes_sumbytes 12 msec 1.00
sumbytes_numpy 29.6 msec 2.48
sumbytes_bytearray 122 msec 10.19
To reproduce the time measurements, download reporttime.py and run:
#!/usr/bin/env python
# compile on-the-fly
import pyximport; pyximport.install() # pip install cython
import numpy as np
from reporttime import get_functions_with_prefix, measure
from sumbytes import sumbytes # from sumbytes.pyx
def sumbytes_sumbytes(input):
return sumbytes(input)
def sumbytes_bytearray(input):
return sum(bytearray(input))
def sumbytes_numpy(input):
return np.frombuffer(input, 'uint8').sum() # #root's answer
def main():
funcs = get_functions_with_prefix('sumbytes_')
buf = ''.join(map(unichr, range(256))).encode('latin1') * (1 << 16)
measure(funcs, args=[buf])
main()

Use numpy.frombuffer(buf, "uint8").sum(), it seems to be about 70 times faster than your example:
In [9]: import numpy as np
In [10]: buf = b'a'*(13*(1<<20))
In [11]: sum(bytearray(buf))
Out[11]: 1322254336
In [12]: %timeit sum(bytearray(buf))
1 loops, best of 3: 253 ms per loop
In [13]: np.frombuffer(buf, "uint8").sum()
Out[13]: 1322254336
In [14]: %timeit np.frombuffer(buf, "uint8").sum()
10 loops, best of 3: 36.7 ms per loop
In [15]: %timeit sum([ord(c) for c in buf])
1 loops, best of 3: 2.65 s per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing access on numpy arrays for numba - python

Related

Comparing Numpy and Matlab array summation speed

Python dict: the size affects timing?

Python: faster operation for indexing

Why is my Numpy test code 2X slower than in Matlab

python how to compute a simple checksum as quickly as zlib.adler32

Categories

Resources