Numba : Why guvectorize is so slow? - python

I am learning numba to try to optimize my code.
First, I calculated a*exp(b*x) where a and b are two parameters and x a large numpy array. I used the #jit decorator. The output is 1D array with len(x) elements.
This the code :
import timeit
import numpy as np
from numba import jit, prange
def func_np (par,x):
return(par[0]*np.exp(par[1]*x))
#jit('float64[:](float64[:], float64[:])', nopython=True, parallel=True)
def func_parallel(par,x):
length=len(x)
result = np.empty(length, dtype=np.float64)
for i in prange(length):
result[i] = par[0]*np.exp(par[1]*x[i])
return result
x=np.array(np.arange(0,100,0.0001))
par=np.array([10,0.1])
print("numpy only")
%timeit func_np(par,x)
print("")
print("numba")
%timeit func_parallel(par,x)
The output is :
numpy only
28.1 ms ± 27.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
numba
2.05 ms ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So, everything worked well. I have a large time decrease with numba optimization. (The machine where this code runs has 48 cores.).
After, I decided to use an array of a and b parameters (par=np.array([[i*10,i*0.1] for i in range(10)]), x is unchanged.
This is the beginning of my problems. First, as the output is a 2D array of shape (10 ,len(x)) elements, it seems that I could not use #jit decorator but I must use #guvectorize decorator. I have written this code (very similar to the one above) that works :
import timeit
import numpy as np
from numba import guvectorize, prange
def func_np (par,x):
return(par[:,0:1]*np.exp(par[:,1:2]*x))
#guvectorize([ 'float64[:,:], float64[:,:], float64[:]' ], '(m,n),(m,p),(n)', nopython=True, target='parallel')
def func_parallel(result,par,x):
lenght=len(x)
for i in prange(lenght):
result[:,i] = (par[:,0:1]*np.exp(par[:,1:2]*x[i])).T
x=np.array(np.arange(0,100,0.0001))
par=np.array([[i*10,i*0.1] for i in range(10)])
print("numpy only")
%timeit func_np(par,x)
print("")
print("numba")
result = np.empty([par.shape[0],x.shape[0]], dtype=np.float64)
%timeit func_parallel(result, par,x);
But, the benchmarks are not good. The code with numba optimization is slower than the code written only with numpy functions.
numpy only
358 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
numba
563 ms ± 2.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As it is the first time I used numba, I suppose there is something that I have not understood. Any help would be appreciated.

i don't exactly know what guvectorize does to slow you down (most likely creating extra unnecessary threads), but it's simple to make a better defined implementation.
consider the following implementation which is faster than numpy implementation.
#jit('float64[:,:](float64[:,:], float64[:])', nopython=True,parallel=True)
def func_parallel(par, x):
result = np.zeros((par.shape[0],len(x)),dtype=np.float64)
lenght = len(x)
for i in prange(lenght):
for j in range(par.shape[0]):
result[j, i] = par[j, 0] * np.exp(par[j, 1] * x[i])
return result
it works x3 faster on my 2C/4T machine, so it should be much faster on yours.

Related

Why is the function imported from a module significantly slower than a local function in Python?

I'm trying to calculate the standard deviation for a large set of data, at first I used the statistics module to import the function.
from statistics import pstdev
But the result is very slow, so I decided to write a local helper method which does the exactly same thing,
def get_std_dev(ls):
n = len(ls)
mean = sum(ls) / n
var = sum((x - mean)**2 for x in ls) / n
std_dev = var ** 0.5
return std_dev
This runs significantly faster! Here are the runtime comparison
Runtime with my written function: 0:00:00.532228
Runtime with imported module function: 0:00:17.605583
I am very confused why the imported function is so slow compared to my local written function. Does it have to do with memory location?
The only difference between the two functions are the these pieces of codes
stdev = get_std_dev(close_price_list) # my written one
stdev = pstdev(close_price_list) # the imported function
This is not an answer
This is just to explain some of my comments more clearly. With x = np.random.random(10_000_000) I get these timings:
In [3]: %timeit statistics.pstdev(x)
39.4 s ± 428 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit get_std_dev(x)
4.53 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit x.std(ddof=0)
52.6 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numba in nonpython mode is much slower than pure python (no print statements or specified numpy functions)

I have recently discovered that Numba may work much slower than pure python even in non-python mode with the parrallel=True option enabled.
Important: If you don't deal with Voronoi diagrams please continue reading, my question doesn't relate to them directly.
Currently, I am working on a problem where I have energy associated with the Voronoi diagram's edges and cells areas. The scipy Vornoi returns an array containing couples of points (vor.ridge_points) associated with each Vornoi edge. For my code, I want to have the ability to get index of the edge when providing indexes of the associated points, so I define a kind of adjacency matrix, but instead of ones and zeros, it has zeros and indexes of edges.
It turns out that pure python when performing cycles over numpy arrays turns to be 10 times faster than numba. Here is a toy example (i just randomly generated arrays, for the same number of edges and points as in my simulation).
My guess that it has something to do with memory allocation. Any take on the subject would be apprectiated (the reason why is it so much slower or a better way to get edge number from numbers of points) :)
# %%
from numba.np.ufunc import parallel
import numpy as np
from numba import njit
from numba import prange
# %% generating array that models array og ridges
points_number = 8802
ridges_number = 26379
np.random.seed(123)
ridge_points = np.random.randint(points_number, size=(ridges_number, 2))
# %% symmetric matrix containing indexes of all edges
# in space [original_point_1, original_point_2]
ridge_points = np.array(ridge_points, dtype=np.int32)
#njit(parallel=True, cache=True)
def jit_edges_matrix_op(r_p, r_n):
matrix = np.zeros((r_n, r_n), dtype=np.int32)
for i in prange(r_n):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
return matrix
e_matrix_op = jit_edges_matrix_op(ridge_points, ridges_number)
# %% the same but not jitted
def edges_matrix_op(r_p, r_n):
matrix = np.zeros((r_n, r_n), dtype=np.int32)
for i in range(r_n):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
return matrix
e_matrix_op = edges_matrix_op(ridge_points, ridges_number)
# %%
%%timeit
jit_edges_matrix_op(ridge_points, ridges_number)
# %%
%%timeit
edges_matrix_op(ridge_points, ridges_number)
UPDATE
Indeed parallelization is not working properly here, so I run tests with parallel=False. Here are the results
630 ms ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - parallel=True
553 ms ± 4.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) - parallel=False
66.5 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) - pure python
UPDATE 2
Thanks to max9111 sharing a link https://github.com/numba/numba/issues/7259
There seems to be an issue with allocating large arrays with zeros (np.zeros)
The issue has been reported a couple of weeks ago, and the link contains some workaround examples.
I tried allocating np.empty()
29.7 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- numba parallel=True
44.7 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- numba parallel=False
60.4 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)- pure python
And as you can see parallelized numba works the best, so this task is parallizable and overhead is not that big
I think the crucial issue is related to what is the nature of the calculations you are performing in this part of the code:
for i in range(ridges_number):
e1 = r_p[i, 0]
e2 = r_p[i, 1]
matrix[e1, e2] = i
matrix[e2, e1] = i
Loop calculations are performing best if they are cache-local and trivially paralellizable (i.e. calculations are independent in each loop).
In your case, both of the conditions are violated. e1 and e2 do not take consecutive values across all of the loops. Similarly, the matrix r_p is likely preventing efficient paralellization because it needs to be accessed by all of the threads in each of the loops and probably it is locked by one while being accesed by all others).
All in all, the function you chose to speed-up may suffer the overhead of paralellization while in effect the calculations are executed sequentially. And the calculations, at least as they are at the moment, are inherently difficult to speed up by parallelization.

Is there a faster way to return multiple values when looping over function?

positions=[]; velocities=[]
for _ in range(1000):
position, velocity = generateRandomVectors()
positions.append(position)
velocities.append(velocity)
Can this be done faster?
For example, appending values at a different stage, or using a different kind of loop?
Current times:
n=100
times = timeit.repeat(lambda: test1(), number=n, repeat=10, timer=time.process_time)
print(min(times)/n)
--- 0.04439204000000018 ---
Fast, but I have a lot of vectors to generate.
You should use vectorization as much as possible
def generateRandomVectors():
return np.random.randn(3), np.random.randn(3)
positions=[]; velocities=[]
for _ in range(1000):
position, velocity = generateRandomVectors()
positions.append(position)
velocities.append(velocity)
2.19 ms ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If instead of generating one vector per call I generate the 1000 in one call it runs about 15x faster here
def generateRandomVectors(n):
return np.random.randn(n, 3), np.random.randn(n, 3)
positions, velocities = generateRandomVectors(1000)
145 µs ± 533 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
One difference is that now positions and velocities are arrays and not lists. Good because you save memory and can do batched operations, great if you will not try to append more elements.
You can try using the split method from the numpy module:
from numpy import split, array
position, velocity = split(array(tuple(generateRandomVectors() for _ in range(1000))), 2, 1)
If you don't have the numpy module, you can install it via the command prompt command
pip install numpy

numpy elementwise outer product with sparse matrices

I want to do the element-wise outer product of three (or four) large 2D arrays in python (values are float32 rounded to 2 decimals). They all have the same number of rows "n", but different number of columns "i", "j", "k".
The resulting array should be of shape (n, i*j*k). Then, I want to sum each column of the result to end up with a 1D array of shape (i*j*k).
np.shape(a) = (75466, 10)
np.shape(b) = (75466, 28)
np.shape(c) = (75466, 66)
np.shape(intermediate_result) = (75466, 18480)
np.shape(result) = (18480)
Thanks to ruankesi and divakar, I got a piece of code that works:
# Multiply first two matrices
first_multi = a[...,None] * b[:,None]
# could use np.einsum('ij,ik->ijk',a,b), which is slightly faster
ab_fills = first_multi.reshape(a.shape[0], a.shape[1]*b.shape[1])
# Multiply the result with the third matrix
second_multi = ab_fills[..., None] * c[:,None]
abc_fills = second_multi.reshape(ab_fills.shape[0], ab_fills.shape[1] * c.shape[1])
# Get the result: sum columns and get a 1D array of length 10*28*66 = 18 480
result = np.sum(abc_fills, axis = 0)
Problem 1: Performance
This takes about 3 seconds, but I have to repeat this operation many times and some of the matrices are even larger (in number of rows). It is acceptable but making it faster would be nice.
Problem 2: My matrices are sparse
Indeed, for instance, "a" contains 70% of 0s. I tried to play with scipy csc_matrix, but really could not get a working version. (to get the element-wise outer product here I go via a conversion to a 3D matrix, which are not supported in scipy sparse_matrix)
Problem 3: memory usage
If I try to also work with a 4th matrix, I run into memory issues.
I imagine that converting this code to sparse_matrix would save a lot of memory, and make the calculation faster by ignoring the numerous 0 values.
Is that true? If yes, can someone help me?
Of course, if you have any suggestion for a better implementation, I am also very interested. I don't need any of the intermediate results, just the final 1D result.
It's been weeks I'm stuck on this part of code, I am going nuts!
Thank you!
Edit after Divakar's answer
Approach #1:
Very nice one liner but surprisingly slower than the original approach (?).
On my test dataset, approach #1 takes 4.98 s ± 3.06 ms per loop (no speedup with optimize = True)
The original decomposed approach took 3.01 s ± 16.5 ms per loop
Approach #2:
Absolutely great, thank you! What an impressive speedup!
62.6 ms ± 233 µs per loop
About numexpr, I try to avoid as much as possible requirements for external modules, and I don't plan to use multicores/threads. This is an "embarrassingly" parallelizable task, with hundreds of thousands of objects to analyze, I'll just spread the list across available CPUs during production. I will give it a try for memory optimization.
As a brief try of numexpr with a restriction for 1 thread, performing 1 multiplication, I get a runtime of 40ms without numexpr, and 52 ms with numexpr.
Thanks again!!
Approach #1
We can use np.einsum to do sum-reductions in one go -
result = np.einsum('ij,ik,il->jkl',a,b,c).ravel()
Also, play around with the optimize flag in np.einsum by setting it as True to use BLAS.
Approach #2
We can use broadcasting to do the first step as also mentioned in the posted code and then leverage tensor-matrix-multiplcation with np.tensordot -
def broadcast_dot(a,b,c):
first_multi = a[...,None] * b[:,None]
return np.tensordot(first_multi,c, axes=(0,0)).ravel()
We can also use numexpr module that supports multi-core processing and also achieves better memory efficiency to get first_multi. This gives us a modified solution, like so -
import numexpr as ne
def numexpr_broadcast_dot(a,b,c):
first_multi = ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
return np.tensordot(first_multi,c, axes=(0,0)).ravel()
Timings on random float data with given dataset sizes -
In [36]: %timeit np.einsum('ij,ik,il->jkl',a,b,c).ravel()
4.57 s ± 75.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit broadcast_dot(a,b,c)
270 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit numexpr_broadcast_dot(a,b,c)
172 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Just to give a sense of improvement with numexpr -
In [7]: %timeit a[...,None] * b[:,None]
80.4 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
25.9 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This should be substantial when extending this solution to higher number of inputs.

how to speed up the algorithm about boolean

If I have a very large piece of data, and I want to find out some specific elements and convert them from bool to number. For example, I want to find whether the element is in the interval (0.3,0.4), and convert True to 1 and False to 0.
i=np.random.rand(1000,1000,1000)
j=((0.3<i)*(i<0.4))*1
Does j=((0.3<i)&(i<0.4))*1 work the same as the expression above?
I know bool*bool is time-consuming and exploit a huge memory, and so is bool convert to number. Then how can I seed up the algorithm and save memory? Is there a way to evaluate 0.3<i<0.4 quickly?
Yes, for boolean arrays & and * are identical because both are only True if both operands are True, otherwise False.
You already found out that each operation creates a temporary array (although newer NumPy versions might be optimized in that respect), so you have one temporary boolean array for each <, one for the * or the & and then you create an integer array with the * 1. Without using additional libraries you can't avoid that. NumPy is fast because it does the loops in C but that means you have to deal with temporary arrays.
But with additional libraries you actually can speed that up and make it more memory-efficient.
Numba:
import numba as nb
import numpy as np
#nb.njit
def numba_func(arr, lower, upper):
res = np.zeros(arr.size, dtype=np.int8)
arr_raveled = arr.ravel()
for idx in range(arr.size):
res[idx] = lower < arr_raveled[idx] < upper
return res.reshape(arr.shape)
>>> numba_func(i, 0.3, 0.4) # sample call
Numexpr
import numexpr as ne
ne.evaluate('((0.3<i)&(i<0.4))*1')
However numexpr is more of a black-box, you don't control how much memory it needs, but in most cases where you deal with multiple element-wise NumPy operations it's very fast and much more memory efficient than NumPy.
Cython
I'm using IPython magic here. If you don't use IPython or Jupyter you probably need to cythonize it yourself.
%load_ext cython
%%cython
import numpy as np
cimport numpy as cnp
cpdef cnp.int8_t[:] cython_func(double[:] arr, double lower, double upper):
cdef Py_ssize_t idx
cdef cnp.int8_t[:] res = np.empty(len(arr), dtype=np.int8)
for idx in range(len(arr)):
res[idx] = lower < arr[idx] < upper
return res
Given that I used 1D-memoryviews here, you need to cast it to an array and reshape it afterwards:
np.asarray(cython_func(i.ravel(), 0.3, 0.4)).reshape(i.shape) # sample call
There are probably better ways to get around the ravel, asarray and reshape but those require that you know the dimension of your array.
Timing
I use a smaller array because I don't have much RAM but you can easily change the numbers:
i = np.random.random((1000, 1000, 10))
%timeit numba_func(i, 0.3, 0.4)
52.1 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit ne.evaluate('((0.3<i)&(i<0.4))*1')
77.1 ms ± 6.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.asarray(cython_func(i.ravel(), 0.3, 0.4)).reshape(i.shape)
146 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit ((0.3<i)&(i<0.4))*1
180 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Yes, the expression works the same. Check it with
jmult = ((0.3<i)*(i<0.4))*1
jand = ((0.3<i)&(i<0.4))*1
jand == jmult

Categories