numba #jit slower that pure python? - python

so i need to improve the execution time for a script that i have been working on. I started working with numba jit decorator to try parallel computing however it throws me
KeyError: "Does not support option: 'parallel'"
so i decided to test the nogil if it unlocks the whole capabilities from my cpu but it was slower than pure python i dont understand why this happened, and if someone can help me or guide me i will be very grateful
import numpy as np
from numba import *
#jit(['float64[:,:],float64[:,:]'],'(n,m),(n,m)->(n,m)',nogil=True)
def asd(x,y):
return x+y
u=np.random.random(100)
w=np.random.random(100)
%timeit asd(u,w)
%timeit u+w
10000 loops, best of 3: 137 µs per loop
The slowest run took 7.13 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 1.75 µs per loop

You cannot expect numba to outperform numpy on such a simple vectorized operation. Also your comparison isn't exactly fair since the numba function includes the cost of the outside function call. If you sum a larger array, you'll see that the performance of the two converge and what you are seeing is just overhead on a very fast operation:
import numpy as np
import numba as nb
#nb.njit
def asd(x,y):
return x+y
def asd2(x, y):
return x + y
u=np.random.random(10000)
w=np.random.random(10000)
%timeit asd(u,w)
%timeit asd2(u,w)
The slowest run took 17796.43 times longer than the fastest. This could mean
that an intermediate result is being cached.
100000 loops, best of 3: 6.06 µs per loop
The slowest run took 29.94 times longer than the fastest. This could mean that
an intermediate result is being cached.
100000 loops, best of 3: 5.11 µs per loop
As far as parallel functionality, for this simple operation, you can use nb.vectorize:
#nb.vectorize([nb.float64(nb.float64, nb.float64)], target='parallel')
def asd3(x, y):
return x + y
u=np.random.random((100000, 10))
w=np.random.random((100000, 10))
%timeit asd(u,w)
%timeit asd2(u,w)
%timeit asd3(u,w)
But again, if you operate on small arrays, you are going to be seeing the overhead of thread dispatch. For the array sizes above, I see the parallel giving me a 2x speedup.
Where numba really shines is doing operations that are difficult to do in numpy using broadcasting, or when operations would result in a lot of temporary intermediate array allocations.

Related

How is numpy multi_dot slower than numpy.dot?

I'm trying to optimize some code that performs lots of sequential matrix operations.
I figured numpy.linalg.multi_dot (docs here) would perform all the operations in C or BLAS and thus it would be way faster than going something like arr1.dot(arr2).dot(arr3) and so on.
I was really surprised running this code on a notebook:
v1 = np.random.rand(2,2)
v2 = np.random.rand(2,2)
%%timeit
​
v1.dot(v2.dot(v1.dot(v2)))
The slowest run took 9.01 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.14 µs per loop
%%timeit ​
np.linalg.multi_dot([v1,v2,v1,v2])
The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.9 µs per loop
To find out that the same operation is about 10x slower using multi_dot.
My questions are:
Am I missing something ? does it make any sense ?
Is there another way to optimize sequential matrix operations ?
Should I expect the same behavior using cython ?
It's because your test matrices are too small and too regular; the overhead in figuring out the fastest evaluation order may outweights the potential performance gain.
Using the example from the document:
import numpy as snp
from numpy.linalg import multi_dot
# Prepare some data
A = np.random.rand(10000, 100)
B = np.random.rand(100, 1000)
C = np.random.rand(1000, 5)
D = np.random.rand(5, 333)
%timeit -n 10 multi_dot([A, B, C, D])
%timeit -n 10 np.dot(np.dot(np.dot(A, B), C), D)
%timeit -n 10 A.dot(B).dot(C).dot(D)
Result:
10 loops, best of 3: 12 ms per loop
10 loops, best of 3: 62.7 ms per loop
10 loops, best of 3: 59 ms per loop
multi_dot improves performance by evaluating the fastest multiplication order in which there are least scalar multiplications.
In the above case, the default regular multiplication order ((AB)C)D is evaluated as A((BC)D)--so that a 1000x100 # 100x1000 multiplication is reduced to 1000x100 # 100x333, cutting down at least 2/3 scalar multiplications.
You can verify this by testing
%timeit -n 10 np.dot(A, np.dot(np.dot(B, C), D))
10 loops, best of 3: 19.2 ms per loop

numpy's random vs python's default random subsampling

I observed that python's default random.sample is much faster than numpy's random.choice. Taking a small sample from an array of length 1 million, random.sample is more than 1000x faster than its numpy's counterpart.
In [1]: import numpy as np
In [2]: import random
In [3]: arr = [x for x in range(1000000)]
In [4]: nparr = np.array(arr)
In [5]: %timeit random.sample(arr, 5)
The slowest run took 5.25 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 4.54 µs per loop
In [6]: %timeit np.random.choice(arr, 5)
10 loops, best of 3: 47.7 ms per loop
In [7]: %timeit np.random.choice(nparr, 5)
The slowest run took 6.79 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 7.79 µs per loop
Although numpy sampling from numpy array was decently fast yet it was slower than default random sampling.
Is the observation above correct, or am I missing the difference between what random.sample and np.random.choice compute?
What you're seeing in your first call of numpy.random.choice is simply the overhead of converting the list arr to a numpy array.
As for your second call, the slightly worse is probably due to the fact that numpy.random.choice offers the ability to sample non-uniformly, and can also sample with replacement as well as without.

Fast indexed dot-product for numpy/scipy

I'm using numpy to do linear algebra. I want to do fast subset-indexed dot and other linear operations.
When dealing with big matrices, slicing solution like A[:,subset].dot(x[subset]) may be longer than doing the multiplication on the full matrix.
A = np.random.randn(1000,10000)
x = np.random.randn(10000,1)
subset = np.sort(np.random.randint(0,10000,500))
Timings show that sub-indexing can be faster when columns are in one block.
%timeit A.dot(x)
100 loops, best of 3: 4.19 ms per loop
%timeit A[:,subset].dot(x[subset])
100 loops, best of 3: 7.36 ms per loop
%timeit A[:,:500].dot(x[:500])
1000 loops, best of 3: 1.75 ms per loop
Still the acceleration is not what I would expect (20x faster!).
Does anyone know an idea of a library/module that allow these kind of fast operation through numpy or scipy?
For now on I'm using cython to code a fast column-indexed dot product through the cblas library. But for more complex operation (pseudo-inverse, or subindexed least square solving) I'm not shure to reach good acceleration.
Thanks!
Well, this is faster.
%timeit A.dot(x)
#4.67 ms
%%timeit
y = numpy.zeros_like(x)
y[subset]=x[subset]
d = A.dot(y)
#4.77ms
%timeit c = A[:,subset].dot(x[subset])
#7.21ms
And you have all(d-ravel(c)==0) == True.
Notice that how fast this is depends on the input. With subset = array([1,2,3]) you have that the time of my solution is pretty much the same, while the timing of the last solution is 46micro seconds.
Basically this will be faster if the size ofsubset is not much smaller than the size of x

sign() much slower in python than matlab?

I have a function in python that basically takes the sign of an array (75,150), for example.
I'm coming from Matlab and the time execution looks more or less the same less this function.
I'm wondering if sign() works very slowly and you know an alternative to do the same.
Thx,
I can't tell you if this is faster or slower than Matlab, since I have no idea what numbers you're seeing there (you provided no quantitative data at all). However, as far as alternatives go:
import numpy as np
a = np.random.randn(75, 150)
aSign = np.sign(a)
Testing using %timeit in IPython:
In [15]: %timeit np.sign(a)
10000 loops, best of 3: 180 µs per loop
Because the loop over the array (and what happens inside it) is implemented in optimized C code rather than generic Python code, it tends to be about an order of magnitude faster—in the same ballpark as Matlab.
Comparing the exact same code as a numpy vectorized operation vs. a Python loop:
In [276]: %timeit [np.sign(x) for x in a]
1000 loops, best of 3: 276 us per loop
In [277]: %timeit np.sign(a)
10000 loops, best of 3: 63.1 us per loop
So, only 4x as fast here. (But then a is pretty small here.)

Are Numpy functions slow?

Numpy is supposed to be fast. However, when comparing Numpy ufuncs with standard Python functions I find that the latter are much faster.
For example,
aa = np.arange(1000000, dtype = float)
%timeit np.mean(aa) # 1000 loops, best of 3: 1.15 ms per loop
%timeit aa.mean # 10000000 loops, best of 3: 69.5 ns per loop
I got similar results with other Numpy functions like max, power. I was under the impression that Numpy has an overhead that makes it slower for small arrays but would be faster for large arrays. In the code above aa is not small: it has 1 million elements. Am I missing something?
Of course, Numpy is fast, only the functions seem to be slow:
bb = range(1000000)
%timeit mean(bb) # 1 loops, best of 3: 551 ms per loop
%timeit mean(list(bb)) # 10 loops, best of 3: 136 ms per loop
Others already pointed out that your comparison is not a real comparison (you are not calling the function + both are numpy).
But to give an answer to the question "Are numpy function slow?": generally speaking, no, numpy function are not slow (or not slower than plain python function). Off course there are some side notes to make:
'Slow' depends off course on what you compare with, and it can always faster. With things like cython, numexpr, numba, calling C-code, ... and others it is in many cases certainly possible to get faster results.
Numpy has a certain overhead, which can be significant in some cases. For example, as you already mentioned, numpy can be slower on small arrays and scalar math. For a comparison on this, see eg Are NumPy's math functions faster than Python's?
To make the comparison you wanted to make:
In [1]: import numpy as np
In [2]: aa = np.arange(1000000)
In [3]: bb = range(1000000)
For the mean (note, there is no mean function in python standard library: Calculating arithmetic mean (average) in Python):
In [4]: %timeit np.mean(aa)
100 loops, best of 3: 2.07 ms per loop
In [5]: %timeit float(sum(bb))/len(bb)
10 loops, best of 3: 69.5 ms per loop
For max, numpy vs plain python:
In [6]: %timeit np.max(aa)
1000 loops, best of 3: 1.52 ms per loop
In [7]: %timeit max(bb)
10 loops, best of 3: 31.2 ms per loop
As a final note, in the above comparison I used a numpy array (aa) for the numpy functions and a list (bb) for the plain python functions. If you would use a list with numpy functions, in this case it would again be slower:
In [10]: %timeit np.max(bb)
10 loops, best of 3: 115 ms per loop
because the list is first converted to an array (which consumes most of the time). So, if you want to rely on numpy in your application, it is important to make use of numpy arrays to store you data (or if you have a list, convert it to an array so this conversion has to be done only once).
You're not calling aa.mean. Put the function call parentheses on the end, to actually call it, and the speed difference will nearly vanish. (Both np.mean(aa) and aa.mean() are NumPy; neither uses Python builtins to do the math.)

Categories