Why is numpy.dot much faster than numpy.einsum? - python

I have numpy compiled with OpenBlas and I am wondering why einsum is much slower than dot (I understand in the 3 indices case, but I dont understand why it is also less performant in the two indices case)? Here an example:
import numpy as np
A = np.random.random([1000,1000])
B = np.random.random([1000,1000])
%timeit np.dot(A,B)
Out: 10 loops, best of 3: 26.3 ms per loop
%timeit np.einsum("ij,jk",A,B)
Out: 5 loops, best of 3: 477 ms per loop
Is there a way to let einsum use OpenBlas and parallelization like numpy.dot?
Why does np.einsum not just call np.dot if it notices a dot product?

einsum parses the index string, and then constructs an nditer object, and uses that to perform a sum-of-products iteration. It has special cases where the indexes just perform axis swaps, and sums ('ii->i'). It may also have special cases for 2 and 3 variables (as opposed to more). But it does not make any attempt to invoke external libraries.
I worked out a pure python work-a-like, but with more focus on the parsing than the calculation special cases.
tensordot reshapes and swaps, so it can then call dot to the actual calculations.

Related

Is there a way to speed up indexing a vector with JAX?

I am indexing vectors and using JAX, but I have noticed a considerable slow-down compared to numpy when simply indexing arrays. For example, consider making a basic array in JAX numpy and ordinary numpy:
import jax.numpy as jnp
import numpy as onp
jax_array = jnp.ones((1000,))
numpy_array = onp.ones(1000)
Then simply indexing between two integers, for JAX (on GPU) this gives a time of:
%timeit jax_array[435:852]
1000 loops, best of 5: 1.38 ms per loop
And for numpy this gives a time of:
%timeit numpy_array[435:852]
1000000 loops, best of 5: 271 ns per loop
So numpy is 5000 times faster than JAX. When JAX is on a CPU, then
%timeit jax_array[435:852]
1000 loops, best of 5: 577 µs per loop
So faster, but still 2000 times slower than numpy. I am using Google Colab notebooks for this, so there should not be a problem with the installation/CUDA.
Am I missing something? I realise that indexing is different for JAX and numpy, as given by the JAX 'sharp edges' documentation, but I cannot find any way to perform assignment such as
new_array = jax_array[435:852]
without a considerable slowdown. I cannot avoid indexing the arrays as it is necessary in my program.
The short answer: to speed things up in JAX, use jit.
The long answer:
You should generally expect single operations using JAX in op-by-op mode to be slower than similar operations in numpy. This is because JAX execution has some amount of fixed per-python-function-call overhead involved in pushing compilations down to XLA.
Even seemingly simple operations like indexing are implemented in terms of multiple XLA operations, which (outside JIT) will each add their own call overhead. You can see this sequence using the make_jaxpr transform to inspect how the function is expressed in terms of primitive operations:
from jax import make_jaxpr
f = lambda x: x[435:852]
make_jaxpr(f)(jax_array)
# { lambda ; a.
# let b = broadcast_in_dim[ broadcast_dimensions=( )
# shape=(1,) ] 435
# c = gather[ dimension_numbers=GatherDimensionNumbers(offset_dims=(0,), collapsed_slice_dims=(), start_index_map=(0,))
# indices_are_sorted=True
# slice_sizes=(417,)
# unique_indices=True ] a b
# d = broadcast_in_dim[ broadcast_dimensions=(0,)
# shape=(417,) ] c
# in (d,) }
(See Understanding Jaxprs for info on how to read this).
Where JAX outperforms numpy is not in single small operations (in which JAX dispatch overhead dominates), but rather in sequences of operations compiled via the jit transform. So, for example, compare the JIT-compiled versus not-JIT-compiled version of the indexing:
%timeit f(jax_array).block_until_ready()
# 1000 loops, best of 5: 612 µs per loop
f_jit = jit(f)
f_jit(jax_array) # trigger compilation
%timeit f_jit(jax_array).block_until_ready()
# 100000 loops, best of 5: 4.34 µs per loop
(note that block_until_ready() is required for accurate micro-benchmarks because of JAX's asynchronous dispatch)
JIT-compiling this code gives a 150x speedup. It's still not as fast as numpy because of JAX's few-millisecond dispatch overhead, but with JIT that overhead is incurred only once. And when you move past microbenchmarks to more complicated sequences of real-world computations, those few milliseconds will no longer dominate, and the optimization provided by the XLA compiler can make JAX far faster than the equivalent numpy computation.

Vectorisation of coordinate distances in Numpy

I'm trying to understand Numpy by applying vectorisation. I'm trying to find the fastest function to do it.
def get_distances3(coordinates):
return np.linalg.norm(
coordinates[:, None, :] - coordinates[None, :, :],
axis=-1)
coordinates = np.random.rand(1000, 3)
%timeit get_distances3(coordinates)
The function above took 10 loops, best of 3: 35.4 ms per loop. From numpy library there's also an np.vectorize option to do it.
def get_distances4(coordinates):
return np.vectorize(coordinates[:, None, :] - coordinates[None, :, :],axis=-1)
%timeit get_distances4(coordinates)
I tried with np.vectorize below, yet ended up with the following error.
TypeError: __init__() got an unexpected keyword argument 'axis'
How can I find vectorization in get_distances4? How should I edit the lsat code in order to avoid the error? I have never used np.vectorize, so I might be missing something.
You're not calling np.vectorize() correctly. I suggest referring to the documentation.
Vectorize takes as its argument a function that is written to operate on scalar values, and converts it into a function that can be vectorized over values in arrays according to the Numpy broadcasting rules. It's basically like a fancy map() for Numpy array.
i.e. as you know Numpy already has built-in vectorized versions of many common functions, but if you had some custom function like "my_special_function(x)" and you wanted to be able to call it on Numpy arrays, you could use my_special_function_ufunc = np.vectorize(my_special_function).
In your above example you might "vectorize" your distance function like:
>>> norm = np.linalg.norm
>>> get_distance4 = np.vectorize(lambda a, b: norm(a - b))
>>> get_distance4(coordinates[:, None, :], coordinates[None, :, :])
However, you will find that this is incredibly slow:
>>> %timeit get_distance4(coordinates[:, None, :], coordinates[None, :, :])
1 loop, best of 3: 10.8 s per loop
This is because your first example get_distance3 is already using Numpy's built-in fast implementations of these operations, whereas the np.vectorize version requires calling the Python function I defined some 3000 times.
In fact according to the docs:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
If you want a potentially faster function for converting distances between vectors you could use scipy.spacial.distance.pdist:
>>> %timeit get_distances3(coordinates)
10 loops, best of 3: 24.2 ms per loop
>>> %timeit distance.pdist(coordinates)
1000 loops, best of 3: 1.77 ms per loop
It's worth noting that this has a different return formation. Rather than a 1000x1000 array it uses a condensed format that excludes i = j entries and i > j entries. If you wish you can then use scipy.spatial.distance.squareform to convert back to the square matrix format.

How does numpy broadcasting perform faster?

In the following question,
https://stackoverflow.com/a/40056135/5714445
Numpy's broadcasting provides a solution that's almost 6x faster than using np.setdiff1d() paired with np.view(). How does it manage to do this?
And using A[~((A[:,None,:] == B).all(-1)).any(1)] speeds it up even more.
Interesting, but raises yet another question. How does this perform even better?
I would try to answer the second part of the question.
So, with it we are comparing :
A[np.all(np.any((A-B[:, None]), axis=2), axis=0)] (I)
and
A[~((A[:,None,:] == B).all(-1)).any(1)]
To compare with a matching perspective against the first one, we could write down the second approach like this -
A[(((~(A[:,None,:] == B)).any(2))).all(1)] (II)
The major difference when considering performance, would be the fact that with the first one, we are getting non-matches with subtraction and then checking for non-zeros with .any(). Thus, any() is made to operate on an array of non-boolean dtype array. In the second approach, instead we are feeding it a boolean array obtained with A[:,None,:] == B.
Let's do a small runtime test to see how .any() performs on int dtype vs boolean array -
In [141]: A = np.random.randint(0,9,(1000,1000)) # An int array
In [142]: %timeit A.any(0)
1000 loops, best of 3: 1.43 ms per loop
In [143]: A = np.random.randint(0,9,(1000,1000))>5 # A boolean array
In [144]: %timeit A.any(0)
10000 loops, best of 3: 164 µs per loop
So, with close to 9x speedup on this part, we see a huge advantage to use any() with boolean arrays. This I think was the biggest reason to make the second approach faster.

Optimizing Many Matrix Operations in Python / Numpy

In writing some numerical analysis code, I have bottle-necked at a function that requires many Numpy calls. I am not entirely sure how to approach further performance optimization.
Problem:
The function determines error by calculating the following,
Code:
def foo(B_Mat, A_Mat):
Temp = np.absolute(B_Mat)
Temp /= np.amax(Temp)
return np.sqrt(np.sum(np.absolute(A_Mat - Temp*Temp))) / B_Mat.shape[0]
What would be the best way to squeeze some extra performance out of the code? Would my best course of action be performing the majority of the operations in a single for loop with Cython to cut down on the temporary arrays?
There are specific functions from the implementation that could be off-loaded to numexpr module which is known to be very efficient for arithmetic computations. For our case, specifically we could perform squaring, summation and absolute computations with it. Thus, a numexpr based solution to replace the last step in the original code, would be like so -
import numexpr as ne
out = np.sqrt(ne.evaluate('sum(abs(A_Mat - Temp**2))'))/B_Mat.shape[0]
A further performance boost could be achieved by embedding the normalization step into the numexpr's evaluate expression. Thus, the entire function modified to use numexpr would be -
def numexpr_app1(B_Mat, A_Mat):
Temp = np.absolute(B_Mat)
M = np.amax(Temp)
return np.sqrt(ne.evaluate('sum(abs(A_Mat*M**2-Temp**2))'))/(M*B_Mat.shape[0])
Runtime test -
In [198]: # Random arrays
...: A_Mat = np.random.randn(4000,5000)
...: B_Mat = np.random.randn(4000,5000)
...:
In [199]: np.allclose(foo(B_Mat, A_Mat),numexpr_app1(B_Mat, A_Mat))
Out[199]: True
In [200]: %timeit foo(B_Mat, A_Mat)
1 loops, best of 3: 891 ms per loop
In [201]: %timeit numexpr_app1(B_Mat, A_Mat)
1 loops, best of 3: 400 ms per loop

sign() much slower in python than matlab?

I have a function in python that basically takes the sign of an array (75,150), for example.
I'm coming from Matlab and the time execution looks more or less the same less this function.
I'm wondering if sign() works very slowly and you know an alternative to do the same.
Thx,
I can't tell you if this is faster or slower than Matlab, since I have no idea what numbers you're seeing there (you provided no quantitative data at all). However, as far as alternatives go:
import numpy as np
a = np.random.randn(75, 150)
aSign = np.sign(a)
Testing using %timeit in IPython:
In [15]: %timeit np.sign(a)
10000 loops, best of 3: 180 µs per loop
Because the loop over the array (and what happens inside it) is implemented in optimized C code rather than generic Python code, it tends to be about an order of magnitude faster—in the same ballpark as Matlab.
Comparing the exact same code as a numpy vectorized operation vs. a Python loop:
In [276]: %timeit [np.sign(x) for x in a]
1000 loops, best of 3: 276 us per loop
In [277]: %timeit np.sign(a)
10000 loops, best of 3: 63.1 us per loop
So, only 4x as fast here. (But then a is pretty small here.)

Categories