Vectorisation of coordinate distances in Numpy - python

I'm trying to understand Numpy by applying vectorisation. I'm trying to find the fastest function to do it.
def get_distances3(coordinates):
return np.linalg.norm(
coordinates[:, None, :] - coordinates[None, :, :],
axis=-1)
coordinates = np.random.rand(1000, 3)
%timeit get_distances3(coordinates)
The function above took 10 loops, best of 3: 35.4 ms per loop. From numpy library there's also an np.vectorize option to do it.
def get_distances4(coordinates):
return np.vectorize(coordinates[:, None, :] - coordinates[None, :, :],axis=-1)
%timeit get_distances4(coordinates)
I tried with np.vectorize below, yet ended up with the following error.
TypeError: __init__() got an unexpected keyword argument 'axis'
How can I find vectorization in get_distances4? How should I edit the lsat code in order to avoid the error? I have never used np.vectorize, so I might be missing something.

You're not calling np.vectorize() correctly. I suggest referring to the documentation.
Vectorize takes as its argument a function that is written to operate on scalar values, and converts it into a function that can be vectorized over values in arrays according to the Numpy broadcasting rules. It's basically like a fancy map() for Numpy array.
i.e. as you know Numpy already has built-in vectorized versions of many common functions, but if you had some custom function like "my_special_function(x)" and you wanted to be able to call it on Numpy arrays, you could use my_special_function_ufunc = np.vectorize(my_special_function).
In your above example you might "vectorize" your distance function like:
>>> norm = np.linalg.norm
>>> get_distance4 = np.vectorize(lambda a, b: norm(a - b))
>>> get_distance4(coordinates[:, None, :], coordinates[None, :, :])
However, you will find that this is incredibly slow:
>>> %timeit get_distance4(coordinates[:, None, :], coordinates[None, :, :])
1 loop, best of 3: 10.8 s per loop
This is because your first example get_distance3 is already using Numpy's built-in fast implementations of these operations, whereas the np.vectorize version requires calling the Python function I defined some 3000 times.
In fact according to the docs:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
If you want a potentially faster function for converting distances between vectors you could use scipy.spacial.distance.pdist:
>>> %timeit get_distances3(coordinates)
10 loops, best of 3: 24.2 ms per loop
>>> %timeit distance.pdist(coordinates)
1000 loops, best of 3: 1.77 ms per loop
It's worth noting that this has a different return formation. Rather than a 1000x1000 array it uses a condensed format that excludes i = j entries and i > j entries. If you wish you can then use scipy.spatial.distance.squareform to convert back to the square matrix format.

Related

Is there a way to speed up indexing a vector with JAX?

I am indexing vectors and using JAX, but I have noticed a considerable slow-down compared to numpy when simply indexing arrays. For example, consider making a basic array in JAX numpy and ordinary numpy:
import jax.numpy as jnp
import numpy as onp
jax_array = jnp.ones((1000,))
numpy_array = onp.ones(1000)
Then simply indexing between two integers, for JAX (on GPU) this gives a time of:
%timeit jax_array[435:852]
1000 loops, best of 5: 1.38 ms per loop
And for numpy this gives a time of:
%timeit numpy_array[435:852]
1000000 loops, best of 5: 271 ns per loop
So numpy is 5000 times faster than JAX. When JAX is on a CPU, then
%timeit jax_array[435:852]
1000 loops, best of 5: 577 µs per loop
So faster, but still 2000 times slower than numpy. I am using Google Colab notebooks for this, so there should not be a problem with the installation/CUDA.
Am I missing something? I realise that indexing is different for JAX and numpy, as given by the JAX 'sharp edges' documentation, but I cannot find any way to perform assignment such as
new_array = jax_array[435:852]
without a considerable slowdown. I cannot avoid indexing the arrays as it is necessary in my program.
The short answer: to speed things up in JAX, use jit.
The long answer:
You should generally expect single operations using JAX in op-by-op mode to be slower than similar operations in numpy. This is because JAX execution has some amount of fixed per-python-function-call overhead involved in pushing compilations down to XLA.
Even seemingly simple operations like indexing are implemented in terms of multiple XLA operations, which (outside JIT) will each add their own call overhead. You can see this sequence using the make_jaxpr transform to inspect how the function is expressed in terms of primitive operations:
from jax import make_jaxpr
f = lambda x: x[435:852]
make_jaxpr(f)(jax_array)
# { lambda ; a.
# let b = broadcast_in_dim[ broadcast_dimensions=( )
# shape=(1,) ] 435
# c = gather[ dimension_numbers=GatherDimensionNumbers(offset_dims=(0,), collapsed_slice_dims=(), start_index_map=(0,))
# indices_are_sorted=True
# slice_sizes=(417,)
# unique_indices=True ] a b
# d = broadcast_in_dim[ broadcast_dimensions=(0,)
# shape=(417,) ] c
# in (d,) }
(See Understanding Jaxprs for info on how to read this).
Where JAX outperforms numpy is not in single small operations (in which JAX dispatch overhead dominates), but rather in sequences of operations compiled via the jit transform. So, for example, compare the JIT-compiled versus not-JIT-compiled version of the indexing:
%timeit f(jax_array).block_until_ready()
# 1000 loops, best of 5: 612 µs per loop
f_jit = jit(f)
f_jit(jax_array) # trigger compilation
%timeit f_jit(jax_array).block_until_ready()
# 100000 loops, best of 5: 4.34 µs per loop
(note that block_until_ready() is required for accurate micro-benchmarks because of JAX's asynchronous dispatch)
JIT-compiling this code gives a 150x speedup. It's still not as fast as numpy because of JAX's few-millisecond dispatch overhead, but with JIT that overhead is incurred only once. And when you move past microbenchmarks to more complicated sequences of real-world computations, those few milliseconds will no longer dominate, and the optimization provided by the XLA compiler can make JAX far faster than the equivalent numpy computation.

How does numpy broadcasting perform faster?

In the following question,
https://stackoverflow.com/a/40056135/5714445
Numpy's broadcasting provides a solution that's almost 6x faster than using np.setdiff1d() paired with np.view(). How does it manage to do this?
And using A[~((A[:,None,:] == B).all(-1)).any(1)] speeds it up even more.
Interesting, but raises yet another question. How does this perform even better?
I would try to answer the second part of the question.
So, with it we are comparing :
A[np.all(np.any((A-B[:, None]), axis=2), axis=0)] (I)
and
A[~((A[:,None,:] == B).all(-1)).any(1)]
To compare with a matching perspective against the first one, we could write down the second approach like this -
A[(((~(A[:,None,:] == B)).any(2))).all(1)] (II)
The major difference when considering performance, would be the fact that with the first one, we are getting non-matches with subtraction and then checking for non-zeros with .any(). Thus, any() is made to operate on an array of non-boolean dtype array. In the second approach, instead we are feeding it a boolean array obtained with A[:,None,:] == B.
Let's do a small runtime test to see how .any() performs on int dtype vs boolean array -
In [141]: A = np.random.randint(0,9,(1000,1000)) # An int array
In [142]: %timeit A.any(0)
1000 loops, best of 3: 1.43 ms per loop
In [143]: A = np.random.randint(0,9,(1000,1000))>5 # A boolean array
In [144]: %timeit A.any(0)
10000 loops, best of 3: 164 µs per loop
So, with close to 9x speedup on this part, we see a huge advantage to use any() with boolean arrays. This I think was the biggest reason to make the second approach faster.

Optimizing Many Matrix Operations in Python / Numpy

In writing some numerical analysis code, I have bottle-necked at a function that requires many Numpy calls. I am not entirely sure how to approach further performance optimization.
Problem:
The function determines error by calculating the following,
Code:
def foo(B_Mat, A_Mat):
Temp = np.absolute(B_Mat)
Temp /= np.amax(Temp)
return np.sqrt(np.sum(np.absolute(A_Mat - Temp*Temp))) / B_Mat.shape[0]
What would be the best way to squeeze some extra performance out of the code? Would my best course of action be performing the majority of the operations in a single for loop with Cython to cut down on the temporary arrays?
There are specific functions from the implementation that could be off-loaded to numexpr module which is known to be very efficient for arithmetic computations. For our case, specifically we could perform squaring, summation and absolute computations with it. Thus, a numexpr based solution to replace the last step in the original code, would be like so -
import numexpr as ne
out = np.sqrt(ne.evaluate('sum(abs(A_Mat - Temp**2))'))/B_Mat.shape[0]
A further performance boost could be achieved by embedding the normalization step into the numexpr's evaluate expression. Thus, the entire function modified to use numexpr would be -
def numexpr_app1(B_Mat, A_Mat):
Temp = np.absolute(B_Mat)
M = np.amax(Temp)
return np.sqrt(ne.evaluate('sum(abs(A_Mat*M**2-Temp**2))'))/(M*B_Mat.shape[0])
Runtime test -
In [198]: # Random arrays
...: A_Mat = np.random.randn(4000,5000)
...: B_Mat = np.random.randn(4000,5000)
...:
In [199]: np.allclose(foo(B_Mat, A_Mat),numexpr_app1(B_Mat, A_Mat))
Out[199]: True
In [200]: %timeit foo(B_Mat, A_Mat)
1 loops, best of 3: 891 ms per loop
In [201]: %timeit numexpr_app1(B_Mat, A_Mat)
1 loops, best of 3: 400 ms per loop

Why is numpy.dot much faster than numpy.einsum?

I have numpy compiled with OpenBlas and I am wondering why einsum is much slower than dot (I understand in the 3 indices case, but I dont understand why it is also less performant in the two indices case)? Here an example:
import numpy as np
A = np.random.random([1000,1000])
B = np.random.random([1000,1000])
%timeit np.dot(A,B)
Out: 10 loops, best of 3: 26.3 ms per loop
%timeit np.einsum("ij,jk",A,B)
Out: 5 loops, best of 3: 477 ms per loop
Is there a way to let einsum use OpenBlas and parallelization like numpy.dot?
Why does np.einsum not just call np.dot if it notices a dot product?
einsum parses the index string, and then constructs an nditer object, and uses that to perform a sum-of-products iteration. It has special cases where the indexes just perform axis swaps, and sums ('ii->i'). It may also have special cases for 2 and 3 variables (as opposed to more). But it does not make any attempt to invoke external libraries.
I worked out a pure python work-a-like, but with more focus on the parsing than the calculation special cases.
tensordot reshapes and swaps, so it can then call dot to the actual calculations.

Scipy sparse... arrays?

So, I'm doing some Kmeans classification using numpy arrays that are quite sparse-- lots and lots of zeroes. I figured that I'd use scipy's 'sparse' package to reduce the storage overhead, but I'm a little confused about how to create arrays, not matrices.
I've gone through this tutorial on how to create sparse matrices:
http://www.scipy.org/SciPy_Tutorial#head-c60163f2fd2bab79edd94be43682414f18b90df7
To mimic an array, I just create a 1xN matrix, but as you may guess, Asp.dot(Bsp) doesn't quite work because you can't multiply two 1xN matrices. I'd have to transpose each array to Nx1, and that's pretty lame, since I'd be doing it for every dot-product calculation.
Next up, I tried to create an NxN matrix where column 1 == row 1 (such that you can multiply two matrices and just take the top-left corner as the dot product), but that turned out to be really inefficient.
I'd love to use scipy's sparse package as a magic replacement for numpy's array(), but as yet, I'm not really sure what to do.
Any advice?
Use a scipy.sparse format that is row or column based: csc_matrix and csr_matrix.
These use efficient, C implementations under the hood (including multiplication), and transposition is a no-op (esp. if you call transpose(copy=False)), just like with numpy arrays.
EDIT: some timings via ipython:
import numpy, scipy.sparse
n = 100000
x = (numpy.random.rand(n) * 2).astype(int).astype(float) # 50% sparse vector
x_csr = scipy.sparse.csr_matrix(x)
x_dok = scipy.sparse.dok_matrix(x.reshape(x_csr.shape))
Now x_csr and x_dok are 50% sparse:
print repr(x_csr)
<1x100000 sparse matrix of type '<type 'numpy.float64'>'
with 49757 stored elements in Compressed Sparse Row format>
And the timings:
timeit numpy.dot(x, x)
10000 loops, best of 3: 123 us per loop
timeit x_dok * x_dok.T
1 loops, best of 3: 1.73 s per loop
timeit x_csr.multiply(x_csr).sum()
1000 loops, best of 3: 1.64 ms per loop
timeit x_csr * x_csr.T
100 loops, best of 3: 3.62 ms per loop
So it looks like I told a lie. Transposition is very cheap, but there is no efficient C implementation of csr * csc (in the latest scipy 0.9.0). A new csr object is constructed in each call :-(
As a hack (though scipy is relatively stable these days), you can do the dot product directly on the sparse data:
timeit numpy.dot(x_csr.data, x_csr.data)
10000 loops, best of 3: 62.9 us per loop
Note this last approach does a numpy dense multiplication again. The sparsity is 50%, so it's actually faster than dot(x, x) by a factor of 2.
You could create a subclass of one of the existing 2d sparse arrays
from scipy.sparse import dok_matrix
class sparse1d(dok_matrix):
def __init__(self, v):
dok_matrix.__init__(self, (v,))
def dot(self, other):
return dok_matrix.dot(self, other.transpose())[0,0]
a=sparse1d((1,2,3))
b=sparse1d((4,5,6))
print a.dot(b)
I'm not sure that it is really much better or faster, but you could do this to avoid using the transpose:
Asp.multiply(Bsp).sum()
This just takes the element-by-element product of the two matrices and sums the products. You could make a subclass of whatever matrix format you are using that has the above statement as the dot product.
However, it is probably just easier to tranpose them:
Asp*Bsp.T
That doesn't seem like so much to do, but you could also make a subclass and modify the mul() method.

Categories