Is it possible to speed up this loop in Python? - python

The normal way to map a function in a numpy.narray like np.array[map(some_func,x)] or vectorize(f)(x) can't provide an index.
The following code is just a simple example that is commonly seen in many applications.
dis_mat = np.zeros([feature_mat.shape[0], feature_mat.shape[0]])
for i in range(feature_mat.shape[0]):
for j in range(i, feature_mat.shape[0]):
dis_mat[i, j] = np.linalg.norm(
feature_mat[i, :] - feature_mat[j, :]
)
dis_mat[j, i] = dis_mat[i, j]
Is there a way to speed it up?
Thank you for your help! The quickest way to speed up this code is this, using the function that #user2357112 commented about:
from scipy.spatial.distance import pdist,squareform
dis_mat = squareform(pdist(feature_mat))
#Julien's method is also good if feature_mat is small, but when the feature_mat is 1000 by 2000, then it needs nearly 40 GB of memory.

SciPy comes with a function specifically to compute the kind of pairwise distances you're computing. It's scipy.spatial.distance.pdist, and it produces the distances in a condensed format that basically only stores the upper triangle of the distance matrix, but you can convert the result to square form with scipy.spatial.distance.squareform:
from scipy.spatial.distance import pdist, squareform
distance_matrix = squareform(pdist(feature_mat))
This has the benefit of avoiding the giant intermediate arrays required with a direct vectorized solution, so it's faster and works on larger inputs. It loses the timing to an approach that uses algebraic manipulations to have dot handle the heavy lifting, though.
pdist also supports a wide variety of alternate distance metrics, if you decide you want something other than Euclidean distance.
# Manhattan distance!
distance_matrix = squareform(pdist(feature_mat, 'cityblock'))
# Cosine distance!
distance_matrix = squareform(pdist(feature_mat, 'cosine'))
# Correlation distance!
distance_matrix = squareform(pdist(feature_mat, 'correlation'))
# And more! Check out the docs.

You can create a new axis and broadcast:
dis_mat = np.linalg.norm(feature_mat[:,None] - feature_mat, axis=-1)
Timing:
feature_mat = np.random.rand(100,200)
def a():
dis_mat = np.zeros([feature_mat.shape[0], feature_mat.shape[0]])
for i in range(feature_mat.shape[0]):
for j in range(i, feature_mat.shape[0]):
dis_mat[i, j] = np.linalg.norm(
feature_mat[i, :] - feature_mat[j, :]
)
dis_mat[j, i] = dis_mat[i, j]
def b():
dis_mat = np.linalg.norm(feature_mat[:,None] - feature_mat, axis=-1)
%timeit a()
100 loops, best of 3: 20.5 ms per loop
%timeit b()
100 loops, best of 3: 11.8 ms per loop

Factor what can be done, and use np.dot optimizations on k x k matrix, in little memory place (kxk):
def c(m):
xy=np.dot(m,m.T) # O(k^3)
x2=y2=(m*m).sum(1) #O(k^2)
d2=np.add.outer(x2,y2)-2*xy #O(k^2)
d2.flat[::len(m)+1]=0 # Rounding issues
return np.sqrt(d2) # O (k^2)
And for comparison:
def d(m):
return squareform(pdist(m))
Here are the 'time(it)' for a k*k initial matrices:
The two algorithms are O(k^3), but c(m) makes the O(k^3) part of the job through np.dot, the critical node of linear algebra which benefits of all optimizations (multicore and so on). pdist is just loops as seen in the source.
This explains the 15x factor for big arrays, even if pdist exploits the symmetry of the matrix by calculating only half of the terms.

One way I thought of to avoid mixing NumPy and for loops would be to create an index array using a version of this index creator that allows for replacement:
import numpy as np
from itertools import product, chain
from scipy.special import comb
def comb_index(n, k):
count = comb(n, k, exact=True, repetition=True)
index = np.fromiter(chain.from_iterable(product(range(n), repeat=k)),
int, count=count*k)
return index.reshape(-1, k)
Then, we simply take each of those array couples, compute the difference between them, reshape the resulting array, and take the norm of each of the rows of the array:
reshape_mat = np.diff(feature_mat[comb_index(feature_mat.shape[0], 2), :], axis=1).reshape(-1, feature_mat.shape[1])
dis_list = np.linalg.norm(reshape_mat, axis=-1)
Note that dis_list is simply a list of all of the n*(n+1)/2 possible norms. This runs at close to the same speed as the other answer for the feature_mat he provided, and when comparing the byte sizes of our largest sections,
(feature_mat[:,None] - feature_mat).nbytes == 16000000
while
np.diff(feature_mat[comb_index(feature_mat.shape[0], 2), :], axis=1).reshape(-1, feature_mat.shape[1]).nbytes == 8080000
For most inputs, mine uses only half the storage: still unoptimal, but a marginal improvement.

Based on np.triu_indices, in case you really want to do this with pure NumPy:
s = feature_mat.shape[0]
i, j = np.triu_indices(s, 1) # All possible combinations of indices
dist_mat = np.empty((s, s)) # Don't waste time filling with zeros
np.einsum('ii->i', dist_mat)[:] = 0 # When you can just fill the diagonal
dist_mat[i, j] = dist_mat[j, i] = np.linalg.norm(feature_mat[i] - feature_mat[j], axis=-1)
# Vectorized version of your original process
The benefit of this method over broadcasting is that you can do it in chunks:
n = 10000000 # Based on your RAM available
for k in range (0, i.size, n):
i_ = i[k: k + n]
j_ = j[k: k + n]
dist_mat[i_, j_] = dist_mat[j_, i_] = \
np.linalg.norm(feature_mat[i_] - feature_mat[j_], axis = -1)

Let's begin by rewriting this in terms of a function:
dist(mat, i, j):
return np.linalg.norm(mat[i, :] - mat[j, :])
size = feature_mat.shape[0]
for i in range(size):
for j in range(size):
dis_mat[i, j] = dist(feature_mat, i, j)
This can be rewritten in (a slightly more) vectorized form as:
v = [dist(feature_map, i, j) for i in range(size) for j in range(size)]
dist_mat = np.array(v).reshape(size, size)
Notice that we're still relying on Python rather than NumPy for some of the computation, but it's a step towards vectorization. Also notice that dist(i, j) is symmetric, so we could further reduce computations by approximately half. Perhaps considering:
v = [dist(feature_map, i, j) for i in range(size) for j in range(i + 1)]
Now the tricky bit is assigning these computed values to the correct elements in a dist_mat.
How fast this performs depends on the cost of computing dist(i, j). For small feature_mats, the cost of recomputing is not high enough to worry about this. But for large matrices, you definitely do not want to recompute.

Related

Calculate only lower triangular elements of a matrix OR calculation on all possible pairs of the elements of a vector with jax

Is it possible to efficiently run some calculation on all possible pairs of the elements of a vector? I.e. I want to fill the lower triangular elements of a matrix (possibly flattened).
I.e. I want to:
calculate do_my_calculation(input_vector[i], input_vector[j])
for all i, j in [1, length(input_vector)] and j < i
save all the results
The shape of the results is not terribly important. If I can choose however, I would prefer a vector corresponding to an unrolled of the triangular (i, j) matrix however.
To illustrate what I would like to do in pseudo-python:
input_vector = np.arange(100)
result_vector = []
for i in range(1, len(input_vector)):
for j in range(0, i):
result_vector.append(do_my_calculation(input_vector[i], input_vector[j])
Note: For this question, the types of input_vector and result_vector in the above code are not pertinent. Equally, I am of course happy to preallocate result_vector if required. I am using a list for the sake of conciseness of the sample code.
Edit 1: concrete example as requested by #ddejohn
Note: The question is not whether I can get this to run in jax but whether I can get it to run efficiently, i.e. vectorized .
# Set up the problem
import numpy as np
dim = 15
input_vector_x = np.random.rand(dim)
input_vector_y = np.random.rand(dim)
output_vector = np.empty(np.tril_indices(dim, k=-1)[0].size)
assert input_vector_x.size == input_vector_y.size
# alternative implementation 1
counter = 0
for i in range(1, input_vector_x.size):
for j in range(0, i):
output_vector[counter] = (input_vector_y[j] - input_vector_y[i]) / (input_vector_x[j] - input_vector_x[i])
counter += 1
# alternative implementation 2
indices = np.tril_indices(dim, k=-1)
i = indices[0]
j = indices[1]
output_vector = (input_vector_y[j] - input_vector_y[i]) / (input_vector_x[j] - input_vector_x[i])
There are a few ways to approach this. If you want to compute the full matrix of pairwise results, you could use typical numpy-style broadcasting, assuming your function supports it. Similarly, you could use JAX's Automatic Vectorization (vmap) functionality whether or not your function is compatible with broadcasting.
If you really wish to only compute each value once, you can do this using the lower or upper triangular indices. Note that although this performs fewer operations, you may find that in practice it's faster, particularly on accelerators like GPU and TPU, to compute the full result. The reason for this is that multi-dimensional indexing (the gather operation) is generally relatively expensive on this kind of hardware, so the overhead of doubling the number of function calls may be preferable.
Here's a demonstration of these three approaches:
import jax
import jax.numpy as jnp
key = jax.random.PRNGKey(5748395)
dim = 3
x = jax.random.uniform(key, (dim,))
def f(x1, x2):
return (x1 * x2) / (x1 + x2)
# Option 1: full result, broadcasted operations
print(f(x[:, None], x[None, :]))
# [[0.34950745 0.00658672 0.28704265]
# [0.00658672 0.00332469 0.00655982]
# [0.28704265 0.00655982 0.24352014]]
# Option 2: full result, via vmap
f_mapped = jax.vmap(jax.vmap(f, (None, 0)), (0, None))
print(f_mapped(x, x))
# [[0.34950745 0.00658672 0.28704265]
# [0.00658672 0.00332469 0.00655982]
# [0.28704265 0.00655982 0.24352014]]
# Option 3: explicitly computing at lower-triangular indices
i, j = jnp.tril_indices(dim)
out_tril = f(x[i], x[j])
print(out_tril)
# [0.34950745 0.00658672 0.00332469 0.28704265 0.00655982 0.24352014]
print(jnp.zeros((dim, dim)).at[i, j].set(out_tril))
# [[0.34950745 0. 0. ]
# [0.00658672 0.00332469 0. ]
# [0.28704265 0.00655982 0.24352014]]

Creating a symmetric array with power of an element

I am trying to create an array which is symmetric with elements placed as below
I have written the following code to get this form with parameter being 0.5 and dimension being 4-by-4.
import numpy as np
a = np.eye(4)
for i in range(4):
for j in range(4):
a[i, j] = (0.5) ** (np.abs(i-j))
This does what I need but for large dimension (1000s) this causes a lot of overhead. Is there any other low complexity method to get this matrix? Thanks.
We can leverage broadcasting after creating a ranged array to represent the iterator variable and then performing an outer-subtraction to simulate i-j part -
n = 4
p = 0.5
I = np.arange(n)
out = p ** (np.abs(I[:,None]-I))
Optimization #1
We can do a hashing based one with indexing, so that we optimize on expensive power computations, like so -
out = (p**np.arange(n))[(np.abs(I[:,None]-I))]
Optimization #2
We can optimize further to use multi-cores with numexpr -
import numexpr as ne
out = ne.evaluate('p**abs(I2D-I)',{'I2D':I[:,None],'I':I})

Is there a more efficient (i.e. faster) way to compute correlated random walks?

Is there any way to vectorize or otherwise speed this up? I'm already jitting it using numba, but it is still a major bottleneck. Using numba to jit my functions on 1-d numpy arrays leads to code that is orders of magnitude faster, but there is essentially a negligible improvement when using numba on the 2-d arrays below. decomposition is a numpy matrix representing the cholesky decomposition of the correlation matrix, and x and n are constants, and nrand is numpy.random
#jit
def generate_random_correlated_walks(decomposition, x, n):
uncorrelated_walks = np.empty((2*x, n))
for i in range(x):
# Generate the random uncorrelated walks
wv = nrand.normal(loc=0, scale=1, size=n)
ws = nrand.normal(loc=0, scale=1, size=n)
uncorrelated_walks[2*i] = wv
uncorrelated_walks[(2*i) + 1] = ws
# Create a matrix out of these walks
uncorrelated_walks = np.matrix(uncorrelated_walks)
m, n = uncorrelated_walks.shape
correlated_walks = np.empty((m, n))
# Go through column and correlate the walk values
for i in range(n):
correlated_timestep = np.transpose(uncorrelated_walks[:, i]) * decomposition
correlated_walks[:, i] = correlated_timestep
return correlated_walks
EDIT: I have made the suggested changes, and now my code is as below, but unfortunately still is a major bottleneck. Any ideas?
#jit
def generate_random_correlated_walks(self, decomposition, x, n):
rows = 2*x
uncorrelated_walks = np.random.normal(loc=0, scale=1, size=(rows, n))
correlated_walks = np.empty((rows, n))
for i in range(n):
correlated_timestep = np.dot(np.transpose(uncorrelated_walks[:, i]), decomposition)
correlated_walks[:, i] = correlated_timestep
return correlated_walks
First thing you can improve is to remove your for loop. There is no advantage to calling np.random.normal a bunch of times with the same input parameters if you believe that it really generates random numbers.
Instead of using np.matrix, use np.array. This will make your life easier when you consider that the previous item can be used to shorten the entire first portion of your function into one step.
You can of course completely remove the final loop with a simple matrix multiplication: uncorrelated_walks.T # decomposition will give you the transpose of your current correlated_walks. You can avoid one of the transposes by changing the order of the arguments.
You end up with something like:
def generate_random_correlated_walks(decomposition, x, n):
uncorrelated_walks = nrand.normal(loc=0, scale=1, size=(2*x, n))
correlated_walks = np.dot(decomposition.T, uncorrelated_walks)
return correlated_walks
Not sure how much this will help you, but removing the Python-level loops should be some kind of boost since it will reduce the overhead of multiple numpy calls.
You could sacrifice legibility to turn the whole thing into a one-liner:
def generate_random_correlated_walks(decomposition, x, n):
return np.dot(decomposition.T, nrand.normal(loc=0, scale=1, size=(2*x, n)))

numpy row pair sum of squared row wise differences without for loops (only api calls)

For those who can read Latex, this is what I am trying to compute:
$$k_{xyi} = \sum_{j}\left ( \left ( x_{i}-x_{j} \right )^{2}+\left ( y_{i}-y_{j} \right )^{2} \right )$$
where x and y are rows of a matrix A.
For computer language only folk this would translate as:
k(x,y,i) = sum_j( (xi - xj)^2 + (yi - yj)^2 )
where x and y are rows of a matrix A.
So k is a 3d matrix.
Can this be done with API calls only? (no for loops)
Here is testing startup:
import numpy as np
A = np.random.rand(4,4)
k = np.empty((4,4,4))
for ix in range(4):
for iy in range(4):
x = A[ix,]
y = A[iy,]
sx = np.power(x - x[:,np.newaxis],2)
sy = np.power(y - y[:,np.newaxis],2)
k[ix,iy] = (sx + sy).sum(axis=1).T
And now for the master coders, please replace the two for loops with numpy API calls.
Update:
Forgot to mention that I need a method that saves up RAM space, my A matrices are usually 20-30 thousand squared. So it would be great if your answer does not create huge temporary multidimensional arrays.
I would change your latex to look something more like the following- it is much less confusing imo:
From this I assume the last line in your expression should really be:
k[ix,iy] = (sx + sy).sum(axis=-1)
If so, you can compute the above expression as follows:
Axij = (A[:, None, :] - A[..., None])**2
k = np.sum(Axij[:, None, :, :] + Axij, axis=-1)
The above first expands out a memory intensive 4D array. You can skip this if you are worried about memory by introducing a new for loop:
k = np.empty((4,4,4))
Axij = (A[:, None, :] - A[..., None])**2
for xi in range(A.shape[0]):
k[xi] = np.sum(Axij[xi, None, :, :] + Axij, axis=-1)
This will be slower, but not by as much as you would think since you still do a lot of the operations in numpy. You could probably skip the 3D Axij intermediate, but again you are going to take a performance penalty doing so.
If your matrices are really 20k on an edge your 3D output will be 64TB. You are not going to do this in numpy or even in memory (unless you have a large scale distributed memory system).

Diagonal Matrix Exponential in Python

I'm writing a numerical algorithm with speed in mind. I've come across the two matrix exponential functions in scipy/numpy (scipy.linalg.expm2, scipy.linalg.expm). However I have a matrix that I know to be diagonal beforehand. Do these scipy functions check if the matrix is diagonal before they run? Obviously the exponentiation algorithm can be much faster for a diagonal matrix, and I just want to make sure that these are doing something smart with that - if they aren't, is there an easy way to do it?
If a matrix is diagonal, then its exponential can be obtained by just exponentiating every entry on the main diagonal, so you can calculate it by:
np.diag(np.exp(np.diag(a)))
If you know A is diagonal and you want the k-th power:
def dpow(a, k):
return np.diag(np.diag(a) ** k)
Check if a matrix is diagonal:
def isdiag(a):
return np.all(a == np.diag(np.diag(a)))
so :
def pow(a, k):
if isdiag(a):
return dpow(a, k)
else:
return np.asmatrix(a) ** k
Similarly for exponential (which you can get mathematically from the expansion of a suite of pow) you can do:
def dexp(a, k):
return np.diag(np.exp(np.diag(a)))
def exp(a, k):
if isdiag(a):
return dexp(a, k)
else:
#use scipy.linalg.expm2 or whatever
I've developed a tool that can help being faster doing the same as HYRY but by doing it in-place:
def diagonal(array):
""" Return a **view** of the diagonal elements of 'array' """
from numpy.lib.stride_tricks import as_strided
return as_strided(array,shape=(min(array.shape),),strides=(sum(array.strides),))
# generate a random diagonal array
d = np.diag(np.random.random(4000))
# in-place exponent of the diagonal elements
ddiag = diagonal(d)
ddiag[:] = np.exp(ddiag)
# timeit comparison with HYRY's method
%timeit -n10 np.diag(np.exp(np.diag(d)))
# out> 10 loops, best of 3: 52.1 ms per loop
%timeit -n10 ddiag = diagonal(d); ddiag[:] = np.exp(ddiag)
# out> 10 loops, best of 3: 108 µs per loop
Now,
HYRY's method is quadratic w.r.t the diagonal length (probably because of the new array memory allocation), and so if your matrices are of little dimension, the difference might not be as big.
you need to be alright with in-place computation
Finally, the off-diagonal elements are 0, so their exponential should be 1, shouldn't it ? In both our method the off-diagonal are 0.
For that last part, if you want all off-diagonal elements to be 1, then you can do:
d2 = np.ones_like(d);
diagonal(d2)[:] = np.exp(np.diag(d))
print (d2==np.exp(d)).all() # True
But this is linear w.r.t to array size, so quadratic w.r.t to diagonal length. The timeit gives ~90ms for a 4000x4000 array and 22.3ms for a 2000x2000.
Finally, you can also do it in-place to get a little speed up:
diag = np.diag(d)
d[:]=1
diagonal(d)[:] = np.exp(diag)
Timeit gives 66.1ms for 4000^2 array, and 16.8ms for 2000^2

Categories