How I can improve a loop for in a Python code?

How I can improve a loop for in a Python code? - python

I am translating this code from Matlab to Python. The code function fine but it is painfully slow in python. In Matlab, the code runs in way less then a minute, in python it took 30 min!!! Someone with mode experience in python could help me?
# P({ai})
somai = 0
for i in range(1, n):
somaj = 0
for j in range(1, n):
exponencial = math.exp(-((a[i] - a[j]) * (a[i] - a[j])) / dev_a2 - ((b[i] - b[j]) * (b[i] - b[j])) / dev_b2)
somaj = somaj + exponencial
somai = somai + somaj

As with MATLAB, I'd recommend you vectorize your code. Iterating by for-loops can be much slower than the lower level implementation of MATLAB and numpy.
Your operations (a[i] - a[j])*(a[i] - a[j]) are pairwise squared-Euclidean distance for all N data points. You can calculate a pairwise distance matrix using scipy's pdist and squareform functions -- pdist, squareform.
Then you calculate the difference between pairwise distance matrices A and B, and sum the exponential decay. So you could get a vectorized code like:
import numpy as np
from scipy.spatial.distance import pdist
from scipy.spatial.distance import squareform
# Example data
N = 1000
a = np.random.rand(N,1)
b = np.random.rand(N,1)
dev_a2 = np.random.rand()
dev_b2 = np.random.rand()
# `a` is an [N,1] matrix (i.e. column vector)
A = pdist(a, 'sqeuclidean')
# Change to pairwise distance matrix
A = squareform(A)
# Divide all elements by same divisor
A = A / dev_a2
# Then do the same for `b`'s
# `b` is an [N,1] matrix (i.e. column vector)
B = pdist(b, 'sqeuclidean')
B = squareform(B)
B = B / dev_b2
# Calculate exponential decay
expo = np.exp(-(A-B))
# Sum all elements
total = np.sum(expo)
Here's a quick timing comparison between the iterative method and this vectorized code.
N: 1000 | Iter Output: 2729989.851117 | Vect Output: 2732194.924364
Iter time: 6.759 secs | Vect time: 0.031 secs
N: 5000 | Iter Output: 24855530.997400 | Vect Output: 24864471.007726
Iter time: 171.795 secs | Vect time: 0.784 secs
Note that the final results are not exactly the same. I'm not sure why this is, it might be rounding error or math error on my part, but I'll leave that to you.

TLDR
Use numpy
Why Numpy?
Python, by default, is slow. One of the powers of python is that it plays nicely with C and has tons of libraries. The one that will help you hear is numpy. Numpy is mostly implemented in C and, when used properly, is blazing fast. The trick is to phrase the code in such a way that you keep the execution inside numpy and outside of python proper.
Code and Results
import math
import numpy as np
n = 1000
np_a = np.random.rand(n)
a = list(np_a)
np_b = np.random.rand(n)
b = list(np_b)
dev_a2, dev_b2 = (1, 1)
def old():
somai = 0.0
for i in range(0, n):
somaj = 0.0
for j in range(0, n):
tmp_1 = -((a[i] - a[j]) * (a[i] - a[j])) / dev_a2
tmp_2 = -((b[i] - b[j]) * (b[i] - b[j])) / dev_b2
exponencial = math.exp(tmp_1 + tmp_2)
somaj += exponencial
somai += somaj
return somai
def new():
tmp_1 = -np.square(np.subtract.outer(np_a, np_a)) / dev_a2
tmp_2 = -np.square(np.subtract.outer(np_b, np_b)) / dev_a2
exponential = np.exp(tmp_1 + tmp_2)
somai = np.sum(exponential)
return somai
old = 1.76 s ± 48.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
new = 24.6 ms ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
This is about a 70x improvement
old yields 740919.6020840995
new yields 740919.602084099
Explanation
You'll notice I broke up your code with the tmp_1 and tmp_2 a bit for clarity.
np.random.rand(n): This creates an array of length n that has random floats going from 0 to 1 (excluding 1) (documented here).
np.subtract.outer(a, b): Numpy has modules for all the operators that allow you do various things with them. Lets say you had np_a = [1, 2, 3], np.subtract.outer(np_a, np_a) would yield
array([[ 0, -1, -2],
[ 1, 0, -1],
[ 2, 1, 0]])
Here's a stackoverflow link if you want to go deeper on this. (also the word "outer" comes from "outer product" like from linear algebra)
np.square: simply squares every element in the matrix.
/: In numpy when you do arithmetic operators between scalars and matrices it does the appropriate thing and applies that operation to every element in the matrix.
np.exp: like np.square
np.sum: sums every element together and returns a scalar.

Related

why does it run faster when it was converted from dense to sparse array?

To simplify, I have the test code below:
from scipy.sparse import csr_matrix, dok_array, issparse
import numpy as np
from tqdm import tqdm
X = np.load('dense.npy')
# convert it to csr sparse matrix
#X = csr_matrix(X)
print(repr(X))
n = X.shape[0]
with tqdm(total=n*(n-1)//2) as pbar:
cooccur = dok_array((n, n), dtype='float32')
for i in range(n):
for j in range(i+1, n):
u, v = X[i], X[j]
if issparse(u):
u = u.toarray()[0]
v = v.toarray()[0]
#import pdb; pdb.set_trace()
m = u - v
min_uv = u - np.maximum(m, 0)
val = np.sum(min_uv - np.abs(m) * min_uv)
pbar.update()
Case 1: Run as it is - the time usage is like this (2min 54sec):
Case 2: uncomment the line X=csr_matrix(X) (just for the sake of comparison), the running time is 1min 56sec:
It is so weird and I can't figure out why it is even slower to operate on dense array. I subsampled the array for this test; for the original array, the run time difference between sparse and dense array is big (due to the large number of iterations.)
I put the code into a function and used line_profiler to see the time usage. My findings are: 1. slice indeed is much slower for sparse matrix; 2. the 3 lines above last line are much faster in Case 2; 3. the total run time is smaller for Case 2 even it takes extra time for slicing and converting to dense vector.
I am so confused why these 3 lines costed different run time in Case 1 and Case 2 - they are exactly the same numpy vectors in the two cases. Any explanations?
The dense.npy file is uploaded to here to reproduce the observation.

import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse import issparse
n = 1_000
sparsity = 0.98
A = np.random.rand(n, n)
A[A < sparsity] = 0
As = csr_matrix(A)
def _test(X):
n = X.shape[0]
for i in range(n):
for j in range(i+1, n):
u, v = X[i], X[j]
if issparse(u):
u = u.toarray()[0]
v = v.toarray()[0]
m = u - v
min_uv = u - np.maximum(m, 0)
val = np.sum(min_uv - np.abs(m) * min_uv)
Running this on dense:
%timeit _test(A)
5.3 s ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Running this on sparse:
%timeit _test(As)
1min 10s ± 1.06 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
This makes sense as you're not actually using a sparse data structure for anything - you're just expensively and inefficiently converting it back to a dense data structure every time your inner loop iterates.
I don't know how you got the runtimes you got as the order of magnitude difference between dense and sparse is what I would expect for the code you have provided.

Discrete Fourier Transform, speed up calculation

Input: hermitian matrix \rho_{i,j} with i,j=0,1,..d-1
Output: neg=\sum |W(mu,m)|-W(mu,m), the sum over all mu,m=0,..d-1 and
W(mu,m)=\sum exp(-4i\pi mu n /d) \rho_{(m-n)%d,(m+n)%d}, where n=0,..d-1
Problem: 1) for large d (d>5 000) a direct method (see snippet 1) is rather slow.
2) making use of 'np.fft.fft()' is much faster, but in the definition exponent with 2 is used rather than 4
https://docs.scipy.org/doc/numpy/reference/routines.fft.html#module-numpy.fft
Is it possible to improve snippet 1 using snippet 2 to boost speed calculation? May be one can use 2D fft?
Snippet 1:
W=np.zeros([d,d])
neg=0
for mu in range(d):
for m in range(d):
x=0
for n in range(d):
x+=np.exp(-4*np.pi*1.0j*mu*n/N)*rho[(m-n)%d,(m+n)%d]
W[mu,m]=x.real
neg+=np.abs(W[mu,m])-W[mu,m]
Snippet 2:
# create matrix \rho
psi=np.random.rand(500)
psi=psi/np.linalg.norm(psi) # normalize it
d=len(psi)
rho=np.outer(psi,np.conj(psi))
#
start_time=time.time()
m=1 # for example, use particular m
a=np.array([rho[(m-nn)%d,(m+nn)%d] for nn in range(d)])
ft=np.fft.fft(a)
end_time=time.time()
print(end_time-start_time)

Remove the nested loops by exploiting numpys array arithmetics.
import numpy as np
def my_sins(x):
return np.sin(2*x) + np.cos(4*x) + np.sin(3*x)
def dft(x, n=None):
if n is None:
n = len(x)
k = len(x)
cn = np.sum(x*np.exp(-2*np.pi*1j*np.outer(np.arange(k),np.arange(n))/n),1)
return cn
For some sample data
x = np.linspace(0,2*np.pi,1000)
y = my_sins(x)
%timeit dft(y)
On my system this yields:
145 ms ± 953 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to find the pairwise differences between rows of two very large matrices using numpy?

Given two matrices, I want to compute the pairwise differences between all rows. Each matrix has 1000 rows and 100 columns so they are fairly large. I tried using a for loop and pure broadcasting but the for loop seem to be working faster. Am I doing something wrong? Here is the code:
from numpy import *
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
The broadcasting method takes about 1 second longer and it's even longer for large matrices. Any idea how to speed this up purely using numpy?

Here's another way to perform :
(a-b)^2 = a^2 + b^2 - 2ab
with np.einsum for the first two terms and dot-product for the third one -
import numpy as np
np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - 2*np.dot(A,B.T)
Runtime test
Approaches -
def loopy_app(A,B):
m,n = A.shape[0], B.shape[0]
out = np.empty((m,n))
for i,a in enumerate(A):
out[i] = np.sum((a - B)**2,1)
return out
def broadcasting_app(A,B):
return ((A[:,np.newaxis,:] - B)**2).sum(-1)
# #Paul Panzer's soln
def outer_sum_dot_app(A,B):
return np.add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*np.dot(A,B.T)
# #Daniel Forsman's soln
def einsum_all_app(A,B):
return np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], \
A[:,None,:] - B[None,:,:])
# Proposed in this post
def outer_einsum_dot_app(A,B):
return np.einsum('ij,ij->i',A,A)[:,None] + np.einsum('ij,ij->i',B,B) - \
2*np.dot(A,B.T)
Timings -
In [51]: A = np.random.randn(1000,100)
...: B = np.random.randn(1000,100)
...:
In [52]: %timeit loopy_app(A,B)
...: %timeit broadcasting_app(A,B)
...: %timeit outer_sum_dot_app(A,B)
...: %timeit einsum_all_app(A,B)
...: %timeit outer_einsum_dot_app(A,B)
...:
10 loops, best of 3: 136 ms per loop
1 loops, best of 3: 302 ms per loop
100 loops, best of 3: 8.51 ms per loop
1 loops, best of 3: 341 ms per loop
100 loops, best of 3: 8.38 ms per loop

Here is a solution which avoids both the loop and the large intermediates:
from numpy import *
import time
A = random.randn(1000,100)
B = random.randn(1000,100)
start = time.time()
for a in A:
sum((a - B)**2,1)
print time.time() - start
# pure broadcasting
start = time.time()
((A[:,newaxis,:] - B)**2).sum(-1)
print time.time() - start
#matmul
start = time.time()
add.outer((A*A).sum(axis=-1), (B*B).sum(axis=-1)) - 2*dot(A,B.T)
print time.time() - start
Prints:
0.546781778336
0.674743175507
0.10723400116

Another job for np.einsum
np.einsum('ijk,ijk->ij', A[:,None,:] - B[None,:,:], A[:,None,:] - B[None,:,:])

Similar to #paul-panzer, a general way to compute pairwise differences of arrays of arbitrary dimension is broadcasting as follows:
Let v be a NumPy array of size (n, d):
import numpy as np
v_tiled_across = np.tile(v[:, np.newaxis, :], (1, v.shape[0], 1))
v_tiled_down = np.tile(v[np.newaxis, :, :], (v.shape[0], 1, 1))
result = v_tiled_across - v_tiled_down
For a better picture what's happening, imagine each d-dimensional row of v being propped up like a flagpole, and copied across and down. Now when you do component-wise subtraction, you're getting each pairwise combination.
__
There's also scipy.spatial.distance.pdist, which computes a metric in a pairwise fashion.
from scipy.spatial.distance import pdist, squareform
pairwise_L2_norms = squareform(pdist(v))

Compute Jaccard distances on sparse matrix

I have a large sparse matrix - using sparse.csr_matrix from scipy. The values are binary. For each row, I need to compute the Jaccard distance to every row in the same matrix. What's the most efficient way to do this? Even for a 10.000 x 10.000 matrix, my runtime takes minutes to finish.
Current solution:
def jaccard(a, b):
intersection = float(len(set(a) & set(b)))
union = float(len(set(a) | set(b)))
return 1.0 - (intersection/union)
def regions(csr, p, epsilon):
neighbors = []
for index in range(len(csr.indptr)-1):
if jaccard(p, csr.indices[csr.indptr[index]:csr.indptr[index+1]]) <= epsilon:
neighbors.append(index)
return neighbors
csr = scipy.sparse.csr_matrix("file")
regions(csr, 0.51) #this is called for every row

Vectorization is relatively easy if you use matrix multiplication to calculate the set intersections and then the rule |union(a, b)| == |a| + |b| - |intersection(a, b)| to determine the unions:
# Not actually necessary for sparse matrices, but it is for
# dense matrices and ndarrays, if X.dtype is integer.
from __future__ import division
def pairwise_jaccard(X):
"""Computes the Jaccard distance between the rows of `X`.
"""
X = X.astype(bool).astype(int)
intrsct = X.dot(X.T)
row_sums = intrsct.diagonal()
unions = row_sums[:,None] + row_sums - intrsct
dist = 1.0 - intrsct / unions
return dist
Note the cast to bool and then int, because the dtype of X must be large enough to accumulate twice the maximum row sum and that entries of X must be either zero or one. The downside of this code is that it's heavy on RAM, because unions and dists are dense matrices.
If you're only interested in distances smaller than some cut-off epsilon, the code can be tuned for sparse matrices:
from scipy.sparse import csr_matrix
def pairwise_jaccard_sparse(csr, epsilon):
"""Computes the Jaccard distance between the rows of `csr`,
smaller than the cut-off distance `epsilon`.
"""
assert(0 < epsilon < 1)
csr = csr_matrix(csr).astype(bool).astype(int)
csr_rownnz = csr.getnnz(axis=1)
intrsct = csr.dot(csr.T)
nnz_i = np.repeat(csr_rownnz, intrsct.getnnz(axis=1))
unions = nnz_i + csr_rownnz[intrsct.indices] - intrsct.data
dists = 1.0 - intrsct.data / unions
mask = (dists > 0) & (dists <= epsilon)
data = dists[mask]
indices = intrsct.indices[mask]
rownnz = np.add.reduceat(mask, intrsct.indptr[:-1])
indptr = np.r_[0, np.cumsum(rownnz)]
out = csr_matrix((data, indices, indptr), intrsct.shape)
return out
If this still takes to much RAM you could try to vectorize over one dimension and Python-loop over the other.

To add to the accepted answer: I had use for a weighted version of the above method which is simply implemented as:
def pairwise_jaccard_sparse_weighted(csr, epsilon, weight):
csr = scipy.sparse.csr_matrix(csr).astype(bool).astype(int)
csr_w = csr * scipy.sparse.diags(weight)
csr_rowsum = numpy.array(csr_w.sum(axis = 1)).flatten()
intrsct = csr.dot(csr_w.T)
rowsum_i = numpy.repeat(csr_rowsum, intrsct.getnnz(axis = 1))
unions = rowsum_i + csr_rowsum[intrsct.indices] - intrsct.data
dists = 1.0 - 1.0 * intrsct.data / unions
mask = (dists > 0) & (dists <= epsilon)
data = dists[mask]
indices = intrsct.indices[mask]
rownnz = numpy.add.reduceat(mask, intrsct.indptr[:-1])
indptr = numpy.r_[0, numpy.cumsum(rownnz)]
out = scipy.sparse.csr_matrix((data, indices, indptr), intrsct.shape)
return out
I doubt this is the most efficient implementation, but it's a damn sight quicker than the dense implementation in scipy.spatial.distance.jaccard.

Here a solution that has a scikit-learn-like API.
def pairwise_sparse_jaccard_distance(X, Y=None):
"""
Computes the Jaccard distance between two sparse matrices or between all pairs in
one sparse matrix.
Args:
X (scipy.sparse.csr_matrix): A sparse matrix.
Y (scipy.sparse.csr_matrix, optional): A sparse matrix.
Returns:
numpy.ndarray: A similarity matrix.
"""
if Y is None:
Y = X
assert X.shape[1] == Y.shape[1]
X = X.astype(bool).astype(int)
Y = Y.astype(bool).astype(int)
intersect = X.dot(Y.T)
x_sum = X.sum(axis=1).A1
y_sum = Y.sum(axis=1).A1
xx, yy = np.meshgrid(x_sum, y_sum)
union = ((xx + yy).T - intersect)
return (1 - intersect / union).A
Here some testing and benchmarking:
>>> import timeit
>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> from sklearn.metrics import pairwise_distances
>>> X = csr_matrix(np.random.choice(a=[False, True], size=(10000, 1000), p=[0.9, 0.1]))
>>> Y = csr_matrix(np.random.choice(a=[False, True], size=(1000, 1000), p=[0.9, 0.1]))
Asserting that all results are approximately equivalent
>>> custom_jaccard_distance = pairwise_sparse_jaccard_distance(X, Y)
>>> sklearn_jaccard_distance = pairwise_distances(X.todense(), Y.todense(), "jaccard")
>>> np.allclose(custom_jaccard_distance, sklearn_jaccard_distance)
True
Benchmarking runtime (from Jupyer Notebook)
>>> %timeit pairwise_jaccard_index(X, Y)
795 ms ± 58.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit 1 - pairwise_distances(X.todense(), Y.todense(), "jaccard")
14.7 s ± 694 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Why doesn't Numba improve this iteration ...?

I am trying out Numba in speeding up a function that computes a minimum conditional probability of joint occurrence.
import numpy as np
from numba import double
from numba.decorators import jit, autojit
X = np.random.random((100,2))
def cooccurance_probability(X):
P = X.shape[1]
CS = np.sum(X, axis=0) #Column Sums
D = np.empty((P, P), dtype=np.float) #Return Matrix
for i in range(P):
for j in range(P):
D[i, j] = (X[:,i] * X[:,j]).sum() / max(CS[i], CS[j])
return D
cooccurance_probability_numba = autojit(cooccurance_probability)
However I am finding that the performance of cooccurance_probability and cooccurance_probability_numba to be pretty much the same.
%timeit cooccurance_probability(X)
1 loops, best of 3: 302 ms per loop
%timeit cooccurance_probability_numba(X)
1 loops, best of 3: 307 ms per loop
Why is this? Could it be due to the numpy element by element operation?
I am following as an example:
http://nbviewer.ipython.org/github/ellisonbg/talk-sicm2-2013/blob/master/NumbaCython.ipynb
[Note: I could half the execution time due to the symmetric nature of the problem - but that isn't my main concern]

My guess would be that you're hitting the object layer instead of generating native code due to the calls to sum, which means that Numba isn't going to speed things up significantly. It just doesn't know how to optimize/translate sum (at this point). Additionally it's usually better to unroll vectorized operations into explicit loops with Numba. Notice that the ipynb that you link to only calls out to np.sqrt which I believe does get translated to machine code, and it operates on elements, not slices. I would try to expand out the sum in the inner loop as an explicit additional loop over elements, rather than taking slices and using the sum method.
My experience is that Numba can work wonders sometimes, but it doesn't speed-up arbitrary python code. You need to get a sense of the limitations and what it can optimize effectively. Also note that v0.11 is a bit different in this regard as compared to 0.12 and 0.13 due to the major refactoring that Numba went through between those versions.

Below is a solution using Josh's advice, which is spot on. It appears however the max() works fine in the below implementation. It would be great if there was a list of "safe" python / numpy functions.
Note: I reduced the dimensionality of the original matrix to 100 x 200]
import numpy as np
from numba import double
from numba.decorators import jit, autojit
X = np.random.random((100,200))
def cooccurance_probability_explicit(X):
C = X.shape[0]
P = X.shape[1]
# - Column Sums - #
CS = np.zeros((P,), dtype=np.float)
for p in range(P):
for c in range(C):
CS[p] += X[c,p]
D = np.empty((P, P), dtype=np.float) #Return Matrix
for i in range(P):
for j in range(P):
# - Compute Elemental Pairwise Sums over each Product Vector - #
pws = 0
for c in range(C):
pws += (X[c,i] * X[c,j])
D[i,j] = pws / max(CS[i], CS[j])
return D
cooccurance_probability_explicit_numba = autojit(cooccurance_probability_explicit)
%timeit results:
%timeit cooccurance_probability(X)
10 loops, best of 3: 83 ms per loop
%timeit cooccurance_probability_explicit(X)
1 loops, best of 3: 2.55s per loop
%timeit cooccurance_probability_explicit_numba(X)
100 loops, best of 3: 7.72 ms per loop
The interesting thing about the results is that the explicitly written version executed by python is very slow due to the large type checking overheads. But passing through Numba works it's magic. (Numba is ~11.5 times faster than the python solution using Numpy).
Update: Added a Cython Function for Comparison (thanks to moarningsun: Cython function with variable sized matrix input)
%load_ext cythonmagic
%%cython
import numpy as np
cimport numpy as np
def cooccurance_probability_cy(double[:,:] X):
cdef int C, P, i, j, k
C = X.shape[0]
P = X.shape[1]
cdef double pws
cdef double [:] CS = np.sum(X, axis=0)
cdef double [:,:] D = np.empty((P,P), dtype=np.float)
for i in range(P):
for j in range(P):
pws = 0.0
for c in range(C):
pws += (X[c, i] * X[c, j])
D[i,j] = pws / max(CS[i], CS[j])
return D
%timeit results:
%timeit cooccurance_probability_cy(X)
100 loops, best of 3: 12 ms per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How I can improve a loop for in a Python code? - python

Related

why does it run faster when it was converted from dense to sparse array?

Discrete Fourier Transform, speed up calculation

How to find the pairwise differences between rows of two very large matrices using numpy?

Compute Jaccard distances on sparse matrix

Why doesn't Numba improve this iteration ...?

Categories

Resources