python vectorization: how to increase the efficiency of 4 layer loop

python vectorization: how to increase the efficiency of 4 layer loop - python

I am trying to implement LDA using Gibbs sampling and in the step of updating each topic proportion, I have a 4 layer loop and it runs extremely slow and I am not sure how to improve the efficiency of this code. The code I have now is the following:
N_W is the number of words, and N_D is the number of document, and Z[i,j] is the topic assignment (1 to K possible assignments), X[i,j] is the count of the j-th word in i-th document, Beta[k,:] is of dimension [K, N_W].
And the update is the following:
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
for i in range(N_D):
for j in range(N_W):
# counting number of times a word is assigned to a topic
n_k[w] += (X[i,j] == w) and (Z[i,j] == k)
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)

You could get rid of the last two for loops using logic functions:
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
a = np.logical_not(X-w) # all X(i,j) == w become a True, others a false
b = np.logical_not(Z-k) # all Z(i,j) == w become a True, others a false
c = np.logical_and(a,b) # all (i,j) where X(i,j) == w and Z(i,j) == k are True, others false
n_k[w] = np.sum(c) # sum all True values
Or even as a one liner:
n_k = np.array([[np.sum(np.logical_and(np.logical_not(X[:N_D,:N_W]-w), np.logical_not(Z[:N_D,:N_W]-k))) for w in range(N_W)] for k in range(K)])
Each row in n_k then can be used for beta calculation. Now it also includes N_W and N_D as restrictions, if they are not equal to the size of X and Z

I did some testing with the following matrices:
import numpy as np
K = 90
N_W = 100
N_D = 11
N_W = 12
Z = np.random.randint(0, K, size=(N_D, N_W))
X = np.random.randint(0, N_W, size=(N_D, N_W))
gamma = 1
The original code:
%%timeit
Beta = numpy.zeros((K, N_W))
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
for i in range(N_D):
for j in range(N_W):
# counting number of times a word is assigned to a topic
n_k[w] += (X[i,j] == w) and (Z[i,j] == k)
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
865 ms ± 8.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Then vectorising only the inner two loops:
%%timeit
Beta = numpy.zeros((K, N_W))
for k in range(K): # iteratively for each topic update
n_k = np.zeros(N_W) # vocab size
for w in range(N_W):
n_k[w] = np.sum((X == w) & (Z == k))
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
21.6 ms ± 542 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Finally with some creative application of broadcasting and extracting common elements:
%%timeit
Beta = numpy.zeros((K, N_W))
w = np.arange(N_W)
X_eq_w = np.equal.outer(X, w)
for k in range(K): # iteratively for each topic update
n_k = np.sum(X_eq_w & (Z == k)[:, :, None], axis=(0, 1))
# update
Beta[k,:] = np.random.dirichlet(gamma + n_k)
4.6 ms ± 92.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The trade-off here is between speed and memory. For the shapes I used this was not so memory-intensive, but the intermediate three-dimensional arrays I built in the last solution could get quite large.

Related

Conditional combination of arrays row by row

The task is to combine two arrays row by row (construct the permutations) based on the resulting multiplication of two corresponding vectors. Such as:
Row1_A,
Row2_A,
Row3_A,
Row1_B,
Row2_B,
Row3_B,
The result should be: Row1_A_Row1_B, Row1_A_Row2_B, Row1_A_Row3_B, Row2_A_Row1_B, etc..
Given the following initial arrays:
n_rows = 1000
A = np.random.randint(10, size=(n_rows, 5))
B = np.random.randint(10, size=(n_rows, 5))
P_A = np.random.rand(n_rows, 1)
P_B = np.random.rand(n_rows, 1)
Arrays P_A and P_B are corresponding vectors to the individual arrays, which contain a float. The combined rows should only appear in the final array if the multiplication surpasses a certain threshold, for example:
lim = 0.8
I have thought of the following functions or ways to solve this problem, but I would be interested in faster solutions. I am open to using numba or other libraries, but ideally I would like to improve the vectorized solution using numpy.
Method A
def concatenate_per_row(A, B):
m1,n1 = A.shape
m2,n2 = B.shape
out = np.zeros((m1,m2,n1+n2),dtype=A.dtype)
out[:,:,:n1] = A[:,None,:]
out[:,:,n1:] = B
return out.reshape(m1*m2,-1)
%%timeit
A_B = concatenate_per_row(A, B)
P_A_B = (P_A[:, None]*P_B[None, :])
P_A_B = P_A_B.flatten()
idx = P_A_B > lim
A_B = A_B[idx, :]
P_A_B = P_A_B[idx]
37.8 ms ± 660 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Method B
%%timeit
A_B = []
P_A_B = []
for i in range(len(P_A)):
P_A_B_i = P_A[i]*P_B
idx = np.where(P_A_B_i > lim)[0]
if len(idx) > 0:
P_A_B.append(P_A_B_i[idx])
A_B_i = np.zeros((len(idx), A.shape[1] + B.shape[1]), dtype='int')
A_B_i[:, :A.shape[1]] = A[i]
A_B_i[:, A.shape[1]:] = B[idx, :]
A_B.append(A_B_i)
A_B = np.concatenate(A_B)
P_A_B = np.concatenate(P_A_B)
9.65 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

First of all, there is a more efficient algorithm. Indeed, you can pre-compute the size of the output array so the values can be directly written in the final output arrays rather than stored temporary in lists. To find the size efficiently, you can sort the array P_B and then do a binary search so to find the number of value greater than lim/P_A[i,0] for all possible i (P_B*P_A[i,0] > lim is equivalent to P_B > lim/P_A[i,0]). The number of item filtered for each i can be temporary stored so to quickly loop over the filtered items.
Moreover, you can use Numba to significantly speed the computation of the main loop up.
Here is the resulting code:
#nb.njit('(int_[:,::1], int_[:,::1], float64[:,::1], float64[:,::1])')
def compute(A, B, P_A, P_B):
assert P_A.shape[1] == 1
assert P_B.shape[1] == 1
P_B_sorted = np.sort(P_B.reshape(P_B.size))
counts = len(P_B) - np.searchsorted(P_B_sorted, lim/P_A[:,0], side='right')
n = np.sum(counts)
mA, mB = A.shape[1], B.shape[1]
m = mA + mB
A_B = np.empty((n, m), dtype=np.int_)
P_A_B = np.empty((n, 1), dtype=np.float64)
k = 0
for i in range(P_A.shape[0]):
if counts[i] > 0:
idx = np.where(P_B > lim/P_A[i, 0])[0]
assert counts[i] == len(idx)
start, end = k, k + counts[i]
A_B[start:end, :mA] = A[i, :]
A_B[start:end, mA:] = B[idx, :]
P_A_B[start:end, :] = P_B[idx, :] * P_A[i, 0]
k += counts[i]
return A_B, P_A_B
Here are performance results on my machine:
Original: 35.6 ms
Optimized original: 18.2 ms
Proposed (with order): 0.9 ms
Proposed (no ordering): 0.3 ms
The algorithm proposed above is 20 times faster than the original optimized algorithm. It can be made even faster. Indeed, if the order of items do not matter you can use an argsort so to reorder both B and P_B. This enable you not to compute idx every time in the hot loop and select directly the last elements from B and P_B (that are guaranteed to be higher than the threshold but not in the same order than the original code). Because the selected items are stored contiguously in memory, this implementation is much faster. In the end, this last implementation is about 60 times faster than the original optimized algorithm. Note that the proposed implementations are significantly faster than the original ones even without Numba.
Here is the implementation that do not care about the order of the items:
#nb.njit('(int_[:,::1], int_[:,::1], float64[:,::1], float64[:,::1])')
def compute(A, B, P_A, P_B):
assert P_A.shape[1] == 1
assert P_B.shape[1] == 1
nA, mA = A.shape
nB, mB = B.shape
m = mA + mB
order = np.argsort(P_B.reshape(nB))
P_B_sorted = P_B[order, :]
B_sorted = B[order, :]
counts = nB - np.searchsorted(P_B_sorted.reshape(nB), lim/P_A[:,0], side='right')
nRes = np.sum(counts)
A_B = np.empty((nRes, m), dtype=np.int_)
P_A_B = np.empty((nRes, 1), dtype=np.float64)
k = 0
for i in range(P_A.shape[0]):
if counts[i] > 0:
start, end = k, k + counts[i]
A_B[start:end, :mA] = A[i, :]
A_B[start:end, mA:] = B_sorted[nB-counts[i]:, :]
P_A_B[start:end, :] = P_B_sorted[nB-counts[i]:, :] * P_A[i, 0]
k += counts[i]
return A_B, P_A_B

Discrete Fourier Transform, speed up calculation

Input: hermitian matrix \rho_{i,j} with i,j=0,1,..d-1
Output: neg=\sum |W(mu,m)|-W(mu,m), the sum over all mu,m=0,..d-1 and
W(mu,m)=\sum exp(-4i\pi mu n /d) \rho_{(m-n)%d,(m+n)%d}, where n=0,..d-1
Problem: 1) for large d (d>5 000) a direct method (see snippet 1) is rather slow.
2) making use of 'np.fft.fft()' is much faster, but in the definition exponent with 2 is used rather than 4
https://docs.scipy.org/doc/numpy/reference/routines.fft.html#module-numpy.fft
Is it possible to improve snippet 1 using snippet 2 to boost speed calculation? May be one can use 2D fft?
Snippet 1:
W=np.zeros([d,d])
neg=0
for mu in range(d):
for m in range(d):
x=0
for n in range(d):
x+=np.exp(-4*np.pi*1.0j*mu*n/N)*rho[(m-n)%d,(m+n)%d]
W[mu,m]=x.real
neg+=np.abs(W[mu,m])-W[mu,m]
Snippet 2:
# create matrix \rho
psi=np.random.rand(500)
psi=psi/np.linalg.norm(psi) # normalize it
d=len(psi)
rho=np.outer(psi,np.conj(psi))
#
start_time=time.time()
m=1 # for example, use particular m
a=np.array([rho[(m-nn)%d,(m+nn)%d] for nn in range(d)])
ft=np.fft.fft(a)
end_time=time.time()
print(end_time-start_time)

Remove the nested loops by exploiting numpys array arithmetics.
import numpy as np
def my_sins(x):
return np.sin(2*x) + np.cos(4*x) + np.sin(3*x)
def dft(x, n=None):
if n is None:
n = len(x)
k = len(x)
cn = np.sum(x*np.exp(-2*np.pi*1j*np.outer(np.arange(k),np.arange(n))/n),1)
return cn
For some sample data
x = np.linspace(0,2*np.pi,1000)
y = my_sins(x)
%timeit dft(y)
On my system this yields:
145 ms ± 953 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

How to vectorize calculation of barycentric coordinates in python

In scipy.spatial there is the Delaunay function. The documentation includes an example of how to calculate barycentric coordinates.
Following that example, the following code will calculate barycentric coordinates using a loop.
points = np.array([(0,0),(0,1),(1,0),(1,1)])
samples = np.array([(0.5,0.5),(0,0),(0.1,0.1)])
dim = len(points[0]) # determine the dimension of the samples
simp = Delaunay(points) # create simplexes for the defined points
s = simp.find_simplex(samples) # for each sample, find corresponding simplex for each sample
b0 = np.zeros((len(samples),dim)) # reserve space for each barycentric coordinate
for ii in range(len(samples)):
b0[ii,:] = simp.transform[s[ii],:dim].dot((samples[ii] - simp.transform[s[ii],dim]).transpose())
coord = np.c_[b0, 1 - b0.sum(axis=1)]
This is ok for short list of samples to convert to barycentric coordinates, however for very large lists of samples, the performance is poor. How can this be modified to take advantage of vectorized math in numpy/scipy to improve performance?

Consider the following modification (for-loop replaced with numpy methods):
def f_1(points, samples):
""" original """
dim = len(points[0])
simp = ssp.Delaunay(points)
s = simp.find_simplex(samples)
b0 = np.zeros((len(samples), dim))
for ii in range(len(samples)):
b0[ii, :] = simp.transform[s[ii], :dim].dot(
(samples[ii] - simp.transform[s[ii], dim]).transpose())
coord = np.c_[b0, 1 - b0.sum(axis=1)]
return coord
def f_2(points, samples):
""" modified """
simp = ssp.Delaunay(points)
s = simp.find_simplex(samples)
b0 = (simp.transform[s, :points.shape[1]].transpose([1, 0, 2]) *
(samples - simp.transform[s, points.shape[1]])).sum(axis=2).T
coord = np.c_[b0, 1 - b0.sum(axis=1)]
return coord
Test case:
N = 100
points = np.array(list(itertools.product(range(N), repeat=2)))
samples = np.random.rand(100_000, 2) * N
Result:
%timeit f_1(points, samples)
712 ms ± 2.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit f_2(points, samples)
422 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
With modified version the line simp.find_simplex(samples) gives about 95% of the running time. So, I guess there is nothing else you can do with vectorization. To improve perfomance further you need another implementation of find_simplex method or another approach to the problem.

extract glcm matrix using numpy

I want to find the GLCM matrix using python (numpy)
I've written this code and it gave me a correct result from the four angles but it very slow , to processes 1000 picture with demonsion 128x128 it take about 35min
def getGLCM(image, distance, direction):
npPixel = np.array(image) // image as numpy array
glcm = np.zeros((255, 255), dtype=int)
if direction == 1: # direction 90° up ↑
for i in range(distance, npPixel.shape[0]):
for j in range(0, npPixel.shape[1]):
glcm[npPixel[i, j], npPixel[i-distance, j]] += 1
elif direction == 2: # direction 45° up-right ↗
for i in range(distance, npPixel.shape[0]):
for j in range(0, npPixel.shape[1] - distance):
glcm[npPixel[i, j], npPixel[i - distance, j + distance]] += 1
elif direction == 3: # direction 0° right →
for i in range(0, npPixel.shape[0]):
for j in range(0, npPixel.shape[1] - distance):
glcm[npPixel[i, j], npPixel[i, j + distance]] += 1
elif direction == 4: # direction -45° down-right ↘
for i in range(0, npPixel.shape[0] - distance):
for j in range(0, npPixel.shape[1] - distance):
glcm[npPixel[i, j], npPixel[i + distance, j + distance]] += 1
return glcm
I need help to make this code faster
Thanks.

There is a bug in your code. You need to change the initialization of the gray level co-occurrence matrix to glcm = np.zeros((256, 256), dtype=int), otherwise if the image to process contains some pixels with the intensity level 255, the function getGLCM will throw an error.
Here's a pure NumPy implementation that improves performance through vectorization:
def vectorized_glcm(image, distance, direction):
img = np.array(image)
glcm = np.zeros((256, 256), dtype=int)
if direction == 1:
first = img[distance:, :]
second = img[:-distance, :]
elif direction == 2:
first = img[distance:, :-distance]
second = img[:-distance, distance:]
elif direction == 3:
first = img[:, :-distance]
second = img[:, distance:]
elif direction == 4:
first = img[:-distance, :-distance]
second = img[distance:, distance:]
for i, j in zip(first.ravel(), second.ravel()):
glcm[i, j] += 1
return glcm
If you are open to use other packages, I would strongly recommend you to utilize scikit-image's greycomatrix. As shown below, this speeds up the calculation by two orders of magnitude.
Demo
In [93]: from skimage import data
In [94]: from skimage.feature import greycomatrix
In [95]: img = data.camera()
In [96]: a = getGLCM(img, 1, 1)
In [97]: b = vectorized_glcm(img, 1, 1)
In [98]: c = greycomatrix(img, distances=[1], angles=[-np.pi/2], levels=256)
In [99]: np.array_equal(a, b)
Out[99]: True
In [100]: np.array_equal(a, c[:, :, 0, 0])
Out[100]: True
In [101]: %timeit getGLCM(img, 1, 1)
240 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [102]: %timeit vectorized_glcm(img, 1, 1)
203 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [103]: %timeit greycomatrix(img, distances=[1], angles=[-np.pi/2], levels=256)
1.46 ms ± 15.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

More efficient weighted Gini coefficient in Python

Per https://stackoverflow.com/a/48981834/1840471, this is an implementation of the weighted Gini coefficient in Python:
import numpy as np
def gini(x, weights=None):
if weights is None:
weights = np.ones_like(x)
# Calculate mean absolute deviation in two steps, for weights.
count = np.multiply.outer(weights, weights)
mad = np.abs(np.subtract.outer(x, x) * count).sum() / count.sum()
rmad = mad / np.average(x, weights=weights)
# Gini equals half the relative mean absolute deviation.
return 0.5 * rmad
This is clean and works well for medium-sized arrays, but as warned in its initial suggestion (https://stackoverflow.com/a/39513799/1840471) it's O(n2). On my computer that means it breaks after ~20k rows:
n = 20000 # Works, 30000 fails.
gini(np.random.rand(n), np.random.rand(n))
Can this be adjusted to work for larger datasets? Mine is ~150k rows.

Here is a version which is much faster than the one you provided above, and also uses a simplified formula for the case without weight to get even faster results in that case.
def gini(x, w=None):
# The rest of the code requires numpy arrays.
x = np.asarray(x)
if w is not None:
w = np.asarray(w)
sorted_indices = np.argsort(x)
sorted_x = x[sorted_indices]
sorted_w = w[sorted_indices]
# Force float dtype to avoid overflows
cumw = np.cumsum(sorted_w, dtype=float)
cumxw = np.cumsum(sorted_x * sorted_w, dtype=float)
return (np.sum(cumxw[1:] * cumw[:-1] - cumxw[:-1] * cumw[1:]) /
(cumxw[-1] * cumw[-1]))
else:
sorted_x = np.sort(x)
n = len(x)
cumx = np.cumsum(sorted_x, dtype=float)
# The above formula, with all weights equal to 1 simplifies to:
return (n + 1 - 2 * np.sum(cumx) / cumx[-1]) / n
Here is some test code to check we get (mostly) the same results:
>>> x = np.random.rand(1000000)
>>> w = np.random.rand(1000000)
>>> gini_max_ghenis(x, w)
0.33376310938610521
>>> gini(x, w)
0.33376310938610382
But the speed is very different:
%timeit gini(x, w)
203 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit gini_max_ghenis(x, w)
55.6 s ± 3.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you remove the pandas ops from the function, it is already much faster:
%timeit gini_max_ghenis_no_pandas_ops(x, w)
1.62 s ± 75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you want to get the last drop of performance you could use numba or cython but that would only gain a few percent because most of the time is spent in sorting.
%timeit ind = np.argsort(x); sx = x[ind]; sw = w[ind]
180 ms ± 4.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
edit: gini_max_ghenis is the code used in Max Ghenis' answer

Adapting the StatsGini R function from here:
import numpy as np
import pandas as pd
def gini(x, w=None):
# Array indexing requires reset indexes.
x = pd.Series(x).reset_index(drop=True)
if w is None:
w = np.ones_like(x)
w = pd.Series(w).reset_index(drop=True)
n = x.size
wxsum = sum(w * x)
wsum = sum(w)
sxw = np.argsort(x)
sx = x[sxw] * w[sxw]
sw = w[sxw]
pxi = np.cumsum(sx) / wxsum
pci = np.cumsum(sw) / wsum
g = 0.0
for i in np.arange(1, n):
g = g + pxi.iloc[i] * pci.iloc[i - 1] - pci.iloc[i] * pxi.iloc[i - 1]
return g
This works for large vectors, at least up to 10M rows:
n = 1e7
gini(np.random.rand(n), np.random.rand(n)) # Takes ~15s.
It also produces the same result as the function provided in the question, for example giving 0.2553 for this example:
gini(np.array([3, 1, 6, 2, 1]), np.array([4, 2, 2, 10, 1]))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python vectorization: how to increase the efficiency of 4 layer loop - python

Related

Conditional combination of arrays row by row

Discrete Fourier Transform, speed up calculation

How to vectorize calculation of barycentric coordinates in python

extract glcm matrix using numpy

More efficient weighted Gini coefficient in Python

Categories

Resources