Parallelize three nested loops

Parallelize three nested loops - python

Context:
I have 3 3D arrays ("precursor arrays") that I am upsampling with an Inverse Distance Weighting method. To do that, I calculate a 3D weights array that I use in a for loop on each point of my precursor arrays.
Each 2D slice of my weights array is used to calculate a partial array. Once I generate all 28 of them, they are summed to give one final host array.
I would like to parallelize this for loop in order to reduce my computing time. I tried doing it but I can not manage to update correctly my host arrays.
Question:
How could I parallelize my main function (last section of my code) ?
EDIT: Or is there a way I could "slice" my i for loop (for example one core running between i = 0 to 5, and one core running on i = 6 to 9) ?
Summary:
3 precursor arrays (temperatures, precipitations, snow): 10x4x7 (10 is a time dimension)
1 weight array (w): 28x1101x2101
28x3 partial arrays: 1101x2101
3 host arrays (temp, prec, Eprec): 1101x2101
Here is my code (runable as it is aside from the MAIN ALGORITHM PARALLEL section, please see the MAIN ALGORITHM NOT PARALLEL section at the end for the non-parallelized version of my code):
import numpy as np
import multiprocessing as mp
import time
#%% ------ Create data ------ ###
temperatures = np.random.rand(10,4,7)*100
precipitation = np.random.rand(10,4,7)
snow = np.random.rand(10,4,7)
# Array of altitudes to "adjust" the temperatures
alt = np.random.rand(4,7)*1000
#%% ------ Functions to run in parallel ------ ###
# This function upsamples the precursor arrays and creates the partial arrays
def interpolator(i, k, mx, my):
T = ((temperatures[i,mx,my]-272.15) + (-alt[mx, my] * -6/1000)) * w[k,:,:]
P = (precipitation[i,mx,my])*w[k,:,:]
S = (snow[i,mx,my])*w[k,:,:]
return(T, P, S)
# We add each partial array to each other to create the host array
def get_results(results):
global temp, prec, Eprec
temp += results[0]
prec += results[1]
Eprec += results[2]
#%% ------ IDW Interpolation ------ ###
# We create a weight matrix that we use to upsample our temperatures, precipitations and snow matrices
# This part is not that important, it works well as it is
MX,MY = np.shape(temperatures[0])
N = 300
T = np.zeros([N*MX+1, N*MY+1])
# create NxM inverse distance weight matrices based on Gaussian interpolation
x = np.arange(0,N*MX+1)
y = np.arange(0,N*MY+1)
X,Y = np.meshgrid(x,y)
k = 0
w = np.zeros([MX*MY,N*MX+1,N*MY+1])
for mx in range(MX):
for my in range(MY):
# Gaussian
add_point = np.exp(-((mx*N-X.T)**2+(my*N-Y.T)**2)/N**2)
w[k,:,:] += add_point
k += 1
sum_weights = np.sum(w, axis=0)
for k in range(MX*MY):
w[k,:,:] /= sum_weights
#%% ------ MAIN ALGORITHM PARALLEL ------ ###
if __name__ == '__main__':
# Create an empty array to use as a template
dummy = np.zeros((w.shape[1], w.shape[2]))
# Start a timer
ts = time.time()
# Iterate over the time dimension
for i in range(temperatures.shape[0]):
# Initialize the host arrays
temp = dummy.copy()
prec = dummy.copy()
Eprec = dummy.copy()
# Create the pool based on my amount of cores
pool = mp.Pool(mp.cpu_count())
# Loop through every weight slice, for every cell of the temperatures, precipitations and snow arrays
for k in range(0,w.shape[0]):
for mx in range(MX):
for my in range(MY):
# Upsample the temperatures, precipitations and snow arrays by adding the contribution of each weight slice
pool.apply_async(interpolator, args = (i, k, mx, my), callback = get_results)
pool.close()
pool.join()
# Print the time spent on the loop
print("Time spent: ", time.time()-ts)
#%% ------ MAIN ALGORITHM NOT PARALLEL ------ ###
if __name__ == '__main__':
# Create an empty array to use as a template
dummy = np.zeros((w.shape[1], w.shape[2]))
ts = time.time()
for i in range(temperatures.shape[0]):
# Create empty host arrays
temp = dummy.copy()
prec = dummy.copy()
Eprec = dummy.copy()
k = 0
for mx in range(MX):
for my in range(MY):
get_results(interpolator(i, k, mx, my))
k += 1
print("Time spent:", time.time()-ts)

The problem with multiprocessing is that it creates many new processes taht execute the code before the main (ie. before if __name__ == '__main__'). This causes a very slow initialization (since all process does it) and a huge amount of RAM being used for nothing. You certainly should move everything in the main or if possible in functions (which generally results in a faster execution and is a good software engineering practice anyway, especially for parallel codes). Even with this, there is another huge problem with multiprocessing: inter-process communication is slow. One solution is to use a multi-threaded approach made possible by using Numba or Cython (you can disable the GIL with them as opposed to basic CPython threads). In fact, they are often simpler to use than multiprocessing. However, you should be more careful though since parallel access are unprotected and data-races can appear in bogus parallel codes.
In your case, the computation is mostly memory-bound. This means multiprocessing is pretty useless. In fact, parallelism is barely useful here unless you are running this code on a computing server with a high-throughput. Indeed, the memory is a shared resource and using more computing core does not help much since 1 core can almost saturate the memory bandwidth on a regular PC (while few cores are needed on computing servers).
The key to speed up memory-bound codes is to avoid creating temporary arrays and use cache-friendly algorithms. In your case, T, P and S are filled just to be read later so to update the temp, prec and Eprec arrays. This temporary step is pretty expensive and necessary here (especially filling the arrays). Removing this will increase the arithmetic intensity resulting in a code that will certainly be faster in sequential and that can better scale on multiple cores. This is the case on my machine.
Here is an example of code using Numba so to parallelize the code:
import numba as nb
# get_results + interpolator
#nb.njit('void(float64[:,::1], float64[:,::1], float64[:,::1], float64[:,:,::1], int_, int_, int_, int_)', parallel=True)
def interpolate_and_get_results(temp, prec, Eprec, w, i, k, mx, my):
factor1 = ((temperatures[i,mx,my]-272.15) + (-alt[mx, my] * -6/1000))
factor2 = precipitation[i,mx,my]
factor3 = snow[i,mx,my]
for i in nb.prange(w.shape[1]):
for j in range(w.shape[2]):
val = w[k, i, j]
temp[i, j] += factor1 * val
prec[i, j] += factor2 * val
Eprec[i, j] += factor3 * val
# Example of usage:
interpolate_and_get_results(temp, prec, Eprec, w, i, k, mx, my)
Note the string in nb.njit is called a signature and specify the type to the JIT so it can compile it eagerly.
This code is 4.6 times faster on my 6-core machine (while it was barely faster without the merge of get_results and interpolator). In fact, it is 3.8 times faster in sequential so threads does not help much since the computation is still memory-bound. Indeed, the cost of the multiply-add is negligible compared to the memory reads/writes.

Related

Simulation of Markov chain slower than in Matlab

I run the same test code in Python+Numpy and in Matlab and see that the Matlab code is faster by an order of magnitude. I want to know what is the bottleneck of the Python code and how to speed it up.
I run the following test code using Python+Numpy (the last part is the performance sensitive one):
# Packages
import numpy as np
import time
# Number of possible outcomes
num_outcomes = 20
# Dimension of the system
dim = 50
# Number of iterations
num_iterations = int(1e7)
# Possible outcomes
outcomes = np.arange(num_outcomes)
# Possible transition matrices
matrices = [np.random.rand(dim, dim) for k in outcomes]
matrices = [mat/np.sum(mat, axis=0) for mat in matrices]
# Initial state
state = np.random.rand(dim)
state = state/np.sum(state)
# List of samples
samples = np.random.choice(outcomes, size=(num_iterations,))
samples = samples.tolist()
# === PERFORMANCE-SENSITIVE PART OF THE CODE ===
# Update the state over all iterations
start_time = time.time()
for k in range(num_iterations):
sample = samples[k]
matrix = matrices[sample]
state = np.matmul(matrix, state)
end_time = time.time()
# Print the execution time
print(end_time - start_time)
I then run an equivalent code using Matlab (the last part is the performance sensitive one):
% Number of possible outcomes
num_outcomes = 20;
% Number of dimensions
dim = 50;
% Number of iterations
num_iterations = 1e7;
% Possible outcomes
outcomes = 1:num_outcomes;
% Possible transition matrices
matrices = rand(num_outcomes, dim, dim);
matrices = matrices./sum(matrices,2);
matrices = num2cell(matrices,[2,3]);
matrices = cellfun(#shiftdim, matrices, 'UniformOutput', false);
% Initial state
state = rand(dim,1);
state = state./sum(state);
% List of samples
samples = datasample(outcomes, num_iterations);
% === PERFORMANCE-SENSITIVE PART OF THE CODE ===
% Update the state over all iterations
tic;
for k = 1:num_iterations
sample = samples(k);
matrix = matrices{sample};
state = matrix * state;
end
toc;
The Python code is consistently slower than the Matlab code by an order of magnitude, and I am not sure why.
Any idea where to start?
I run the Python code with the Python 3.10 interpreter and Numpy 1.22.4. I run the Matlab code with Matlab R2022a. Both codes are run on Windows 11 Pro 64 bits on a Lenovo T14 ThinkPad with the following processors:
11th Gen Intel(R) Core(TM) i7-1165G7 # 2.80GHz, 2803 Mhz, 4 Core(s), 8 Logical Processor(s)
EDIT 1: I made some additional tests and it looks like the culprit is some type of Python-specific constant overhead at low matrix sizes:
As hpaulj and MSS suggest, this might mean that a JIT compiler could solve some of these issues. I will do my best to try this in the near future.
EDIT 2: I ran the code under Pypy 3.9-v7.3.11-win64 and although it does change the scaling and even beats Cpython at small matrix sizes, it generally incurs a big overhead for this particular code:
So a JIT compiler could help if there are ways to mitigate this overhead. Otherwise a Cython implementation is probably the remaining way to go...

In the loop, the main hindrance is np.matmul(matrix,state).
If we unroll the loop:
st[1] = m[0]#st[0]
st[2] = m[1]#st[1] = m[1]#m[0]#st[0]
st[3] = m[2]#m[1]#m[0]#st[0]
There is no obvious vectorized way to do looped np.matmul in a non loopy way.
A better way would be to do it in log_2(n) loops.
import numpy as np
outcomes = 20
dim = 50
num_iter = int(1e7)
mat = np.random.rand(outcomes,dim, dim)
mat = mat/mat.sum(axis=1)[...,None]
state = np.random.rand(dim)
state = state/np.sum(state)
samples = np.random.choice(np.arange(outcomes), size=(num_iter,))
a = mat[samples,...]
# This while loop takes log_2(num_iter) iterations
while len(a) > 1:
a = np.matmul(a[::2, ...], a[1::2, ...])
state = np.matmul(a,state)
The time may be further reduced by using numba jit.

Why are variable assignments faster than calling from arrays in python?

I've been working on optimizing some Euclidean distance transform calculations for a program that I'm building. To preface, I have little formal training in computer science other than some MOOCs I've been taking.
I've learned through empirical testing in Python that assigning values to individual variables and performing operations on them is faster than performing operations on arrays. Is this observation reproducible for others?
If so, could someone provide a deeper explanation as to why there are such speed differences between these two forms of syntax?
Please see some example code below.
import numpy as np
from math import sqrt
import time
# Numpy array math
def test1(coords):
results = []
for coord in coords:
mins = np.array([1,1,1])
# The three lines below seem faster than np.linalg.norm()
mins = (coord - mins)**2
mins = np.sum(mins)
results.append(sqrt(mins))
# Individual variable assignment math
def test2(coords):
results = []
for point in coords:
z, y, x = 1, 1, 1
z = (point[0] - z)**2
y = (point[1] - y)**2
x = (point[2] - x)**2
mins = sqrt(z + y + x)
results.append(mins)
a = np.random.randint(0, 10, (500000,3))
t = time.perf_counter()
test1(a)
print ("Test 1 speed:", time.perf_counter() - t)
t = time.perf_counter()
test2(a)
print ("Test 2 speed:", time.perf_counter() - t)
Test 1 speed: 3.261552719 s
Test 2 speed: 0.716983475 s

Python operations and memory allocations are generally much slower than Numpy's highly optimized, vectorized array operations. Since you are looping over the array and allocating memory, you don't get any of the benefits that Numpy offers. It's especially bad in your first one because it causes an undue number of allocations of small arrays.
Compare your code to one that offloads all the operations to Numpy instead of having Python do the operations one by one:
def test3(coords):
mins = (coords - 1)**2
results = np.sqrt(np.sum(mins, axis=1))
return results
On my system, this results in:
Test 1 speed: 4.995761550962925
Test 2 speed: 1.3881473205983639
Test 3 speed: 0.05562112480401993

Numba function take long time for assign value to an array

I wrote a function to calculate the HOG of an image by Numba, and I ran it on 7000 images. it takes 10 sec time. but when I commented on the line that assigns a variable into an array ( hist[idx] += mag ), the time decreased into 5 milliseconds. what is the problem and what should I do about this.
#numba.jit( numba.uint64[:]( numba.uint8[:,:],numba.uint8), nopython=True )
def hog_numba( img, bins ):
h,w = img.shape
hist = np.zeros( bins, dtype=np.uint64)
for i in range(h-1):
for j in range(w-1):
cy = img[i-1,j-1]*1 + img[i-1,j]*2 + img[i-1,j+1]*1 + img[i+1,j-1]*-1 + img[i+1,j]*-2 + img[i+1,j+1]*-1
cx = img[i-1,j-1]*1 + img[i,j-1]*2 + img[i+1,j-1]*1 + img[i-1,j+1]*-1 + img[i,j+1]*-2 + img[i+1,j+1]*-1
mag = numba.uint32(math.sqrt( math.pow(cx,2) + math.pow(cy,2) ) )
if cx!=0:
ang = math.atan2( cy, cx)#arc_tang
else :
if cy>0:
ang = math.pi / 2
else:
ang = -math.pi / 2
if ang<0:
ang = abs(ang) + math.pi
idx = (ang * bins) // (math.pi * 2 )
idx = int(idx)
#hist[idx] += mag
return hist
below code used for benchmark
for _ in range(20):
print('start')
t = time.time()
hists = []
for i in range(8000):
hist = hog_numba(img, 10)
t = time.time() - t
print('time:',t)

The difference in speed is not due to the fact that assignment is slow but due to the optimization of the JIT compiler. Indeed, if you comment the line hist[idx] += mag, then Numba can see that mag and idx do not need to be computed and can just remove the associated lines. Transitively, it can also remove the computation of ang, cx and cy. Finally it can fully remove the two nested loops. Such a code will be much faster but also useless. However, the JIT may not fully remove all the operation inside the two nested loops in practice since the JIT may not be able to fully optimize the code possibly due to Python transformations, guards and side effects. On my machine is does optimize the loop to a no-op. Indeed, it takes lass than 1 ms in average to compute a 8000 images of size (16_000,16_000) which is totally impossible on my machine (it should be at least 1000 times slower).
Thus, you cannot measure the time of an isolated instruction by just removing it and look for the time difference with Numba (or any optimized compiled code). Modern compilers are very advanced and trying to defeat them is not easy. If you still want to see if the cost actually comes mainly from the assignment, you could try to perform a summation like mag_sum += mag, idx_sum += idx and return/print the summation variables (otherwise the compiler can see that they are useless as they do not cause visible changes). On my machine the assignment version is only 9% slower than an implementation use a summation showing the assignment does not take most of the execution time (despite not being very fast probably due to the random access pattern).
The main source of slow down comes from the line (ang * bins) // (math.pi * 2 ) and more specifically from the multiplication/division by a constant. Pre-computing bins / (math.pi * 2) in a temporary variable ahead of time result in a 3.5 times faster code. The code is far from being optimized. Further optimizations include using vectorization, branch-less operations and parallelism (using simple precision and trying to remove the math.atan2 call may also help).

Sum of difference of squares between each combination of rows of 17,000 by 300 matrix

Ok, so I have a matrix with 17000 rows (examples) and 300 columns (features). I want to compute basically the euclidian distance between each possible combination of rows, so the sum of the squared differences for each possible pair of rows.
Obviously it's a lot and iPython, while not completely crashing my laptop, says "(busy)" for a while and then I can't run anything anymore and it certain seems to have given up, even though I can move my mouse and everything.
Is there any way to make this work? Here's the function I wrote. I used numpy everywhere I could.
What I'm doing is storing the differences in a difference matrix for each possible combination. I'm aware that the lower diagonal part of the matrix = the upper diagonal, but that would only save 1/2 the computation time (better than nothing, but not a game changer, I think).
EDIT: I just tried using scipy.spatial.distance.pdistbut it's been running for a good minute now with no end in sight, is there a better way? I should also mention that I have NaN values in there...but that's not a problem for numpy apparently.
features = np.array(dataframe)
distances = np.zeros((17000, 17000))
def sum_diff():
for i in range(17000):
for j in range(17000):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares

You could always divide your computation time by 2, noticing that d(i, i) = 0 and d(i, j) = d(j, i).
But have you had a look at sklearn.metrics.pairwise.pairwise_distances() (in v 0.18, see the doc here) ?
You would use it as:
from sklearn.metrics import pairwise
import numpy as np
a = np.array([[0, 0, 0], [1, 1, 1], [3, 3, 3]])
pairwise.pairwise_distances(a)

The big thing with numpy is to avoid using loops and to let it do its magic with the vectorised operations, so there are a few basic improvements that will save you some computation time:
import numpy as np
import timeit
#I reduced the problem size to 1000*300 to keep the timing in reasonable range
n=1000
features = np.random.rand(n,300)
distances = np.zeros((n,n))
def sum_diff():
for i in range(n):
for j in range(n):
diff = np.array(features[i] - features[j])
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
#Here I removed the unnecessary copy induced by calling np.array
# -> some improvement
def sum_diff_v0():
for i in range(n):
for j in range(n):
diff = features[i] - features[j]
diff = np.square(diff)
sumsquares = np.sum(diff)
distances[i][j] = sumsquares
#Collapsing of the statements -> no improvement
def sum_diff_v1():
for i in range(n):
for j in range(n):
distances[i][j] = np.sum(np.square(features[i] - features[j]))
# Using brodcasting and vetorized operations -> big improvement
def sum_diff_v2():
for i in range(n):
distances[i] = np.sum(np.square(features[i] - features),axis=1)
# Computing only half the distance -> 1/2 computation time
def sum_diff_v3():
for i in range(n):
distances[i][i+1:] = np.sum(np.square(features[i] - features[i+1:]),axis=1)
distances[:] = distances + distances.T
print("original :",timeit.timeit(sum_diff, number=10))
print("v0 :",timeit.timeit(sum_diff_v0, number=10))
print("v1 :",timeit.timeit(sum_diff_v1, number=10))
print("v2 :",timeit.timeit(sum_diff_v2, number=10))
print("v3 :",timeit.timeit(sum_diff_v3, number=10))
Edit : For completeness I also timed Camilleri's solution that is much faster:
from sklearn.metrics import pairwise
def Camilleri_solution():
distances=pairwise.pairwise_distances(features)
Timing results (in seconds, function run 10 times with 1000*300 input):
original : 138.36921879299916
v0 : 111.39915344800102
v1 : 117.7582511530054
v2 : 23.702392491002684
v3 : 9.712442981006461
Camilleri's : 0.6131987979897531
So as you can see we can easily gain an order of magnitude by using the proper numpy syntax. Note that with only 1/20th of the data the function run in about one second so I would expect the whole thing to run in the tens of minutes as the scipt runs in N^2.

Pearson correlation on big numpy matrices

I have a 24000 * 316 numpy matrix, each row represents a time series with 316 time points, and I am computing pearson correlation between each pair of these time series. Meaning as a result I would have a 24000 * 24000 numpy matrix having pearson values.
My problem is that this takes a very long time. I have tested my pipeline on smaller matrices (200 * 200) and it works (though still slow). I am wondering if it is expected to be this slow (takes more than a day!!!). And what I might be able to do about it...
If it helps this is my code... nothing special or hard..
def SimMat(mat,name):
mrange = mat.shape[0]
print "mrange:", mrange
nTRs = mat.shape[1]
print "nTRs:", nTRs
SimM = numpy.zeros((mrange,mrange))
for i in range(mrange):
SimM[i][i] = 1
for i in range (mrange):
for j in range(i+1, mrange):
pearV = scipy.stats.pearsonr(mat[i], mat[j])
if(pearV[1] <= 0.05):
if(pearV[0] >= 0.5):
print "Pearson value:", pearV[0]
SimM[i][j] = pearV[0]
SimM[j][i] = 0
else:
SimM[i][j] = SimM[j][i] = 0
numpy.savetxt(name, SimM)
return SimM, nTRs
Thanks

The main problem with your implementation is the amount of memory you'll need to store the correlation coefficients (at least 4.5GB). There is no reason to keep the already computed coefficients in memory. For problems like this, I like to use hdf5 to store the intermediate results since they work nicely with numpy. Here is a complete, minimal working example:
import numpy as np
import h5py
from scipy.stats import pearsonr
# Create the dataset
h5 = h5py.File("data.h5",'w')
h5["test"] = np.random.random(size=(24000,316))
h5.close()
# Compute dot products
h5 = h5py.File("data.h5",'r+')
A = h5["test"][:]
N = A.shape[0]
out = h5.require_dataset("pearson", shape=(N,N), dtype=float)
for i in range(N):
out[i] = [pearsonr(A[i],A[j])[0] for j in range(N)]
Testing the first 100 rows suggests this will only take 8 hours on a single core. If you parallelized it, it should have linear speedup with the number of cores.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.