Multiprocessing a function's output to a single array

Multiprocessing a function's output to a single array - python

I have been trying this for quite sometime now, but my array remains unchanged.
My array here is TC_p_value, and the function I am trying to simulate is TC_stats. The code runs fine if we run it normally, but takes too long to simulate (about an hour). Thus, to reduce the processing time, I divided the original array (1000x100) in 10 small sets of 100x100. Although, the code runs without an error, I somehow always get the same array (same as it is defined originally). I tried to define TC_p_value as global, so that each run can assign values to specific part of the array. However, it seems like I am doing something wrong here (as simulating a single array on multiple processors is not possible) or is there something wrong with my coding logic?
Any help is greatly appreciated.
Code for the same is written below.
import pingouin as pg # A package to do regression
TC_p_value = np.zeros((Treecover.shape[1],Treecover.shape[2])) #let this array be of size 1000 x 100
def TC_stats(grid_start):
global TC_p_value
for lat in tqdm(range(grid_start, grid_start+100)):
for lon in range(Treecover.shape[2]):
TC_p_value[lat,lon] = pg.corr(y=Treecover[:, lat,lon].values,
x=np.arange(1,16,1))['p-val'].values[0]
#Multiprocessing starts here
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool()
grid = np.arange(0,1000,100) #Running it in a group of 100, 10 times
pool.map(TC_stats, grid)
pool.close()
pool.join()

The problem is that an array defined globally is not shared across processes. Thus, you need to use shared memory.
import ctypes
import numpy as np
import pingouin as pg # A package to do regression
N, M = Treecover.shape[1], Treecover.shape[2]
mp_arr = mp.Array(ctypes.c_double, N * M)
TC_p_value = np.frombuffer(mp_arr.get_obj())
TC_p_value = TC_p_value.reshape((N, M))
#let this array be of size 1000 x 100
def TC_stats(grid_start):
TC_p_value = np.frombuffer(mp_arr.get_obj())
TC_p_value = TC_p_value.reshape((N, M))
for lat in tqdm(range(grid_start, grid_start+100)):
for lon in range(Treecover.shape[2]):
TC_p_value[lat,lon] = pg.corr(y=Treecover[:, lat,lon].values,
x=np.arange(1,16,1))['p-val'].values[0]
def init(shared_arr_):
global mp_arr
mp_arr = shared_arr_
#Multiprocessing starts here
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool(initializer=init, initargs=(mp_arr,))
grid = np.arange(0,1000,100) #Running it in a group of 100, 10 times
pool.map_async(TC_stats, grid)
pool.close()
pool.join()
I ran the code above with some modified toy example, and it worked.
Reference: Use numpy array in shared memory for multiprocessing

Related

Parallelize three nested loops

Context:
I have 3 3D arrays ("precursor arrays") that I am upsampling with an Inverse Distance Weighting method. To do that, I calculate a 3D weights array that I use in a for loop on each point of my precursor arrays.
Each 2D slice of my weights array is used to calculate a partial array. Once I generate all 28 of them, they are summed to give one final host array.
I would like to parallelize this for loop in order to reduce my computing time. I tried doing it but I can not manage to update correctly my host arrays.
Question:
How could I parallelize my main function (last section of my code) ?
EDIT: Or is there a way I could "slice" my i for loop (for example one core running between i = 0 to 5, and one core running on i = 6 to 9) ?
Summary:
3 precursor arrays (temperatures, precipitations, snow): 10x4x7 (10 is a time dimension)
1 weight array (w): 28x1101x2101
28x3 partial arrays: 1101x2101
3 host arrays (temp, prec, Eprec): 1101x2101
Here is my code (runable as it is aside from the MAIN ALGORITHM PARALLEL section, please see the MAIN ALGORITHM NOT PARALLEL section at the end for the non-parallelized version of my code):
import numpy as np
import multiprocessing as mp
import time
#%% ------ Create data ------ ###
temperatures = np.random.rand(10,4,7)*100
precipitation = np.random.rand(10,4,7)
snow = np.random.rand(10,4,7)
# Array of altitudes to "adjust" the temperatures
alt = np.random.rand(4,7)*1000
#%% ------ Functions to run in parallel ------ ###
# This function upsamples the precursor arrays and creates the partial arrays
def interpolator(i, k, mx, my):
T = ((temperatures[i,mx,my]-272.15) + (-alt[mx, my] * -6/1000)) * w[k,:,:]
P = (precipitation[i,mx,my])*w[k,:,:]
S = (snow[i,mx,my])*w[k,:,:]
return(T, P, S)
# We add each partial array to each other to create the host array
def get_results(results):
global temp, prec, Eprec
temp += results[0]
prec += results[1]
Eprec += results[2]
#%% ------ IDW Interpolation ------ ###
# We create a weight matrix that we use to upsample our temperatures, precipitations and snow matrices
# This part is not that important, it works well as it is
MX,MY = np.shape(temperatures[0])
N = 300
T = np.zeros([N*MX+1, N*MY+1])
# create NxM inverse distance weight matrices based on Gaussian interpolation
x = np.arange(0,N*MX+1)
y = np.arange(0,N*MY+1)
X,Y = np.meshgrid(x,y)
k = 0
w = np.zeros([MX*MY,N*MX+1,N*MY+1])
for mx in range(MX):
for my in range(MY):
# Gaussian
add_point = np.exp(-((mx*N-X.T)**2+(my*N-Y.T)**2)/N**2)
w[k,:,:] += add_point
k += 1
sum_weights = np.sum(w, axis=0)
for k in range(MX*MY):
w[k,:,:] /= sum_weights
#%% ------ MAIN ALGORITHM PARALLEL ------ ###
if __name__ == '__main__':
# Create an empty array to use as a template
dummy = np.zeros((w.shape[1], w.shape[2]))
# Start a timer
ts = time.time()
# Iterate over the time dimension
for i in range(temperatures.shape[0]):
# Initialize the host arrays
temp = dummy.copy()
prec = dummy.copy()
Eprec = dummy.copy()
# Create the pool based on my amount of cores
pool = mp.Pool(mp.cpu_count())
# Loop through every weight slice, for every cell of the temperatures, precipitations and snow arrays
for k in range(0,w.shape[0]):
for mx in range(MX):
for my in range(MY):
# Upsample the temperatures, precipitations and snow arrays by adding the contribution of each weight slice
pool.apply_async(interpolator, args = (i, k, mx, my), callback = get_results)
pool.close()
pool.join()
# Print the time spent on the loop
print("Time spent: ", time.time()-ts)
#%% ------ MAIN ALGORITHM NOT PARALLEL ------ ###
if __name__ == '__main__':
# Create an empty array to use as a template
dummy = np.zeros((w.shape[1], w.shape[2]))
ts = time.time()
for i in range(temperatures.shape[0]):
# Create empty host arrays
temp = dummy.copy()
prec = dummy.copy()
Eprec = dummy.copy()
k = 0
for mx in range(MX):
for my in range(MY):
get_results(interpolator(i, k, mx, my))
k += 1
print("Time spent:", time.time()-ts)

The problem with multiprocessing is that it creates many new processes taht execute the code before the main (ie. before if __name__ == '__main__'). This causes a very slow initialization (since all process does it) and a huge amount of RAM being used for nothing. You certainly should move everything in the main or if possible in functions (which generally results in a faster execution and is a good software engineering practice anyway, especially for parallel codes). Even with this, there is another huge problem with multiprocessing: inter-process communication is slow. One solution is to use a multi-threaded approach made possible by using Numba or Cython (you can disable the GIL with them as opposed to basic CPython threads). In fact, they are often simpler to use than multiprocessing. However, you should be more careful though since parallel access are unprotected and data-races can appear in bogus parallel codes.
In your case, the computation is mostly memory-bound. This means multiprocessing is pretty useless. In fact, parallelism is barely useful here unless you are running this code on a computing server with a high-throughput. Indeed, the memory is a shared resource and using more computing core does not help much since 1 core can almost saturate the memory bandwidth on a regular PC (while few cores are needed on computing servers).
The key to speed up memory-bound codes is to avoid creating temporary arrays and use cache-friendly algorithms. In your case, T, P and S are filled just to be read later so to update the temp, prec and Eprec arrays. This temporary step is pretty expensive and necessary here (especially filling the arrays). Removing this will increase the arithmetic intensity resulting in a code that will certainly be faster in sequential and that can better scale on multiple cores. This is the case on my machine.
Here is an example of code using Numba so to parallelize the code:
import numba as nb
# get_results + interpolator
#nb.njit('void(float64[:,::1], float64[:,::1], float64[:,::1], float64[:,:,::1], int_, int_, int_, int_)', parallel=True)
def interpolate_and_get_results(temp, prec, Eprec, w, i, k, mx, my):
factor1 = ((temperatures[i,mx,my]-272.15) + (-alt[mx, my] * -6/1000))
factor2 = precipitation[i,mx,my]
factor3 = snow[i,mx,my]
for i in nb.prange(w.shape[1]):
for j in range(w.shape[2]):
val = w[k, i, j]
temp[i, j] += factor1 * val
prec[i, j] += factor2 * val
Eprec[i, j] += factor3 * val
# Example of usage:
interpolate_and_get_results(temp, prec, Eprec, w, i, k, mx, my)
Note the string in nb.njit is called a signature and specify the type to the JIT so it can compile it eagerly.
This code is 4.6 times faster on my 6-core machine (while it was barely faster without the merge of get_results and interpolator). In fact, it is 3.8 times faster in sequential so threads does not help much since the computation is still memory-bound. Indeed, the cost of the multiply-add is negligible compared to the memory reads/writes.

Are Scipy LAPACK functions parallel?

I am currently using the scipy.linalg.lapack.zheevd() function and it runs on all cores, and produces hangs and memory overflows if I try mapping the function to an array of arguments using the ProcessPoolExecutor() or ThreadPoolExecutor() from concurrent.futures.
It utilizes as many cores as my test system has, but I was under the impression that things were not typically parallelized in Python due to the GIL. Is this a result of the underlying Fortran code running with OpenMP?
Is it safe to assume this is parallelized, and cannot be parallelized further? This is not a large bottleneck for my code (finding the eigensystems of 400 unique 1000x1000 matrices; although there may be need for this to be scaled up, e.g. 1000 2000x2000 matrices eventually), but I am in the optimization phase for it.
Here is a, hopefully, helpful code snippet for conceptualization, but does not represent the actual matrices:
import numpy as np
from scipy import linalg as la
import concurrent.futures
# In real code
# various parameters are used to build the matrix function,
# it is presumably not sparse
# Matrix with independent variable x
def matrix_function(x):
# Define dimensions and pre-allocate space for matrix
#dim = 100 # For quicker evaluation/testing
dim = 1000 # For conveying the scale of the problem
matrix_dimensions = [dim, dim]
# The matrix is complex
mat = np.zeros(matrix_dimensions, dtype=complex)
for i in range(dim):
for j in range(i,dim):
mat[i,j] = x*np.random.rand(1) + np.random.rand(1)*1J
# Making the matrix Hermitian
mat[j,i] = np.conjugate( mat[i,j] )
return mat
# 400 Arguments for the defined matrix function
args = np.arange(0,10,0.025)
# Parallelizing evaluation of 400 matrices
with concurrent.futures.ProcessPoolExecutor() as pool:
evaluated_matrix_functions = pool.map( matrix_function, args )
''' This will hang,
which is what tipped me off to the issue
**not important to question
eigsystem = pool.map( la.lapack.zheevd,
evaluated_matrix_functions
)
'''
pool.shutdown()
''' This will cause a memory overflow,
depending on the size of the matrices
and how many of them; even with 32GB memory
with concurrent.futures.ThreadPoolExecutor() as pool:
eigsystem = pool.map( la.lapack.zheevd,
evaluated_matrix_functions
)
pool.shutdown()
'''
# The code which I run, in serial,
# but still uses all cores/threads my 2700x provides at full load
eigensystem_list = []
for matrix in evaluated_matrix_functions:
eigensystem_list.append( la.lapack.zheevd(matrix) )
# The eigensystem_list is then used in later calculations

This is all controlled by the LAPACK library you are using under the hood.

Efficiently using 1-D pyfftw on small slices of a 3-D numpy array

I have a 3D data cube of values of size on the order of 10,000x512x512. I want to parse a window of vectors (say 6) along dim[0] repeatedly and generate the fourier transforms efficiently. I think I'm doing an array copy into the pyfftw package and it's giving me massive overhead. I'm going over the documentation now since I think there is an option I need to set, but I could use some extra help on the syntax.
This code was originally written by another person with numpy.fft.rfft and accelerated with numba. But the implementation wasn't working on my workstation so I re-wrote everything and opted to go for pyfftw instead.
import numpy as np
import pyfftw as ftw
from tkinter import simpledialog
from math import ceil
import multiprocessing
ftw.config.NUM_THREADS = multiprocessing.cpu_count()
ftw.interfaces.cache.enable()
def runme():
# normally I would load a file, but for Stack Overflow, I'm just going to generate a 3D data cube so I'll delete references to the binary saving/loading functions:
# load the file
dataChunk = np.random.random((1000,512,512))
numFrames = dataChunk.shape[0]
# select the window size
windowSize = int(simpledialog.askstring('Window Size',
'How many frames to demodulate a single time point?'))
numChannels = windowSize//2+1
# create fftw arrays
ftwIn = ftw.empty_aligned(windowSize, dtype='complex128')
ftwOut = ftw.empty_aligned(windowSize, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut)
# perform DFT on the data chunk
demodFrames = dataChunk.shape[0]//windowSize
channelChunks = np.zeros([numChannels,demodFrames,
dataChunk.shape[1],dataChunk.shape[2]])
channelChunks = getDFT(dataChunk,channelChunks,
ftwIn,ftwOut,fftObject,windowSize,numChannels)
return channelChunks
def getDFT(data,channelOut,ftwIn,ftwOut,fftObject,
windowSize,numChannels):
frameLen = data.shape[0]
demodFrames = frameLen//windowSize
for yy in range(data.shape[1]):
for xx in range(data.shape[2]):
index = 0
for i in range(0,frameLen-windowSize+1,windowSize):
ftwIn[:] = data[i:i+windowSize,yy,xx]
fftObject()
channelOut[:,index,yy,xx] = 2*np.abs(ftwOut[:numChannels])/windowSize
index+=1
return channelOut
if __name__ == '__main__':
runme()
What happens is I get a 4D array; the variable channelChunks. I am saving out each channel to a binary (not included in the code above, but the saving part works fine).
This process is for a demodulation project we have, the 4D data cube channelChunks is then parsed into eval(numChannel) 3D data cubes (movies) and from that we are able to separate a movie by color given our experimental set up. I was hoping I could circumvent writing a C++ function that calls the fft on the matrix via pyfftw.
Effectively, I am taking windowSize=6 elements along the 0 axis of dataChunk at a given index of 1 and 2 axis and performing a 1D FFT. I need to do this throughout the entire 3D volume of dataChunk to generate the demodulated movies. Thanks.

The FFTW advanced plans can be automatically built by pyfftw.
The code could be modified in the following way:
Real to complex transforms can be used instead of complex to complex transform.
Using pyfftw, it typically writes:
ftwIn = ftw.empty_aligned(windowSize, dtype='float64')
ftwOut = ftw.empty_aligned(windowSize//2+1, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut)
Add a few flags to the FFTW planner. For instance, FFTW_MEASURE will time different algorithms and pick the best. FFTW_DESTROY_INPUT signals that the input array can be modified: some implementations tricks can be used.
fftObject = ftw.FFTW(ftwIn,ftwOut, flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
Limit the number of divisions. A division costs more than a multiplication.
scale=1.0/windowSize
for ...
for ...
2*np.abs(ftwOut[:,:,:])*scale #instead of /windowSize
Avoid multiple for loops by making use of FFTW advanced plan through pyfftw.
nbwindow=numFrames//windowSize
# create fftw arrays
ftwIn = ftw.empty_aligned((nbwindow,windowSize,dataChunk.shape[2]), dtype='float64')
ftwOut = ftw.empty_aligned((nbwindow,windowSize//2+1,dataChunk.shape[2]), dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut, axes=(1,), flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
...
for yy in range(data.shape[1]):
ftwIn[:] = np.reshape(data[0:nbwindow*windowSize,yy,:],(nbwindow,windowSize,data.shape[2]),order='C')
fftObject()
channelOut[:,:,yy,:]=np.transpose(2*np.abs(ftwOut[:,:,:])*scale, (1,0,2))
Here is the modifed code. I also, decreased the number of frame to 100, set the seed of the random generator to check that the outcome is not modifed and commented tkinter. The size of the window can be set to a power of two, or a number made by multiplying 2,3,5 or 7, so that the Cooley-Tuckey algorithm can be efficiently applied. Avoid large prime numbers.
import numpy as np
import pyfftw as ftw
#from tkinter import simpledialog
from math import ceil
import multiprocessing
import time
ftw.config.NUM_THREADS = multiprocessing.cpu_count()
ftw.interfaces.cache.enable()
ftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
def runme():
# normally I would load a file, but for Stack Overflow, I'm just going to generate a 3D data cube so I'll delete references to the binary saving/loading functions:
# load the file
np.random.seed(seed=42)
dataChunk = np.random.random((100,512,512))
numFrames = dataChunk.shape[0]
# select the window size
#windowSize = int(simpledialog.askstring('Window Size',
# 'How many frames to demodulate a single time point?'))
windowSize=32
numChannels = windowSize//2+1
nbwindow=numFrames//windowSize
# create fftw arrays
ftwIn = ftw.empty_aligned((nbwindow,windowSize,dataChunk.shape[2]), dtype='float64')
ftwOut = ftw.empty_aligned((nbwindow,windowSize//2+1,dataChunk.shape[2]), dtype='complex128')
#ftwIn = ftw.empty_aligned(windowSize, dtype='complex128')
#ftwOut = ftw.empty_aligned(windowSize, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut, axes=(1,), flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
# perform DFT on the data chunk
demodFrames = dataChunk.shape[0]//windowSize
channelChunks = np.zeros([numChannels,demodFrames,
dataChunk.shape[1],dataChunk.shape[2]])
channelChunks = getDFT(dataChunk,channelChunks,
ftwIn,ftwOut,fftObject,windowSize,numChannels)
return channelChunks
def getDFT(data,channelOut,ftwIn,ftwOut,fftObject,
windowSize,numChannels):
frameLen = data.shape[0]
demodFrames = frameLen//windowSize
printed=0
nbwindow=data.shape[0]//windowSize
scale=1.0/windowSize
for yy in range(data.shape[1]):
#for xx in range(data.shape[2]):
index = 0
ftwIn[:] = np.reshape(data[0:nbwindow*windowSize,yy,:],(nbwindow,windowSize,data.shape[2]),order='C')
fftObject()
channelOut[:,:,yy,:]=np.transpose(2*np.abs(ftwOut[:,:,:])*scale, (1,0,2))
#for i in range(nbwindow):
#channelOut[:,i,yy,xx] = 2*np.abs(ftwOut[i,:])*scale
if printed==0:
for j in range(channelOut.shape[0]):
print j,channelOut[j,0,yy,0]
printed=1
return channelOut
if __name__ == '__main__':
seconds=time.time()
runme()
print "time: ", time.time()-seconds
Let us know how much it speeds up your computations! I went from 24s to less than 2s on my computer...

scipy parallel cdist with multiprocessing

I have a big matrix with millions of rows and hundreds of columns.
The first n rows (about 100K) are reference rows, and for the others, I would like to find the k (about 10) closest neighbours in the reference vectors with scipy cdist
I created an multiprocessing.sharedctypes.Array from the matrix, and use asarray and slicing to split up the matrix and compute distances with cdist.
My current code looks like this:
import numpy
from multiprocessing import Pool, sharedctypes
from scipy.spatial.distance import cdist
shared_m = None
def generate_sample():
m = numpy.random.random((20000, 640))
shape = m.shape
global shared_m
shared_m = sharedctypes.Array('d', m.flatten(), lock=False)
return shape
def dist(A, B, metric):
return cdist(A, B, metric=metric)
def get_closest(args):
shape, lenA, start_B, end_B, metric, numres = args
m = numpy.ctypeslib.as_array(shared_m)
m.shape = shape
A = numpy.asarray(m[:lenA,:], dtype=numpy.double)
B = numpy.asarray(m[start_B:end_B,:], dtype=numpy.double)
distances = dist(B, A, metric)
# rest of code to find closests
def p_get_closest(shape, lenA, lenB, metric="cosine", sample_size=1000, numres=10):
p = Pool(4)
args = ((shape, lenA, i, i + sample_size, metric, numres)
for i in xrange(lenB / sample_size))
res = p.map_async(get_closest, args)
return res.get()
def main():
shape = generate_sample()
p_get_closest(shape, 5000, shape[0] - 5000, "cosine", 3000, 10)
if __name__ == "__main__":
main()
My problem right now is that the parallel calls of cdist are somehow block each other. (Maybe I use the block expression incorrectly. The problem is that there are no parallel cdist computations)
I tried to trace back the problem with printouts into scipy/spatial/distance.py and scipy/spatial/src/distance.c to understand where the run blocks. It looks like there is no copying of data, the dtypes argument took care of that.
When putting printf into distance.c:cdist_cosine(), it shows that all the processes reach the point where the actual computation starts (before the for loops), but the computations don't run in parallel.
I tried a lot of things like using multiprocessing.sharedctypes.RawArray instead of Array, using lock=True while creating the Array.
I have no other idea what I did wrong or how to investigate more the problem.

Print current residual from callback in scipy.sparse.linalg.cg

I am using scipy.sparse.linalg.cg to solve a large, sparse linear system, and it works fine, except that I would like to add a progress report, so that I can monitor the residual as the solver works. I've managed to set up a callback, but I can't figure out how to access the current residual from inside the callback. Calculating the residual myself is possible, of course, but that is a rather heavy operation, which I'd like to avoid. Have I missed something, or is there no efficient way of getting at the residual?

The callback is only sent xk, the current solution vector. So you don't have direct access to the residual. However, the source code shows resid is a local variable in the cg function.
So, with CPython, it is possible to use the inspect module to peek at the local variables in the caller's frame:
import inspect
import numpy as np
import scipy as sp
import scipy.sparse as sparse
import scipy.sparse.linalg as splinalg
import random
def report(xk):
frame = inspect.currentframe().f_back
print(frame.f_locals['resid'])
N = 200
A = sparse.lil_matrix( (N, N) )
for _ in xrange(N):
A[random.randint(0, N-1), random.randint(0, N-1)] = random.randint(1, 100)
b = np.random.randint(0, N-1, size = N)
x, info = splinalg.cg(A, b, callback = report)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiprocessing a function's output to a single array - python

Related

Parallelize three nested loops

Are Scipy LAPACK functions parallel?

Efficiently using 1-D pyfftw on small slices of a 3-D numpy array

scipy parallel cdist with multiprocessing

Print current residual from callback in scipy.sparse.linalg.cg

Categories

Resources