scipy parallel cdist with multiprocessing - python

I have a big matrix with millions of rows and hundreds of columns.
The first n rows (about 100K) are reference rows, and for the others, I would like to find the k (about 10) closest neighbours in the reference vectors with scipy cdist
I created an multiprocessing.sharedctypes.Array from the matrix, and use asarray and slicing to split up the matrix and compute distances with cdist.
My current code looks like this:
import numpy
from multiprocessing import Pool, sharedctypes
from scipy.spatial.distance import cdist
shared_m = None
def generate_sample():
m = numpy.random.random((20000, 640))
shape = m.shape
global shared_m
shared_m = sharedctypes.Array('d', m.flatten(), lock=False)
return shape
def dist(A, B, metric):
return cdist(A, B, metric=metric)
def get_closest(args):
shape, lenA, start_B, end_B, metric, numres = args
m = numpy.ctypeslib.as_array(shared_m)
m.shape = shape
A = numpy.asarray(m[:lenA,:], dtype=numpy.double)
B = numpy.asarray(m[start_B:end_B,:], dtype=numpy.double)
distances = dist(B, A, metric)
# rest of code to find closests
def p_get_closest(shape, lenA, lenB, metric="cosine", sample_size=1000, numres=10):
p = Pool(4)
args = ((shape, lenA, i, i + sample_size, metric, numres)
for i in xrange(lenB / sample_size))
res = p.map_async(get_closest, args)
return res.get()
def main():
shape = generate_sample()
p_get_closest(shape, 5000, shape[0] - 5000, "cosine", 3000, 10)
if __name__ == "__main__":
main()
My problem right now is that the parallel calls of cdist are somehow block each other. (Maybe I use the block expression incorrectly. The problem is that there are no parallel cdist computations)
I tried to trace back the problem with printouts into scipy/spatial/distance.py and scipy/spatial/src/distance.c to understand where the run blocks. It looks like there is no copying of data, the dtypes argument took care of that.
When putting printf into distance.c:cdist_cosine(), it shows that all the processes reach the point where the actual computation starts (before the for loops), but the computations don't run in parallel.
I tried a lot of things like using multiprocessing.sharedctypes.RawArray instead of Array, using lock=True while creating the Array.
I have no other idea what I did wrong or how to investigate more the problem.

Related

Are Scipy LAPACK functions parallel?

I am currently using the scipy.linalg.lapack.zheevd() function and it runs on all cores, and produces hangs and memory overflows if I try mapping the function to an array of arguments using the ProcessPoolExecutor() or ThreadPoolExecutor() from concurrent.futures.
It utilizes as many cores as my test system has, but I was under the impression that things were not typically parallelized in Python due to the GIL. Is this a result of the underlying Fortran code running with OpenMP?
Is it safe to assume this is parallelized, and cannot be parallelized further? This is not a large bottleneck for my code (finding the eigensystems of 400 unique 1000x1000 matrices; although there may be need for this to be scaled up, e.g. 1000 2000x2000 matrices eventually), but I am in the optimization phase for it.
Here is a, hopefully, helpful code snippet for conceptualization, but does not represent the actual matrices:
import numpy as np
from scipy import linalg as la
import concurrent.futures
# In real code
# various parameters are used to build the matrix function,
# it is presumably not sparse
# Matrix with independent variable x
def matrix_function(x):
# Define dimensions and pre-allocate space for matrix
#dim = 100 # For quicker evaluation/testing
dim = 1000 # For conveying the scale of the problem
matrix_dimensions = [dim, dim]
# The matrix is complex
mat = np.zeros(matrix_dimensions, dtype=complex)
for i in range(dim):
for j in range(i,dim):
mat[i,j] = x*np.random.rand(1) + np.random.rand(1)*1J
# Making the matrix Hermitian
mat[j,i] = np.conjugate( mat[i,j] )
return mat
# 400 Arguments for the defined matrix function
args = np.arange(0,10,0.025)
# Parallelizing evaluation of 400 matrices
with concurrent.futures.ProcessPoolExecutor() as pool:
evaluated_matrix_functions = pool.map( matrix_function, args )
''' This will hang,
which is what tipped me off to the issue
**not important to question
eigsystem = pool.map( la.lapack.zheevd,
evaluated_matrix_functions
)
'''
pool.shutdown()
''' This will cause a memory overflow,
depending on the size of the matrices
and how many of them; even with 32GB memory
with concurrent.futures.ThreadPoolExecutor() as pool:
eigsystem = pool.map( la.lapack.zheevd,
evaluated_matrix_functions
)
pool.shutdown()
'''
# The code which I run, in serial,
# but still uses all cores/threads my 2700x provides at full load
eigensystem_list = []
for matrix in evaluated_matrix_functions:
eigensystem_list.append( la.lapack.zheevd(matrix) )
# The eigensystem_list is then used in later calculations
This is all controlled by the LAPACK library you are using under the hood.

Multiprocessing a function's output to a single array

I have been trying this for quite sometime now, but my array remains unchanged.
My array here is TC_p_value, and the function I am trying to simulate is TC_stats. The code runs fine if we run it normally, but takes too long to simulate (about an hour). Thus, to reduce the processing time, I divided the original array (1000x100) in 10 small sets of 100x100. Although, the code runs without an error, I somehow always get the same array (same as it is defined originally). I tried to define TC_p_value as global, so that each run can assign values to specific part of the array. However, it seems like I am doing something wrong here (as simulating a single array on multiple processors is not possible) or is there something wrong with my coding logic?
Any help is greatly appreciated.
Code for the same is written below.
import pingouin as pg # A package to do regression
TC_p_value = np.zeros((Treecover.shape[1],Treecover.shape[2])) #let this array be of size 1000 x 100
def TC_stats(grid_start):
global TC_p_value
for lat in tqdm(range(grid_start, grid_start+100)):
for lon in range(Treecover.shape[2]):
TC_p_value[lat,lon] = pg.corr(y=Treecover[:, lat,lon].values,
x=np.arange(1,16,1))['p-val'].values[0]
#Multiprocessing starts here
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool()
grid = np.arange(0,1000,100) #Running it in a group of 100, 10 times
pool.map(TC_stats, grid)
pool.close()
pool.join()
The problem is that an array defined globally is not shared across processes. Thus, you need to use shared memory.
import ctypes
import numpy as np
import pingouin as pg # A package to do regression
N, M = Treecover.shape[1], Treecover.shape[2]
mp_arr = mp.Array(ctypes.c_double, N * M)
TC_p_value = np.frombuffer(mp_arr.get_obj())
TC_p_value = TC_p_value.reshape((N, M))
#let this array be of size 1000 x 100
def TC_stats(grid_start):
TC_p_value = np.frombuffer(mp_arr.get_obj())
TC_p_value = TC_p_value.reshape((N, M))
for lat in tqdm(range(grid_start, grid_start+100)):
for lon in range(Treecover.shape[2]):
TC_p_value[lat,lon] = pg.corr(y=Treecover[:, lat,lon].values,
x=np.arange(1,16,1))['p-val'].values[0]
def init(shared_arr_):
global mp_arr
mp_arr = shared_arr_
#Multiprocessing starts here
from multiprocessing import Pool
if __name__ == '__main__':
pool = Pool(initializer=init, initargs=(mp_arr,))
grid = np.arange(0,1000,100) #Running it in a group of 100, 10 times
pool.map_async(TC_stats, grid)
pool.close()
pool.join()
I ran the code above with some modified toy example, and it worked.
Reference: Use numpy array in shared memory for multiprocessing

Efficiently using 1-D pyfftw on small slices of a 3-D numpy array

I have a 3D data cube of values of size on the order of 10,000x512x512. I want to parse a window of vectors (say 6) along dim[0] repeatedly and generate the fourier transforms efficiently. I think I'm doing an array copy into the pyfftw package and it's giving me massive overhead. I'm going over the documentation now since I think there is an option I need to set, but I could use some extra help on the syntax.
This code was originally written by another person with numpy.fft.rfft and accelerated with numba. But the implementation wasn't working on my workstation so I re-wrote everything and opted to go for pyfftw instead.
import numpy as np
import pyfftw as ftw
from tkinter import simpledialog
from math import ceil
import multiprocessing
ftw.config.NUM_THREADS = multiprocessing.cpu_count()
ftw.interfaces.cache.enable()
def runme():
# normally I would load a file, but for Stack Overflow, I'm just going to generate a 3D data cube so I'll delete references to the binary saving/loading functions:
# load the file
dataChunk = np.random.random((1000,512,512))
numFrames = dataChunk.shape[0]
# select the window size
windowSize = int(simpledialog.askstring('Window Size',
'How many frames to demodulate a single time point?'))
numChannels = windowSize//2+1
# create fftw arrays
ftwIn = ftw.empty_aligned(windowSize, dtype='complex128')
ftwOut = ftw.empty_aligned(windowSize, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut)
# perform DFT on the data chunk
demodFrames = dataChunk.shape[0]//windowSize
channelChunks = np.zeros([numChannels,demodFrames,
dataChunk.shape[1],dataChunk.shape[2]])
channelChunks = getDFT(dataChunk,channelChunks,
ftwIn,ftwOut,fftObject,windowSize,numChannels)
return channelChunks
def getDFT(data,channelOut,ftwIn,ftwOut,fftObject,
windowSize,numChannels):
frameLen = data.shape[0]
demodFrames = frameLen//windowSize
for yy in range(data.shape[1]):
for xx in range(data.shape[2]):
index = 0
for i in range(0,frameLen-windowSize+1,windowSize):
ftwIn[:] = data[i:i+windowSize,yy,xx]
fftObject()
channelOut[:,index,yy,xx] = 2*np.abs(ftwOut[:numChannels])/windowSize
index+=1
return channelOut
if __name__ == '__main__':
runme()
What happens is I get a 4D array; the variable channelChunks. I am saving out each channel to a binary (not included in the code above, but the saving part works fine).
This process is for a demodulation project we have, the 4D data cube channelChunks is then parsed into eval(numChannel) 3D data cubes (movies) and from that we are able to separate a movie by color given our experimental set up. I was hoping I could circumvent writing a C++ function that calls the fft on the matrix via pyfftw.
Effectively, I am taking windowSize=6 elements along the 0 axis of dataChunk at a given index of 1 and 2 axis and performing a 1D FFT. I need to do this throughout the entire 3D volume of dataChunk to generate the demodulated movies. Thanks.
The FFTW advanced plans can be automatically built by pyfftw.
The code could be modified in the following way:
Real to complex transforms can be used instead of complex to complex transform.
Using pyfftw, it typically writes:
ftwIn = ftw.empty_aligned(windowSize, dtype='float64')
ftwOut = ftw.empty_aligned(windowSize//2+1, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut)
Add a few flags to the FFTW planner. For instance, FFTW_MEASURE will time different algorithms and pick the best. FFTW_DESTROY_INPUT signals that the input array can be modified: some implementations tricks can be used.
fftObject = ftw.FFTW(ftwIn,ftwOut, flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
Limit the number of divisions. A division costs more than a multiplication.
scale=1.0/windowSize
for ...
for ...
2*np.abs(ftwOut[:,:,:])*scale #instead of /windowSize
Avoid multiple for loops by making use of FFTW advanced plan through pyfftw.
nbwindow=numFrames//windowSize
# create fftw arrays
ftwIn = ftw.empty_aligned((nbwindow,windowSize,dataChunk.shape[2]), dtype='float64')
ftwOut = ftw.empty_aligned((nbwindow,windowSize//2+1,dataChunk.shape[2]), dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut, axes=(1,), flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
...
for yy in range(data.shape[1]):
ftwIn[:] = np.reshape(data[0:nbwindow*windowSize,yy,:],(nbwindow,windowSize,data.shape[2]),order='C')
fftObject()
channelOut[:,:,yy,:]=np.transpose(2*np.abs(ftwOut[:,:,:])*scale, (1,0,2))
Here is the modifed code. I also, decreased the number of frame to 100, set the seed of the random generator to check that the outcome is not modifed and commented tkinter. The size of the window can be set to a power of two, or a number made by multiplying 2,3,5 or 7, so that the Cooley-Tuckey algorithm can be efficiently applied. Avoid large prime numbers.
import numpy as np
import pyfftw as ftw
#from tkinter import simpledialog
from math import ceil
import multiprocessing
import time
ftw.config.NUM_THREADS = multiprocessing.cpu_count()
ftw.interfaces.cache.enable()
ftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
def runme():
# normally I would load a file, but for Stack Overflow, I'm just going to generate a 3D data cube so I'll delete references to the binary saving/loading functions:
# load the file
np.random.seed(seed=42)
dataChunk = np.random.random((100,512,512))
numFrames = dataChunk.shape[0]
# select the window size
#windowSize = int(simpledialog.askstring('Window Size',
# 'How many frames to demodulate a single time point?'))
windowSize=32
numChannels = windowSize//2+1
nbwindow=numFrames//windowSize
# create fftw arrays
ftwIn = ftw.empty_aligned((nbwindow,windowSize,dataChunk.shape[2]), dtype='float64')
ftwOut = ftw.empty_aligned((nbwindow,windowSize//2+1,dataChunk.shape[2]), dtype='complex128')
#ftwIn = ftw.empty_aligned(windowSize, dtype='complex128')
#ftwOut = ftw.empty_aligned(windowSize, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut, axes=(1,), flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
# perform DFT on the data chunk
demodFrames = dataChunk.shape[0]//windowSize
channelChunks = np.zeros([numChannels,demodFrames,
dataChunk.shape[1],dataChunk.shape[2]])
channelChunks = getDFT(dataChunk,channelChunks,
ftwIn,ftwOut,fftObject,windowSize,numChannels)
return channelChunks
def getDFT(data,channelOut,ftwIn,ftwOut,fftObject,
windowSize,numChannels):
frameLen = data.shape[0]
demodFrames = frameLen//windowSize
printed=0
nbwindow=data.shape[0]//windowSize
scale=1.0/windowSize
for yy in range(data.shape[1]):
#for xx in range(data.shape[2]):
index = 0
ftwIn[:] = np.reshape(data[0:nbwindow*windowSize,yy,:],(nbwindow,windowSize,data.shape[2]),order='C')
fftObject()
channelOut[:,:,yy,:]=np.transpose(2*np.abs(ftwOut[:,:,:])*scale, (1,0,2))
#for i in range(nbwindow):
#channelOut[:,i,yy,xx] = 2*np.abs(ftwOut[i,:])*scale
if printed==0:
for j in range(channelOut.shape[0]):
print j,channelOut[j,0,yy,0]
printed=1
return channelOut
if __name__ == '__main__':
seconds=time.time()
runme()
print "time: ", time.time()-seconds
Let us know how much it speeds up your computations! I went from 24s to less than 2s on my computer...

Python efficient summation in large 2D array

My task is fairly simple: I have a large 2D matrix, containing only zeros and ones. For each position in this matrix, I want to sum all pixels in a window around this position. The problem is that the matrix has the shape (166667, 17668) and window sizes range from (333, 333) to (5333, 5333). So far I have only tried on a subset of the data. The code I arrived at:
out_arr = np.array( in_arr.shape )
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
for y in range(out_arr.shape[0]):
for x in range(out_arr.shape[1]):
out_arr[y, x] = np.sum(in_arr[y:y+windowsize, x:x+windowsize])
Obviously, this takes a long time. But in my case it was faster than a rolling window approach using numpy.stride_tricks.as_strided, as described here. I tried compiling it using cython, without effect.
What would be your suggestions to speed this up, apart from parallelizing?
I have a Nvidia Titan X at hand. Is there a way to benefit from that?
(e.g. using cupy)
For windowed summation convolution is actually overkill since a simple O(n) solution exists:
import numpy as np
from scipy.signal import convolve
def winsum(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2+1, mode='reflect')[:-1, :-1]
in_arr[0] = 0
in_arr[:, 0] = 0
ps = in_arr.cumsum(0).cumsum(1)
return ps[windowsize:, windowsize:] + ps[:-windowsize, :-windowsize] \
- ps[windowsize:, :-windowsize] - ps[:-windowsize, windowsize:]
This is already fast but you can save even more because ps calculated once for the largest window size could be reused for all smaller window sizes.
However, there is one potential drawback, which are the very large numbers that may arise from summing everything like that. A numerically more sound version eliminates this problem by taking the differences first. Downside: the extra saving through sharing ps is no longer available.
def winsum_safe(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
in_arr[windowsize:] -= in_arr[:-windowsize]
in_arr[:, windowsize:] -= in_arr[:, :-windowsize]
return in_arr.cumsum(0)[windowsize-1:].cumsum(1)[:, windowsize-1:]
For reference, here is the closest competitor which is fft based convolution. You need an up-to-date version of scipy for this to work efficiently. On older versions use fftconvolve instead of convolve.
def winsumc(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
kernel = np.ones((windowsize, windowsize), in_arr.dtype)
return convolve(in_arr, kernel, 'valid')
The next one is to simulate scipy's old - and excruciatingly slow - behavior.
def winsum_nofft(in_arr, windowsize):
in_arr = np.pad(in_arr, windowsize//2, mode='reflect')
kernel = np.ones((windowsize, windowsize), in_arr.dtype)
return convolve(in_arr, kernel, 'valid', method='direct')
Testing and benchmarking:
data = np.random.random((1000, 1000))
assert np.allclose(winsum(data, 333), winsumc(data, 333))
assert np.allclose(winsum(data, 333), winsum_safe(data, 333))
kwds = dict(globals=globals(), number=10)
from timeit import timeit
from time import perf_counter
print('data 100x1000, window 333x333')
print('cumsum: ', timeit('winsum(data, 333)', **kwds)*100, 'ms')
print('cumsum safe: ', timeit('winsum_safe(data, 333)', **kwds)*100, 'ms')
print('fftconv: ', timeit('winsumc(data, 333)', **kwds)*100, 'ms')
t = perf_counter()
res = winsum_nofft(data, 99) # 333 just takes too long
t = perf_counter() - t
assert np.allclose(winsum(data, 99), res)
print('data 100x1000, window 99x99')
print('conv: ', t*1000, 'ms')
Sample output:
data 100x1000, window 333x333
cumsum: 70.33260859316215 ms
cumsum safe: 59.98647050000727 ms
fftconv: 298.60571819590405 ms
data 100x1000, window 99x99
conv: 135224.8261970235 ms
#Divakar pointed out in the comments that you can use conv2d and he is right. Here is an example:
import numpy as np
from scipy import signal
data = np.random.rand(5,5) # you original data that you want to sum
kernel = np.ones((2,2)) # square matrix of your dimensions, filled with ones
output = signal.convolve2d(data,kernel,mode='same') # the convolution

Joblib parallel write to "shared" numpy sparse matrix

Im trying to compute number of shared neighbors for each node of a very big graph (~1m nodes). Using Joblib Im trying to run it in parallel. But Im worrying about parallel writes to sparse matrix, which supposed to keep all data. Will this piece of code produce consistent results?
vNum = 1259084
NN_Matrix = csc_matrix((vNum, vNum), dtype=np.int8)
def nn_calc_parallel(node_id = None):
i, j = np.unravel_index(node_id, (1259084, 1259084))
NN_Matrix[i, j] = len(np.intersect1d(nx.neighbors(G, i), nx.neighbors(G,j)))
num_cores = multiprocessing.cpu_count()
result = Parallel(n_jobs=num_cores)(delayed(nn_calc_parallel)(i) for i in xrange(vNum**2))
If not, can you help me to solve this?
I needed to do the same work, in my case was just ok to merge the matrixes together into one matrix which you can do this way:
from scipy.sparse import vstack
matrixes = Parallel(n_jobs=-3)(delayed(nn_calc_parallel)(x) for x in documents)
matrix = vstack(matrixes)
Njob-3 means all CPUS except 2, otherwise it might throw some memory errors.

Categories