I am currently using the scipy.linalg.lapack.zheevd() function and it runs on all cores, and produces hangs and memory overflows if I try mapping the function to an array of arguments using the ProcessPoolExecutor() or ThreadPoolExecutor() from concurrent.futures.
It utilizes as many cores as my test system has, but I was under the impression that things were not typically parallelized in Python due to the GIL. Is this a result of the underlying Fortran code running with OpenMP?
Is it safe to assume this is parallelized, and cannot be parallelized further? This is not a large bottleneck for my code (finding the eigensystems of 400 unique 1000x1000 matrices; although there may be need for this to be scaled up, e.g. 1000 2000x2000 matrices eventually), but I am in the optimization phase for it.
Here is a, hopefully, helpful code snippet for conceptualization, but does not represent the actual matrices:
import numpy as np
from scipy import linalg as la
import concurrent.futures
# In real code
# various parameters are used to build the matrix function,
# it is presumably not sparse
# Matrix with independent variable x
def matrix_function(x):
# Define dimensions and pre-allocate space for matrix
#dim = 100 # For quicker evaluation/testing
dim = 1000 # For conveying the scale of the problem
matrix_dimensions = [dim, dim]
# The matrix is complex
mat = np.zeros(matrix_dimensions, dtype=complex)
for i in range(dim):
for j in range(i,dim):
mat[i,j] = x*np.random.rand(1) + np.random.rand(1)*1J
# Making the matrix Hermitian
mat[j,i] = np.conjugate( mat[i,j] )
return mat
# 400 Arguments for the defined matrix function
args = np.arange(0,10,0.025)
# Parallelizing evaluation of 400 matrices
with concurrent.futures.ProcessPoolExecutor() as pool:
evaluated_matrix_functions = pool.map( matrix_function, args )
''' This will hang,
which is what tipped me off to the issue
**not important to question
eigsystem = pool.map( la.lapack.zheevd,
evaluated_matrix_functions
)
'''
pool.shutdown()
''' This will cause a memory overflow,
depending on the size of the matrices
and how many of them; even with 32GB memory
with concurrent.futures.ThreadPoolExecutor() as pool:
eigsystem = pool.map( la.lapack.zheevd,
evaluated_matrix_functions
)
pool.shutdown()
'''
# The code which I run, in serial,
# but still uses all cores/threads my 2700x provides at full load
eigensystem_list = []
for matrix in evaluated_matrix_functions:
eigensystem_list.append( la.lapack.zheevd(matrix) )
# The eigensystem_list is then used in later calculations
This is all controlled by the LAPACK library you are using under the hood.
I have a 3D data cube of values of size on the order of 10,000x512x512. I want to parse a window of vectors (say 6) along dim[0] repeatedly and generate the fourier transforms efficiently. I think I'm doing an array copy into the pyfftw package and it's giving me massive overhead. I'm going over the documentation now since I think there is an option I need to set, but I could use some extra help on the syntax.
This code was originally written by another person with numpy.fft.rfft and accelerated with numba. But the implementation wasn't working on my workstation so I re-wrote everything and opted to go for pyfftw instead.
import numpy as np
import pyfftw as ftw
from tkinter import simpledialog
from math import ceil
import multiprocessing
ftw.config.NUM_THREADS = multiprocessing.cpu_count()
ftw.interfaces.cache.enable()
def runme():
# normally I would load a file, but for Stack Overflow, I'm just going to generate a 3D data cube so I'll delete references to the binary saving/loading functions:
# load the file
dataChunk = np.random.random((1000,512,512))
numFrames = dataChunk.shape[0]
# select the window size
windowSize = int(simpledialog.askstring('Window Size',
'How many frames to demodulate a single time point?'))
numChannels = windowSize//2+1
# create fftw arrays
ftwIn = ftw.empty_aligned(windowSize, dtype='complex128')
ftwOut = ftw.empty_aligned(windowSize, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut)
# perform DFT on the data chunk
demodFrames = dataChunk.shape[0]//windowSize
channelChunks = np.zeros([numChannels,demodFrames,
dataChunk.shape[1],dataChunk.shape[2]])
channelChunks = getDFT(dataChunk,channelChunks,
ftwIn,ftwOut,fftObject,windowSize,numChannels)
return channelChunks
def getDFT(data,channelOut,ftwIn,ftwOut,fftObject,
windowSize,numChannels):
frameLen = data.shape[0]
demodFrames = frameLen//windowSize
for yy in range(data.shape[1]):
for xx in range(data.shape[2]):
index = 0
for i in range(0,frameLen-windowSize+1,windowSize):
ftwIn[:] = data[i:i+windowSize,yy,xx]
fftObject()
channelOut[:,index,yy,xx] = 2*np.abs(ftwOut[:numChannels])/windowSize
index+=1
return channelOut
if __name__ == '__main__':
runme()
What happens is I get a 4D array; the variable channelChunks. I am saving out each channel to a binary (not included in the code above, but the saving part works fine).
This process is for a demodulation project we have, the 4D data cube channelChunks is then parsed into eval(numChannel) 3D data cubes (movies) and from that we are able to separate a movie by color given our experimental set up. I was hoping I could circumvent writing a C++ function that calls the fft on the matrix via pyfftw.
Effectively, I am taking windowSize=6 elements along the 0 axis of dataChunk at a given index of 1 and 2 axis and performing a 1D FFT. I need to do this throughout the entire 3D volume of dataChunk to generate the demodulated movies. Thanks.
The FFTW advanced plans can be automatically built by pyfftw.
The code could be modified in the following way:
Real to complex transforms can be used instead of complex to complex transform.
Using pyfftw, it typically writes:
ftwIn = ftw.empty_aligned(windowSize, dtype='float64')
ftwOut = ftw.empty_aligned(windowSize//2+1, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut)
Add a few flags to the FFTW planner. For instance, FFTW_MEASURE will time different algorithms and pick the best. FFTW_DESTROY_INPUT signals that the input array can be modified: some implementations tricks can be used.
fftObject = ftw.FFTW(ftwIn,ftwOut, flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
Limit the number of divisions. A division costs more than a multiplication.
scale=1.0/windowSize
for ...
for ...
2*np.abs(ftwOut[:,:,:])*scale #instead of /windowSize
Avoid multiple for loops by making use of FFTW advanced plan through pyfftw.
nbwindow=numFrames//windowSize
# create fftw arrays
ftwIn = ftw.empty_aligned((nbwindow,windowSize,dataChunk.shape[2]), dtype='float64')
ftwOut = ftw.empty_aligned((nbwindow,windowSize//2+1,dataChunk.shape[2]), dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut, axes=(1,), flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
...
for yy in range(data.shape[1]):
ftwIn[:] = np.reshape(data[0:nbwindow*windowSize,yy,:],(nbwindow,windowSize,data.shape[2]),order='C')
fftObject()
channelOut[:,:,yy,:]=np.transpose(2*np.abs(ftwOut[:,:,:])*scale, (1,0,2))
Here is the modifed code. I also, decreased the number of frame to 100, set the seed of the random generator to check that the outcome is not modifed and commented tkinter. The size of the window can be set to a power of two, or a number made by multiplying 2,3,5 or 7, so that the Cooley-Tuckey algorithm can be efficiently applied. Avoid large prime numbers.
import numpy as np
import pyfftw as ftw
#from tkinter import simpledialog
from math import ceil
import multiprocessing
import time
ftw.config.NUM_THREADS = multiprocessing.cpu_count()
ftw.interfaces.cache.enable()
ftw.config.PLANNER_EFFORT = 'FFTW_MEASURE'
def runme():
# normally I would load a file, but for Stack Overflow, I'm just going to generate a 3D data cube so I'll delete references to the binary saving/loading functions:
# load the file
np.random.seed(seed=42)
dataChunk = np.random.random((100,512,512))
numFrames = dataChunk.shape[0]
# select the window size
#windowSize = int(simpledialog.askstring('Window Size',
# 'How many frames to demodulate a single time point?'))
windowSize=32
numChannels = windowSize//2+1
nbwindow=numFrames//windowSize
# create fftw arrays
ftwIn = ftw.empty_aligned((nbwindow,windowSize,dataChunk.shape[2]), dtype='float64')
ftwOut = ftw.empty_aligned((nbwindow,windowSize//2+1,dataChunk.shape[2]), dtype='complex128')
#ftwIn = ftw.empty_aligned(windowSize, dtype='complex128')
#ftwOut = ftw.empty_aligned(windowSize, dtype='complex128')
fftObject = ftw.FFTW(ftwIn,ftwOut, axes=(1,), flags=('FFTW_MEASURE','FFTW_DESTROY_INPUT',))
# perform DFT on the data chunk
demodFrames = dataChunk.shape[0]//windowSize
channelChunks = np.zeros([numChannels,demodFrames,
dataChunk.shape[1],dataChunk.shape[2]])
channelChunks = getDFT(dataChunk,channelChunks,
ftwIn,ftwOut,fftObject,windowSize,numChannels)
return channelChunks
def getDFT(data,channelOut,ftwIn,ftwOut,fftObject,
windowSize,numChannels):
frameLen = data.shape[0]
demodFrames = frameLen//windowSize
printed=0
nbwindow=data.shape[0]//windowSize
scale=1.0/windowSize
for yy in range(data.shape[1]):
#for xx in range(data.shape[2]):
index = 0
ftwIn[:] = np.reshape(data[0:nbwindow*windowSize,yy,:],(nbwindow,windowSize,data.shape[2]),order='C')
fftObject()
channelOut[:,:,yy,:]=np.transpose(2*np.abs(ftwOut[:,:,:])*scale, (1,0,2))
#for i in range(nbwindow):
#channelOut[:,i,yy,xx] = 2*np.abs(ftwOut[i,:])*scale
if printed==0:
for j in range(channelOut.shape[0]):
print j,channelOut[j,0,yy,0]
printed=1
return channelOut
if __name__ == '__main__':
seconds=time.time()
runme()
print "time: ", time.time()-seconds
Let us know how much it speeds up your computations! I went from 24s to less than 2s on my computer...
''In general, it would get better performance creating batches of linear constraints rather than creating them one at a time. I just wondering if it states even with a huge problem.'' - The wise programmer.
To be clear, I have a (35k x 40) dataset, and I want to do SVM on it. I need to produce the Gramm matrix of this dataset, it is fine, but to pass the coefficient to CPLEX is a mess, it takes hours, here my code:
nn = 35000
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
xg, yg = np.meshgrid(range(nn), range(nn))
indici = np.dstack([yg,xg])
quadraric_part = []
for ii in xrange(nn):
for indd in indici[ii][ii:]:
quadraric_part.append([indd[0],indd[1],temp[indd[0],indd[1]]])
The 'quadratic_part' is a list of the form [i,j,c_ij] where c_ij is the coefficient stored in temp. It will be passed to the function 'objective.set_quadratic_coefficients()' of the CPLEX Python API.
There is a wiser way to do that?
P.S. I have maybe a Memory problem, so It wold be better, instead store the whole list 'quadratic_part', call several times the function 'objective.set_quadratic_coefficients()'.... you know what I mean?!
Under the hood, objective.set_quadratic makes use of the CPXXcopyquad function in the C Callable Library. Whereas, objective.set_quadratic_coefficients uses CPXXcopyqpsep.
Here is an example (bear in mind that I am not a numpy expert; it's quite possible there's a better way to do that part):
import numpy as np
import cplex
nn = 5 # a small example size here
XXt = np.random.rand(nn,nn) # the gramm matrix of the dataset
yy = np.random.rand(nn) # the label vector of the dataset
temp = ((yy*XXt).T)*yy
# create symetric matrix
tempu = np.triu(temp) # upper triangle
iu1 = np.triu_indices(nn, 1)
tempu.T[iu1] = tempu[iu1] # copy upper into lower
ind = np.array([[x for x in range(nn)] for x in range(nn)])
qmat = []
for i in range(nn):
qmat.append([np.arange(nn), tempu[i]])
c = cplex.Cplex()
c.variables.add(lb=[0]*nn)
c.objective.set_quadratic(qmat)
c.write("test2.lp")
Your Q matrix is completely dense so depending on the amount of memory you have, this technique may not scale. When it's possible, though, you should get better performance initializing your Q matrix with objective.set_quadratic. Perhaps you'll need to use some hybrid technique where you use both set_quadratic and set_quadratic_coefficients.
I have a 24000 * 316 numpy matrix, each row represents a time series with 316 time points, and I am computing pearson correlation between each pair of these time series. Meaning as a result I would have a 24000 * 24000 numpy matrix having pearson values.
My problem is that this takes a very long time. I have tested my pipeline on smaller matrices (200 * 200) and it works (though still slow). I am wondering if it is expected to be this slow (takes more than a day!!!). And what I might be able to do about it...
If it helps this is my code... nothing special or hard..
def SimMat(mat,name):
mrange = mat.shape[0]
print "mrange:", mrange
nTRs = mat.shape[1]
print "nTRs:", nTRs
SimM = numpy.zeros((mrange,mrange))
for i in range(mrange):
SimM[i][i] = 1
for i in range (mrange):
for j in range(i+1, mrange):
pearV = scipy.stats.pearsonr(mat[i], mat[j])
if(pearV[1] <= 0.05):
if(pearV[0] >= 0.5):
print "Pearson value:", pearV[0]
SimM[i][j] = pearV[0]
SimM[j][i] = 0
else:
SimM[i][j] = SimM[j][i] = 0
numpy.savetxt(name, SimM)
return SimM, nTRs
Thanks
The main problem with your implementation is the amount of memory you'll need to store the correlation coefficients (at least 4.5GB). There is no reason to keep the already computed coefficients in memory. For problems like this, I like to use hdf5 to store the intermediate results since they work nicely with numpy. Here is a complete, minimal working example:
import numpy as np
import h5py
from scipy.stats import pearsonr
# Create the dataset
h5 = h5py.File("data.h5",'w')
h5["test"] = np.random.random(size=(24000,316))
h5.close()
# Compute dot products
h5 = h5py.File("data.h5",'r+')
A = h5["test"][:]
N = A.shape[0]
out = h5.require_dataset("pearson", shape=(N,N), dtype=float)
for i in range(N):
out[i] = [pearsonr(A[i],A[j])[0] for j in range(N)]
Testing the first 100 rows suggests this will only take 8 hours on a single core. If you parallelized it, it should have linear speedup with the number of cores.
I have a big matrix with millions of rows and hundreds of columns.
The first n rows (about 100K) are reference rows, and for the others, I would like to find the k (about 10) closest neighbours in the reference vectors with scipy cdist
I created an multiprocessing.sharedctypes.Array from the matrix, and use asarray and slicing to split up the matrix and compute distances with cdist.
My current code looks like this:
import numpy
from multiprocessing import Pool, sharedctypes
from scipy.spatial.distance import cdist
shared_m = None
def generate_sample():
m = numpy.random.random((20000, 640))
shape = m.shape
global shared_m
shared_m = sharedctypes.Array('d', m.flatten(), lock=False)
return shape
def dist(A, B, metric):
return cdist(A, B, metric=metric)
def get_closest(args):
shape, lenA, start_B, end_B, metric, numres = args
m = numpy.ctypeslib.as_array(shared_m)
m.shape = shape
A = numpy.asarray(m[:lenA,:], dtype=numpy.double)
B = numpy.asarray(m[start_B:end_B,:], dtype=numpy.double)
distances = dist(B, A, metric)
# rest of code to find closests
def p_get_closest(shape, lenA, lenB, metric="cosine", sample_size=1000, numres=10):
p = Pool(4)
args = ((shape, lenA, i, i + sample_size, metric, numres)
for i in xrange(lenB / sample_size))
res = p.map_async(get_closest, args)
return res.get()
def main():
shape = generate_sample()
p_get_closest(shape, 5000, shape[0] - 5000, "cosine", 3000, 10)
if __name__ == "__main__":
main()
My problem right now is that the parallel calls of cdist are somehow block each other. (Maybe I use the block expression incorrectly. The problem is that there are no parallel cdist computations)
I tried to trace back the problem with printouts into scipy/spatial/distance.py and scipy/spatial/src/distance.c to understand where the run blocks. It looks like there is no copying of data, the dtypes argument took care of that.
When putting printf into distance.c:cdist_cosine(), it shows that all the processes reach the point where the actual computation starts (before the for loops), but the computations don't run in parallel.
I tried a lot of things like using multiprocessing.sharedctypes.RawArray instead of Array, using lock=True while creating the Array.
I have no other idea what I did wrong or how to investigate more the problem.