I want to build a dictionary of function evaluations in a parallel manner, but I am struggling to figure out how to do this efficiently.
Take the case of a randomly constructed matrix:
import functools
import multiprocessing
import numpy as np
import time
#generate random symmetric matrix
N = 500
b = np.random.random_integers(-2000,2000,size=(N,N))
b_symm = (b + b.T)/2
#identity matrix
ident = np.eye(N)
# define worker function:
def func(w, b_mat):
if w !=0:
L = np.linalg.inv(1j * w * ident - b_mat)
else:
L = np.linalg.pinv(-b_mat)
return L
I now want to sample over many values of w, and construct a dictionary of outputs. This would be an embarrassingly parallel problem. I can do this by using a shared dictionary, using something like this:
def dict_builder(w, d):
d[w] = func(w, b_symm)
manager = multiprocessing.Manager()
val_dict = manager.dict()
wrange = np.linspace(-10,10,200)
processors=2
pool = multiprocessing.Pool(processors)
st = time.time()
data = pool.map(functools.partial(dict_builder, d= val_dict), wrange,2)
pool.close()
pool.join()
en = time.time()
print("parallel test took ",en - st," seconds.")
but this seems more complicated than necessary since I am only evaluating the function at unique points, and comes with the overhead of having a shared memory object.
What I would like to is split wrange into n chunks, where n is the number of processors, build n dictionaries independently, then combine them into a single dictionary. So two questions: 1) Would this be computationally advantageous? 2) What is the best way to implement this using the multiprocessing module?
Related
I am trying to parallelize my code across multiple processes. The computation is mostly CPU intensive, so I decided to go for multiprocessing rather than multithreading. I am given a matrix (~300/400 elements) and a vector (~10 elements).
First, I use the pool of processes to perform i distinct operations on a copy the original matrix/vector, obtaining as result i distinct new matrix/vector pairs. In the process, I also check if a selection of columns is equal to another given matrix.
Then, for each pair obtained before, I perform other operations to check that the vector resulting from the bitwise sum of some columns of the matrix and the vector has a given weight. This last part is repeated x times, where x dictates how many columns of the matrix should I extract.
The main problem is that the computation at some point stops without any error message. For this reason, I don't know how to debug the issue at all.
Since numpy is already multithreading, to avoid exhausting my hardware, I also set all the thread numbers to 1. However, even if I have 96 cores, spawning a pool of 48 workers still sometimes bring all of them to 100% usage.
I have tried to simplify the code to a minimum example to just show the idea.
What can be the problem related to this code? Also, do you think this is the right approach to the problem?
import operator
import os
from multiprocessing import Pool
import numpy as np
ENVIRONMENT = ("OMP_NUM_THREADS", "MKL_NUM_THREADS", "OPENBLAS_NUM_THREADS",
"OPENBLAS_NUM_THREADS", "VECLIB_MAXIMUM_THREADS",
"NUMEXPR_NUM_THREADS")
def _prepare_environment():
"""This is necessary since numpy already uses a lot of threads. See
https://stackoverflow.com/a/58195413/2326627"""
print("Setting threads to 1")
for env in ENVIRONMENT:
os.environ[env] = "1"
def _op1(matrix, vector, another_matrix, params):
matrix2 = vector.copy()
vector2 = vector.copy()
# ... conditional additions of matrix2's rows and vector2 based on params...
# Check if some columns of the matrix are equal to the other matrix given
is_eq = np.array_equal(matrix2[:, params['range']], another_matrix)
return (matrix2, vector2, is_eq)
def _op2(matrix, is_eq, vector, params):
# Check bitwise sum of some columns of matrix and vector
sum_cols_v = (matrix[:, params['range2']].sum(axis=1) + vector) % 2
sum_cols_v_w = np.sum(sum_cols_v)
# Check if the weight is equal to the given one
is_correct_w = sum_cols_v_w == params['w']
return (is_eq, is_correct_w)
def go(matrix, vector, pool):
op1_ress = []
for i in range(1000): # number depends on other params, not interesting
params = {}
params['i'] = i
# ...create other params...
other_matrix = None
# ...generate other matrix...
res = pool.apply_async(_op1, (matrix, vector, other_matrix, params))
op1_ress.append(res)
# At this point we have all the results for all possible RREF
print('op1 done')
# We want to count the numer of matrix equal to another matrix
n_idens = sum(i for _, _, i in map(operator.methodcaller('get'), op1_ress))
print(n_idens)
for x in range(5): # number depends on other params, not interesting
op2_ress = []
for (matrix2, vector2, is_eq) in map(operator.methodcaller('get'),
op1_ress):
params2 = {}
params2['x'] = x
# ... create other params ...
for y in range(1000): # number depends on other params, not interesting
params2['y'] = y
res = pool.apply_async(_op2,
(matrix2, is_eq, vector2, params2))
op2_ress.append(res)
n_weights = 0
n_weights_given_eq = 0
for is_eq, is_correct_w in map(operator.methodcaller('get'), op2_ress):
if is_correct_w:
n_weights += 1
if is_eq:
n_weights_given_eq += 1
print(n_weights)
print(n_weights_given_eq)
def main():
_prepare_environment()
pool = Pool(48)
rng = np.random.default_rng()
matrix = rng.integers(2, size=(12, 28))
vector = rng.integers(2, size=(12, 1))
go(matrix, vector, pool)
pool.close()
pool.join()
if __name__ == '__main__':
main()
I don't really understand how to handle multiprocessing in Python when mapping the function results to multidimensional arrays. I provide a simple example of how I calculate it serially. The parallel computing does not work. Often, I pass a lot of arguments to a function, so this would be a very annoying way of doing it. Is there a "better way", than creating all i,j-pairs with a reshaped meshgrid?
import concurrent.futures
import numpy as np
def complex_function(i,j):
# this is a computationally intense function
return i,j,i+j
all_i = np.arange(3)
all_j = np.arange(6)
#%% serial
solution = np.empty((len(all_i),len(all_j)),dtype=float)
for i in range(len(all_i)):
for j in range(len(all_j)):
solution[i,j] = complex_function(all_i[i],all_j[j])[2]
#%% parallel
solution = np.empty((len(all_i),len(all_j)),dtype=float)
I,J = np.meshgrid(all_i, all_j, sparse=False, indexing='ij')
I = I.reshape(-1)
J = J.reshape(-1)
with concurrent.futures.ProcessPoolExecutor() as executor:
for i, j, result in executor.map(complex_function, I, J):
solution[i,j] = result
Okay, now I want to know wheather I can use nested functions like
def dummy_function(i,j):
result = complex_function(i,j)
return result
and then call dummy_function(i,j) with the executer.
I am using something similar to the following to parallelize a for loop over two matrices
from joblib import Parallel, delayed
import numpy
def processInput(i,j):
for k in range(len(i)):
i[k] = 1
for t in range(len(b)):
j[t] = 0
return i,j
a = numpy.eye(3)
b = numpy.eye(3)
num_cores = 2
(a,b) = Parallel(n_jobs=num_cores)(delayed(processInput)(i,j) for i,j in zip(a,b))
but I'm getting the following error: Too many values to unpack (expected 2)
Is there a way to return 2 values with delayed? Or what solution would you propose?
Also, a bit OP, is there a more compact way, like the following (which doesn't actually modify anything) to process the matrices?
from joblib import Parallel, delayed
def processInput(i,j):
for k in i:
k = 1
for t in b:
t = 0
return i,j
I would like to avoid the use of has_shareable_memory anyway, to avoid possible bad interactions in the actual script and lower performances(?)
Probably too late, but as an answer to the first part of your question:
Just return a tuple in your delayed function.
return (i,j)
And for the variable holding the output of all your delayed functions
results = Parallel(n_jobs=num_cores)(delayed(processInput)(i,j) for i,j in zip(a,b))
Now results is a list of tuples each holding some (i,j) and you can just iterate through results.
I understand that there is overhead when using the Multiprocessing module, but this seems to be a high amount and the level of IPC should be fairly low from what I can gather.
Say I generate a large-ish list of random numbers between 1-1000 and want to obtain a list of only the prime numbers. This code is only meant to test multiprocessing on CPU-intensive tasks. Ignore the overall inefficiency of the primality test.
The bulk of the code may look something like this:
from random import SystemRandom
from math import sqrt
from timeit import default_timer as time
from multiprocessing import Pool, Process, Manager, cpu_count
rdev = SystemRandom()
NUM_CNT = 0x5000
nums = [rdev.randint(0, 1000) for _ in range(NUM_CNT)]
primes = []
def chunk(l, n):
i = int(len(l)/float(n))
for j in range(0, n-1):
yield l[j*i:j*i+i]
yield l[n*i-i:]
def is_prime(n):
if n <= 2: return True
if not n % 2: return False
for i in range(3, int(sqrt(n)) + 1, 2):
if n % i == 0:
return False
return True
It seems to me that I should be able to split this up among multiple processes. I have 8 logical cores, so I should be able to use cpu_count() as the # of processes.
Serial:
def serial():
global primes
primes = []
for num in nums:
if is_prime(num):
primes.append(num) # primes now contain all the values
The following sizes of NUM_CNT correspond to the speed:
0x500 = 0.00100 sec.
0x5000 = 0.01723 sec.
0x50000 = 0.27573 sec.
0x500000 = 4.31746 sec.
This was the way I chose to do the multiprocessing. It uses the chunk() function to split up nums into cpu_count() (roughly equal) parts. It passes each chunk into a new process, which iterates through them, and then assigns it to an entry of a shared dict variable. The IPC should really occur when I assign the value to the shared variable. Why would it occur otherwise?
def loop(ret, id, numbers):
l_primes = []
for num in numbers:
if is_prime(num):
l_primes.append(num)
ret[id] = l_primes
def parallel():
man = Manager()
ret = man.dict()
num_procs = cpu_count()
procs = []
for i, l in enumerate(chunk(nums, num_procs)):
p = Process(target=loop, args=(ret, i, l))
p.daemon = True
p.start()
procs.append(p)
[proc.join() for proc in procs]
return sum(ret.values(), [])
Again, I expect some overhead, but the time seems to be increasing exponentially faster than the serial version.
0x500 = 0.37199 sec.
0x5000 = 0.91906 sec.
0x50000 = 8.38845 sec.
0x500000 = 119.37617 sec.
What is causing it to do this? Is it IPC? The initial setup makes me expect some overhead, but this is just an insane amount.
Edit:
Here's how I'm timing the execution of the functions:
if __name__ == '__main__':
print(hex(NUM_CNT))
for func in (serial, parallel):
t1 = time()
vals = func()
t2 = time()
if vals is None: # serial has no return value
print(len(primes))
else: # but parallel does
print(len(vals))
print("Took {:.05f} sec.".format(t2 - t1))
The same list of numbers is used each time.
Example output:
0x5000
3442
Took 0.01828 sec.
3442
Took 0.93016 sec.
Hmm. How do you measure time? On my computer, the parallel version is much faster than the serial one.
I'm mesuring using time.time() that way: if we assume tt is an alias for time.time().
serial()
t2 = int(round(tt() * 1000))
print(t2 - t1)
parallel()
t3 = int(round(tt() * 1000))
print(t3-t2)
I get, with 0x500000 as input:
5519ms for the serial version
3351ms for the parallel version
I believe that your mistake is caused by the inclusion of the number generation process inside the parallel, but not inside the serial one.
On my computer, the generation of the random numbers takes like 45seconds (it's a very slow process). So, it can explain the difference between your two values as I don't think that my computer uses a very different architecture.
I'm trying to evaluate a chi squared function, i.e. compare an arbitrary (blackbox) function to a numpy vector array of data. At the moment I'm looping over the array in python but something like this is very slow:
n=len(array)
sigma=1.0
chisq=0.0
for i in range(n):
data = array[i]
model = f(i,a,b,c)
chisq += 0.5*((data-model)/sigma)**2.0
return chisq
array is a 1-d numpy array and a,b,c are scalars. Is there a way to speed this up by using numpy.sum() or some sort of lambda function etc.? I can see how to remove one loop (over chisq) like this:
numpy.sum(((array-model_vec)/sigma)**2.0)
but then I still need to explicitly populate the array model_vec, which will presumably be just as slow; how do I do that without an explicit loop like this:
model_vec=numpy.zeros(len(data))
for i in range(n):
model_vec[i] = f(i,a,b,c)
return numpy.sum(((array-model_vec)/sigma)**2.0)
?
Thanks!
You can use np.vectorize to 'vectorize' your function f if you don't have control over its definition:
g = np.vectorize(f)
But this is not as good as vectorizing the function yourself manually to support arrays, as it doesn't really do much more than internalize the loop, and it might not work well with certain functions. In fact, from the documentation:
Notes The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
You should instead focus on making f accept a vector instead of i:
def f(i, a, b, x):
return a*x[i] + b
def g(a, b, x):
x = np.asarray(x)
return a*x + b
Then, instead of calling f(i, a, b, x), call g(a,b,x)[i] if you only want the ith, but for operations on the entire function, use g(a, b, x) and it will be much faster.
model_vec = g(a, b, x)
return numpy.sum(((array-model_vec)/sigma)**2.0)
It seems that your code is slow because what is executing in the loop is slow (your model generation). Turning this into a one-liner won't speed things up. If you have access to a modern computer with more than on CPU you could try to run this loop in parallel - for example using the multiprocessing module;
from multiprocessing import Pool
if __name__ == '__main__':
# snip set up code
pool = Pool(processes=4) # start 4 worker processes
inputs = [(i,a,b,c) for i in range(n)]
model_array = pool.map(model, inputs)
for i in range(n):
data = array[i]
model = model_array[i]
chisq += 0.5*((data-model)/sigma)**2.0