I am trying to parallelize my code across multiple processes. The computation is mostly CPU intensive, so I decided to go for multiprocessing rather than multithreading. I am given a matrix (~300/400 elements) and a vector (~10 elements).
First, I use the pool of processes to perform i distinct operations on a copy the original matrix/vector, obtaining as result i distinct new matrix/vector pairs. In the process, I also check if a selection of columns is equal to another given matrix.
Then, for each pair obtained before, I perform other operations to check that the vector resulting from the bitwise sum of some columns of the matrix and the vector has a given weight. This last part is repeated x times, where x dictates how many columns of the matrix should I extract.
The main problem is that the computation at some point stops without any error message. For this reason, I don't know how to debug the issue at all.
Since numpy is already multithreading, to avoid exhausting my hardware, I also set all the thread numbers to 1. However, even if I have 96 cores, spawning a pool of 48 workers still sometimes bring all of them to 100% usage.
I have tried to simplify the code to a minimum example to just show the idea.
What can be the problem related to this code? Also, do you think this is the right approach to the problem?
import operator
import os
from multiprocessing import Pool
import numpy as np
ENVIRONMENT = ("OMP_NUM_THREADS", "MKL_NUM_THREADS", "OPENBLAS_NUM_THREADS",
"OPENBLAS_NUM_THREADS", "VECLIB_MAXIMUM_THREADS",
"NUMEXPR_NUM_THREADS")
def _prepare_environment():
"""This is necessary since numpy already uses a lot of threads. See
https://stackoverflow.com/a/58195413/2326627"""
print("Setting threads to 1")
for env in ENVIRONMENT:
os.environ[env] = "1"
def _op1(matrix, vector, another_matrix, params):
matrix2 = vector.copy()
vector2 = vector.copy()
# ... conditional additions of matrix2's rows and vector2 based on params...
# Check if some columns of the matrix are equal to the other matrix given
is_eq = np.array_equal(matrix2[:, params['range']], another_matrix)
return (matrix2, vector2, is_eq)
def _op2(matrix, is_eq, vector, params):
# Check bitwise sum of some columns of matrix and vector
sum_cols_v = (matrix[:, params['range2']].sum(axis=1) + vector) % 2
sum_cols_v_w = np.sum(sum_cols_v)
# Check if the weight is equal to the given one
is_correct_w = sum_cols_v_w == params['w']
return (is_eq, is_correct_w)
def go(matrix, vector, pool):
op1_ress = []
for i in range(1000): # number depends on other params, not interesting
params = {}
params['i'] = i
# ...create other params...
other_matrix = None
# ...generate other matrix...
res = pool.apply_async(_op1, (matrix, vector, other_matrix, params))
op1_ress.append(res)
# At this point we have all the results for all possible RREF
print('op1 done')
# We want to count the numer of matrix equal to another matrix
n_idens = sum(i for _, _, i in map(operator.methodcaller('get'), op1_ress))
print(n_idens)
for x in range(5): # number depends on other params, not interesting
op2_ress = []
for (matrix2, vector2, is_eq) in map(operator.methodcaller('get'),
op1_ress):
params2 = {}
params2['x'] = x
# ... create other params ...
for y in range(1000): # number depends on other params, not interesting
params2['y'] = y
res = pool.apply_async(_op2,
(matrix2, is_eq, vector2, params2))
op2_ress.append(res)
n_weights = 0
n_weights_given_eq = 0
for is_eq, is_correct_w in map(operator.methodcaller('get'), op2_ress):
if is_correct_w:
n_weights += 1
if is_eq:
n_weights_given_eq += 1
print(n_weights)
print(n_weights_given_eq)
def main():
_prepare_environment()
pool = Pool(48)
rng = np.random.default_rng()
matrix = rng.integers(2, size=(12, 28))
vector = rng.integers(2, size=(12, 1))
go(matrix, vector, pool)
pool.close()
pool.join()
if __name__ == '__main__':
main()
Related
I want to build a dictionary of function evaluations in a parallel manner, but I am struggling to figure out how to do this efficiently.
Take the case of a randomly constructed matrix:
import functools
import multiprocessing
import numpy as np
import time
#generate random symmetric matrix
N = 500
b = np.random.random_integers(-2000,2000,size=(N,N))
b_symm = (b + b.T)/2
#identity matrix
ident = np.eye(N)
# define worker function:
def func(w, b_mat):
if w !=0:
L = np.linalg.inv(1j * w * ident - b_mat)
else:
L = np.linalg.pinv(-b_mat)
return L
I now want to sample over many values of w, and construct a dictionary of outputs. This would be an embarrassingly parallel problem. I can do this by using a shared dictionary, using something like this:
def dict_builder(w, d):
d[w] = func(w, b_symm)
manager = multiprocessing.Manager()
val_dict = manager.dict()
wrange = np.linspace(-10,10,200)
processors=2
pool = multiprocessing.Pool(processors)
st = time.time()
data = pool.map(functools.partial(dict_builder, d= val_dict), wrange,2)
pool.close()
pool.join()
en = time.time()
print("parallel test took ",en - st," seconds.")
but this seems more complicated than necessary since I am only evaluating the function at unique points, and comes with the overhead of having a shared memory object.
What I would like to is split wrange into n chunks, where n is the number of processors, build n dictionaries independently, then combine them into a single dictionary. So two questions: 1) Would this be computationally advantageous? 2) What is the best way to implement this using the multiprocessing module?
I am running some simulations that take a numpy array as input, continue for several iterations until some condition is met (stochastically), and then repeat again using the same starting array. Each complete simulation starts with the same array and the number of steps to complete each simulation isn't known beforehand (but it's fine to put a limit on it, maxt).
I would like to save the array (X) after each step through each simulation, preferably in a large multi-dimensional array. Below I'm saving each simulation output in a list, and saving the array with copy.copy. I appreciate there are other methods I could use (using tuples for instance) so is there a more efficient method for doing this in Python?
Note: I appreciate this is a trivial example and that one could vectorize the code below. In the actual application being used, however, I have to use a loop as the stochasticity is introduced in a more complicated manner.
import numpy as np, copy
N = 10
Xstart = np.zeros(N)
Xstart[0] = 1
num_sims = 3
prob = 0.2
maxt = 20
analysis_result = []
for i in range(num_sims):
print("-------- Starting new simulation --------")
t = 0
X = copy.copy(Xstart)
# Create a new array to store results, save the array
sim_result = np.zeros((maxt, N))
sim_result[t,:] = X
while( (np.count_nonzero(X) < N) & (t < maxt) ):
print(X)
# Increment elements of the array stochastically
X[(np.random.rand(N) < prob)] += 1
# Save the array for time t
sim_result[t,:] = copy.copy(X)
t += 1
print(X)
analysis_result.append(sim_result[:t,:])
Consider the following example in Python 2.7. We have an arbitrary function f() that returns two 1-dimensional numpy arrays. Note that in general f() may returns arrays of different size and that the size may depend on the input.
Now we would like to call map on f() and concatenate the results into two separate new arrays.
import numpy as np
def f(x):
return np.arange(x),np.ones(x,dtype=int)
inputs = np.arange(1,10)
result = map(f,inputs)
x = np.concatenate([i[0] for i in result])
y = np.concatenate([i[1] for i in result])
This gives the intended result. However, since result may take up much memory, it may be preferable to use a generator by calling imap instead of map.
from itertools import imap
result = imap(f,inputs)
x = np.concatenate([i[0] for i in result])
y = np.concatenate([i[1] for i in result])
However, this gives an error because the generator is empty at the point where we calculate y.
Is there a way to use the generator only once and still create these two concatenated arrays? I'm looking for a solution without a for loop, since it is rather inefficient to repeatedly concatenate/append arrays.
Thanks in advance.
Is there a way to use the generator only once and still create these two concatenated arrays?
Yes, a generator can be cloned with tee:
import itertools
a, b = itertools.tee(result)
x = np.concatenate([i[0] for i in a])
y = np.concatenate([i[1] for i in b])
However, using tee does not help with the memory usage in your case. The above solution would require 5 N memory to run:
N for caching the generator inside tee,
2 N for the list comprehensions inside np.concatenate calls,
2 N for the concatenated arrays.
Clearly, we could do better by dropping the tee:
x_acc = []
y_acc = []
for x_i, y_i in result:
x_acc.append(x_i)
y_acc.append(y_i)
x = np.concatenate(x_acc)
y = np.concatenate(y_acc)
This shaved off one more N, leaving 4 N. Going further means dropping the intermediate lists and preallocating x and y. Note, that you needn't know the exact sizes of the arrays, only the upper bounds:
x = np.empty(capacity)
y = np.empty(capacity)
right = 0
for x_i, y_i in result:
left = right
right += len(x_i) # == len(y_i)
x[left:right] = x_i
y[left:right] = y_i
x = x[:right].copy()
y = y[:right].copy()
In fact, you don't even need an upper bound. Just ensure that x and y are big enough to accommodate the new item:
for x_i, y_i in result:
# ...
if right >= len(x):
# It would be slightly trickier for >1D, but the idea
# remains the same: alter the 0-the dimension to fit
# the new item.
new_capacity = max(right, len(x)) * 1.5
x = x.resize(new_capacity)
y = y.resize(new_capacity)
I'm a beginner. I recently see the Mandelbrot set which is fantastic, so I decide to draw this set with python.
But there is a problem,I got 'memoryerror' when I run this code.
This statement num_set = gen_num_set(10000) will produce a large list, about 20000*20000*4 = 1600000000. When I use '1000' instead of '10000', I can run code successfully.
My computer's memory is 4GB and the operating system is window7 32bit. I want to know if this problem is limit of my computer or there is some way to optimize my code.
Thanks.
#!/usr/bin/env python3.4
import matplotlib.pyplot as plt
import numpy as np
import random,time
from multiprocessing import *
def first_quadrant(n):
start_point = 1 / n
n = 2*n
return gen_complex_num(start_point,n,1)
def second_quadrant(n):
start_point = 1 / n
n = 2*n
return gen_complex_num(start_point,n,2)
def third_quadrant(n):
start_point = 1 / n
n = 2*n
return gen_complex_num(start_point,n,3)
def four_quadrant(n):
start_point = 1 / n
n = 2*n
return gen_complex_num(start_point,n,4)
def gen_complex_num(start_point,n,quadrant):
complex_num = []
if quadrant == 1:
for i in range(n):
real = i*start_point
for j in range(n):
imag = j*start_point
complex_num.append(complex(real,imag))
return complex_num
elif quadrant == 2:
for i in range(n):
real = i*start_point*(-1)
for j in range(n):
imag = j*start_point
complex_num.append(complex(real,imag))
return complex_num
elif quadrant == 3:
for i in range(n):
real = i*start_point*(-1)
for j in range(n):
imag = j*start_point*(-1)
complex_num.append(complex(real,imag))
return complex_num
elif quadrant == 4:
for i in range(n):
real = i*start_point
for j in range(n):
imag = j*start_point*(-1)
complex_num.append(complex(real,imag))
return complex_num
def gen_num_set(n):
return [first_quadrant(n), second_quadrant(n), third_quadrant(n), four_quadrant(n)]
def if_man_set(num_set):
iteration_n = 10000
man_set = []
z = complex(0,0)
for c in num_set:
if_man = 1
for i in range(iteration_n):
if abs(z) > 2:
if_man = 0
z = complex(0,0)
break
z = z*z + c
if if_man:
man_set.append(c)
return man_set
def plot_scatter(x,y):
#plt.plot(x,y)
color = ran_color()
plt.scatter(x,y,c=color)
plt.show()
def ran_num():
return random.random()
def ran_color():
return [ran_num() for i in range(3)]
def plot_man_set(man_set):
z_real = []
z_imag = []
for z in man_set:
z_real.append(z.real)
z_imag.append(z.imag)
plot_scatter(z_real,z_imag)
if __name__ == "__main__":
start_time = time.time()
num_set = gen_num_set(10000)
with Pool(processes=4) as pool:
#use multiprocess
set_part = pool.map(if_man_set, num_set)
man_set = []
for i in set_part:
man_set += i
plot_man_set(man_set)
end_time = time.time()
use_time = end_time - start_time
print(use_time)
You say you are creating a list with 1.6 billion elements. Each of those is a complex number which contains 2 floats. A Python complex number takes 24 bytes (at least on my system: sys.getsizeof(complex(1.0,1.0)) gives 24), so you'll need over 38GB just to store the values, and that's before you even start looking at the list itself.
Your list with 1.6 billion elements won't fit at all on a 32-bit system (6.4GB with 4 byte pointers), so you need to go to a 64-bit system with 8 byte pointers and at will need 12.8GB just for the pointers.
So, no way you're going to do that unless you upgrade to a 64-bit OS with maybe 64GB RAM (though it might need more).
When handling large data like this you should prefer using numpy arrays instead of python lists. There is a nice post explaining why (What are the advantages of NumPy over regular Python lists?), but I will try to sum it up.
In Python, each complex number in your list is an object (with methods and attributes) and takes up some overhead space for that. That is why they take up 24 bytes (as Duncan pointed out) instead of the 2 * 32bit for two floats per complex number.
Numpy arrays build on c-style arrays (basically all values written next to each other in memory as raw numbers, not objects). They don't provide some of the nice functionality of python lists (like appending) and are restricted to only one data type. They save a lot of space though, as you do not need to save the objects' overhead. This reduces the space needed for each complex number from 24 bytes to 8 bytes (two floats, 32bit each).
While Duncan is right and the big instance you tried will not run even with numpy, it might help you to process bigger instances.
As you have already imported numpy your could change you code to use numpy arrays instead. Please mind that I am not too proficient with numpy and there most certainly is a better way to do this, but this is an example with only little changes to your original code:
def gen_complex_num_np(start_point, n, quadrant):
# create a nxn array of complex numbers
complex_num = np.ndarray(shape=(n,n), dtype=np.complex64)
if quadrant == 1:
for i in range(n):
real = i*start_point
for j in range(n):
imag = j*start_point
# fill ony entry in the array
complex_num[i,j] = complex(real,imag)
# concatenate the array rows to
# get a list-like return value again
return complex_num.flatten()
...
Here your Python list is replaced with a 2d-numpy array with the data type complex. After the array has been filled it is flattened (all row vectors are concatenated) to mimic your return format.
Note that you would have to change the man_set lists in all other parts of your program accordingly.
I hope this helps.
Here's the inefficient Python code I have:
import numpy as np
import random
matrix = np.empty( (100,100), dtype=bool )
matrix[:,:] = False
matrix[50,50] = True
def propagate(matrix, i, j):
for (di,dj) in [ (1,0), (0,1), (-1,0), (0,-1) ]:
(ni,nj) = (i+di, j+dj)
if matrix[ni,nj] and flip_coin_is_face():
matrix[i,j] = True
def flip_coin_is_face():
return random.uniform(0,1) < 0.5
for k in xrange(1000):
for i in xrange(1,99):
for j in xrange(1,99):
propagate(matrix, i, j)
which basically propagates the True state from the center of the matrix. Since I'm coding the loops and the propagation rule in Python, this is of course very slow.
My question is, how can I use Numpy indexing to make this as fast as possible?
I can think of one approach, but it differs with your original code. Namely, you can filter ones in each step array (k loop), propagate each value to its neiboughours, i.e. roll dice 4 times number of ones, and evaluate next step array. Each operation can be done with a numpy one liner (using where, reshape, + and * for matrices), so there will be no inner loops.
Difference lies in fact that we do not take into account values, propagated within a step, evaluating all changes at once. In fact, it will slow down, and I suppose noticeably, propagation speed in terms of steps required to fulfill all the matrix.
If this approach is ok, I can come up with some code.
I'm not great with numpy, but to "propagate" through the matrix, you can use something like a breadth-first search. If you haven't used it before, it looks like this:
import Queue
def neighbors(i, j, mat_shape):
rows = mat_shape[0]
cols = mat_shape[1]
offsets = [(-1, 0), (1, 0), (0, 1), (0, -1)]
neighbors = []
for off in offsets:
r = off[0]+i
c = off[1]+j
if 0 <= r and r <= rows and 0 <= c and c <= cols:
neighbors.append((r,c))
return neighbors
def propagate(matrix, i, j):
# 'parents' is used in two ways. first, it tells us where we've already been
# second, it tells us w
parents = np.empty(matrix.shape)
parents[:,:] = None
# first-in-first-out queue. initially it just has the start point
Q = Queue.Queue()
# do the first step manually; start propagation with neighbors
matrix[i,j] = True
for n in neighbors(i,j,matrix.shape):
Q.put(n)
parents[n[0],n[1]] = (i,j)
# initialization done. on to the propagation
while not Q.empty():
current = Q.get() # get's front element and removes it
parent = parents[current[0], current[1]]
matrix[current[0], current[1]] = matrix[parent[0], parent[1]] and flip_coin_is_face()
# propagate to neighbors, in order
for next in neighbors(current[0], current[1], matrix.shape):
# only propagate there if we haven't already
if parents[next[0], next[1]] is None:
parents[next[0], next[1]] = current
Q.put(next)
return matrix
You can probably be more clever and cut off the propagation early (since once it gets to False, it will never get True again). But for 100x100, this should be plenty fast.