Mismatch between parallelized and linear nested for loops - python

I want to parallelize a piece of code that resembles the following:
Ngal=10
sampind=[7,16,22,31,45]
samples=0.3*np.ones((60,Ngal))
zt=[2.15,7.16,1.23,3.05,4.1,2.09,1.324,3.112,0.032,0.2356]
toavg=[]
for j in range(Ngal):
gal=[]
for m in sampind:
gal.append(samples[m][j]-zt[j])
toavg.append(np.mean(gal))
accuracy=np.mean(toavg)
so I followed the advice here and I rewrote it as follows:
toavg=[]
gal=[]
p = mp.Pool()
def deltaz(params):
j=params[0] # index of the galaxy
m=params[1] # indices for which we have sampled redshifts
gal.append(samples[m][j]-zt[j])
return np.mean(gal)
j=(np.linspace(0,Ngal-1,Ngal).astype(int))
m=sampind
grid=[j,m]
input=itertools.product(*grid)
results = p.map(deltaz,input)
accuracy=np.mean(results)
p.close()
p.join()
but the results are not the same. In fact, sometimes they are, sometimes they're not. It doesn't seem very deterministic. Is my approach correct? If not, what should I fix? Thank you! The modules that you will need to reproduce the above examples are:
import numpy as np
import multiprocess as mp
import itertools
Thank you!

The first issue I see is that you are creating a global variable gal which is being accessed by the function deltaz. These are however not shared between the pool processes but instantiated for each process separately. You will have to use shared memory if you want them to share this structure. This is probably why you see a non-deterministic behavior.
The next issue is that you are not actually completing the same tasking with the different variation. The first one you are taking an average of each set of averages (gal). The parallel one is taking an average of which ever elements happen to end up in that list. This is nondeterministic because items are assigned to processes as they become available and this is not necessarily predictable.
I would suggest parallelizing the inner loop. To do this, you need zt and samples to both be in shared memory because they are accessed by all of the processes. This can get dangerous if you are modifying data but since you appear to only be reading it should be fine.
import numpy as np
import multiprocessing as mp
import itertools
import ctypes
#Non-parallel code
Ngal=10
sampind=[7,16,22,31,45]
samples=0.3*np.ones((60,Ngal))
zt=[2.15,7.16,1.23,3.05,4.1,2.09,1.324,3.112,0.032,0.2356]
#Nonparallel
toavg=[]
for j in range(Ngal):
gal=[]
for m in sampind:
gal.append(samples[m][j]-zt[j])
toavg.append(np.mean(gal))
accuracy=np.mean(toavg)
print(toavg)
# Parallel function
def deltaz(j):
sampind=[7,16,22,31,45]
gal = []
for m in sampind:
gal.append(samples[m][j]-zt[j])
return np.mean(gal)
# Shared array for zt
zt_base = mp.Array(ctypes.c_double, int(len(zt)),lock=False)
ztArr = np.ctypeslib.as_array(zt_base)
#Shared array for samples
sample_base = mp.Array(ctypes.c_double, int(np.product(samples.shape)),lock=False)
sampArr = np.ctypeslib.as_array(sample_base)
sampArr = sampArr.reshape(samples.shape)
#Copy arrays to shared
sampArr[:,:] = samples[:,:]
ztArr[:] = zt[:]
with mp.Pool() as p:
result = p.map(deltaz,(np.linspace(0,Ngal-1,Ngal).astype(int)))
print(result)
Here is an example that produces the same results. You can add more complexity to this as you see fit but I would read about multiprocessing in general and memory types/scopes to get an idea of what will and won't work. You have to take more care when you get into the multiprocessing world. Let me know if this doesn't help and I will try to update it so that it does.

Related

Parallelize Image Processing Using Numpy

I'm trying to speed up a section of my code using parallel processing in python, but I'm having trouble getting it to work right, or even find examples that are relevant to me.
The code produces a low-polygon version of an image using Delaunay triangulation, and the part that's slowing me down is finding the mean values of each triangle.
I've been able to get a good speed increase by vectorizing my code, but hope to get more using parallelization:
The code I'm having trouble with is an extremely simple for loop:
for tri in tris:
lopo[tridex==tri,:] = np.mean(hipo[tridex==tri,:],axis=0)
The variables referenced are as follows.
tris - a unique python list of all the indices of the triangles
lopo - a Numpy array of the final low-polygon version of the image
hipo - a Numpy array of the original image
tridex - a Numpy array the same size as the image. Each element represents a pixel and stores the triangle that the pixel lies within
I can't seem to find a good example that uses multiple numpy arrays as input, with one of them shared.
I've tried multiprocessing (with the above snippet wrapped in a function called colorImage):
p = Process(target=colorImage, args=(hipo,lopo,tridex,ppTris))
p.start()
p.join()
But I get a a broken pipe error immediately.
So the way that Python's multiprocessing works (for the most part) is that you have to designate the individual threads that you want to run. I made a brief introductory tutorial here: http://will-farmer.com/parallel-python.html
In your case, what I would recommend is split tris into a bunch of different parts, each equally sized, each that represents a "worker". You can split this list with numpy.split() (documentation here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.split.html).
Then for each list in tri, we use the Threading and Queue modules to designate 8 workers.
import numpy as np
# split into 8 different lists
tri_lists = np.split(tris, 8)
# Queues are threadsafe
return_values = queue.Queue()
threads = []
def color_image(q, tris, hipo, tridex):
""" This is the function we're parallelizing """
for tri in tris:
return_values.put(np.mean(hipo[tridex==tri,:], axis=0))
# Now we run the jobs
for i in range(8):
threads.append(threading.Thread(
target=color_image,
args=(return_values, tri_lists[i], hipo, tridex)))
# Now we have to cleanup our results
# First get items from queue
results = [item for item in return_values.queue]
# Now set values in lopo
for i in range(len(results)):
for t in tri_lists[i]:
lopo[tridex==t, :] = results[i]
This isn't the cleanest way to do it, and I'm not sure if it works since I can't test it, but this is a decent way to do it. The parallelized part is now np.mean(), while setting the values is not parallelized.
If you want to also parallelize the setting of the values, you'll have to have a shared variable, either using the Queue, or with a global variable.
See this post for a shared global variable: Python Global Variable with thread

Python parallel programming issue

I need to do some intense numerical computations and fortunately python offers very simple ways to implement parallelisations. However, the results I got were totally weird and after some trial'n error I stumbled upon the problem.
The following code simply calculates the mean of a random sample of numbers but illustrates my problem:
import multiprocessing
import numpy as np
from numpy.random import random
# Define function to generate random number
def get_random(seed):
dummy = random(1000) * seed
return np.mean(dummy)
# Input data
input_data = [100,100,100,100]
pool = multiprocessing.Pool(processes=4)
result = pool.map(get_random, input_data)
print result
for i in input_data:
print get_random(i)
Now the output looks like this:
[51.003368466729405, 51.003368466729405, 51.003368466729405, 51.003368466729405]
for the parallelisation, which is always the same
and like this for the normal not parallelised loop:
50.8581749381
49.2887091049
50.83585841
49.3067281055
As you can see, the parallelisation just returns the same results, even though it should have calculated difference means just as the loop. Now, sometimes I get only 3 equal numbers with one being different from the other 3.
I suspect that some memory is allocated to all sub processes...
I would love some hints on what is going on here and what a fix would look like. :)
thanks
When you use multiprocessing, you're talking about distinct processes. Distinct processes means distinct Python interpreters. Distinct interpreters means distinct random states. If you aren't seeding the random number generator uniquely on each process, then you're going to get the same starting random state from each process.
The answer was to put a new random seed into each process. Changing the function to
def get_random(seed):
np.random.seed()
dummy = random(1000) * seed
return np.mean(dummy)
gives the wanted results. 😊

Python multiprocessing is taking much longer than single processing

I am performing some large computations on 3 different numpy 2D arrays sequentially. The arrays are huge, 25000x25000 each. Each computation takes significant time so I decided to run 3 of them in parallel on 3 CPU cores on the server. I am following standard multiprocessing guideline and creating 2 processes and a worker function. Two computations are running through the 2 processes and the third one is running locally without separate process. I am passing the huge arrays as arguments of the processes like :
p1 = Process(target = Worker, args = (queue1, array1, ...)) # Some other params also going
p2 = Process(target = Worker, args = (queue2, array2, ...)) # Some other params also going
the Worker function sends back two numpy vectors (1D array) in a list appended in the queue like:
queue.put([v1, v2])
I am not using multiprocessing.pool
but surprisingly I am not getting speedup, it is actually running 3 times slower. Is passing large arrays taking time? I am unable to figure out what is going on. Should I use shared memory objects instead of passing arrays?
I shall be thankful if anybody can help.
Thank you.
my problem appears to be resolved. I was using a django module from inside which I was calling multiprocessing.pool.map_async. My worker function was a function inside the class itself. That was the problem. Multiprocessesing cannot call a function of the same class inside another process because subprocesses do not share memory. So inside the subprocess there is no live instance of the class. Probably that is why it is not getting called. As far as I understood. I removed the function from the class and put it in the same file but outside of the class, just before the class definition starts. It worked. I got moderate speedup also. And One more thing is people who are facing the same problem please do not read large arrays and pass between processes. Pickling and Unpickling would take a lot of time and you won't get speed up rather speed down. Try to read arrays inside the subprocess itself.
And if possible please use numpy.memmap arrays, they are quite fast.
Here is an example using np.memmap and Pool. See that you can define the number of processes and workers. In this case you don't have control over the queue, which can be achieved using multiprocessing.Queue:
from multiprocessing import Pool
import numpy as np
def mysum(array_file_name, col1, col2, shape):
a = np.memmap(array_file_name, shape=shape, mode='r+')
a[:, col1:col2] = np.random.random((shape[0], col2-col1))
ans = a[:, col1:col2].sum()
del a
return ans
if __name__ == '__main__':
nop = 1000 # number_of_processes
now = 3 # number of workers
p = Pool(now)
array_file_name = 'test.array'
shape = (250000, 250000)
a = np.memmap(array_file_name, shape=shape, mode='w+')
del a
cols = [[shape[1]*i/nop, shape[1]*(i+1)/nop] for i in range(nop)]
results = []
for c1, c2 in cols:
r = p.apply_async(mysum, args=(array_file_name, c1, c2, shape))
results.append(r)
p.close()
p.join()
final_result = sum([r.get() for r in results])
print final_result
You can achieve better performances using shared memory parallel processing, when possible. See this related question:
Shared-memory objects in python multiprocessing

Python: Splitting up a sum with threads

i have a costly calculation to do for fitting some experimental data. The fitting function is a sum over eigenmodes, each of them containing a specific surface integral. As it is rather slow if you do it the classical way i thought about threading it. I'm using python btw.
The function i want to calculate is something like
def fit_func(params , Mmin, Mmax):
values = np.zeros(1000)
for m in range(Mmin, Mmax):
# Fancy Calculation for each mode
# some calulation with all modes, adding them up 'values'
return values
How can i split this up? I did something like
data1 = thread.start_new_thread(fit_func, (params,0,13))
data2 = thread.start_new_thread(fit_func, (params,13,25))
but then the sum of data1 and data2 is not the same as fitfunc(params, 0,25)...
Try out multiprocessing. This will effectively create separate Python processes using a thread-like interface. However, make sure that you profile your computation and make sure that it is the problem, not something else like IO. Starting processes is very slow, so keep them around for a while if you are planning to use them.
You can also use numpy for those functions. They're written in C code, so they're stupid fast. Check them both out and see what fits best. I would go for the numpy solution myself...
use multiprocessing pool
import multiprocessing as mp
p = mp.Pool(10)
res = p.map(your_function, range(Mmin, Mmax))

Parfor for Python

I am looking for a definitive answer to MATLAB's parfor for Python (Scipy, Numpy).
Is there a solution similar to parfor? If not, what is the complication for creating one?
UPDATE: Here is a typical numerical computation code that I need speeding up
import numpy as np
N = 2000
output = np.zeros([N,N])
for i in range(N):
for j in range(N):
output[i,j] = HeavyComputationThatIsThreadSafe(i,j)
An example of a heavy computation function is:
import scipy.optimize
def HeavyComputationThatIsThreadSafe(i,j):
n = i * j
return scipy.optimize.anneal(lambda x: np.sum((x-np.arange(n)**2)), np.random.random((n,1)))[0][0,0]
The one built-in to python would be multiprocessing docs are here. I always use multiprocessing.Pool with as many workers as processors. Then whenever I need to do a for-loop like structure I use Pool.imap
As long as the body of your function does not depend on any previous iteration then you should have near linear speed-up. This also requires that your inputs and outputs are pickle-able but this is pretty easy to ensure for standard types.
UPDATE:
Some code for your updated function just to show how easy it is:
from multiprocessing import Pool
from itertools import product
output = np.zeros((N,N))
pool = Pool() #defaults to number of available CPU's
chunksize = 20 #this may take some guessing ... take a look at the docs to decide
for ind, res in enumerate(pool.imap(Fun, product(xrange(N), xrange(N))), chunksize):
output.flat[ind] = res
There are many Python frameworks for parallel computing. The one I happen to like most is IPython, but I don't know too much about any of the others. In IPython, one analogue to parfor would be client.MultiEngineClient.map() or some of the other constructs in the documentation on quick and easy parallelism.
Jupyter Notebook
To see an example consider you want to write the equivalence of this Matlab code on in Python
matlabpool open 4
parfor n=0:9
for i=1:10000
for j=1:10000
s=j*i
end
end
n
end
disp('done')
The way one may write this in python particularly in jupyter notebook. You have to create a function in the working directory (I called it FunForParFor.py) which has the following
def func(n):
for i in range(10000):
for j in range(10000):
s=j*i
print(n)
Then I go to my Jupyter notebook and write the following code
import multiprocessing
import FunForParFor
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
pool.map(FunForParFor.func, range(10))
pool.close()
pool.join()
print('done')
This has worked for me! I just wanted to share it here to give you a particular example.
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your functions with the #ray.remote decorator, and then invoke them with .remote.
import numpy as np
import time
import ray
ray.init()
# Define the function. Each remote function will be executed
# in a separate process.
#ray.remote
def HeavyComputationThatIsThreadSafe(i, j):
n = i*j
time.sleep(0.5) # Simulate some heavy computation.
return n
N = 10
output_ids = []
for i in range(N):
for j in range(N):
# Remote functions return a future, i.e, an identifier to the
# result, rather than the result itself. This allows invoking
# the next remote function before the previous finished, which
# leads to the remote functions being executed in parallel.
output_ids.append(HeavyComputationThatIsThreadSafe.remote(i,j))
# Get results when ready.
output_list = ray.get(output_ids)
# Move results into an NxN numpy array.
outputs = np.array(output_list).reshape(N, N)
# This program should take approximately N*N*0.5s/p, where
# p is the number of cores on your machine, N*N
# is the number of times we invoke the remote function,
# and 0.5s is the time it takes to execute one instance
# of the remote function. For example, for two cores this
# program will take approximately 25sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Note: One point to keep in mind is that each remote function is executed in a separate process, possibly on a different machine, and thus the remote function's computation should take more than invoking a remote function. As a rule of thumb a remote function's computation should take at least a few 10s of msec to amortize the scheduling and startup overhead of a remote function.
I've always used Parallel Python but it's not a complete analog since I believe it typically uses separate processes which can be expensive on certain operating systems. Still, if the body of your loops are chunky enough then this won't matter and can actually have some benefits.
I tried all solutions here, but found that the simplest way and closest equivalent to matlabs parfor is numba's prange.
Essentially you change a single letter in your loop, range to prange:
from numba import autojit, prange
#autojit
def parallel_sum(A):
sum = 0.0
for i in prange(A.shape[0]):
sum += A[i]
return sum
I recommend trying joblib Parallel.
one liner
from joblib import Parallel, delayed
out = Parallel(n_jobs=2)(delayed(heavymethod)(i) for i in range(10))
instructional
instead of taking a for loop
from time import sleep
for _ in range(10):
sleep(.2)
rewrite your operation into a list comprehension
[sleep(.2) for _ in range(10)]
Now let us not directly evaluate the expression, but collect what should be done.
This is what the delayed method is for.
from joblib import delayed
[delayed(sleep(.2)) for _ in range(10)]
Next instantiate a parallel process with n_workers and process the list.
from joblib import Parallel
r = Parallel(n_jobs=2, verbose=10)(delayed(sleep)(.2) for _ in range(10))
[Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 0.6s
[Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 0.8s
[Parallel(n_jobs=2)]: Done 10 out of 10 | elapsed: 1.4s finished
Ok, I'll also give it a go, let's see if my way is easier
from multiprocessing import Pool
def heavy_func(key):
#do some heavy computation on each key
output = key**2
return key, output
output_data ={} #<--this dict will store the results
keys = [1,5,7,8,10] #<--compute heavy_func over all the values of keys
with Pool(processes=40) as pool:
for i in pool.imap_unordered(heavy_func, keys):
output_data[i[0]] = i[1]
Now output_data is a dictionary that will contain for every key the result of the computation on this key.
That is it..

Categories