I have an understanding problem/question concerning the multiprocessing library of Python:
Why do different processes started (almost) simultaneously at least seem to execute serially instead of parallely?
The task is to control a universe of a large number of particles (a particle being a set of x/y/z coordinates and a mass) and perform various analyses on them while taking advantage of a multi-processor environment. Specifically for the example shown below I want to calculate the center of the mass of all particles.
Because the task specifically says to use multiple processors I didn't use the thread library as there is this GIL-thingy in place that constrains the execution to one processor.
Here's my code:
from multiprocessing import Process, Lock, Array, Value
from random import random
import math
from time import time
def exercise2(noOfParticles, noOfProcs):
startingTime = time()
particles = []
processes = []
centerCoords = Array('d',[0,0,0])
totalMass = Value('d',0)
lock = Lock()
#create all particles
for i in range(noOfParticles):
p = Particle()
particles.append(p)
for i in range(noOfProcs):
#determine the number of particles every process needs to analyse
particlesPerProcess = math.ceil(noOfParticles / noOfProcs)
#create noOfProcs Processes, each with a different set of particles
p = Process(target=processBatch, args=(
particles[i*particlesPerProcess:(i+1)*particlesPerProcess],
centerCoords, #handle to shared memory
totalMass, #handle to shared memory
lock, #handle to lock
'batch'+str(i)), #also pass name of process for easier logging
name='batch'+str(i))
processes.append(p)
print('created proc:',i)
#start all processes
for p in processes:
p.start() #here, the program waits for the started process to terminate. why?
#wait for all processes to finish
for p in processes:
p.join()
#normalize the coordinates
centerCoords[0] /= totalMass.value
centerCoords[1] /= totalMass.value
centerCoords[2] /= totalMass.value
print(centerCoords[:])
print('total time used', time() - startingTime, ' seconds')
class Particle():
"""a particle is a very simple physical object, having a set of x/y/z coordinates and a mass.
All values are randomly set at initialization of the object"""
def __init__(self):
self.x = random() * 1000
self.y = random() * 1000
self.z = random() * 1000
self.m = random() * 10
def printProperties(self):
attrs = vars(self)
print ('\n'.join("%s: %s" % item for item in attrs.items()))
def processBatch(particles,centerCoords,totalMass,lock,name):
"""calculates the mass-weighted sum of all coordinates of all particles as well as the sum of all masses.
Writes the results into the shared memory centerCoords and totalMass, using lock"""
print(name,' started')
mass = 0
centerX = 0
centerY = 0
centerZ = 0
for p in particles:
centerX += p.m*p.x
centerY += p.m*p.y
centerZ += p.m*p.z
mass += p.m
with lock:
centerCoords[0] += centerX
centerCoords[1] += centerY
centerCoords[2] += centerZ
totalMass.value += mass
print(name,' ended')
if __name__ == '__main__':
exercise2(2**16,6)
Now I'd expect all processes to start at about the same time and parallelly execute. But when I look at the output of the programm, this looks as if the processes were executing serially:
created proc: 0
created proc: 1
created proc: 2
created proc: 3
created proc: 4
created proc: 5
batch0 started
batch0 ended
batch1 started
batch1 ended
batch2 started
batch2 ended
batch3 started
batch3 ended
batch4 started
batch4 ended
batch5 started
batch5 ended
[499.72234074100135, 497.26586187539453, 498.9208784328791]
total time used 4.7220001220703125 seconds
Also when stepping through the programm using the Eclipse-debugger, I can see how the program always waits for one process to terminate before starting the next one at the line marked with a comment ending in 'why?'. Of course, this might just be the debugger, but when I look at the output produced in a normal run, this shows exactly the above picture.
Are those processes executing parallelly and I just can't see it due to some sharing problem of stdout?
If the processes are executing serially: why? And how can I make them run in parallel?
Any help on understanding this is greatly appreciated.
I executed the above code from PyDev and from command line using Python 3.2.3 on a Windows 7 Machine with a dual core Intel processor.
Edit:
Due to the output of the program I misunderstood the problem: The processes are actually running in parallel, but the overhead of pickling large amounts of data and sending it to the subprocesses takes so long that it completely distorts the picture.
Moving the creation of the particles (i.e. the data) to the subprocesses so that they don't have to be pickled in the first place removed all problems and resulted in a useful, parallel execution of the program.
To solve the task, I will therefore have to keep the particles in shared memory so they don't have to be passed to subprocesses.
I ran your code on my system (Python 2.6.5) and it returned almost instantly with results, which makes me think that perhaps your task size is so small that the processes finish before the next can begin (note that starting a process is slower than spinning up a thread). I question the total time used 4.7220001220703125 seconds in your results, because that's about 40x longer than it took my system to run the same code. I scaled up the number of particles to 2**20, and I got the following results:
('created proc:', 0)
('created proc:', 1)
('created proc:', 2)
('created proc:', 3)
('created proc:', 4)
('created proc:', 5)
('batch0', ' started')
('batch1', ' started')
('batch2', ' started')
('batch3', ' started')
('batch4', ' started')
('batch5', ' started')
('batch0', ' ended')
('batch1', ' ended')
('batch2', ' ended')
('batch3', ' ended')
('batch5', ' ended')
('batch4', ' ended')
[500.12090773656854, 499.92759577086059, 499.97075039983588]
('total time used', 5.1031057834625244, ' seconds')
That's more in line with what I would expect. What do you get if you increase the task size?
Related
I have a program i want to split into 10 parts with multiprocessing. Each worker will be searching for the same answer using different variables to look for it (in this case its brute forcing a password). How to I get the processes to communicate their status, and how do I terminate all processes once one process has found the answer. Thank you!
If you are going to split it into 10 parts than either you should have 10 cores or at least your worker function should not be 100% CPU bound.
The following code initializes each process with a multiprocess.Queue instance to which the worker function will write its result. The main process waits for the first entry written to the queue and then terminates all pool processes. For this demo, the worker function is passed arguments 1, 2, 3, ... 10 and then sleeps for that amount of time and returns the argument passed. So we would expect that the worker function that was passed the argument value of 1 to complete first and that the total running time of the program should be slightly more than 1 second (it takes some time to create the 10 processes):
import multiprocessing
import time
def init_pool(q):
global queue
queue = q
def worker(x):
time.sleep(x)
# write result to queue
queue.put_nowait(x)
def main():
queue = multiprocessing.Queue()
pool = multiprocessing.Pool(10, initializer=init_pool, initargs=(queue,))
for i in range(1, 11):
# non-blocking:
pool.apply_async(worker, args=(i,))
# wait for first result
result = queue.get()
pool.terminate() # kill all tasks
print('Result: ', result)
# required for Windows:
if __name__ == '__main__':
t = time.time()
main()
print('total time =', time.time() - t)
Prints:
Result: 1
total time = 1.2548246383666992
I am playing around with multiprocessing in Python 3 to try and understand how it works and when it's good to use it.
I am basing my examples on this question, which is really old (2012).
My computer is a Windows, 4 physical cores, 8 logical cores.
First: not segmented data
First I try to brute force compute numpy.sinfor a million values. The million values is a single chunk, not segmented.
import time
import numpy
from multiprocessing import Pool
# so that iPython works
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"
def numpy_sin(value):
return numpy.sin(value)
a = numpy.arange(1000000)
if __name__ == '__main__':
pool = Pool(processes = 8)
start = time.time()
result = numpy.sin(a)
end = time.time()
print('Singled threaded {}'.format(end - start))
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print('Multithreaded {}'.format(end - start))
And I get that, no matter the number of processes, the 'multi_threading' always takes 10 times or so as much as the 'single threading'. In the task manager, I see that not all the CPUs are maxed out, and the total CPU usage is goes between 18% and 31%.
So I try something else.
Second: segmented data
I try to split up the original 1 million computations in 10 batches of 100,000 each. Then I try again for 10 million computations in 10 batches of 1 million each.
import time
import numpy
from multiprocessing import Pool
# so that iPython works
__spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"
def numpy_sin(value):
return numpy.sin(value)
p = 3
s = 1000000
a = [numpy.arange(s) for _ in range(10)]
if __name__ == '__main__':
print('processes = {}'.format(p))
print('size = {}'.format(s))
start = time.time()
result = numpy.sin(a)
end = time.time()
print('Singled threaded {}'.format(end - start))
pool = Pool(processes = p)
start = time.time()
result = pool.map(numpy_sin, a)
pool.close()
pool.join()
end = time.time()
print('Multithreaded {}'.format(end - start))
I ran this last piece of code for different processes p and different list length s, 100000and 1000000.
At least now the task Manager gives the CPU maxed out at 100% usage.
I get the following results for the elapsed times (ORANGE: multiprocess, BLUE: single):
So multiprocessing never wins over the single process.
Why??
Numpy changes how the parent process runs so that it only runs on one core. You can call os.system("taskset -p 0xff %d" % os.getpid()) after you import numpy to reset the CPU affinity so that all cores are used.
See this question for more details
A computer can really only do one thing at a time. When multi-threading or multi-processing, the computer is really only switching back and forth between tasks quickly. With the provided problem, the computer could either perform the calculation 1,000,000 times, or split-up the work between a couple "workers" and perform 100,000 for each of 10 "workers".
Multi-processing shines not when computing something straight out, as the computer has to take time to create multiple processes, but while waiting for something. The main example I've heard is for webscraping. If a program requested data from a list of websites and waited for each server to send data before requesting data from the next, the program will have to sit for a couple seconds. If instead, the computer used multiprocessing/threading to ask all the websites first and all concurrently wait, the total running time is much shorter.
I'm trying to learn how to implement multiprocessing for computing Monte Carlo simulations. I reproduced the code from this simple tutorial where the aim is to compute an integral. I also compare it to the answer from WolframAlpha and compute the error. The first part of my code has no problems and is just there to define the integral function and declare some constants:
import numpy as np
import multiprocessing as mp
import time
def integrate(iterations):
np.random.seed()
mc_sum = 0
chunks = 10000
chunk_size = int(iterations/chunks)
for i in range(chunks):
u = np.random.uniform(size=chunk_size)
mc_sum += np.sum(np.exp(-u * u))
normed = mc_sum / iterations
return normed
wolfram_answer = 0.746824132812427
mc_iterations = 1000000000
But there's some very spooky stuff that happens in the next two parts (I've labelled them because it's important). First (labelled "BLOCK 1"), I do the simulation without any multiprocessing at all, just to get a benchmark. After this (labelled "BLOCK 2"), I do the same thing but with a multiprocessing step. If you're reproducing this, you may want to adjust the num_procs variable depending on how many cores your machines has:
#### BLOCK 1
single_before = time.time()
single = integrate(mc_iterations)
single_after = time.time()
single_duration = np.round(single_after - single_before, 3)
error_single = (wolfram_answer - single)/wolfram_answer
print(mc_iterations, "iterations on single-thread:",
single_duration, "seconds.")
print("Estimation error:", error_single)
print("")
#### BLOCK 2
if __name__ == "__main__":
num_procs = 8
multi_iterations = int(mc_iterations / num_procs)
multi_before = time.time()
pool = mp.Pool(processes = num_procs)
multi_result = pool.map(integrate, [multi_iterations]*num_procs)
multi_result = np.array(multi_result).mean()
multi_after = time.time()
multi_duration = np.round(multi_after - multi_before, 3)
error_multi = (wolfram_answer - multi_result)/wolfram_answer
print(num_procs, "threads with", multi_iterations, "iterations each:",
multi_duration, "seconds.")
print("Estimation error:", error_multi)
The output is:
1000000000 iterations on single-thread: 37.448 seconds.
Estimation error: 1.17978774235e-05
8 threads with 125000000 iterations each: 54.697 seconds.
Estimation error: -5.88380936901e-06
So, the multiprocessing is slower. That's not at all unheard of; maybe the overhead from the multiprocessing is just more than the gains from the parallelization?
But, that is not what is happening. Watch what happens when I merely comment out the first block:
#### BLOCK 1
##single_before = time.time()
##single = integrate(mc_iterations)
##single_after = time.time()
##single_duration = np.round(single_after - single_before, 3)
##error_single = (wolfram_answer - single)/wolfram_answer
##
##print(mc_iterations, "iterations on single-thread:",
## single_duration, "seconds.")
##print("Estimation error:", error_single)
##print("")
#### BLOCK 2
if __name__ == "__main__":
num_procs = 8
multi_iterations = int(mc_iterations / num_procs)
multi_before = time.time()
pool = mp.Pool(processes = num_procs)
multi_result = pool.map(integrate, [multi_iterations]*num_procs)
multi_result = np.array(multi_result).mean()
multi_after = time.time()
multi_duration = np.round(multi_after - multi_before, 3)
error_multi = (wolfram_answer - multi_result)/wolfram_answer
print(num_procs, "threads with", multi_iterations, "iterations each:",
multi_duration, "seconds.")
print("Estimation error:", error_multi)
The output is:
8 threads with 125000000 iterations each: 6.662 seconds.
Estimation error: 3.86063069069e-06
That's right -- the time to complete the multiprocessing goes down from 55 seconds to less than 7 seconds! And that's not even the weirdest part. Watch what happens when I move Block 1 to be after Block 2:
#### BLOCK 2
if __name__ == "__main__":
num_procs = 8
multi_iterations = int(mc_iterations / num_procs)
multi_before = time.time()
pool = mp.Pool(processes = num_procs)
multi_result = pool.map(integrate, [multi_iterations]*num_procs)
multi_result = np.array(multi_result).mean()
multi_after = time.time()
multi_duration = np.round(multi_after - multi_before, 3)
error_multi = (wolfram_answer - multi_result)/wolfram_answer
print(num_procs, "threads with", multi_iterations, "iterations each:",
multi_duration, "seconds.")
print("Estimation error:", error_multi)
#### BLOCK 1
single_before = time.time()
single = integrate(mc_iterations)
single_after = time.time()
single_duration = np.round(single_after - single_before, 3)
error_single = (wolfram_answer - single)/wolfram_answer
print(mc_iterations, "iterations on single-thread:",
single_duration, "seconds.")
print("Estimation error:", error_single)
print("")
The output is:
8 threads with 125000000 iterations each: 54.938 seconds.
Estimation error: 7.42415402896e-06
1000000000 iterations on single-thread: 37.396 seconds.
Estimation error: 9.79800494235e-06
We're back to the slow output again, which is completely crazy! Isn't Python supposed to be interpreted? I know that statement comes with a hundred caveats, but I took for granted that the code gets executed line-by-line, so stuff that comes afterwards (outside of functions, classes, etc) can't affect the stuff from before, because it hasn't been "looked at" yet.
So, how can the stuff that gets executed after the multiprocessing step has concluded, retroactively slow down the multiprocessing code?
Finally, the fast behavior is restored merely by indenting Block 1 to be inside the if __name__ == "__main__" block, because of course it does:
#### BLOCK 2
if __name__ == "__main__":
num_procs = 8
multi_iterations = int(mc_iterations / num_procs)
multi_before = time.time()
pool = mp.Pool(processes = num_procs)
multi_result = pool.map(integrate, [multi_iterations]*num_procs)
multi_result = np.array(multi_result).mean()
multi_after = time.time()
multi_duration = np.round(multi_after - multi_before, 3)
error_multi = (wolfram_answer - multi_result)/wolfram_answer
print(num_procs, "threads with", multi_iterations, "iterations each:",
multi_duration, "seconds.")
print("Estimation error:", error_multi)
#### BLOCK 1
single_before = time.time()
single = integrate(mc_iterations)
single_after = time.time()
single_duration = np.round(single_after - single_before, 3)
error_single = (wolfram_answer - single)/wolfram_answer
print(mc_iterations, "iterations on single-thread:",
single_duration, "seconds.")
print("Estimation error:", error_single)
print("")
The output is:
8 threads with 125000000 iterations each: 7.293 seconds.
Estimation error: 1.10350027622e-05
1000000000 iterations on single-thread: 31.035 seconds.
Estimation error: 2.53582945763e-05
And the fast behavior is also restored if you keep Block 1 inside the if block, but move it to above where num_procs is defined (not shown here because this question is already getting long).
So, what on Earth is causing this behavior? I'm guessing it's some kind of race-condition to do with threading and process branching, but from my level of expertise it might as well be that my Python interpreter is haunted.
This is because you are using Windows. On Windows, each subprocess is generated using the 'spawn' method which essentially starts a new python interpreter and imports your module instead of forking the process.
This is a problem, because all the code outside if __name__ == '__main__' is executed again. This can lead to a multiprocessing bomb if you put the multiprocessing code at the top-level, because it will start spawning processes until you run out of memory.
This is actually warned about in the docs
Safe importing of main module
Make sure that the main module can be safely imported by a new Python
interpreter without causing unintended side effects (such a starting a
new process).
...
Instead one should protect the “entry point” of the program by using
if __name__ == '__main__'
...
This allows the newly spawned Python interpreter to safely import the
module...
That section used to be called "Windows" in the older docs on Python 2.
Adding some detail, on Windows the module is imported "from scratch" in each worker process. That means everything in the module is executed by each worker. So, in your first example, each worker process first executes "BLOCK 1".
But your output doesn't reflect that. You should have gotten a line of output like
1000000000 iterations on single-thread: 37.448 seconds.
from each of your 8 worker processes. But your output doesn't show that. Perhaps you're using an IDE that suppresses output from spawned processes? If you run it in a "DOS box" (cmd.exe window) instead, that won't suppress output, and can make what's going on clearer.
I have done a simple code using the multiprocessing library to build an extra process apart from the main code (2 processes in total). I did this code on W7 Professional x64 through Anaconda-spyder v3.2.4 and it works almost as I want except for the fact that when I run the code it increase the memory consumption of my second process (not the main one) until it reaches the total capacity and the computer got stuck and freezed (you can notice this at the whindows task manager).
"""
Example to print data from a function using multiprocessing library
Created on Thu Jan 30 12:07:49 2018
author: Kevin Machado Gamboa
Contct: ing.kevin#hotmail.com
"""
from time import time
import numpy as np
from multiprocessing import Process, Queue, Event
t0=time()
def ppg_parameters(hr, minR, ampR, minIR, ampIR, t):
HR = float(hr)
f= HR * (1/60)
# Spo2 Red signal function
sR = minR + ampR * (0.05*np.sin(2*np.pi*t*3*f)
+ 0.4*np.sin(2*np.pi*t*f) + 0.25*np.sin(2*np.pi*t*2*f+45))
# Spo2 InfraRed signal function
sIR = minIR + ampIR * (0.05*np.sin(2*np.pi*t*3*f)
+ 0.4*np.sin(2*np.pi*t*f) + 0.25*np.sin(2*np.pi*t*2*f+45))
return sR, sIR
def loop(q):
"""
generates the values of the function ppg_parameters
"""
hr = 60
ampR = 1.0814 # amplitud for Red signal
minR = 0.0 # Desplacement from zero for Red signal
ampIR = 1.12 # amplitud for InfraRed signal
minIR = 0.7 # Desplacement from zero for Red signal
# infinite loop to generate the signal
while True:
t = time()-t0
y = ppg_parameters(hr, minR, ampR, minIR, ampIR, t)
q.put([t, y[0], y[1]])
if __name__ == "__main__":
_exit = Event()
q = Queue()
p = Process(target=loop, args=(q,))
p.start()
# starts the main process
while q.qsize() != 1:
try:
data = q.get(True,2) # takes each data from the queue
print(data[0], data[1], data[2])
except KeyboardInterrupt:
p.terminate()
p.join()
print('supposed to stop')
break
Why is this happening? Perhaps is the while loop of my 2nd process? I don't know. I haven't seen this issue nowhere.
Moreover, if I run the same code on my Rpi 3 model B, there is a point when it pops an error that said "the queue is empty" something like if the main process is running faster than process two.
Please any guess of why is this happening, suggestion or link would be helpful.
Thanks
It looks like inside your infinite loop you are adding to the queue and I'm guessing that you are adding data faster than it can be taken off of the queue by the other process.
You could check the queue size periodically from inside the infinite loop and if it is over a certain amount (say 500 items), then you could sleep for a few seconds and then check again.
https://docs.python.org/2/library/queue.html#Queue.Queue.qsize
I've found that numpy.fft.fft (and its variants) very slow when run in the background. Here is an example of what I'm talking about
import numpy as np
import multiprocessing as mproc
import time
import sys
# the producer function, which will run in the background and produce data
def Producer(dataQ):
numFrames = 5
n = 0
while n < numFrames:
data = np.random.rand(3000, 200)
dataQ.put(data) # send the datta to the consumer
time.sleep(0.1) # sleep for 0.5 second, so we dont' overload CPU
n += 1
# the consumer function, which will run in the backgrounnd and consume data from the producer
def Consumer(dataQ):
while True:
data = dataQ.get()
t1 = time.time()
fftdata = np.fft.rfft(data, n=3000*5)
tDiff = time.time() - t1
print("Elapsed time is %0.3f" % tDiff)
time.sleep(0.01)
sys.stdout.flush()
# the main program if __name__ == '__main__': is necessary to prevent this code from being run
# only when this program is started by user
if __name__ == '__main__':
data = np.random.rand(3000, 200)
t1 = time.time()
fftdata = np.fft.rfft(data, n=3000*5, axis=0)
tDiff = time.time() - t1
print("Elapsed time is %0.3f" % tDiff)
# generate a queue for transferring data between the producedr and the consumer
dataQ = mproc.Queue(4)
# start up the processoso
producerProcess = mproc.Process(target=Producer, args=[dataQ], daemon=False)
consumerProcess = mproc.Process(target=Consumer, args=[dataQ], daemon=False)
print("starting up processes")
producerProcess.start()
consumerProcess.start()
time.sleep(10) # let program run for 5 seconds
producerProcess.terminate()
consumerProcess.terminate()
The output it produes on my machine:
Elapsed time is 0.079
starting up processes
Elapsed time is 0.859
Elapsed time is 0.861
Elapsed time is 0.878
Elapsed time is 0.863
Elapsed time is 0.758
As you can see, it is roughly 10x slower when run in the background, and I can't figure out why this would be the case. The time.sleep() calls should ensure that the other process (the main process and producer process) aren't doing anything when the FFT is being computed, so it should use all the cores. I've checked CPU utilization through Windows Task Manager and it seems to use up about 25% when numpy.fft.fft is called heavily in both the single process and multiprocess cases.
Anyone have an idea whats going on?
The main problem is that your fft call in the background thread is:
fftdata = np.fft.rfft(data, n=3000*5)
rather than:
fftdata = np.fft.rfft(data, n=3000*5, axis=0)
which for me made all the difference.
There are a few other things worth noting. Rather than having the time.sleep() everywhere, why not just let the processor take care of this itself? Further more, rather than suspending the main thread, you can use
consumerProcess.join()
and then have the producer process run dataQ.put(None) once it has finished loading the data, and break out of the loop in the consumer process, i.e.:
def Consumer(dataQ):
while True:
data = dataQ.get()
if(data is None):
break
...