Python, Multiprocessing: what does process.join() do? - python

import time
from multiprocessing import Process
def loop(limit):
for i in xrange(limit):
pass
print i
limit = 100000000 #100 million
start = time.time()
for i in xrange(5):
p = Process(target=loop, args=(limit,))
p.start()
p.join()
end = time.time()
print end - start
I tried running this code, this is the output I am getting
99999999
99999999
2.73401999474
99999999
99999999
99999999
and sometimes
99999999
99999999
3.72434902191
99999999
99999999
99999999
99999999
99999999
In this case the loop function is called 7 times instead of 5. Why this strange behaviour?
I am also confused about the role of the p.join() statement. Is it ending any one process or all of them at the same time?

The join function currently will wait for the last process you call to finish before moving onto the next section of code. If you walk through what you have done you should see why you get the "strange" output.
for i in xrange(5):
p = Process(target=loop, args=(limit,))
p.start()
This starts 5 new processes one after the other. These are all running at the same time. Just about at least, it is down to the scheduler to decide what process is currently being processed.
This mean you have 5 processes running now:
Process 1
Process 2
Process 3
Process 4
Process 5
p.join()
This is going to wait for p process to finish Process 5 as that was the last process to be assigned to p.
Lets now say that Process 2 finishes first followed by Process 5, which is perfectly feasible as the scheduler could give those processes more time on the CPU.
Process 1
Process 2 prints 99999999
Process 3
Process 4
Process 5 prints 99999999
The p.join() line will now move on to the next part as p Process 5 has finished.
end = time.time()
print end - start
This section prints its part and now there are 3 Processes still going on after this output.
The other Processes finish and print there 99999999.
To fix this behaviour you will need to .join() all the processes. To do this you could alter your code to this...
processes = []
for i in xrange(5):
p = Process(target=loop, args=(limit,))
p.start()
processes.append(p)
for process in processes:
process.join()
This will wait for the first process, then the second and so on. It won't matter if one process finished before anther because every process on the list must be waited on before the script continues.

There are some problems with the way you are doing things, try this:
start = time.time()
procs = []
for i in xrange(5):
p = Process(target=loop, args=(limit,))
p.start()
procs.append(p)
[p.join() for p in procs]
The problem is that you are not tracking of individual processes (p variables inside the loop). You need to keep them around so you can interact with them. This update will keep them in the array and then join all of them at the end.
Output looks like this:
99999999
99999999
99999999
99999999
99999999
6.29328012466
Note that now the time it took to run is also printed at the end of the execution.
Also, I ran your code and was not able to get the loop to execute multiple times.

Related

Mulitprocessing queue termination

I have a program i want to split into 10 parts with multiprocessing. Each worker will be searching for the same answer using different variables to look for it (in this case its brute forcing a password). How to I get the processes to communicate their status, and how do I terminate all processes once one process has found the answer. Thank you!
If you are going to split it into 10 parts than either you should have 10 cores or at least your worker function should not be 100% CPU bound.
The following code initializes each process with a multiprocess.Queue instance to which the worker function will write its result. The main process waits for the first entry written to the queue and then terminates all pool processes. For this demo, the worker function is passed arguments 1, 2, 3, ... 10 and then sleeps for that amount of time and returns the argument passed. So we would expect that the worker function that was passed the argument value of 1 to complete first and that the total running time of the program should be slightly more than 1 second (it takes some time to create the 10 processes):
import multiprocessing
import time
def init_pool(q):
global queue
queue = q
def worker(x):
time.sleep(x)
# write result to queue
queue.put_nowait(x)
def main():
queue = multiprocessing.Queue()
pool = multiprocessing.Pool(10, initializer=init_pool, initargs=(queue,))
for i in range(1, 11):
# non-blocking:
pool.apply_async(worker, args=(i,))
# wait for first result
result = queue.get()
pool.terminate() # kill all tasks
print('Result: ', result)
# required for Windows:
if __name__ == '__main__':
t = time.time()
main()
print('total time =', time.time() - t)
Prints:
Result: 1
total time = 1.2548246383666992

Using Python's multiprocessing.Pool(), am I really doing multi-processing?

I am currently studying '''multiprocessing''' package. Here is a simple code I tried on '''multiprocessing.Process''' and '''multiprocessing.Pool'''.
import random
import multiprocessing
import time
def list_append(count, id, out_list):
"""
Creates an empty list and then appends a
random number to the list 'count' number
of times. A CPU-heavy operation!
"""
for i in range(count):
out_list.append(random.random())
if __name__ == "__main__":
size = 10000000 # Number of random numbers to add
procs = 8 # Number of processes to create
# Create a list of jobs and then iterate through
# the number of processes appending each process to
# the job list
print('number of CPU: ', multiprocessing.cpu_count())
starting = time.time()
jobs = []
for i in range(procs):
out_list = list()
process = multiprocessing.Process(target=list_append,
args=(size, i, out_list))
jobs.append(process)
# Start the processes (i.e. calculate the random number lists)
for j in jobs:
j.start()
# Ensure all of the processes have finished
for j in jobs:
j.join()
print("jobs one done in {}".format(time.time()-starting))
starting = time.time()
for i in range(procs):
p = multiprocessing.Pool(8)
p.starmap(list_append, [(size, i, list())])
print('jobs two done in {}'.format(time.time()-starting))
My laptop has 12 cup cores, so I expect that job one and job two would finish in similar time. However, the job one finish in 3 seconds, but job two finish in 12 seconds. It looks to me that '''multiprocessing.Pool()''' does not actually do multiprocess... Is there sth I did wrong?
In your jobs two, you are not using multiprocessing. The starmap() distributes the specified method (list_append) to each of the arg lists provided in the second argument, but you only provide a list with one element, so each iteration of your for loop executes one process. I think you meant to do:
p = multiprocessing.Pool(8)
p.starmap(list_append, [(size, i, list()) for i in range(procs)])
without the containing for loop.
Note, also, that starmap waits for the result, so in the for loop, it waits for each single process.

How to use Pool.join() in Python multiprocessing module?

Here is the code:
def function(index):
print('start process '+str(index))
time.sleep(1)
print('end process '+str(index))
return str(index)
if __name__ == '__main__':
pool = Pool(processes=3)
for i in range(4):
res = pool.apply_async(function,args=(i,))
print(res.get())
pool.close()
print('done')
and the output:
start process 0
end process 0
0
start process 1
end process 1
1
start process 2
end process 2
2
start process 3
end process 3
3
done
In my opinion, if the I don't use the pool.join(), the code should only print 'done' and that's it, because the function of pool.join() is 'Wait for the worker processes to exit', but now without pool.join(), it get the same result.
I really don't understand.
In your code, the method get() has the same effect as join(). It also waits for the process to finish because you want to get the result of it.
If you remove it from your code, you will see the 'done' being printed first:
done
start process 0
res.get waits for the process to finish (how else would it get the return value?) which means that process 0 must finish before process 1 can start, and so on.
Remove res.get and you won't see the processes finish. Move res.get to a separate loop after the first one and you'll see they all start before any of them finish.
Also check out Pool.map.

Run parallel Stata do files in python using multiprocess and subprocess

I have a stata do file pyexample3.do, which uses its argument as a regressor to run a regression. The F-statistic from the regression is saved in a text file. The code is as follows:
clear all
set more off
local y `1'
display `"first parameter: `y'"'
sysuse auto
regress price `y'
local f=e(F)
display "`f'"
file open myhandle using test_result.txt, write append
file write myhandle "`f'" _n
file close myhandle
exit, STATA clear
Now I am trying to run the stata do file in parallel in python and write all the F-statistics in one text file. My cpu has 4 cores.
import multiprocessing
import subprocess
def work(staname):
dofile = "pyexample3.do"
cmd = ["StataMP-64.exe","/e", "do", dofile,staname]
return subprocess.call(cmd, shell=False)
if __name__ == '__main__':
my_list =[ "mpg","rep78","headroom","trunk","weight","length","turn","displacement","gear_ratio" ]
my_list.sort()
print my_list
# Get the number of processors available
num_processes = multiprocessing.cpu_count()
threads = []
len_stas = len(my_list)
print "+++ Number of stations to process: %s" % (len_stas)
# run until all the threads are done, and there is no data left
for list_item in my_list:
# if we aren't using all the processors AND there is still data left to
# compute, then spawn another thread
if( len(threads) < num_processes ):
p = multiprocessing.Process(target=work,args=[list_item])
p.start()
print p, p.is_alive()
threads.append(p)
else:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
Although the do file is supposed to run 9 times as there are 9 strings in my_list, it was only run 4 times. So where went wrong?
In your for list_item in my_list loop, after the first 4 processes get initiated, it then goes into else:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
As you can see since thread.is_alive() won't block, this loop get executed immediately without any of those 4 processes finishing their task. Therefore only the first 4 processes get executed in total.
You could simply use a while loop to constantly check process status with a small interval:
keep_checking = True
while keep_checking:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
keep_checking = False
time.sleep(0.5) # wait 0.5s

Python Thread Pool - process never ends

Below is the test code - I'm playing around with the Thread Pool found in the standard library. The problem is that the final process never ends. It just hangs.
I should mention what I'm after here, func depending on input can take a few seconds to a few minutes and I want them to finish as soon as possible - in order of whatever finishes first. Ideally the number of "func" I will execute will be around four of five at the same time.
>>> from multiprocessing.pool import ThreadPool
>>>
>>> def func(i):
... import time
... if i % 2 == 0: time.sleep(5)
... print i
...
>>> t = ThreadPool(5)
>>>
>>> for i in range(10):
... z = t.Process(target=func, args=(i,))
... z.start()
...
1
3
5
7
9
>>> 0
2
4
6
8
In other words, after printing "8" the code just waits here until I force a KeyboardInterrupt. I've tried setting the process as a daemon but no luck. Any advice/better documentation?
From the documentation
Worker processes within a Pool typically live for the complete duration of the Pool’s work queue.
You should probably use for task like this
t.imap(func,xrange(10))
t.close()
t.join()

Categories