Purpose of multiprocessing.Pool.apply and multiprocessing.Pool.apply_async

Purpose of multiprocessing.Pool.apply and multiprocessing.Pool.apply_async - python

See example and execution result below:
#!/usr/bin/env python3.4
from multiprocessing import Pool
import time
import os
def initializer():
print("In initializer pid is {} ppid is {}".format(os.getpid(),os.getppid()))
def f(x):
print("In f pid is {} ppid is {}".format(os.getpid(),os.getppid()))
return x*x
if __name__ == '__main__':
print("In main pid is {} ppid is {}".format(os.getpid(), os.getppid()))
with Pool(processes=4, initializer=initializer) as pool: # start 4 worker processes
result = pool.apply(f, (10,)) # evaluate "f(10)" in a single process
print(result)
#result = pool.apply_async(f, (10,)) # evaluate "f(10)" in a single process
#print(result.get())
Gives:
$ ./pooleg.py
In main pid is 22783 ppid is 19542
In initializer pid is 22784 ppid is 22783
In initializer pid is 22785 ppid is 22783
In initializer pid is 22787 ppid is 22783
In f pid is 22784 ppid is 22783
In initializer pid is 22786 ppid is 22783
100
As is clear from the output: 4 processes were created but only one of them actually did the work (called f).
Question: Why would I create a pool of > 1 workers and call apply() when the work f is done only by one process ? And same thing for apply_async() because in that case also the work is only done by one worker.
I don't understand the use cases in which these functions are useful.

First off, both are meant to operate on argument-tuples (single function calls), contrary to the Pool.map variants which operate on iterables. So it's not an error when you observe only one process used when you call these functions only once.
You would use Pool.apply_async instead of one of the Pool.map versions, where you need more fine grained control over the single tasks you want to distribute.
The Pool.map versions take an iterable and chunk them into tasks, where every task has the same (mapped) target function.
Pool.apply_async typically isn't called only once with a pool of >1 workers. Since it's asynchronous, you can iterate over manually pre-bundled tasks and submit them to several
worker-processes before any of them has completed. Your task-list here can consist of different target functions like you can see in this answer here. It also allows registering callbacks for results and errors like in this example.
These properties make Pool.apply_async pretty versatile and a first-choice tool for unusual problem scenarios you cannot get done with one of the Pool.map versions.
Pool.apply indeed is not widely usefull at first sight (and second). You could use it to synchronize control flow in a scenario where you start up multiple tasks with apply_async first and then have a task which has to be completed before you fire up another round of tasks with apply_async.
Using Pool.apply could also just mean sparing you to create a single extra Process for an in-between task, when you already have a pool which is currently idling.

This line in your code:
Pool(processes=4, initializer=initializer) as pool: # start 4 worker processes
doesn't start 4 worker processes. It just create a pool of them than can support running that many of them at a time running concurrently. It's methods like apply() that actually start separate processes running.
The difference is that apply() and apply_async() is that the former blocks until a result is ready but the latter returns a "result" object right away. This doesn't make much difference unless you want to submit more than one task to the Pool at a time (which of course is the whole point of using the multiprocessing module).
Here's are some modifications to your code showing how to actually do some concurrent processing with the Pool:
from multiprocessing import Pool
import time
import os
def initializer():
print("In initializer pid is {} ppid is {}".format(os.getpid(),os.getppid()))
def f(x):
print("In f pid is {} ppid is {}".format(os.getpid(),os.getppid()))
return x*x
if __name__ == '__main__':
print("In main pid is {} ppid is {}".format(os.getpid(), os.getppid()))
with Pool(processes=4, initializer=initializer) as pool: # Create 4 worker Pool.
# result = pool.apply(f, (10,)) # evaluate "f(10)" in a single process
# print(result)
# Start multiple tasks.
tasks = [pool.apply_async(f, (val,)) for val in range(10, 20)]
pool.close() # No more tasks.
pool.join() # Wait for all tasks to finish.
results = [result.get() for result in tasks] # Get the result of each.
print(results)
map_sync() would be better suited for processing something like this (a sequence of values) as it will handle some of the details shown in the above code automatically.

Related

Python Subprocess Popen Parallelization

Objective
a process (.exe) with multiple input arguments
Multiple files. For each the above mentioned process shall be executed
I want to use python to parallelize the process
I am using subprocess.Popen to create the processes and afterwards keep a maximum of N parallel processes.
For testing purposes, I want to parallelize a simple script like "cmd timeout 5".
State of work
import subprocess
count = 10
parallel = 2
processes = []
for i in range(0,count):
while (len(processes) >= parallel):
for process in processes:
if (process.poll() is None):
processes.remove(process)
break
process = subprocess.Popen(["cmd", "/c timeout 5"])
processes.append(process)
[...]
I read somewhere that a good approach for checking if a process is running would be is not None like shown in the code.
Question
I am somehow struggling to set it up correctly, especially the Popen([...]) part. In some cases, all processes are executed without considering the maximum parallel count and in other cases, it doesnt work at all.
I guess that there has to be a part where the process is closed if finished.
Thanks!

You will probably have a better time using the built-in multiprocessing module to manage the subprocesses running your tasks.
The reason I've wrapped the command in a dict is that imap_unordered (which is faster than imap but doesn't guarantee ordered execution since any worker process can grab any job – whether that's okay for you is your business problem) doesn't have a starmap alternative, so it's easier to unpack a single "job" within the callable.
import multiprocessing
import subprocess
def run_command(job):
# TODO: add other things here?
subprocess.check_call(job["command"])
def main():
with multiprocessing.Pool(2) as p:
jobs = [{"command": ["cmd", "/c timeout 5"]} for x in range(10)]
for result in p.imap_unordered(run_command, jobs):
pass
if __name__ == "__main__":
main()

multiprocessing.Pool: How to start new processes as old ones finish?

I'm using multiprocessing Pool to manage tesseract processes (OCRing pages of microfilm). Very often in a Pool of say 20 tesseract processes a few pages will be more difficult to OCR, and thus these processes are taking much much longer than the other ones. In the mean time, the pool is just hanging and most of the CPUs are not being leveraged. I want these stragglers to be left to continue, but I also want to start up more processes to fill up the many other CPUs that are now lying idle while these few sticky pages are finishing up. My question: is there a way to load up new processes to leverage those idle CPUs. In other words, can the empty spots in the Pool be filled before waiting for the whole pool to complete?
I could use the async version of starmap and then load up a new pool when the current pool has gone down to a certain number of living processes. But this seems inelegant. It would be more elegant to automagically keep slotting in processes as needed.
Here's what my code looks like right now:
def getMpBatchMap(fileList, commandTemplate, concurrentProcesses):
mpBatchMap = []
for i in range(concurrentProcesses):
fileName = fileList.readline()
if fileName:
mpBatchMap.append((fileName, commandTemplate))
return mpBatchMap
def executeSystemProcesses(objFileName, commandTemplate):
objFileName = objFileName.strip()
logging.debug(objFileName)
objDirName = os.path.dirname(objFileName)
command = commandTemplate.substitute(objFileName=objFileName, objDirName=objDirName)
logging.debug(command)
subprocess.call(command, shell=True)
def process(FILE_LIST_FILENAME, commandTemplateString, concurrentProcesses=3):
"""Go through the list of files and run the provided command against them,
one at a time. Template string maps the terms $objFileName and $objDirName.
Example:
>>> runBatchProcess('convert -scale 256 "$objFileName" "$objDirName/TN.jpg"')
"""
commandTemplate = Template(commandTemplateString)
with open(FILE_LIST_FILENAME) as fileList:
while 1:
# Get a batch of x files to process
mpBatchMap = getMpBatchMap(fileList, commandTemplate, concurrentProcesses)
# Process them
logging.debug('Starting MP batch of %i' % len(mpBatchMap))
if mpBatchMap:
with Pool(concurrentProcesses) as p:
poolResult = p.starmap(executeSystemProcesses, mpBatchMap)
logging.debug('Pool result: %s' % str(poolResult))
else:
break

You're mixing something up here. The pool always keeps a number of specified processes alive. As long as you don't close the pool, either manually or by leaving the with-block of the context-manager, there is no need for you to refill the pool with processes, because they're not going anywhere.
What you probably meant to say is 'tasks', tasks these processes can work on. A task is a per-process-chunk of the iterable you pass to the pool-methods. And yes, there's a way to use idle processes in the pool for new tasks before all previously enqueued tasks have been processed. You already picked the right tool for this, the async-versions of the pool-methods. All you have to do, is to reapply some sort of async pool-method.
from multiprocessing import Pool
import os
def busy_foo(x):
x = int(x)
for _ in range(x):
x - 1
print(os.getpid(), ' returning: ', x)
return x
if __name__ == '__main__':
arguments1 = zip([222e6, 22e6] * 2)
arguments2 = zip([111e6, 11e6] * 2)
with Pool(4) as pool:
results = pool.starmap_async(busy_foo, arguments1)
results2 = pool.starmap_async(busy_foo, arguments2)
print(results.get())
print(results2.get())
Example Output:
3182 returning: 22000000
3185 returning: 22000000
3185 returning: 11000000
3182 returning: 111000000
3182 returning: 11000000
3185 returning: 111000000
3181 returning: 222000000
3184 returning: 222000000
[222000000, 22000000, 222000000, 22000000]
[111000000, 11000000, 111000000, 11000000]
Process finished with exit code 0
Note above, processes 3182 and 3185 which ended up with the easier task, immediately start with tasks from the second argument-list, without waiting for 3181 and 3184 to complete first.
If you, for some reason, really would like to use fresh processes after some amount of processed tasks per process, there's the maxtasksperchild parameter for Pool. There you can specify after how many tasks the pool should replace the old processes with new ones. The default for this argument is None, so the Pool does not replace processes by default.

Spawning multiple processes with Python

Earlier I tried to use the threading module in python to create multiple threads. Then I learned about the GIL and how it does not allow taking advantage of multiple CPU cores on a single machine. So now I'm trying to do multiprocessing (I don't strictly need seperate threads).
Here is a sample code I wrote to see if distinct processes are being created. But as can be seen in the output below, I'm getting the same process ID everytime. So multiple processes are not being created. What am I missing?
import multiprocessing as mp
import os
def pri():
print(os.getpid())
if __name__=='__main__':
# Checking number of CPU cores
print(mp.cpu_count())
processes=[mp.Process(target=pri()) for x in range(1,4)]
for p in processes:
p.start()
for p in processes:
p.join()
Output:
4
12554
12554
12554

The Process class requires a callable as its target.
Instead of running the function in the separate process, you are calling it and passing its result (None in this case) to the Process class.
Just change the following:
mp.Process(target=pri())
with:
mp.Process(target=pri)

Since the subprocesses runs on a different process, you won't see their print statements. They also don't share the same memory space. You pass pri() to target, where it needs to be pri. You need to pass a callable object, not execute it.
The prints you see are part of your main thread executions. Because you pass pri(), the code is actually executed. You need to change your code so the pri function returns value, rather than prints it.
Then you need to implement a queue, where all your threads write to it and when they're done, your main thread reads the queue.
A nice feature of the multiprocessing module is the Pool object. It allows you to create a thread pool, and then just use it. It's more convenient.
I have tried your code, the thing is the command executes too quick, so the OS reuses the PIDs. If you add a time.sleep(1) in your pri function, it would work as you expect.
That is True only for Windows. The example below is made on Windows platform. On Unix like machines, you won't need the sleep.
The more convenience solution is like this:
from multiprocessing import Pool
from time import sleep
import os
def pri(x):
sleep(1)
return os.getpid()
def use_procs():
p_pool = Pool(4)
p_results = p_pool.map(pri, [_ for _ in range(1,4)])
p_pool.close()
p_pool.join()
return p_results
if __name__ == '__main__':
res = use_procs()
for r in res:
print r
Without the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
6576
6576
6576
>>>
with the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
10396
10944
9000

Simple and safe way to wait for a Python Process to complete when using a Queue

I'm using the Process class to create and manage subprocesses, which may return non-trival quantities of data. The documentation states that join() is the correct way to wait for a Process to complete (https://docs.python.org/2/library/multiprocessing.html#the-process-class).
However, when using multiprocessing.Queue this can cause a hang after joining the process, as described here: https://bugs.python.org/issue8426 and here https://docs.python.org/2/library/multiprocessing.html#multiprocessing-programming (not a bug).
These docs suggest removing p.join() - but surely this will remove the guarantee that all processes have completed, as Queue.get() only waits for a single item to become available?
How can I wait for completion of all Processes in this case, and ensure I'm collecting output from them all?
A simple example of the hang I'd like to deal with:
from multiprocessing import Process, Queue
class MyClass:
def __init__(self):
pass
def example_run(output):
output.put([MyClass() for i in range(1000)])
print("Bottom of example_run() - note hangs after this is printed")
if __name__ == '__main__':
output = Queue()
processes = [Process(target=example_run, args=(output,)) for x in range(5)]
for p in processes:
p.start()
for p in processes:
p.join()
print("Processes completed")

https://bugs.python.org/issue8426
This means that whenever you use a queue you need to make sure that
all items which have been put on the queue will eventually be removed
before the process is joined. Otherwise you cannot be sure that
processes which have put items on the queue will terminate.
In your example I just added output.get() before calling to join() and every thing worked fine. We put data in queue to be used some where, so just make sure that.
for p in processes:
p.start()
print output.get()
for p in processes:
p.join()
print("Processes completed")

An inelegant solution is to add
output_final = []
for i in range(5): # we have 5 processes
output_final.append(output.get())
before attempting to join any of the processes. This simply tries to get the appropriate number of outputs for the number of processes we've started.
Turns out a much better, wider solution is not to use Process at all; use Pool instead. This way the hassles of starting worker processes and collecting the results is handled for you:
import multiprocessing
class MyClass:
def __init__(self):
pass
def example_run(someArbitraryInput):
foo = [MyClass() for i in range(10000)]
return foo
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=5)
output = pool.map(example_run, range(5))
pool.close(); pool.join() # make sure the processes are complete and tidy
print("Processes completed")

Subprocess management with python

I have a data analysis script that takes an argument specifying the segments of the analysis to perform. I want to run up to 'n' instances of the script at a time where 'n' is the number of cores on the machine. The complication is that there are more segments of the analysis than there are cores so I want to run at most, 'n' processes at once, and one of them finishes, kick off another one. Has anyone done something like this before using the subprocess module?

I do think that multiprocessing module will help you achieve what you need.
Take look at the example technique.
import multiprocessing
def do_calculation(data):
"""
#note: you can define your calculation code
"""
return data * 2
def start_process():
print 'Starting', multiprocessing.current_process().name
if __name__ == '__main__':
analsys_jobs = list(range(10)) # could be your analysis work
print 'analsys_jobs :', analsys_jobs
pool_size = multiprocessing.cpu_count() * 2
pool = multiprocessing.Pool(processes=pool_size,
initializer=start_process,
maxtasksperchild=2, )
#maxtasksperchild = tells the pool to restart a worker process \
# after it has finished a few tasks. This can be used to avoid \
# having long-running workers consume ever more system resources
pool_outputs = pool.map(do_calculation, analsys_jobs)
#The result of the map() method is functionally equivalent to the \
# built-in map(), except that individual tasks run in parallel. \
# Since the pool is processing its inputs in parallel, close() and join()\
# can be used to synchronize the main process with the \
# task processes to ensure proper cleanup.
pool.close() # no more tasks
pool.join() # wrap up current tasks
print 'Pool :', pool_outputs
You could find good multiprocessing techniques here to start with

Use the multiprocessing module, specifically the Pool class. Pool creates a pool of processes (by default, as many processes as you have CPUs), and allows you to submit jobs to the pool which are executed on the next free process. It takes care of all the subprocess management and the details of passing data between tasks, so you can write code in a very straightforward manner. See the documentation for some examples on usage.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.