python multiprocessing - rerun initializer, or run function on all workers? - python

With multiprocessing, it is possible to execute the same function on all workers at creation time of the pool with the initializer and initargs options in the Pool factory function.
Is it possible to run something, with a guarantee, on all workers? I would like to do this periodically, but realize it is not a very popular use case and may not be possible without re-implementing Pool based on multiprocessing primitives...

You can simply use a Timer to ensure the initializer is called once again after a given interval.
This is a pretty standard way to re-schedule an action periodically.
from threading import Timer
from multiprocessing import Pool
def initializer(interval: float):
update_global_state()
timer = Timer(interval, initializer, args=[interval])
timer.start()
pool = Pool(initializer=initializer, initargs=[STATE_UPDATE_INTERVAL])
Each worker of the Pool will update its state periodically and independently.
EDIT:
The Pool design paradigm goal is to abstract the management of tasks and workers de-coupling the main loop from the execution of the jobs. While doing so, it restricts the access to the workers to protect the overall logic.
If what you need is to share and update the state among the workers, the only feasible approach is to let the workers poll for state updates. You can either use the above approach or let the worker check the state at any new job.
As a Pool is asynchronous by design, there is no way to synchronously provide information to the workers.

If what you want to run has nothing to do with the task submitted to pool, you can simply replace Pool.Process with your own version, which add initializer at first.
from multiprocessing.pool import Pool as PoolCls
from multiprocessing import Pool, Process
class MyProcess(Process):
def start(self):
print 'do something you want...'
super(MyProcess, self).start()
# use our own Process instead, should before creating pool
PoolCls.Process = MyProcess
Note that the statements added in start() are ran in current process, not the work process. Overwrite run() if you want to run in each work process.

Related

How to specify a part of code to run in a particular thread in a multithreaded environment in python?

How to achieve something like:
def call_me():
# doing some stuff which requires distributed locking
def i_am_calling():
# other logic
call_me()
# other logic
This code runs in a multithreaded environment. How can I make it something like, only a single thread from the thread pool has responsibility to run call_me() part of the i_am_calling()?
It depends on the exact requirement in hand and on the system architecture / solution. Accordingly, one of the approach can be based on lock to ensure that only one process does the locking at a time.
You can arrive on logic by trying usage of apply_async of the multiprocessing module that could enable invocation of a number of different functions (not of same type of function) with pool.apply_async. It shall use only one process when that function is invoked only once, however you can bundle up tasks ahead and pass/submit these tasks to the various worker processes. There is also the pool.apply that submits a task to the pool , but it blocks until the function is completed or result is available. The equivalent of it is pool.apply_async(func, args, kwargs).get() based on get() or a callback function with pool.apply_async without get(). Also, it should be noted that pool.apply(f, args) ensures that only one of the workers of the pool will execute f(args).
You can also arrive on logic by trying of making a respective call in its own thread using executor.submit that is part of concurrent.futures which is a standard Python library . The asyncio can be coupled with concurrent.futures such that it can await functions executed in thread or process pools provided by concurrent.futures as highlighted in this example.
If you would like to run a routine functionality at regular interval, then you can arrive on a logic based on threading.timer.

Can I map a subprocess to the same multiprocessing.Pool where the main process is running?

I am relatively new to the multiprocessing world in python3 and I am therefore sorry if this question has been asked before. I have a script which, from a list of N elements, runs the entire analysis on each element, mapping each onto a different process.
I am aware that this is suboptimal, in fact I want to increase the multiprocessing efficiency. I use map() to run each process into a Pool() which can contain as many processes as the user specifies via command line arguments.
Here is how the code looks like:
max_processes = 7
# it is passed by command line actually but not relevant here
def main_function( ... ):
res_1 = sub_function_1( ... )
res_2 = sub_function_2( ... )
if __name__ == '__main__':
p = Pool(max_processes)
Arguments = []
for x in Paths.keys():
# generation of the arguments
...
Arguments.append( Tup_of_arguments )
p.map(main_function, Arguments)
p.close()
p.join()
As you see my process calls a main function which in turn calls many other functions one after the other. Now, each of the sub_functions is multiprocessable. Can I map processes from those subfunctions, which map to the same pool where the main process runs?
No, you can't.
The pool is (pretty much) not available in the worker processes. It depends a bit on the start method used for the pool.
spawn
A new Python interpreter process is started and imports the module. Since in that process __name__ is '__mp_main__', the code in the __name__ == '__main__' block is not executed and no pool object exists in the workers.
fork
The memory space of the parent process is copied into the memory space of the child process. That effectively leads to an existing Pool object in the memory space of each worker.
However, that pool is unusable. The workers are created during the execution of the pool's __init__, hence the pool's initialization is incomplete when the workers are forked. The pool's copies in the worker processes have none of the threads running that manage workers, tasks and results. Threads anyway don't make it into child processes via fork.
Additionally, since the workers are created during the initialization, the pool object has not yet been assigned to any name at that point. While it does lurk in the worker's memory space, there is no handle to it. It does not show up via globals(); I only found it via gc.get_objects(): <multiprocessing.pool.Pool object at 0x7f75d8e50048>
Anyway, that pool object is a copy of the one in the main process.
forkserver
I could not test this start method
To solve your problem, you could fiddle around with queues and a queue handler thread in the main process to send back tasks from workers and delegate them to the pool, but all approaches I can think of seem rather clumsy.
You'll very probaly end up with a lot more maintainable code if you make the effort to adopt it for processing in a pool.
As an aside: I am not sure if allowing users to pass the number of workers via commandline is a good idea. I recommend to to give that value an upper boundary via os.cpu_count() at the very least.

Python multiprocessing pool: dynamically set number of processes during execution of tasks

We submit large CPU intensive jobs in Python 2.7 (that consist of many independent parallel processes) on our development machine which last for days at a time. The responsiveness of the machine slows down a lot when these jobs are running with a large number of processes. Ideally, I would like to limit the number of CPU available during the day when we're developing code and over night run as many processes as efficiently possible.
The Python multiprocessing library allows you to specify the number of process when you initiate a Pool. Is there a way to dynamically change this number each time a new task is initiated?
For instance, allow 20 processes to run during the hours 19-07 and 10 processes from hours 07-19.
One way would be to check the number of active processes using significant CPU. This is how I would like it to work:
from multiprocessing import Pool
import time
pool = Pool(processes=20)
def big_task(x):
while check_n_process(processes=10) is False:
time.sleep(60*60)
x += 1
return x
x = 1
multiple_results = [pool.apply_async(big_task, (x)) for i in range(1000)]
print([res.get() for res in multiple_results])
But I would need to write the 'check_n_process' function.
Any other ideas how this problem could be solved?
(The code needs to run in Python 2.7 - a bash implementation is not feasible).
Python multiprocessing.Pool does not provide a way to change the amount of workers of a running Pool. A simple solution would be relying on third party tools.
The Pool provided by billiard used to provide such a feature.
Task queue frameworks like Celery or Luigi surely allow a flexible workload but are way more complex.
If the use of external dependencies is not feasible, you can try the following approach. Elaborating from this answer, you could set a throttling mechanism based on a Semaphore.
from threading import Semaphore, Lock
from multiprocessing import Pool
def TaskManager(object):
def __init__(self, pool_size):
self.pool = Pool(processes=pool_size)
self.workers = Semaphore(pool_size)
# ensures the semaphore is not replaced while used
self.workers_mutex = Lock()
def change_pool_size(self, new_size):
"""Set the Pool to a new size."""
with self.workers_mutex:
self.workers = Semaphore(new_size)
def new_task(self, task):
"""Start a new task, blocks if queue is full."""
with self.workers_mutex:
self.workers.acquire()
self.pool.apply_async(big_task, args=[task], callback=self.task_done))
def task_done(self):
"""Called once task is done, releases the queue is blocked."""
with self.workers_mutex:
self.workers.release()
The pool would block further attempts to schedule your big_tasks if more than X workers are busy. By controlling this mechanism you could throttle the amount of processes running concurrently. Of course, this means that you give up the Pool queueing mechanism.
task_manager = TaskManager(20)
while True:
if seven_in_the_morning():
task_manager.change_pool_size(10)
if seven_in_the_evening():
task_manager.change_pool_size(20)
task = get_new_task()
task_manager.new_task() # blocks here if all workers are busy
This is woefully incomplete (and an old question), but you can manage the load by keeping track of the running processes and only calling apply_async() when it's favorable; if each job runs for less than forever, you can drop the load by dispatching fewer jobs during working hours, or when os.getloadavg() is too high.
I do this to manage network load when running multiple "scp"s to evade traffic shaping on our internal network (don't tell anyone!)

Dynamically reordering jobs in a multiprocessing pool in Python

I'm writing a python script (for cygwin and linux environments) to run regression testing on a program that is run from the command line using subprocess.Popen(). Basically, I have a set of jobs, a subset of which need to be run depending on the needs of the developer (on the order of 10 to 1000). Each job can take anywhere from a few seconds to 20 minutes to complete.
I have my jobs running successfully across multiple processors, but I'm trying to eke out some time savings by intelligently ordering the jobs (based on past performance) to run the longer jobs first. The complication is that some jobs (steady state calculations) need to be run before others (the transients based on the initial conditions determined by the steady state).
My current method of handling this is to run the parent job and all child jobs recursively on the same process, but some jobs have multiple, long-running children. Once the parent job is complete, I'd like to add the children back to the pool to farm out to other processes, but they would need to be added to the head of the queue. I'm not sure I can do this with multiprocessing.Pool. I looked for examples with Manager, but they all are based on networking it seems, and not particularly applicable. Any help in the form of code or links to a good tutorial on multiprocessing (I've googled...) would be much appreciated. Here's a skeleton of the code for what I've got so far, commented to point out the child jobs that I would like spawned off on other processors.
import multiprocessing
import subprocess
class Job(object):
def __init__(self, popenArgs, runTime, children)
self.popenArgs = popenArgs #list to be fed to popen
self.runTime = runTime #Approximate runTime for the job
self.children = children #Jobs that require this job to run first
def runJob(job):
subprocess.Popen(job.popenArgs).wait()
####################################################
#I want to remove this, and instead kick these back to the pool
for j in job.children:
runJob(j)
####################################################
def main(jobs):
# This jobs argument contains only jobs which are ready to be run
# ie no children, only parent-less jobs
jobs.sort(key=lambda job: job.runTime, reverse=True)
multiprocessing.Pool(4).map(runJob, jobs)
First, let me second Armin Rigo's comment: There's no reason to use multiple processes here instead of multiple threads. In the controlling process you're spending most of your time waiting on subprocesses to finish; you don't have CPU-intensive work to parallelize.
Using threads will also make it easier to solve your main problem. Right now you're storing the jobs in attributes of other jobs, an implicit dependency graph. You need a separate data structure that orders the jobs in terms of scheduling. Also, each tree of jobs is currently tied to one worker process. You want to decouple your workers from the data structure you use to hold the jobs. Then the workers each draw jobs from the same queue of tasks; after a worker finishes its job, it enqueues the job's children, which can then be handled by any available worker.
Since you want the child jobs to be inserted at the front of the line when their parent is finished a stack-like container would seem to fit your needs; the Queue module provides a thread-safe LifoQueue class that you can use.
import threading
import subprocess
from Queue import LifoQueue
class Job(object):
def __init__(self, popenArgs, runTime, children):
self.popenArgs = popenArgs
self.runTime = runTime
self.children = children
def run_jobs(queue):
while True:
job = queue.get()
subprocess.Popen(job.popenArgs).wait()
for child in job.children:
queue.put(child)
queue.task_done()
# Parameter 'jobs' contains the jobs that have no parent.
def main(jobs):
job_queue = LifoQueue()
num_workers = 4
jobs.sort(key=lambda job: job.runTime)
for job in jobs:
job_queue.put(job)
for i in range(num_workers):
t = threading.Thread(target=run_jobs, args=(job_queue,))
t.daemon = True
t.start()
job_queue.join()
A couple of notes: (1) We can't know when all the work is done by monitoring the worker threads, since they don't keep track of the work to be done. That's the queue's job. So the main thread monitors the queue object to know when all the work is complete (job_queue.join()). We can thus mark the worker threads as daemon threads, so the process will exit whenever the main thread does without waiting on the workers. We thereby avoid the need for communication between the main thread and the worker threads in order to tell the latter when to break out of their loops and stop.
(2) We know all the work is done when all tasks that have been enqueued have been marked as done (specifically, when task_done() has been called a number of times equal to the number of items that have been enqueued). It wouldn't be reliable to use the queue's being empty as the condition that all work is done; the queue might be momentarily and misleadingly empty between popping a job from it and enqueuing that job's children.

Python queues - have at most n threads running

The scenario:
I have a really large DB model migration going on for a new build, and I'm working on boilerplating how we will go about migration current live data from a webapp into the local test databases.
I'd like to setup in python a script that will concurrently process the migration of my models. I have from_legacy and to_legacy methods for my model instances. What I have so far loads all my instances and creates threads for each, with each thread subclassed from the core threading modules with a run method that just does the conversion and saves the result.
I'd like to make the main loop in the program build a big stack of instances of these threads, and start to process them one by one, running only at most 10 concurrently as it does its work, and feeding the next in to be processed as others finish migrating.
What I can't figure out is how to utilize the queue correctly to do this? If each thread represents the full task of migration, should I load all the instances first and then create a Queue with maxsize set to 10, and have that only track currently running queues? Something like this perhaps?
currently_running = Queue()
for model in models:
task = Migrate(models) #this is subclassed thread
currently_running.put(task)
task.start()
In this case relying on the put call to block while it is at capacity? If I were to go this route, how would I call task_done?
Or rather, should the Queue include all the tasks (not just the started ones) and use join to block to completion? Does calling join on a queue of threads start the included threads?
What is the best methodology to approach the "at most have N running threads" problem and what role should the Queue play?
Although not documented, the multiprocessing module has a ThreadPool class which, as its name implies, creates a pool of threads. It shares the same API as the multiprocessing.Pool class.
You can then send tasks to the thread pool using pool.apply_async:
import multiprocessing.pool as mpool
def worker(task):
# work on task
print(task) # substitute your migration code here.
# create a pool of 10 threads
pool = mpool.ThreadPool(10)
N = 100
for task in range(N):
pool.apply_async(worker, args = (task, ))
pool.close()
pool.join()
This should probably be done using semaphores the example in the documentation is a hint of what you're try to accomplish.

Categories