Dynamically reordering jobs in a multiprocessing pool in Python - python

I'm writing a python script (for cygwin and linux environments) to run regression testing on a program that is run from the command line using subprocess.Popen(). Basically, I have a set of jobs, a subset of which need to be run depending on the needs of the developer (on the order of 10 to 1000). Each job can take anywhere from a few seconds to 20 minutes to complete.
I have my jobs running successfully across multiple processors, but I'm trying to eke out some time savings by intelligently ordering the jobs (based on past performance) to run the longer jobs first. The complication is that some jobs (steady state calculations) need to be run before others (the transients based on the initial conditions determined by the steady state).
My current method of handling this is to run the parent job and all child jobs recursively on the same process, but some jobs have multiple, long-running children. Once the parent job is complete, I'd like to add the children back to the pool to farm out to other processes, but they would need to be added to the head of the queue. I'm not sure I can do this with multiprocessing.Pool. I looked for examples with Manager, but they all are based on networking it seems, and not particularly applicable. Any help in the form of code or links to a good tutorial on multiprocessing (I've googled...) would be much appreciated. Here's a skeleton of the code for what I've got so far, commented to point out the child jobs that I would like spawned off on other processors.
import multiprocessing
import subprocess
class Job(object):
def __init__(self, popenArgs, runTime, children)
self.popenArgs = popenArgs #list to be fed to popen
self.runTime = runTime #Approximate runTime for the job
self.children = children #Jobs that require this job to run first
def runJob(job):
subprocess.Popen(job.popenArgs).wait()
####################################################
#I want to remove this, and instead kick these back to the pool
for j in job.children:
runJob(j)
####################################################
def main(jobs):
# This jobs argument contains only jobs which are ready to be run
# ie no children, only parent-less jobs
jobs.sort(key=lambda job: job.runTime, reverse=True)
multiprocessing.Pool(4).map(runJob, jobs)

First, let me second Armin Rigo's comment: There's no reason to use multiple processes here instead of multiple threads. In the controlling process you're spending most of your time waiting on subprocesses to finish; you don't have CPU-intensive work to parallelize.
Using threads will also make it easier to solve your main problem. Right now you're storing the jobs in attributes of other jobs, an implicit dependency graph. You need a separate data structure that orders the jobs in terms of scheduling. Also, each tree of jobs is currently tied to one worker process. You want to decouple your workers from the data structure you use to hold the jobs. Then the workers each draw jobs from the same queue of tasks; after a worker finishes its job, it enqueues the job's children, which can then be handled by any available worker.
Since you want the child jobs to be inserted at the front of the line when their parent is finished a stack-like container would seem to fit your needs; the Queue module provides a thread-safe LifoQueue class that you can use.
import threading
import subprocess
from Queue import LifoQueue
class Job(object):
def __init__(self, popenArgs, runTime, children):
self.popenArgs = popenArgs
self.runTime = runTime
self.children = children
def run_jobs(queue):
while True:
job = queue.get()
subprocess.Popen(job.popenArgs).wait()
for child in job.children:
queue.put(child)
queue.task_done()
# Parameter 'jobs' contains the jobs that have no parent.
def main(jobs):
job_queue = LifoQueue()
num_workers = 4
jobs.sort(key=lambda job: job.runTime)
for job in jobs:
job_queue.put(job)
for i in range(num_workers):
t = threading.Thread(target=run_jobs, args=(job_queue,))
t.daemon = True
t.start()
job_queue.join()
A couple of notes: (1) We can't know when all the work is done by monitoring the worker threads, since they don't keep track of the work to be done. That's the queue's job. So the main thread monitors the queue object to know when all the work is complete (job_queue.join()). We can thus mark the worker threads as daemon threads, so the process will exit whenever the main thread does without waiting on the workers. We thereby avoid the need for communication between the main thread and the worker threads in order to tell the latter when to break out of their loops and stop.
(2) We know all the work is done when all tasks that have been enqueued have been marked as done (specifically, when task_done() has been called a number of times equal to the number of items that have been enqueued). It wouldn't be reliable to use the queue's being empty as the condition that all work is done; the queue might be momentarily and misleadingly empty between popping a job from it and enqueuing that job's children.

Related

Can I map a subprocess to the same multiprocessing.Pool where the main process is running?

I am relatively new to the multiprocessing world in python3 and I am therefore sorry if this question has been asked before. I have a script which, from a list of N elements, runs the entire analysis on each element, mapping each onto a different process.
I am aware that this is suboptimal, in fact I want to increase the multiprocessing efficiency. I use map() to run each process into a Pool() which can contain as many processes as the user specifies via command line arguments.
Here is how the code looks like:
max_processes = 7
# it is passed by command line actually but not relevant here
def main_function( ... ):
res_1 = sub_function_1( ... )
res_2 = sub_function_2( ... )
if __name__ == '__main__':
p = Pool(max_processes)
Arguments = []
for x in Paths.keys():
# generation of the arguments
...
Arguments.append( Tup_of_arguments )
p.map(main_function, Arguments)
p.close()
p.join()
As you see my process calls a main function which in turn calls many other functions one after the other. Now, each of the sub_functions is multiprocessable. Can I map processes from those subfunctions, which map to the same pool where the main process runs?
No, you can't.
The pool is (pretty much) not available in the worker processes. It depends a bit on the start method used for the pool.
spawn
A new Python interpreter process is started and imports the module. Since in that process __name__ is '__mp_main__', the code in the __name__ == '__main__' block is not executed and no pool object exists in the workers.
fork
The memory space of the parent process is copied into the memory space of the child process. That effectively leads to an existing Pool object in the memory space of each worker.
However, that pool is unusable. The workers are created during the execution of the pool's __init__, hence the pool's initialization is incomplete when the workers are forked. The pool's copies in the worker processes have none of the threads running that manage workers, tasks and results. Threads anyway don't make it into child processes via fork.
Additionally, since the workers are created during the initialization, the pool object has not yet been assigned to any name at that point. While it does lurk in the worker's memory space, there is no handle to it. It does not show up via globals(); I only found it via gc.get_objects(): <multiprocessing.pool.Pool object at 0x7f75d8e50048>
Anyway, that pool object is a copy of the one in the main process.
forkserver
I could not test this start method
To solve your problem, you could fiddle around with queues and a queue handler thread in the main process to send back tasks from workers and delegate them to the pool, but all approaches I can think of seem rather clumsy.
You'll very probaly end up with a lot more maintainable code if you make the effort to adopt it for processing in a pool.
As an aside: I am not sure if allowing users to pass the number of workers via commandline is a good idea. I recommend to to give that value an upper boundary via os.cpu_count() at the very least.

Sending completed jobs back to correct process in python

I'd like to create a set of processes with the following structure:
main, which dequeues requests from an external source. main generates a variable number of worker processes.
worker which does some preliminary processing on job requests, then sends data to gpuProc.
gpuProc, which accepts job requests from worker processes. When it has received enough requests, it sends the batch to a process that runs on the GPU. After getting the results back, it has to then send back the completed batch of requests back to the worker processes such that the worker that requested it receives it back
One could envision doing this with a number of queues. Since the number of worker processes is variable, it would be ideal if gpuProc had a single input queue into which workers put their job request and their specific return queue as a tuple. However, this isn't possible--you can only share vanilla queues in python via inheritance, and manager.Queues() fail with:
RemoteError:
---------------------------------------------------------------------------
Unserializable message: ('#RETURN', ('Worker 1 asked proc to do some work.', <Queue.Queue instance at 0x7fa0ba14d908>))
---------------------------------------------------------------------------
Is there a pythonic way to do this without invoking some external library?
multiprocessing.Queue is implemented with a pipe, a deque and a thread.
When you call queue.put() the objects ends up in the deque and the thread takes care of pushing it into the pipe.
You cannot share threads within processes for obvious reasons. Therefore you need to use something else.
Regular pipes and sockets can be easily shared.
Nevertheless I'd rather use a different architecture for your program.
The main process would act as an orchestrator routing the tasks to two different Pools of processes, one for CPU bound jobs and the other to GPU bound ones. This would imply you need to share more information within the workers but it's way more robust and scalable.
Here you get a draft:
from multiprocessing import Pool
def cpu_worker(job_type, data):
if job_type == "first_computation":
results do_cpu_work()
elif job_type == "compute_gpu_results":
results = do_post_gpu_work()
return results
def gpu_worker(data):
return do_gpu_work()
class Orchestrator:
def __init__(self):
self.cpu_pool = Pool()
self.gpu_pool = Pool()
def new_task(self, task):
"""Entry point for a new task. The task will be run by the CPU workers and the results handled by the cpu_job_done method."""
self.cpu_pool.apply_async(cpu_worker, args=["first_computation", results], callback=self.cpu_job_done)
def cpu_job_done(self, results):
"""Once the first CPU computation is done, send its results to a GPU worker. Its results will be handled by the gpu_job_done method."""
self.gpu_pool.apply_async(gpu_worker, args=[results], callback=self.gpu_job_done)
def gpu_job_done(self, results):
"""GPU computation done, send the data back for the last CPU computation phase. Results will be handled by the task_done method."""
self.cpu_pool.apply_async(cpu_worker, args=["compute_gpu_results", results], callback=self.task_done)
def task_done(self, results):
"""Here you get your final results for the task."""
print(results)

parallel processing of DAG

I'm trying hard to figure out how I can process a directed acyclic graph in parallel. Each node should only be able to "execute" when all its input nodes have been processed beforehand. Imagine a class Task with the following interface:
class Task(object):
result = None
def inputs(self):
''' List all requirements of the task. '''
return ()
def run(self):
pass
I can not think of a way to process the graph that could be represented
by this structure asynchronously with a maximum number of workers at the
same time, except for one method.
I think the optimal processing would be achieved by creating a thread
for each task, waiting for all inputs to be processed. But, spawning
a thread for each task immediately instead of consecutively (i.e. when the
task is ready to be processed) does not sound like a good idea to me.
import threading
class Runner(threading.Thread):
def __init__(self, task):
super(Runner, self).__init__()
self.task = task
self.start()
def run(self):
threads = [Runner(r) for r in self.task.inputs()]
[t.join() for t in threads]
self.task.run()
Is there a way to mimic this behaviour more ideally? Also, this approach
does currently not implement a way to limit the number of running tasks at
a time.
Have one master thread push items to a queue once they are ready for being processsed. Then have a pool of workers listen on the queue for tasks to work on. (Python provides a synchronized queue in the Queue module, renamed to lower-case queue in Python 3).
The master first creates a map from dependencies to dependent tasks. Every task that doesn't have any dependcies can go into the queue. Everytime a task is completed, the master uses the dictionary to figure out which dependent tasks there are, and puts them into the queue if all their depndencies are met now.
Celery (http://www.celeryproject.org/) is the leading task management tool for Python. It should be able to help you with this.

Python queues - have at most n threads running

The scenario:
I have a really large DB model migration going on for a new build, and I'm working on boilerplating how we will go about migration current live data from a webapp into the local test databases.
I'd like to setup in python a script that will concurrently process the migration of my models. I have from_legacy and to_legacy methods for my model instances. What I have so far loads all my instances and creates threads for each, with each thread subclassed from the core threading modules with a run method that just does the conversion and saves the result.
I'd like to make the main loop in the program build a big stack of instances of these threads, and start to process them one by one, running only at most 10 concurrently as it does its work, and feeding the next in to be processed as others finish migrating.
What I can't figure out is how to utilize the queue correctly to do this? If each thread represents the full task of migration, should I load all the instances first and then create a Queue with maxsize set to 10, and have that only track currently running queues? Something like this perhaps?
currently_running = Queue()
for model in models:
task = Migrate(models) #this is subclassed thread
currently_running.put(task)
task.start()
In this case relying on the put call to block while it is at capacity? If I were to go this route, how would I call task_done?
Or rather, should the Queue include all the tasks (not just the started ones) and use join to block to completion? Does calling join on a queue of threads start the included threads?
What is the best methodology to approach the "at most have N running threads" problem and what role should the Queue play?
Although not documented, the multiprocessing module has a ThreadPool class which, as its name implies, creates a pool of threads. It shares the same API as the multiprocessing.Pool class.
You can then send tasks to the thread pool using pool.apply_async:
import multiprocessing.pool as mpool
def worker(task):
# work on task
print(task) # substitute your migration code here.
# create a pool of 10 threads
pool = mpool.ThreadPool(10)
N = 100
for task in range(N):
pool.apply_async(worker, args = (task, ))
pool.close()
pool.join()
This should probably be done using semaphores the example in the documentation is a hint of what you're try to accomplish.

producer/consumer-like multithreading in python

A typical producer-consumer problem is solved in python like below:
from queue import Queue
job_queue = Queue(maxsize=10)
def manager():
while i_have_some_job_do:
job = get_data_from_somewhere()
job_queue.put(job) #blocks only if queue is currently full
def worker():
while True:
data = job_queue.get() # blocks until data available
#get things done
But I have a variant of producer/consumer problem (not one strictly speaking, so let me call it manager-worker):
The manager puts some job in a Queue, and the worker should keep getting the jobs and doing them. But when the worker get a job, it does not remove the job from the Queue(unlike Queue.get()). And it is the manager which is able to remove a job from the Queue.
So how does the worker get the job while not removing the job from the queue? Maybe get and put is OK?
How does the manager remove a particular job from the queue?
Perhaps your works can't remove jobs completely, but consider letting them move them from the original queue to a different "job done" queue. The move itself should be cheap and fast, and the manager can then process the "job done" queue, removing elements it agrees are done, and moving others back to the worker queue.

Categories