Sending completed jobs back to correct process in python - python

I'd like to create a set of processes with the following structure:
main, which dequeues requests from an external source. main generates a variable number of worker processes.
worker which does some preliminary processing on job requests, then sends data to gpuProc.
gpuProc, which accepts job requests from worker processes. When it has received enough requests, it sends the batch to a process that runs on the GPU. After getting the results back, it has to then send back the completed batch of requests back to the worker processes such that the worker that requested it receives it back
One could envision doing this with a number of queues. Since the number of worker processes is variable, it would be ideal if gpuProc had a single input queue into which workers put their job request and their specific return queue as a tuple. However, this isn't possible--you can only share vanilla queues in python via inheritance, and manager.Queues() fail with:
RemoteError:
---------------------------------------------------------------------------
Unserializable message: ('#RETURN', ('Worker 1 asked proc to do some work.', <Queue.Queue instance at 0x7fa0ba14d908>))
---------------------------------------------------------------------------
Is there a pythonic way to do this without invoking some external library?

multiprocessing.Queue is implemented with a pipe, a deque and a thread.
When you call queue.put() the objects ends up in the deque and the thread takes care of pushing it into the pipe.
You cannot share threads within processes for obvious reasons. Therefore you need to use something else.
Regular pipes and sockets can be easily shared.
Nevertheless I'd rather use a different architecture for your program.
The main process would act as an orchestrator routing the tasks to two different Pools of processes, one for CPU bound jobs and the other to GPU bound ones. This would imply you need to share more information within the workers but it's way more robust and scalable.
Here you get a draft:
from multiprocessing import Pool
def cpu_worker(job_type, data):
if job_type == "first_computation":
results do_cpu_work()
elif job_type == "compute_gpu_results":
results = do_post_gpu_work()
return results
def gpu_worker(data):
return do_gpu_work()
class Orchestrator:
def __init__(self):
self.cpu_pool = Pool()
self.gpu_pool = Pool()
def new_task(self, task):
"""Entry point for a new task. The task will be run by the CPU workers and the results handled by the cpu_job_done method."""
self.cpu_pool.apply_async(cpu_worker, args=["first_computation", results], callback=self.cpu_job_done)
def cpu_job_done(self, results):
"""Once the first CPU computation is done, send its results to a GPU worker. Its results will be handled by the gpu_job_done method."""
self.gpu_pool.apply_async(gpu_worker, args=[results], callback=self.gpu_job_done)
def gpu_job_done(self, results):
"""GPU computation done, send the data back for the last CPU computation phase. Results will be handled by the task_done method."""
self.cpu_pool.apply_async(cpu_worker, args=["compute_gpu_results", results], callback=self.task_done)
def task_done(self, results):
"""Here you get your final results for the task."""
print(results)

Related

Can I map a subprocess to the same multiprocessing.Pool where the main process is running?

I am relatively new to the multiprocessing world in python3 and I am therefore sorry if this question has been asked before. I have a script which, from a list of N elements, runs the entire analysis on each element, mapping each onto a different process.
I am aware that this is suboptimal, in fact I want to increase the multiprocessing efficiency. I use map() to run each process into a Pool() which can contain as many processes as the user specifies via command line arguments.
Here is how the code looks like:
max_processes = 7
# it is passed by command line actually but not relevant here
def main_function( ... ):
res_1 = sub_function_1( ... )
res_2 = sub_function_2( ... )
if __name__ == '__main__':
p = Pool(max_processes)
Arguments = []
for x in Paths.keys():
# generation of the arguments
...
Arguments.append( Tup_of_arguments )
p.map(main_function, Arguments)
p.close()
p.join()
As you see my process calls a main function which in turn calls many other functions one after the other. Now, each of the sub_functions is multiprocessable. Can I map processes from those subfunctions, which map to the same pool where the main process runs?
No, you can't.
The pool is (pretty much) not available in the worker processes. It depends a bit on the start method used for the pool.
spawn
A new Python interpreter process is started and imports the module. Since in that process __name__ is '__mp_main__', the code in the __name__ == '__main__' block is not executed and no pool object exists in the workers.
fork
The memory space of the parent process is copied into the memory space of the child process. That effectively leads to an existing Pool object in the memory space of each worker.
However, that pool is unusable. The workers are created during the execution of the pool's __init__, hence the pool's initialization is incomplete when the workers are forked. The pool's copies in the worker processes have none of the threads running that manage workers, tasks and results. Threads anyway don't make it into child processes via fork.
Additionally, since the workers are created during the initialization, the pool object has not yet been assigned to any name at that point. While it does lurk in the worker's memory space, there is no handle to it. It does not show up via globals(); I only found it via gc.get_objects(): <multiprocessing.pool.Pool object at 0x7f75d8e50048>
Anyway, that pool object is a copy of the one in the main process.
forkserver
I could not test this start method
To solve your problem, you could fiddle around with queues and a queue handler thread in the main process to send back tasks from workers and delegate them to the pool, but all approaches I can think of seem rather clumsy.
You'll very probaly end up with a lot more maintainable code if you make the effort to adopt it for processing in a pool.
As an aside: I am not sure if allowing users to pass the number of workers via commandline is a good idea. I recommend to to give that value an upper boundary via os.cpu_count() at the very least.

Python multiprocessing pool: dynamically set number of processes during execution of tasks

We submit large CPU intensive jobs in Python 2.7 (that consist of many independent parallel processes) on our development machine which last for days at a time. The responsiveness of the machine slows down a lot when these jobs are running with a large number of processes. Ideally, I would like to limit the number of CPU available during the day when we're developing code and over night run as many processes as efficiently possible.
The Python multiprocessing library allows you to specify the number of process when you initiate a Pool. Is there a way to dynamically change this number each time a new task is initiated?
For instance, allow 20 processes to run during the hours 19-07 and 10 processes from hours 07-19.
One way would be to check the number of active processes using significant CPU. This is how I would like it to work:
from multiprocessing import Pool
import time
pool = Pool(processes=20)
def big_task(x):
while check_n_process(processes=10) is False:
time.sleep(60*60)
x += 1
return x
x = 1
multiple_results = [pool.apply_async(big_task, (x)) for i in range(1000)]
print([res.get() for res in multiple_results])
But I would need to write the 'check_n_process' function.
Any other ideas how this problem could be solved?
(The code needs to run in Python 2.7 - a bash implementation is not feasible).
Python multiprocessing.Pool does not provide a way to change the amount of workers of a running Pool. A simple solution would be relying on third party tools.
The Pool provided by billiard used to provide such a feature.
Task queue frameworks like Celery or Luigi surely allow a flexible workload but are way more complex.
If the use of external dependencies is not feasible, you can try the following approach. Elaborating from this answer, you could set a throttling mechanism based on a Semaphore.
from threading import Semaphore, Lock
from multiprocessing import Pool
def TaskManager(object):
def __init__(self, pool_size):
self.pool = Pool(processes=pool_size)
self.workers = Semaphore(pool_size)
# ensures the semaphore is not replaced while used
self.workers_mutex = Lock()
def change_pool_size(self, new_size):
"""Set the Pool to a new size."""
with self.workers_mutex:
self.workers = Semaphore(new_size)
def new_task(self, task):
"""Start a new task, blocks if queue is full."""
with self.workers_mutex:
self.workers.acquire()
self.pool.apply_async(big_task, args=[task], callback=self.task_done))
def task_done(self):
"""Called once task is done, releases the queue is blocked."""
with self.workers_mutex:
self.workers.release()
The pool would block further attempts to schedule your big_tasks if more than X workers are busy. By controlling this mechanism you could throttle the amount of processes running concurrently. Of course, this means that you give up the Pool queueing mechanism.
task_manager = TaskManager(20)
while True:
if seven_in_the_morning():
task_manager.change_pool_size(10)
if seven_in_the_evening():
task_manager.change_pool_size(20)
task = get_new_task()
task_manager.new_task() # blocks here if all workers are busy
This is woefully incomplete (and an old question), but you can manage the load by keeping track of the running processes and only calling apply_async() when it's favorable; if each job runs for less than forever, you can drop the load by dispatching fewer jobs during working hours, or when os.getloadavg() is too high.
I do this to manage network load when running multiple "scp"s to evade traffic shaping on our internal network (don't tell anyone!)

parallel processing of DAG

I'm trying hard to figure out how I can process a directed acyclic graph in parallel. Each node should only be able to "execute" when all its input nodes have been processed beforehand. Imagine a class Task with the following interface:
class Task(object):
result = None
def inputs(self):
''' List all requirements of the task. '''
return ()
def run(self):
pass
I can not think of a way to process the graph that could be represented
by this structure asynchronously with a maximum number of workers at the
same time, except for one method.
I think the optimal processing would be achieved by creating a thread
for each task, waiting for all inputs to be processed. But, spawning
a thread for each task immediately instead of consecutively (i.e. when the
task is ready to be processed) does not sound like a good idea to me.
import threading
class Runner(threading.Thread):
def __init__(self, task):
super(Runner, self).__init__()
self.task = task
self.start()
def run(self):
threads = [Runner(r) for r in self.task.inputs()]
[t.join() for t in threads]
self.task.run()
Is there a way to mimic this behaviour more ideally? Also, this approach
does currently not implement a way to limit the number of running tasks at
a time.
Have one master thread push items to a queue once they are ready for being processsed. Then have a pool of workers listen on the queue for tasks to work on. (Python provides a synchronized queue in the Queue module, renamed to lower-case queue in Python 3).
The master first creates a map from dependencies to dependent tasks. Every task that doesn't have any dependcies can go into the queue. Everytime a task is completed, the master uses the dictionary to figure out which dependent tasks there are, and puts them into the queue if all their depndencies are met now.
Celery (http://www.celeryproject.org/) is the leading task management tool for Python. It should be able to help you with this.

Can celery cooperatively run coroutines as stateful/resumable tasks?

I'm currently investigating Celery for use in an video-processing backend. Essentially my problem is as follows:
I have a frontend web server that concurrently processes a large number of video streams (on the order of thousands).
Each stream must be processed independently and in parallel.
Stream processing can be divided into two types of operations:
Frame-by-frame operations (computations that do not need information about the preceding or following frame(s))
Stream-level operations (computations that work on a subset of ordered, adjacent frames)
Given point 3, I need to maintain and update an ordered structure of frames throughout the process and farm computations on subsections of this structure to Celery workers. Initially, I thought about organizing things as follows:
[frontend server] -stream-> [celery worker 1 (greenlet)] --> [celery worker 2 (prefork)]
The idea is that celery worker 1 executes long-running tasks that are primarily I/O-bound. In essence, these tasks would only do the following:
Read a frame from the frontend server
Decode the frame from it's base64 representation
Enqueue it in the aforementioned ordered data structure (a collections.deque object, as it currently stands).
Any CPU-bound operations (i.e. image analysis) are shipped off to celery worker 2.
My problem is as follows:
I would like to execute a coroutine as a task such that I have a long-running tasks from which I can yield so as to not block celery worker 1's operations. In other words, I'd like to be able to do something akin to:
def coroutine(func):
#wraps(func)
def start(*args, **kwargs):
cr = func(*args, **kwargs)
cr.next()
return cr
return start
#coroutine
def my_taks():
stream = deque() # collections.deque
source = MyAsynchronousInputThingy() # something i'll make myself, probably using select
while source.open:
if source.has_data:
stream.append(Frame(source.readline())) # read data, build frame and enqueue to persistent structure
yield # cooperatively interrupt so that other tasks can execute
Is there a way to make a coroutine-based task run indefinitely, ideally producing results as they are yielded?
Primary idea behind Eventlet is that you want to write synchronous code, as with threads, socket.recv() should block current thread until next statement. This style is very easy to read, maintain and reason about while debugging. To make things effective and scalable, behind scenes, Eventlet does the magic to replace seemingly blocking code with green threads and epoll/kqueue/etc mechanisms to wake up those green threads at proper times.
So all you need is execute eventlet.monkey_patch() as soon as possible (e.g. second line in module) and make sure you use pure Python socket operations in MyInputThingy. Forget about asynchronous, just write normal blocking code as you would with threads.
Eventlet makes synchronous code good again.

Dynamically reordering jobs in a multiprocessing pool in Python

I'm writing a python script (for cygwin and linux environments) to run regression testing on a program that is run from the command line using subprocess.Popen(). Basically, I have a set of jobs, a subset of which need to be run depending on the needs of the developer (on the order of 10 to 1000). Each job can take anywhere from a few seconds to 20 minutes to complete.
I have my jobs running successfully across multiple processors, but I'm trying to eke out some time savings by intelligently ordering the jobs (based on past performance) to run the longer jobs first. The complication is that some jobs (steady state calculations) need to be run before others (the transients based on the initial conditions determined by the steady state).
My current method of handling this is to run the parent job and all child jobs recursively on the same process, but some jobs have multiple, long-running children. Once the parent job is complete, I'd like to add the children back to the pool to farm out to other processes, but they would need to be added to the head of the queue. I'm not sure I can do this with multiprocessing.Pool. I looked for examples with Manager, but they all are based on networking it seems, and not particularly applicable. Any help in the form of code or links to a good tutorial on multiprocessing (I've googled...) would be much appreciated. Here's a skeleton of the code for what I've got so far, commented to point out the child jobs that I would like spawned off on other processors.
import multiprocessing
import subprocess
class Job(object):
def __init__(self, popenArgs, runTime, children)
self.popenArgs = popenArgs #list to be fed to popen
self.runTime = runTime #Approximate runTime for the job
self.children = children #Jobs that require this job to run first
def runJob(job):
subprocess.Popen(job.popenArgs).wait()
####################################################
#I want to remove this, and instead kick these back to the pool
for j in job.children:
runJob(j)
####################################################
def main(jobs):
# This jobs argument contains only jobs which are ready to be run
# ie no children, only parent-less jobs
jobs.sort(key=lambda job: job.runTime, reverse=True)
multiprocessing.Pool(4).map(runJob, jobs)
First, let me second Armin Rigo's comment: There's no reason to use multiple processes here instead of multiple threads. In the controlling process you're spending most of your time waiting on subprocesses to finish; you don't have CPU-intensive work to parallelize.
Using threads will also make it easier to solve your main problem. Right now you're storing the jobs in attributes of other jobs, an implicit dependency graph. You need a separate data structure that orders the jobs in terms of scheduling. Also, each tree of jobs is currently tied to one worker process. You want to decouple your workers from the data structure you use to hold the jobs. Then the workers each draw jobs from the same queue of tasks; after a worker finishes its job, it enqueues the job's children, which can then be handled by any available worker.
Since you want the child jobs to be inserted at the front of the line when their parent is finished a stack-like container would seem to fit your needs; the Queue module provides a thread-safe LifoQueue class that you can use.
import threading
import subprocess
from Queue import LifoQueue
class Job(object):
def __init__(self, popenArgs, runTime, children):
self.popenArgs = popenArgs
self.runTime = runTime
self.children = children
def run_jobs(queue):
while True:
job = queue.get()
subprocess.Popen(job.popenArgs).wait()
for child in job.children:
queue.put(child)
queue.task_done()
# Parameter 'jobs' contains the jobs that have no parent.
def main(jobs):
job_queue = LifoQueue()
num_workers = 4
jobs.sort(key=lambda job: job.runTime)
for job in jobs:
job_queue.put(job)
for i in range(num_workers):
t = threading.Thread(target=run_jobs, args=(job_queue,))
t.daemon = True
t.start()
job_queue.join()
A couple of notes: (1) We can't know when all the work is done by monitoring the worker threads, since they don't keep track of the work to be done. That's the queue's job. So the main thread monitors the queue object to know when all the work is complete (job_queue.join()). We can thus mark the worker threads as daemon threads, so the process will exit whenever the main thread does without waiting on the workers. We thereby avoid the need for communication between the main thread and the worker threads in order to tell the latter when to break out of their loops and stop.
(2) We know all the work is done when all tasks that have been enqueued have been marked as done (specifically, when task_done() has been called a number of times equal to the number of items that have been enqueued). It wouldn't be reliable to use the queue's being empty as the condition that all work is done; the queue might be momentarily and misleadingly empty between popping a job from it and enqueuing that job's children.

Categories