The scenario:
I have a really large DB model migration going on for a new build, and I'm working on boilerplating how we will go about migration current live data from a webapp into the local test databases.
I'd like to setup in python a script that will concurrently process the migration of my models. I have from_legacy and to_legacy methods for my model instances. What I have so far loads all my instances and creates threads for each, with each thread subclassed from the core threading modules with a run method that just does the conversion and saves the result.
I'd like to make the main loop in the program build a big stack of instances of these threads, and start to process them one by one, running only at most 10 concurrently as it does its work, and feeding the next in to be processed as others finish migrating.
What I can't figure out is how to utilize the queue correctly to do this? If each thread represents the full task of migration, should I load all the instances first and then create a Queue with maxsize set to 10, and have that only track currently running queues? Something like this perhaps?
currently_running = Queue()
for model in models:
task = Migrate(models) #this is subclassed thread
currently_running.put(task)
task.start()
In this case relying on the put call to block while it is at capacity? If I were to go this route, how would I call task_done?
Or rather, should the Queue include all the tasks (not just the started ones) and use join to block to completion? Does calling join on a queue of threads start the included threads?
What is the best methodology to approach the "at most have N running threads" problem and what role should the Queue play?
Although not documented, the multiprocessing module has a ThreadPool class which, as its name implies, creates a pool of threads. It shares the same API as the multiprocessing.Pool class.
You can then send tasks to the thread pool using pool.apply_async:
import multiprocessing.pool as mpool
def worker(task):
# work on task
print(task) # substitute your migration code here.
# create a pool of 10 threads
pool = mpool.ThreadPool(10)
N = 100
for task in range(N):
pool.apply_async(worker, args = (task, ))
pool.close()
pool.join()
This should probably be done using semaphores the example in the documentation is a hint of what you're try to accomplish.
Related
I recently started putting together a webapp with Plotly Dash. I have a callback function that updates a DataTable with data that are fetched from a Redis server. The code that connects to Redis and downloads the data was originally developed to be used elsewhere - in scripts that run standalone either from the command line or through scheduling systems. The scripts run fine. The code that fetches the data can be run either sequentially or in parallel via multiprocessing. The multiprocessing related code is typical for the use case, it creates two queues, one with tasks pending and one for the completed tasks. An infinite while loop listens on the completed tasks queue and picks up the completed tasks until all of the tasks are finished. The reason why multiprocessing is used is because for each key/value pair fetched from Redis, the value is a big object that needs unpickling which is relatively time consuming.
To cut the long story short, when the code gets executed via the Dash callback function, the tasks are inserted in the pending queue, the infinite while loop listens on the tasks completed queue but no tasks are getting executed. For some reason in the example below the function do_work never gets executed by any worker at all
# Set-up and start the workers
for c in range(num_workers):
p = mp.Process(target=do_work, args=(tasks_pending, tasks_completed, verbose))
p.name = 'worker' + str(c)
processes.append(p)
p.start()
I did have a look around multiprocessing context managers and Flask etc but I didn't manage to make it work. Any idea what is going on and why Dash (or Flask) is a special case? Any hints or pointers to the right direction would be great.
Many thanks!
You can use define a multiprocessing Queue and then pass it to the callback via app.
events_messages = multiprocessing.Queue()
app.queue = events_messages
then you can add messages or read them in the callback function:
app.queue.put('your item for the Queue')
I am relatively new to the multiprocessing world in python3 and I am therefore sorry if this question has been asked before. I have a script which, from a list of N elements, runs the entire analysis on each element, mapping each onto a different process.
I am aware that this is suboptimal, in fact I want to increase the multiprocessing efficiency. I use map() to run each process into a Pool() which can contain as many processes as the user specifies via command line arguments.
Here is how the code looks like:
max_processes = 7
# it is passed by command line actually but not relevant here
def main_function( ... ):
res_1 = sub_function_1( ... )
res_2 = sub_function_2( ... )
if __name__ == '__main__':
p = Pool(max_processes)
Arguments = []
for x in Paths.keys():
# generation of the arguments
...
Arguments.append( Tup_of_arguments )
p.map(main_function, Arguments)
p.close()
p.join()
As you see my process calls a main function which in turn calls many other functions one after the other. Now, each of the sub_functions is multiprocessable. Can I map processes from those subfunctions, which map to the same pool where the main process runs?
No, you can't.
The pool is (pretty much) not available in the worker processes. It depends a bit on the start method used for the pool.
spawn
A new Python interpreter process is started and imports the module. Since in that process __name__ is '__mp_main__', the code in the __name__ == '__main__' block is not executed and no pool object exists in the workers.
fork
The memory space of the parent process is copied into the memory space of the child process. That effectively leads to an existing Pool object in the memory space of each worker.
However, that pool is unusable. The workers are created during the execution of the pool's __init__, hence the pool's initialization is incomplete when the workers are forked. The pool's copies in the worker processes have none of the threads running that manage workers, tasks and results. Threads anyway don't make it into child processes via fork.
Additionally, since the workers are created during the initialization, the pool object has not yet been assigned to any name at that point. While it does lurk in the worker's memory space, there is no handle to it. It does not show up via globals(); I only found it via gc.get_objects(): <multiprocessing.pool.Pool object at 0x7f75d8e50048>
Anyway, that pool object is a copy of the one in the main process.
forkserver
I could not test this start method
To solve your problem, you could fiddle around with queues and a queue handler thread in the main process to send back tasks from workers and delegate them to the pool, but all approaches I can think of seem rather clumsy.
You'll very probaly end up with a lot more maintainable code if you make the effort to adopt it for processing in a pool.
As an aside: I am not sure if allowing users to pass the number of workers via commandline is a good idea. I recommend to to give that value an upper boundary via os.cpu_count() at the very least.
With multiprocessing, it is possible to execute the same function on all workers at creation time of the pool with the initializer and initargs options in the Pool factory function.
Is it possible to run something, with a guarantee, on all workers? I would like to do this periodically, but realize it is not a very popular use case and may not be possible without re-implementing Pool based on multiprocessing primitives...
You can simply use a Timer to ensure the initializer is called once again after a given interval.
This is a pretty standard way to re-schedule an action periodically.
from threading import Timer
from multiprocessing import Pool
def initializer(interval: float):
update_global_state()
timer = Timer(interval, initializer, args=[interval])
timer.start()
pool = Pool(initializer=initializer, initargs=[STATE_UPDATE_INTERVAL])
Each worker of the Pool will update its state periodically and independently.
EDIT:
The Pool design paradigm goal is to abstract the management of tasks and workers de-coupling the main loop from the execution of the jobs. While doing so, it restricts the access to the workers to protect the overall logic.
If what you need is to share and update the state among the workers, the only feasible approach is to let the workers poll for state updates. You can either use the above approach or let the worker check the state at any new job.
As a Pool is asynchronous by design, there is no way to synchronously provide information to the workers.
If what you want to run has nothing to do with the task submitted to pool, you can simply replace Pool.Process with your own version, which add initializer at first.
from multiprocessing.pool import Pool as PoolCls
from multiprocessing import Pool, Process
class MyProcess(Process):
def start(self):
print 'do something you want...'
super(MyProcess, self).start()
# use our own Process instead, should before creating pool
PoolCls.Process = MyProcess
Note that the statements added in start() are ran in current process, not the work process. Overwrite run() if you want to run in each work process.
I'm writing a python script (for cygwin and linux environments) to run regression testing on a program that is run from the command line using subprocess.Popen(). Basically, I have a set of jobs, a subset of which need to be run depending on the needs of the developer (on the order of 10 to 1000). Each job can take anywhere from a few seconds to 20 minutes to complete.
I have my jobs running successfully across multiple processors, but I'm trying to eke out some time savings by intelligently ordering the jobs (based on past performance) to run the longer jobs first. The complication is that some jobs (steady state calculations) need to be run before others (the transients based on the initial conditions determined by the steady state).
My current method of handling this is to run the parent job and all child jobs recursively on the same process, but some jobs have multiple, long-running children. Once the parent job is complete, I'd like to add the children back to the pool to farm out to other processes, but they would need to be added to the head of the queue. I'm not sure I can do this with multiprocessing.Pool. I looked for examples with Manager, but they all are based on networking it seems, and not particularly applicable. Any help in the form of code or links to a good tutorial on multiprocessing (I've googled...) would be much appreciated. Here's a skeleton of the code for what I've got so far, commented to point out the child jobs that I would like spawned off on other processors.
import multiprocessing
import subprocess
class Job(object):
def __init__(self, popenArgs, runTime, children)
self.popenArgs = popenArgs #list to be fed to popen
self.runTime = runTime #Approximate runTime for the job
self.children = children #Jobs that require this job to run first
def runJob(job):
subprocess.Popen(job.popenArgs).wait()
####################################################
#I want to remove this, and instead kick these back to the pool
for j in job.children:
runJob(j)
####################################################
def main(jobs):
# This jobs argument contains only jobs which are ready to be run
# ie no children, only parent-less jobs
jobs.sort(key=lambda job: job.runTime, reverse=True)
multiprocessing.Pool(4).map(runJob, jobs)
First, let me second Armin Rigo's comment: There's no reason to use multiple processes here instead of multiple threads. In the controlling process you're spending most of your time waiting on subprocesses to finish; you don't have CPU-intensive work to parallelize.
Using threads will also make it easier to solve your main problem. Right now you're storing the jobs in attributes of other jobs, an implicit dependency graph. You need a separate data structure that orders the jobs in terms of scheduling. Also, each tree of jobs is currently tied to one worker process. You want to decouple your workers from the data structure you use to hold the jobs. Then the workers each draw jobs from the same queue of tasks; after a worker finishes its job, it enqueues the job's children, which can then be handled by any available worker.
Since you want the child jobs to be inserted at the front of the line when their parent is finished a stack-like container would seem to fit your needs; the Queue module provides a thread-safe LifoQueue class that you can use.
import threading
import subprocess
from Queue import LifoQueue
class Job(object):
def __init__(self, popenArgs, runTime, children):
self.popenArgs = popenArgs
self.runTime = runTime
self.children = children
def run_jobs(queue):
while True:
job = queue.get()
subprocess.Popen(job.popenArgs).wait()
for child in job.children:
queue.put(child)
queue.task_done()
# Parameter 'jobs' contains the jobs that have no parent.
def main(jobs):
job_queue = LifoQueue()
num_workers = 4
jobs.sort(key=lambda job: job.runTime)
for job in jobs:
job_queue.put(job)
for i in range(num_workers):
t = threading.Thread(target=run_jobs, args=(job_queue,))
t.daemon = True
t.start()
job_queue.join()
A couple of notes: (1) We can't know when all the work is done by monitoring the worker threads, since they don't keep track of the work to be done. That's the queue's job. So the main thread monitors the queue object to know when all the work is complete (job_queue.join()). We can thus mark the worker threads as daemon threads, so the process will exit whenever the main thread does without waiting on the workers. We thereby avoid the need for communication between the main thread and the worker threads in order to tell the latter when to break out of their loops and stop.
(2) We know all the work is done when all tasks that have been enqueued have been marked as done (specifically, when task_done() has been called a number of times equal to the number of items that have been enqueued). It wouldn't be reliable to use the queue's being empty as the condition that all work is done; the queue might be momentarily and misleadingly empty between popping a job from it and enqueuing that job's children.
I have a python script that is doing some map reduce-ish ETL. I am not originator of the code but working to analyze/diagnose its runtime for some improvements.
In the package, it uses a "Process":
worker = Process(target=grab_worker)
worker.start()
That does a perpectual FTP loop to extract new files from our CDN, can't include the FTP code but shouldn't be relevant for the question
Later in the code, we create an instance of a worker Pool which runs some asynch functions:
workerpool = multiprocessing.Pool(processes=4)
# ...
resultobjs[k] = workerpool.apply_async(func, args=fargs)
Again, the underlying code therein should be irrelevant to question I think so not including code yet.
My question is, in Python, once I create the worker Pool will the workers there be "shared" with the Process?
In other words, if I first create 1 worker with process doing something, later in execution when I create workers with a pool class, when the loop returns and tries to run the function registered with the process, will it then use the previously created workers?
Or, instead, does it keep the "hot side hot and the cold side cold" by allowing each class instance to reference only the workers which it has spawned ( Process reuses its one worker previously created, and Pool continues to use its designated workers vs. Process using the workers generated by Pool ).
The mp.Process knows nothing about the mp.Pool.
So the call to mp.Process will not somehow use a process spawned by mp.Pool.