We submit large CPU intensive jobs in Python 2.7 (that consist of many independent parallel processes) on our development machine which last for days at a time. The responsiveness of the machine slows down a lot when these jobs are running with a large number of processes. Ideally, I would like to limit the number of CPU available during the day when we're developing code and over night run as many processes as efficiently possible.
The Python multiprocessing library allows you to specify the number of process when you initiate a Pool. Is there a way to dynamically change this number each time a new task is initiated?
For instance, allow 20 processes to run during the hours 19-07 and 10 processes from hours 07-19.
One way would be to check the number of active processes using significant CPU. This is how I would like it to work:
from multiprocessing import Pool
import time
pool = Pool(processes=20)
def big_task(x):
while check_n_process(processes=10) is False:
time.sleep(60*60)
x += 1
return x
x = 1
multiple_results = [pool.apply_async(big_task, (x)) for i in range(1000)]
print([res.get() for res in multiple_results])
But I would need to write the 'check_n_process' function.
Any other ideas how this problem could be solved?
(The code needs to run in Python 2.7 - a bash implementation is not feasible).
Python multiprocessing.Pool does not provide a way to change the amount of workers of a running Pool. A simple solution would be relying on third party tools.
The Pool provided by billiard used to provide such a feature.
Task queue frameworks like Celery or Luigi surely allow a flexible workload but are way more complex.
If the use of external dependencies is not feasible, you can try the following approach. Elaborating from this answer, you could set a throttling mechanism based on a Semaphore.
from threading import Semaphore, Lock
from multiprocessing import Pool
def TaskManager(object):
def __init__(self, pool_size):
self.pool = Pool(processes=pool_size)
self.workers = Semaphore(pool_size)
# ensures the semaphore is not replaced while used
self.workers_mutex = Lock()
def change_pool_size(self, new_size):
"""Set the Pool to a new size."""
with self.workers_mutex:
self.workers = Semaphore(new_size)
def new_task(self, task):
"""Start a new task, blocks if queue is full."""
with self.workers_mutex:
self.workers.acquire()
self.pool.apply_async(big_task, args=[task], callback=self.task_done))
def task_done(self):
"""Called once task is done, releases the queue is blocked."""
with self.workers_mutex:
self.workers.release()
The pool would block further attempts to schedule your big_tasks if more than X workers are busy. By controlling this mechanism you could throttle the amount of processes running concurrently. Of course, this means that you give up the Pool queueing mechanism.
task_manager = TaskManager(20)
while True:
if seven_in_the_morning():
task_manager.change_pool_size(10)
if seven_in_the_evening():
task_manager.change_pool_size(20)
task = get_new_task()
task_manager.new_task() # blocks here if all workers are busy
This is woefully incomplete (and an old question), but you can manage the load by keeping track of the running processes and only calling apply_async() when it's favorable; if each job runs for less than forever, you can drop the load by dispatching fewer jobs during working hours, or when os.getloadavg() is too high.
I do this to manage network load when running multiple "scp"s to evade traffic shaping on our internal network (don't tell anyone!)
Related
Following is my multi processing code. regressTuple has around 2000 items. So, the following code creates around 2000 parallel processes. My Dell xps 15 laptop crashes when this is run.
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in minimal time? Am I not doing this correctly?
Is there a API call in python to get the possible hardware process count?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several times till completion - In this way, after few experiments, I will be able to get the optimal thread count.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Hereby my code:
regressTuple = [(x,) for x in regressList]
processes = []
for i in range(len(regressList)):
processes.append(Process(target=runRegressWriteStatus,args=regressTuple[i]))
for process in processes:
process.start()
for process in processes:
process.join()
There are multiple things that we need to keep in mind
Spinning the number of processes are not limited by number of cores on your system but the ulimit for your user id on your system that controls total number of processes that be launched by your user id.
The number of cores determine how many of those launched processes can actually be running in parallel at one time.
Crashing of your system can be due to the fact your target function that these processes are running is doing something heavy and resource intensive, which system is not able to handle when multiple processes run simultaneously or nprocs limit on the system has exhausted and now kernel is not able to spin new system processes.
That being said it is not a good idea to spawn as many as 2000 processes, no matter even if you have a 16 core Intel Skylake machine, because creating a new process on the system is not a light weight task because there are number of things like generating the pid, allocating memory, address space generation, scheduling the process, context switching and managing the entire life cycle of it that happen in the background. So it is a heavy operation for the kernel to generate a new process,
Unfortunately I guess what you are trying to do is a CPU bound task and hence limited by the hardware you have on the machine. Spinning more number of processes than the number of cores on your system is not going to help at all, but creating a process pool might. So basically you want to create a pool with as many number of processes as you have cores on the system and then pass the input to the pool. Something like this
def target_func(data):
# process the input data
with multiprocessing.pool(processes=multiprocessing.cpu_count()) as po:
res = po.map(f, regressionTuple)
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in
minimal time? Am I not doing this correctly?
I don't think it's python's responsibility to manage the queue length. When people reach out for multiprocessing they tend to want efficiency, adding system performance tests to the run queue would be an overhead.
Is there a API call in python to get the possible hardware process count?
If there were, would it know ahead of time how much memory your task will need?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several
times till completion - In this way, after few experiments, I will be
able to get the optimal thread count.
As balderman pointed out, a pool is a good way forward with this.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Use a pool, or take the available system memory, divide by ~3MB and see how many tasks you can run at once.
This is probably more of a sysadmin task to balance the bottlenecks against the queue length, but generally, if your tasks are IO bound, then there isn't much point in having a long task queue if all the tasks are waiting at a the same T-junction to turn into the road. The tasks will then fight with each other for the next block of IO.
I am using Python's multiprocessing.Pool class to distribute tasks among processes.
The simple case works as expected:
from multiprocessing import Pool
def evaluate:
do_something()
pool = Pool(processes=N)
for task in tasks:
pool.apply_async(evaluate, (data,))
N processes are spawned, and they continually work through the tasks that I pass into apply_async. Now, I have another case where I have many different very complex objects which each need to do computationally heavy activity. I initially let each object create its own multiprocessing.Pool on demand at the time it was completing work, but I eventually ran into OSError for having too many files open, even though I would have assumed that the pools would get garbage collected after use.
At any rate, I decided it would be preferable anyway for each of these complex objects to share the same Pool for computations:
from multiprocessing import Pool
def evaluate:
do_something()
pool = Pool(processes=N)
class ComplexClass:
def work:
for task in tasks:
self.pool.apply_async(evaluate, (data,))
objects = [ComplexClass() for i in range(50)]
for complex in objects:
complex.pool = pool
while True:
for complex in objects:
complex.work()
Now, when I run this on one of my computers (OS X, Python=3.4), it works just as expected. N processes are spawned, and each complex object distributes their tasks among each of them. However, when I ran it on another machine (Google Cloud instance running Ubuntu, Python=3.5), it spawns an enormous number of processes (>> N) and the entire program grinds to a halt due to contention.
If I check the pool for more information:
import random
random_object = random.sample(objects, 1)
print (random_object.pool.processes)
>>> N
Everything looks correct. But it's clearly not. Any ideas what may be going on?
UPDATE
I added some additional logging. I set the pool size to 1 for simplicity. Within the pool, as a task is being completed, I print the current_process() from the multiprocessing module, as well as the pid of the task using os.getpid(). It results in something like this:
<ForkProcess(ForkPoolWorker-1, started daemon)>, PID: 5122
<ForkProcess(ForkPoolWorker-1, started daemon)>, PID: 5122
<ForkProcess(ForkPoolWorker-1, started daemon)>, PID: 5122
<ForkProcess(ForkPoolWorker-1, started daemon)>, PID: 5122
...
Again, looking at actually activity using htop, I'm seeing many processes (one per object sharing the multiprocessing pool) all consuming CPU cycles as this is happening, resulting in so much OS contention that progress is very slow. 5122 appears to be the parent process.
1. Infinite Loop implemented
If you implement an infinite loop, then it will run like an infinite loop.
Your example (which does not work at all due to other reasons) ...
while True:
for complex in objects:
complex.work()
2. Spawn or Fork Processes?
Even though your code above shows only some snippets, you cannot expect the same results on Windows / MacOS on the one hand and Linux on the other. The former spawn processes, the latter fork them. If you use global variables which can have state, you will run into troubles when developing on one environment and running on the other.
Make sure, not to use global statefull variables in your processes. Just pass them explicitly or get rid of them in another way.
3. Use a Program, not a Script
Write a program with the minimal requirement to have a __main__. Especially, when you use Multiprocessing you need this. Instantiate your Pool in that namespace.
1) Your question contains code which is different from what you run (Code in question has incorrect syntax and cannot be run at all).
2) multiprocessing module is extremely bad in error handling/reporting for errors that happen in workers.
The problem is very likely in code that you don't show. Code you show (if fixed) will just work forever and eat CPU, but it will not cause errors with too many open files or processes.
I use a simple RabbitMQ queue to distribute tasks to worker processes. Each worker process uses a pool of multiprocessing instances to work on multiple task at the same time to use the memory and the cpu as much as possible.
The problem is, that some of the task take much more RAM than the others, so that the worker process would crash if it starts more than one instance. But while the worker is working on the RAM intense task, I'd like it to work on other less RAM intense tasks to use the rest of the CPUs.
One idea would be to use multiple queues or topics but I am wondering what the recommended approach is. Can I catch out of memory errors before they crash the process?
What would be the right approach to solve this?
[updated update]
There whole system will consist of multiple multi core machines, but on each multi core machine there is only one worker program running, that creates as much multiprocessing instances as cores. The different machines should be independent of each other except that they get their tasks from the same queue.
I think trying to catch and recover from OOM errors will be very difficult, if not impossible. You would need a thread or process to be running that constantly monitors memory usage, and when it detects it's too high, does...what exactly? Kills a process that's processing a task? tries to pause it (if that's possible; it may not be depending what yours tasks are doing). Even then, pausing it isn't going to release any memory. You'd have to release the memory and restart the task when its safe, which means you'd have to requeue it, decide when its safe, etc.
Instead of trying to detect and recover from the problem, I would recommend trying to avoid it altogether. Create two queues, and two pools. One queue/pool for high-memory tasks, and another queue/pool for low-memory tasks. The high-memory pool would only have a single process in it, so it would be limited to running one task concurrently, which saves your memory. The low-memory queue would have multiprocessing.cpu_count() - 1 processes, allowing you to keep your CPUs saturated across the two pools.
One potential issue with this approach is that if you exhaust the high-memory queue while still having low-memory tasks pending, you'll be wasting one of your CPU. You could handle this consuming from the high-memory queue in a non-blocking way (or with a timeout), so that if the high-memory queue is empty when you're ready to consume a task, you can grab a low-memory task instead. Then when you're done processing it, check the high-memory queue again.
Something like this:
import multiprocessing
# hi_q and lo_q are placeholders for whatever library you're using to consume from RabbitMQ
def high_mem_consume():
while True:
task = hi_q.consume(timeout=2)
if not task:
lo_q.consume(timeout=2)
if task:
process_task(task)
def low_mem_consume():
while True:
task = lo_q.consume() # Blocks forever
process_task(task)
if __name__ == "__main__":
hi_pool = multiprocessing.Pool(1)
lo_pool = multiprocessing.Pool(multiprocessing.cpu_count() - 1)
hi_pool.apply_async(high_mem_consume)
lo_pool.apply_async(lo_mem_consume)
I'm writing a python script (for cygwin and linux environments) to run regression testing on a program that is run from the command line using subprocess.Popen(). Basically, I have a set of jobs, a subset of which need to be run depending on the needs of the developer (on the order of 10 to 1000). Each job can take anywhere from a few seconds to 20 minutes to complete.
I have my jobs running successfully across multiple processors, but I'm trying to eke out some time savings by intelligently ordering the jobs (based on past performance) to run the longer jobs first. The complication is that some jobs (steady state calculations) need to be run before others (the transients based on the initial conditions determined by the steady state).
My current method of handling this is to run the parent job and all child jobs recursively on the same process, but some jobs have multiple, long-running children. Once the parent job is complete, I'd like to add the children back to the pool to farm out to other processes, but they would need to be added to the head of the queue. I'm not sure I can do this with multiprocessing.Pool. I looked for examples with Manager, but they all are based on networking it seems, and not particularly applicable. Any help in the form of code or links to a good tutorial on multiprocessing (I've googled...) would be much appreciated. Here's a skeleton of the code for what I've got so far, commented to point out the child jobs that I would like spawned off on other processors.
import multiprocessing
import subprocess
class Job(object):
def __init__(self, popenArgs, runTime, children)
self.popenArgs = popenArgs #list to be fed to popen
self.runTime = runTime #Approximate runTime for the job
self.children = children #Jobs that require this job to run first
def runJob(job):
subprocess.Popen(job.popenArgs).wait()
####################################################
#I want to remove this, and instead kick these back to the pool
for j in job.children:
runJob(j)
####################################################
def main(jobs):
# This jobs argument contains only jobs which are ready to be run
# ie no children, only parent-less jobs
jobs.sort(key=lambda job: job.runTime, reverse=True)
multiprocessing.Pool(4).map(runJob, jobs)
First, let me second Armin Rigo's comment: There's no reason to use multiple processes here instead of multiple threads. In the controlling process you're spending most of your time waiting on subprocesses to finish; you don't have CPU-intensive work to parallelize.
Using threads will also make it easier to solve your main problem. Right now you're storing the jobs in attributes of other jobs, an implicit dependency graph. You need a separate data structure that orders the jobs in terms of scheduling. Also, each tree of jobs is currently tied to one worker process. You want to decouple your workers from the data structure you use to hold the jobs. Then the workers each draw jobs from the same queue of tasks; after a worker finishes its job, it enqueues the job's children, which can then be handled by any available worker.
Since you want the child jobs to be inserted at the front of the line when their parent is finished a stack-like container would seem to fit your needs; the Queue module provides a thread-safe LifoQueue class that you can use.
import threading
import subprocess
from Queue import LifoQueue
class Job(object):
def __init__(self, popenArgs, runTime, children):
self.popenArgs = popenArgs
self.runTime = runTime
self.children = children
def run_jobs(queue):
while True:
job = queue.get()
subprocess.Popen(job.popenArgs).wait()
for child in job.children:
queue.put(child)
queue.task_done()
# Parameter 'jobs' contains the jobs that have no parent.
def main(jobs):
job_queue = LifoQueue()
num_workers = 4
jobs.sort(key=lambda job: job.runTime)
for job in jobs:
job_queue.put(job)
for i in range(num_workers):
t = threading.Thread(target=run_jobs, args=(job_queue,))
t.daemon = True
t.start()
job_queue.join()
A couple of notes: (1) We can't know when all the work is done by monitoring the worker threads, since they don't keep track of the work to be done. That's the queue's job. So the main thread monitors the queue object to know when all the work is complete (job_queue.join()). We can thus mark the worker threads as daemon threads, so the process will exit whenever the main thread does without waiting on the workers. We thereby avoid the need for communication between the main thread and the worker threads in order to tell the latter when to break out of their loops and stop.
(2) We know all the work is done when all tasks that have been enqueued have been marked as done (specifically, when task_done() has been called a number of times equal to the number of items that have been enqueued). It wouldn't be reliable to use the queue's being empty as the condition that all work is done; the queue might be momentarily and misleadingly empty between popping a job from it and enqueuing that job's children.
The scenario:
I have a really large DB model migration going on for a new build, and I'm working on boilerplating how we will go about migration current live data from a webapp into the local test databases.
I'd like to setup in python a script that will concurrently process the migration of my models. I have from_legacy and to_legacy methods for my model instances. What I have so far loads all my instances and creates threads for each, with each thread subclassed from the core threading modules with a run method that just does the conversion and saves the result.
I'd like to make the main loop in the program build a big stack of instances of these threads, and start to process them one by one, running only at most 10 concurrently as it does its work, and feeding the next in to be processed as others finish migrating.
What I can't figure out is how to utilize the queue correctly to do this? If each thread represents the full task of migration, should I load all the instances first and then create a Queue with maxsize set to 10, and have that only track currently running queues? Something like this perhaps?
currently_running = Queue()
for model in models:
task = Migrate(models) #this is subclassed thread
currently_running.put(task)
task.start()
In this case relying on the put call to block while it is at capacity? If I were to go this route, how would I call task_done?
Or rather, should the Queue include all the tasks (not just the started ones) and use join to block to completion? Does calling join on a queue of threads start the included threads?
What is the best methodology to approach the "at most have N running threads" problem and what role should the Queue play?
Although not documented, the multiprocessing module has a ThreadPool class which, as its name implies, creates a pool of threads. It shares the same API as the multiprocessing.Pool class.
You can then send tasks to the thread pool using pool.apply_async:
import multiprocessing.pool as mpool
def worker(task):
# work on task
print(task) # substitute your migration code here.
# create a pool of 10 threads
pool = mpool.ThreadPool(10)
N = 100
for task in range(N):
pool.apply_async(worker, args = (task, ))
pool.close()
pool.join()
This should probably be done using semaphores the example in the documentation is a hint of what you're try to accomplish.