python keep using same threads in long duration as long running process - python

I have a model that I want to keep running for over 6 hours paralleled:
pool = multiprocessing.Pool(10)
for subModel in self.model:
pool.apply_async(self.compute, subModel, features)
pool.close()
pool.join()
The problem right now is too slow, as I have to call pool = multiprocessing.Pool(10) and pool.join() each time to construct and deconstruct 10 threads while the model is run tons of times in 6 hours.
The solution I believe is to have 10 threads kept running in the background, so that whenever new data coming in, it will go into model right away not worrying about creating new threads and deconstructing them which wastes a lot of time.
Is there a way in python that allows you to have long running process so that you don't need to start and stop over and over again?

You don't need to close() and join() (and destroy) the Pool after submitting one set of tasks to it. If you want to make sure that your apply_async() has completed before going on, just call apply() instead, and use the same Pool next time. Or, if you can do other stuff while waiting for the tasks, save the result object returned by apply_async and call wait() on it once you can't go on without it being complete.

Related

How to run a python multiprocessing pool without closing

I am trying to run multiple copies of a Bert model simultaneously.
I have a python object which holds a pool:
self.tokenizer = BertTokenizer.from_pretrained(BERT_LARGE)
self.model = BertForQuestionAnswering.from_pretrained(BERT_LARGE)
self.pool = Pool(processes=max_processes,
initializer=pool_init,
initargs=(self.model, self.tokenizer))
Each process in the pool copies across a Bert tokenizer and model:
process_model = None
process_tokenizer = None
def pool_init(m: BertForQuestionAnswering, t: BertTokenizer):
global process_model, process_tokenizer
process_model, process_tokenizer = m, t
To use the pool, I then run
while condition:
answers = self.pool.map(answer_func, questions)
condition = check_condition(answers)
This design is in order to avoid the large overhead of reloading the Bert model into each process each time the pool is initialized (which takes about 1.5-2 seconds per process).
Question 1. Is this the best way of doing this?
Question 2. If so, when am I supposed to call self.pool.close() and self.pool.join()? I want to join() before the check_condition() function, but I don't really ever want to close() the pool (unless until the __del__() of the object) but calling join() before calling close() gives me errors, and calling close() makes the pool uncallable in the future. Is pool just not meant for these kind of jobs, and I should manage an array of processes? Help...?
Thanks!!
You said, "This design is in order to avoid the large overhead of reloading the Bert model into each process each time the pool is initialized (which takes about 1.5-2 seconds per process)." Your statement and the small amount of code you showed does not quite make perfect sense to me. I think it's a question of terminology.
First, I don't see where the pool is being initialized multiple times; I only see one instance of creating the pool:
self.pool = Pool(processes=max_processes,
initializer=pool_init,
initargs=(self.model, self.tokenizer))
But if you are creating the pool multiple times, you are in fact with your current design using the pool_init function to reload the Bert model into each process of the pool each time the pool is created and not avoiding what you say you are avoiding. But this can be a good thing. So I suspect we are talking about two different things. So I can only explain what is actually going on:
You are invoking the pool.map function potentially multiple times because of you while condition: loop. But, in general, you do want to avoid creating a pool multiple times if you can avoid doing so. Now there are two reasons I can think of for using the initializer and initargs arguments to the Pool constructor as you are doing:
If you have read-only data items that your worker function (answer_func in your case) needs to access, rather than passing these items on each call to that function, it is generally cheaper to initialize global variables of each process in the pool with these data items and have your worker function just access the global variables.
Certain data types, for example a multiprocessing.Lock instance, cannot be passed as an argument using any of the multiprocessing.Pool methods and need to be "passed" by using a pool initialization function.
Case 2 does not seem to apply. So if you have 100 questions and a pool size of 8, it is better to pass the model and tokenizer 8 times, once for each process in the pool, rather than 100 times, once for each question.
If you are using method Pool.map, which blocks until all submit tasks are complete, you can be sure that there are no processes in the pool running any tasks when that method returns. If you will be re-executing the pool creation code, then when you terminate the while condition: loop you should free resources by either calling pool.close() followed by pool.join(), which will wait for the processes in the pool to terminate or you could just call pool.terminate(), which just terminates all the pool processes immediately (which we know are idle at this point). If you are only creating the pool once, you really do not have to call anything; the processes in the pool are daemon processes, which will terminate when your main process terminates. But, if you will be doing further processing after you have no further need for the pool, then to free up resources sooner rather than later, you should do the previously described "cleanup."
Does this make sense?
Additional Note
Since pool.map blocks until all submit tasks complete, there is no need to call pool.join() just to be sure that the tasks are completed; pool.map will return with a list of all the return values that were returned by your worker function. answer_func.
Where pool.join() can be useful, aside from the freeing of resources I have already mentioned, is when you are issuing a one or more pool.apply_async method calls. This method is non-blocking and returns an AsyncResult instance on which you can later issue a get call to block for the completion of the task and get the return value. But if you are not interested in the return value(s) and just need to wait for the completion of the task(s), then as long as you will not need to submit any more tasks to the pool you simply issue a pool.close() followed by a pool.join() and at the completion of those two calls you can be sure that all of the submitted tasks have completed (possibly with exceptions).
So putting a call to pool.terminate() in the class's __del__ method is a good idea for general usage.

Unexpected behavior with multiprocessing pool imap_unordered method

I have a process that I'm trying to parallelize with the multiprocessing library. Each process does a few things, but one of the things each process does is to call a function that runs an optimization. Now, 90% of the time each optimization will complete in less than 1 minute, but if it doesn't, it might never converge. There is an internal mechanism that terminates the optimization function if it doesn't converge after say, 20,000 iterations, but that can take a long time (~1hr).
Right now my code looks something like this:
pool = multiprocessing.Pool()
imap_it = pool.imap_unordered(synth_worker, arg_tuples)
while 1:
try:
result = imap_it.next(timeout=120)
process_result(result)
except StopIteration:
break
except multiprocessing.TimeoutError:
pass
pool.close()
pool.join()
This seems to work fairly well as I am able to process the first 90% of results fairly quickly, but towards the end, the longer processes start to bog things down and only 1 process completes every ~10 minutes. The weird thing is that once it gets to this point and I run top at the terminal, it seems like only 5-6 parallel processes are running even though I have 24 CPUs in my Pool. I see all 24 CPU's engaged at the beginning of the run, also. Meanwhile I see that there are still over 100 remaining tasks that haven't finished (or even started). I'm not utilizing any of the built-in chunking options of imap_unordered so my understanding is that as soon as one process finishes, the worker that was running that finished process should pick up a new one. Why would there be fewer parallel processes running than there are remaining tasks unless all remaining jobs were already allocated to the 5-6 jammed workers that are waiting for their non-convergent optimizations to hit their 20,000 iteration cutoff? Any ideas? Should I try another method to iterate over the results of my Pool as they come in?
Output of top at the beginning of a run here
and output of top once the jobs seem to get stuck here
and this is what stdout looks like when the jobs start to get stuck.

How do I time out a job submitted to Dask?

I am using Dask to run a pool of tasks, retrieving results in the order they complete by the as_completed method, and potentially submitting new tasks to the pool each time one returns:
# Initial set of jobs
futures = [client.submit(job.run_simulation) for job in jobs]
pool = as_completed(futures, with_results=True)
while True:
# Wait for a job to finish
f, result = next(pool)
# Exit condition
if result == 'STOP':
break
# Do processing and maybe submit more jobs
more_jobs = process_result(f, result)
more_futures = [client.submit(job.run_simulation) for job in more_jobs]
pool.update(more_futures)
Here's my problem: The function job.run_simulation that I am submitting can sometimes hang for a long time, and I want to time out this function - kill the task and move on if the run time exceeds a certain time limit.
Ideally, I'd like to do something like client.submit(job.run_simulation, timeout=10), and have next(pool) return None if the task ran longer than the timeout.
Is there any way that Dask can help me time out jobs like this?
What I've tried so far
My first instinct was to handle the timeout independently of Dask within the job.run_simulation function itself. I've seen two types of suggestions (e.g. here) for generic Python timeouts.
1) Use two threads, one for the function itself and one for a timer. My impression is this doesn't actually work because you can't kill threads. Even if the timer runs out, both threads have to finish before the task is completed.
2) Use two separate processes (with the multiprocessing module), one for the function and one for the timer. This would work, but since I'm already in a daemon subprocess spawned by Dask, I'm not allowed to create new subprocesses.
A third possibility is to move the code block to a separate script that I run with subprocess.run and use the subprocess.run built in timeout. I could do this, but it feels like a worst-case fallback scenario because it would take a lot of cumbersome passing of data to and from the subprocess.
So it feels like I have to accomplish the timeout at the level of Dask. My one idea here is to create a timer as a subprocess at the same time as I submit the task to Dask. Then if the timer runs out, use Client.cancel() to stop the task. The problem with this plan is that Dask might wait for workers to free up before starting the task, and I don't want the timer running before the task is actually running.
Your assessment of the problem seems correct to me and the solutions you went through are the same that I would consider. Some notes:
Client.cancel is unable to stop a function from running if it has already started. These functions are running in a thread pool and so you run into the "can't stop threads" limitation. Dask workers are just Python processes and have the same abilities and limitations.
You say that you can't use processes from within a daemon process. One solution to this would be to change how you're using processes in one of the following ways:
If you're using dask.distributed on a single machine then just don't use processes
client = Client(processes=False)
Don't use Dask's default nanny processes, then your dask worker will be a normal process capable of using multiprocessing
Set dask's multiprocessing-context config to "spawn" rather than fork or forkserver
The clean way to solve this problem though is to solve it inside of your function job.run_simulation. Ideally you would be able to push this timeout logic down to that code and have it raise cleanly.

Retrieve exit code of processes launched with multiprocessing.Pool.map

I'm using python multiprocessing module to parallelize some computationally heavy tasks.
The obvious choice is to use a Pool of workers and then use the map method.
However, processes can fail. For instance, they may be silently killed for instance by the oom-killer. Therefore I would like to be able to retrieve the exit code of the processes launched with map.
Additionally, for logging purpose, I would like to be able to know the PID of the process launched to execute each value in the the iterable.
If you're using multiprocessing.Pool.map you're generally not interested in the exit code of the sub-processes in the pool, you're interested in what value they returned from their work item. This is because under normal conditions, the processes in a Pool won't exit until you close/join the pool, so there's no exit codes to retrieve until all work is complete, and the Pool is about to be destroyed. Because of this, there is no public API to get the exit codes of those sub-processes.
Now, you're worried about exceptional conditions, where something out-of-band kills one of the sub-processes while it's doing work. If you hit an issue like this, you're probably going to run into some strange behavior. In fact, in my tests where I killed a process in a Pool while it was doing work as part of a map call, map never completed, because the killed process didn't complete. Python did, however, immediately launch a new process to replace the one I killed.
That said, you can get the pid of each process in your pool by accessing the multiprocessing.Process objects inside the pool directly, using the private _pool attribute:
pool = multiprocessing.Pool()
for proc in pool._pool:
print proc.pid
So, one thing you could do to try to detect when a process had died unexpectedly (assuming you don't get stuck in a blocking call as a result). You can do this by examining the list of processes in the pool before and after making a call to map_async:
before = pool._pool[:] # Make a copy of the list of Process objects in our pool
result = pool.map_async(func, iterable) # Use map_async so we don't get stuck.
while not result.ready(): # Wait for the call to complete
if any(proc.exitcode for proc in before): # Abort if one of our original processes is dead.
print "One of our processes has exited. Something probably went horribly wrong."
break
result.wait(timeout=1)
else: # We'll enter this block if we don't reach `break` above.
print result.get() # Actually fetch the result list here.
We have to make a copy of the list because when a process in the Pool dies, Python immediately replaces it with a new process, and removes the dead one from the list.
This worked for me in my tests, but because it's relying on a private attribute of the Pool object (_pool) it's risky to use in production code. I would also suggest that it may be overkill to worry too much about this scenario, since it's very unlikely to occur and complicates the implementation significantly.

Persistent Processes Post Python Pool

I have a Python program that takes around 10 minutes to execute. So I use Pool from multiprocessing to speed things up:
from multiprocessing import Pool
p = Pool(processes = 6) # I have an 8 thread processor
results = p.map( function, argument_list ) # distributes work over 6 processes!
It runs much quicker, just from that. God bless Python! And so I thought that would be it.
However I've noticed that each time I do this, the processes and their considerably sized state remain, even when p has gone out of scope; effectively, I've created a memory leak. The processes show up in my System Monitor application as Python processes, which use no CPU at this point, but considerable memory to maintain their state.
Pool has functions close, terminate, and join, and I'd assume one of these will kill the processes. Does anyone know which is the best way to tell my pool p that I am finished with it?
Thanks a lot for your help!
From the Python docs, it looks like you need to do:
p.close()
p.join()
after the map() to indicate that the workers should terminate and then wait for them to do so.

Categories