How to run a python multiprocessing pool without closing - python

I am trying to run multiple copies of a Bert model simultaneously.
I have a python object which holds a pool:
self.tokenizer = BertTokenizer.from_pretrained(BERT_LARGE)
self.model = BertForQuestionAnswering.from_pretrained(BERT_LARGE)
self.pool = Pool(processes=max_processes,
initializer=pool_init,
initargs=(self.model, self.tokenizer))
Each process in the pool copies across a Bert tokenizer and model:
process_model = None
process_tokenizer = None
def pool_init(m: BertForQuestionAnswering, t: BertTokenizer):
global process_model, process_tokenizer
process_model, process_tokenizer = m, t
To use the pool, I then run
while condition:
answers = self.pool.map(answer_func, questions)
condition = check_condition(answers)
This design is in order to avoid the large overhead of reloading the Bert model into each process each time the pool is initialized (which takes about 1.5-2 seconds per process).
Question 1. Is this the best way of doing this?
Question 2. If so, when am I supposed to call self.pool.close() and self.pool.join()? I want to join() before the check_condition() function, but I don't really ever want to close() the pool (unless until the __del__() of the object) but calling join() before calling close() gives me errors, and calling close() makes the pool uncallable in the future. Is pool just not meant for these kind of jobs, and I should manage an array of processes? Help...?
Thanks!!

You said, "This design is in order to avoid the large overhead of reloading the Bert model into each process each time the pool is initialized (which takes about 1.5-2 seconds per process)." Your statement and the small amount of code you showed does not quite make perfect sense to me. I think it's a question of terminology.
First, I don't see where the pool is being initialized multiple times; I only see one instance of creating the pool:
self.pool = Pool(processes=max_processes,
initializer=pool_init,
initargs=(self.model, self.tokenizer))
But if you are creating the pool multiple times, you are in fact with your current design using the pool_init function to reload the Bert model into each process of the pool each time the pool is created and not avoiding what you say you are avoiding. But this can be a good thing. So I suspect we are talking about two different things. So I can only explain what is actually going on:
You are invoking the pool.map function potentially multiple times because of you while condition: loop. But, in general, you do want to avoid creating a pool multiple times if you can avoid doing so. Now there are two reasons I can think of for using the initializer and initargs arguments to the Pool constructor as you are doing:
If you have read-only data items that your worker function (answer_func in your case) needs to access, rather than passing these items on each call to that function, it is generally cheaper to initialize global variables of each process in the pool with these data items and have your worker function just access the global variables.
Certain data types, for example a multiprocessing.Lock instance, cannot be passed as an argument using any of the multiprocessing.Pool methods and need to be "passed" by using a pool initialization function.
Case 2 does not seem to apply. So if you have 100 questions and a pool size of 8, it is better to pass the model and tokenizer 8 times, once for each process in the pool, rather than 100 times, once for each question.
If you are using method Pool.map, which blocks until all submit tasks are complete, you can be sure that there are no processes in the pool running any tasks when that method returns. If you will be re-executing the pool creation code, then when you terminate the while condition: loop you should free resources by either calling pool.close() followed by pool.join(), which will wait for the processes in the pool to terminate or you could just call pool.terminate(), which just terminates all the pool processes immediately (which we know are idle at this point). If you are only creating the pool once, you really do not have to call anything; the processes in the pool are daemon processes, which will terminate when your main process terminates. But, if you will be doing further processing after you have no further need for the pool, then to free up resources sooner rather than later, you should do the previously described "cleanup."
Does this make sense?
Additional Note
Since pool.map blocks until all submit tasks complete, there is no need to call pool.join() just to be sure that the tasks are completed; pool.map will return with a list of all the return values that were returned by your worker function. answer_func.
Where pool.join() can be useful, aside from the freeing of resources I have already mentioned, is when you are issuing a one or more pool.apply_async method calls. This method is non-blocking and returns an AsyncResult instance on which you can later issue a get call to block for the completion of the task and get the return value. But if you are not interested in the return value(s) and just need to wait for the completion of the task(s), then as long as you will not need to submit any more tasks to the pool you simply issue a pool.close() followed by a pool.join() and at the completion of those two calls you can be sure that all of the submitted tasks have completed (possibly with exceptions).
So putting a call to pool.terminate() in the class's __del__ method is a good idea for general usage.

Related

Can I map a subprocess to the same multiprocessing.Pool where the main process is running?

I am relatively new to the multiprocessing world in python3 and I am therefore sorry if this question has been asked before. I have a script which, from a list of N elements, runs the entire analysis on each element, mapping each onto a different process.
I am aware that this is suboptimal, in fact I want to increase the multiprocessing efficiency. I use map() to run each process into a Pool() which can contain as many processes as the user specifies via command line arguments.
Here is how the code looks like:
max_processes = 7
# it is passed by command line actually but not relevant here
def main_function( ... ):
res_1 = sub_function_1( ... )
res_2 = sub_function_2( ... )
if __name__ == '__main__':
p = Pool(max_processes)
Arguments = []
for x in Paths.keys():
# generation of the arguments
...
Arguments.append( Tup_of_arguments )
p.map(main_function, Arguments)
p.close()
p.join()
As you see my process calls a main function which in turn calls many other functions one after the other. Now, each of the sub_functions is multiprocessable. Can I map processes from those subfunctions, which map to the same pool where the main process runs?
No, you can't.
The pool is (pretty much) not available in the worker processes. It depends a bit on the start method used for the pool.
spawn
A new Python interpreter process is started and imports the module. Since in that process __name__ is '__mp_main__', the code in the __name__ == '__main__' block is not executed and no pool object exists in the workers.
fork
The memory space of the parent process is copied into the memory space of the child process. That effectively leads to an existing Pool object in the memory space of each worker.
However, that pool is unusable. The workers are created during the execution of the pool's __init__, hence the pool's initialization is incomplete when the workers are forked. The pool's copies in the worker processes have none of the threads running that manage workers, tasks and results. Threads anyway don't make it into child processes via fork.
Additionally, since the workers are created during the initialization, the pool object has not yet been assigned to any name at that point. While it does lurk in the worker's memory space, there is no handle to it. It does not show up via globals(); I only found it via gc.get_objects(): <multiprocessing.pool.Pool object at 0x7f75d8e50048>
Anyway, that pool object is a copy of the one in the main process.
forkserver
I could not test this start method
To solve your problem, you could fiddle around with queues and a queue handler thread in the main process to send back tasks from workers and delegate them to the pool, but all approaches I can think of seem rather clumsy.
You'll very probaly end up with a lot more maintainable code if you make the effort to adopt it for processing in a pool.
As an aside: I am not sure if allowing users to pass the number of workers via commandline is a good idea. I recommend to to give that value an upper boundary via os.cpu_count() at the very least.

Should I preserve pool object (and its workers) throughout the whole program in this case?

I am currently modifying an existing program to contain multi-processing features so that it can be used more efficiently on multi-core systems. I am using Python3's multiprocessing module to implement this. I'm fairly new to multiprocessing and I was wondering whether my design is very efficient or not.
The general execution steps of my program is as following:
Main process
call function1() -> create pool of workers and carry out certain operations in parallel. close pool.
call function2() -> create pool of workers and carry out certain operations in parallel. close pool.
call function3() -> create pool of workers and carry out certain operations in parallel. close pool.
and repeat until end.
Now you may ask why I would create pool of workers and close it in each function. The reason is that after completion of one function, I need to combine all the results that were processed in parallel and output some statistical values needed for the next function. So for example, function1() might get the mean which is needed by function2().
Now I realize creating a pool of workers repeatedly has its costs in Python. I was wondering if there was a way of preserving the workers between function1 and function2 because the nature of parallelization is the exact same in both functions.
One way I was thinking was creating the mp.Pool object in the main process and pass it as an argument to each function, but I'm not sure if that would be a valid way of doing so. Also, a side note is that I am also concerned about memory consumption of the program.
I am hoping if someone could validate my idea or suggest a better way of achieving the same thing.
*edit thought it would be more helpful if I included some code.
pool = mp.Pool(processes=min(args.cpu, len(chroms)))
find_and_filter_reads_partial = partial(find_and_filter_reads, path_to_file, cutoff)
filtered_result = pool.map(find_and_filter_reads_partial, chroms)
pool.close()

Execute Python threads in small groups

I am trying to insert some number(100) of data sets into SQL server using python. I am using multi-threading to create 100 threads in a loop. All of them are starting at the same time and this is bogging down the database. I want to group my threads into set of 5 and once that group is done, I would like to start the next group of threads and so on. As I am new to python and multi-threading, any help would be highly appreciated.Please find my code below.
for row in datasets:
argument1=row[0]
argument2=row[1]
jobs=[]
t = Thread(target=insertDataIntoSQLServer, args=(argument1,argument2,))
jobs.append(t)
t.start()
for t in jobs:
t.join()
On Python 2 and 3 you could use a multiprocessing.ThreadPool. This is like a multiprocessing.Pool, but using threads instead of processes.
import multiprocessing
datasets = [(1,2,3), (4,5,6)] # Iterable of datasets.
def insertfn(data):
pass # shove data to SQL server
pool = multiprocessing.ThreadPool()
p.map(insertfn, datasets)
By default, a Pool will create as many worker threads as your CPU has cores. Using more threads will probably not help, because they will be fighting for CPU time.
Note that I've grouped data into tuples. That is one way to get around the one argument restriction for pool workers.
On Python 3 you can also use a ThreadPoolExecutor.
Note however that on Python implementations (like the "standard" CPython) that have a Global Interpreter Lock, only one thread at a time can be executing Python bytecode. So using large numbers of threads will not automatically increase performance. Threads might help with operations that are I/O bound. If one thread is waiting for I/O, another thread can run.
First note that your code doesn't work as you intended: it sets jobs to an empty list every time through the loop, so after the loop is over you only join() the last thread created.
So repair that, by moving jobs=[] out of the loop. After that, you can get exactly what you asked for by adding this after t.start():
if len(jobs) == 5:
for t in jobs:
t.join()
jobs = []
I'd personally use some kind of pool (as other answers suggest), but it's easy to directly get what you had in mind.
You can create a ThreadPoolExecutor and specify max_workers=5.
See here.
And you can use functools.partial to turn your functions into the required 0-argument functions.
EDIT: You can pass the args in with the function name when you submit to the executor. Thanks, Roland Smith, for reminding me that partial is a bad idea. There was a better way.

python keep using same threads in long duration as long running process

I have a model that I want to keep running for over 6 hours paralleled:
pool = multiprocessing.Pool(10)
for subModel in self.model:
pool.apply_async(self.compute, subModel, features)
pool.close()
pool.join()
The problem right now is too slow, as I have to call pool = multiprocessing.Pool(10) and pool.join() each time to construct and deconstruct 10 threads while the model is run tons of times in 6 hours.
The solution I believe is to have 10 threads kept running in the background, so that whenever new data coming in, it will go into model right away not worrying about creating new threads and deconstructing them which wastes a lot of time.
Is there a way in python that allows you to have long running process so that you don't need to start and stop over and over again?
You don't need to close() and join() (and destroy) the Pool after submitting one set of tasks to it. If you want to make sure that your apply_async() has completed before going on, just call apply() instead, and use the same Pool next time. Or, if you can do other stuff while waiting for the tasks, save the result object returned by apply_async and call wait() on it once you can't go on without it being complete.

Is this multi-threaded function asynchronous

I'm afraid I'm still a bit confused (despite checking other threads) whether:
all asynchronous code is multi-threaded
all multi-threaded functions are asynchronous
My initial guess is no to both and that proper asynchronous code should be able to run in one thread - however it can be improved by adding threads for example like so:
So I constructed this toy example:
from threading import *
from queue import Queue
import time
def do_something_with_io_lag(in_work):
out = in_work
# Imagine we do some work that involves sending
# something over the internet and processing the output
# once it arrives
time.sleep(0.5) # simulate IO lag
print("Hello, bee number: ",
str(current_thread().name).replace("Thread-",""))
class WorkerBee(Thread):
def __init__(self, q):
Thread.__init__(self)
self.q = q
def run(self):
while True:
# Get some work from the queue
work_todo = self.q.get()
# This function will simiulate I/O lag
do_something_with_io_lag(work_todo)
# Remove task from the queue
self.q.task_done()
if __name__ == '__main__':
def time_me(nmbr):
number_of_worker_bees = nmbr
worktodo = ['some input for work'] * 50
# Create a queue
q = Queue()
# Fill with work
[q.put(onework) for onework in worktodo]
# Launch processes
for _ in range(number_of_worker_bees):
t = WorkerBee(q)
t.start()
# Block until queue is empty
q.join()
# Run this code in serial mode (just one worker)
%time time_me(nmbr=1)
# Wall time: 25 s
# Basically 50 requests * 0.5 seconds IO lag
# For me everything gets processed by bee number: 59
# Run this code using multi-tasking (launch 50 workers)
%time time_me(nmbr=50)
# Wall time: 507 ms
# Basically the 0.5 second IO lag + 0.07 seconds it took to launch them
# Now everything gets processed by different bees
Is it asynchronous?
To me this code does not seem asynchronous because it is Figure 3 in my example diagram. The I/O call blocks the thread (although we don't feel it because they are blocked in parallel).
However, if this is the case I am confused why requests-futures is considered asynchronous since it is a wrapper around ThreadPoolExecutor:
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
future_to_url = {executor.submit(load_url, url, 10): url for url in get_urls()}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
Can this function on just one thread?
Especially when compared to asyncio, which means it can run single-threaded
There are only two ways to have a program on a single processor do
“more than one thing at a time.” Multi-threaded programming is the
simplest and most popular way to do it, but there is another very
different technique, that lets you have nearly all the advantages of
multi-threading, without actually using multiple threads. It’s really
only practical if your program is largely I/O bound. If your program
is processor bound, then pre-emptive scheduled threads are probably
what you really need. Network servers are rarely processor bound,
however.
First of all, one note: concurrent.futures.Future is not the same as asyncio.Future. Basically it's just an abstraction - an object, that allows you to refer to job result (or exception, which is also a result) in your program after you assigned a job, but before it is completed. It's similar to assigning common function's result to some variable.
Multithreading: Regarding your example, when using multiple threads you can say that your code is "asynchronous" as several operations are performed in different threads at the same time without waiting for each other to complete, and you can see it in the timing results. And you're right, your function due to sleep is blocking, it blocks the worker thread for the specified amount of time, but when you use several threads those threads are blocked in parallel. So if you would have one job with sleep and the other one without and run multiple threads, the one without sleep would perform calculations while the other would sleep. When you use single thread, the jobs are performed in in a serial manner one after the other, so when one job sleeps the other jobs wait for it, actually they just don't exist until it's their turn. All this is pretty much proven by your time tests. The thing happened with print has to do with "thread safety", i.e. print uses standard output, which is a single shared resource. So when your multiple threads tried to print at the same time the switching happened inside and you got your strange output. (This also show "asynchronicity" of your multithreaded example.) To prevent such errors there are locking mechanisms, e.g. locks, semaphores, etc.
Asyncio: To better understand the purpose note the "IO" part, it's not 'async computation', but 'async input/output'. When talking about asyncio you usually don't think about threads at first. Asyncio is about event loop and generators (coroutines). The event loop is the arbiter, that governs the execution of coroutines (and their callbacks), that were registered to the loop. Coroutines are implemented as generators, i.e. functions that allow to perform some actions iteratively, saving state at each iteration and 'returning', and on the next call continuing with the saved state. So basically the event loop is while True: loop, that calls all coroutines/generators, assigned to it, one after another, and they provide result or no-result on each such call - this provides possibility for "asynchronicity". (A simplification, as there's scheduling mechanisms, that optimize this behavior.) The event loop in this situation can run in single thread and if coroutines are non-blocking it will give you true "asynchronicity", but if they are blocking then it's basically a linear execution.
You can achieve the same thing with explicit multithreading, but threads are costly - they require memory to be assigned, switching them takes time, etc. On the other hand asyncio API allows you to abstract from actual implementation and just consider your jobs to be performed asynchronously. It's implementation may be different, it includes calling the OS API and the OS decides what to do, e.g. DMA, additional threads, some specific microcontroller use, etc. The thing is it works well for IO due to lower level mechanisms, hardware stuff. On the other hand, performing computation will require explicit breaking of computation algorithm into pieces to use as asyncio coroutine, so a separate thread might be a better decision, as you can launch the whole computation as one there. (I'm not talking about algorithms that are special to parallel computing). But asyncio event loop might be explicitly set to use separate threads for coroutines, so this will be asyncio with multithreading.
Regarding your example, if you'll implement your function with sleep as asyncio coroutine, shedule and run 50 of them single threaded, you'll get time similar to the first time test, i.e. around 25s, as it is blocking. If you will change it to something like yield from [asyncio.sleep][3](0.5) (which is a coroutine itself), shedule and run 50 of them single threaded, it will be called asynchronously. So while one coroutine will sleep the other will be started, and so on. The jobs will complete in time similar to your second multithreaded test, i.e. close to 0.5s. If you will add print here you'll get good output as it will be used by single thread in serial manner, but the output might be in different order then the order of coroutine assignment to the loop, as coroutines could be run in different order. If you will use multiple threads, then the result will obviously be close to the last one anyway.
Simplification: The difference in multythreading and asyncio is in blocking/non-blocking, so basicly blocking multithreading will somewhat come close to non-blocking asyncio, but there're a lot of differences.
Multithreading for computations (i.e. CPU bound code)
Asyncio for input/output (i.e. I/O bound code)
Regarding your original statement:
all asynchronous code is multi-threaded
all multi-threaded functions are asynchronous
I hope that I was able to show, that:
asynchronous code might be both single threaded and multi-threaded
all multi-threaded functions could be called "asynchronous"
I think the main confusion comes from the meaning of asynchronous. From the Free Online Dictionary of Computing, "A process [...] whose execution can proceed independently" is asynchronous. Now, apply that to what your bees do:
Retrieve an item from the queue. Only one at a time can do that, while the order in which they get an item is undefined. I wouldn't call that asynchronous.
Sleep. Each bee does so independently of all others, i.e. the sleep duration runs on all, otherwise the time wouldn't go down with multiple bees. I'd call that asynchronous.
Call print(). While the calls are independent, at some point the data is funneled into the same output target, and at that point a sequence is enforced. I wouldn't call that asynchronous. Note however that the two arguments to print() and also the trailing newline are handled independently, which is why they can be interleaved.
Lastly, the call to q.join(). Here of course the calling thread is blocked until the queue is empty, so some kind of synchronization is enforced and wanted. I don't see why this "seems to break" for you.

Categories