Closing the multiprocessing after each parallel job

Closing the multiprocessing after each parallel job - python

I am using multiprocessing library of python and also emcee which also uses different threads to implement MCMC. The problem is that even when I close the pool still it seems python uses the processors and slows down the cores and I have no idea what is the efficient way to release the cores after the job is done. Could anybody give me an idea of what I should do?
Update:
My code has been already posted here.

Closing a Pool doesn't stop it from doing work, it just prevents new work items from being added to it:
close()
Prevents any more tasks from being submitted to the pool. Once all the
tasks have been completed the worker processes will exit.
So if you have a lot of queued tasks, closing the Pool won't make any difference in resource usage - all the workers will keep consuming those queued tasks until they're gone. If you want to basically abort all the work items immediately, you have to use pool.terminate:
terminate()
Stops the worker processes immediately without completing outstanding work. When the pool object is garbage collected terminate() will be called immediately.

Related

Subprocess gone sleeping with python multiprocessing.Pool

I wrote a data analysis program with python's multiprocessing library for parallelism. As I don't need to control the subprocess in detail, I used the multiprocessing.Pool for simplicity.
However, when running the program, I find all the sub-processes fall into status S(SLEEPING) after a short period of active(Running) state.
I investigated the wchan of the processes. The parent process and all but one sub-processes are waiting for _futex, the other one is waiting for pipe_wait.
Some information about my program:
I used multiprocessing.Pool#map to distribute the tasks.
The sub-process task contains disk IO and high memory usage. During the course of the program, the sub-process memory cost may exceed the memory capacity (32 sub-processes each takes at most 5% memory). The disk space is ample.
The arguments and return values of the mapped function are not very large in size (just the filenames of the file to be processed, to be specific).
I didn't explicitly create any pipe in my code.
This is the code skeleton of my program.
# imports emitted
def subprocess_task(filename):
read_the_file(filename) # Large disk IO
process_the_data() # High memory cost
write_the_file(new_filename) # Large disk IO
return newfile_name
if __name__=="__main__":
files=["","",...] # The filename of files to process, len(files)=32.
p=multiprocessing.Pool(32) # There are more than 32 cores on the computer.
res=p.map(subprocess_task,files)
p.close()
# Do something with res.
So I want to know why the processes stuck in such a state(especially the pipe_waiting one)? Does it have anything to do with the high memory usage, and how do I solve it?
Much thanks!

OK, after some efforts digging into pipe(7), multiprocessing source code and the log of my troublesome program, I finally identified the problem.
The sole child process which is pipe_wait seems suspicious, because of which I wasted hours trying to find the blocking pipe. However, the key problem has nothing to do with pipes.
The problem is solved when I put some print reporting the pid at some checkpoints in my program. The processes is not same when the tasks are submitted (which I will refer to as original processes) and when the program got stuck (referred as the stuck processes). One of the original 32 child processes is missing in the stuck processes, and the only stuck process which is pipe_wait is not present when the tasks are submitted.
So I can guess the reason now. And the multiprocessing source code corresponds with my guess.
As I said, the program costs lots of memory. At some point when the system is out of memory, the OOM killer kills one of the child processes, selected by some certain algorithm. The OOM killer is forcible and the process exited with all the finishing undone, which includes the communication with the multiprocessing.Pool.
As the source code indicates, the pool uses one thread to collect the task results, and another to manage the workers. The collector thread passively waits for the result to be sent by the child process, while the worker manager thread actively detects process exit by polling all processes.
Therefore, after the process is killed, the worker manager thread detects it, and repopulates the pool by spawning a new process. As no more task is submitted, the process is pipe_wait for some new task. That's the sole pipe_wait child process in my problem. Meanwhile, the result collector thread keeps waiting for the result from the killed thread, which will never arrive. So the other threads are also sleeping.
I have no root access to the environment, or this could be further verified by investigating OOM killer log.

How does Waitress handle concurrent tasks?

I'm trying to build a python webserver using Django and Waitress, but I'd like to know how Waitress handles concurrent requests, and when blocking may occur.
While the Waitress documentation mentions that multiple worker threads are available, it doesn't provide a lot of information on how they are implemented and how the python GIL affects them (emphasis my own):
When a channel determines the client has sent at least one full valid HTTP request, it schedules a "task" with a "thread dispatcher". The thread dispatcher maintains a fixed pool of worker threads available to do client work (by default, 4 threads). If a worker thread is available when a task is scheduled, the worker thread runs the task. The task has access to the channel, and can write back to the channel's output buffer. When all worker threads are in use, scheduled tasks will wait in a queue for a worker thread to become available.
There doesn't seem to be much information on Stackoverflow either. From the question "Is Gunicorn's gthread async worker analogous to Waitress?":
Waitress has a master async thread that buffers requests, and enqueues each request to one of its sync worker threads when the request I/O is finished.
These statements don't address the GIL (at least from my understanding) and it'd be great if someone could elaborate more on how worker threads work for Waitress. Thanks!

Here's how the event-driven asynchronous servers generally work:
Start a process and listen to incoming requests. Utilizing the event notification API of the operating system makes it very easy to serve thousands of clients from single thread/process.
Since there's only one process managing all the connections, you don't want to perform any slow (or blocking) tasks in this process. Because then it will block the program for every client.
To perform blocking tasks, the server delegates the tasks to "workers". Workers can be threads (running in the same process) or separate processes (or subprocesses). Now the main process can keep on serving clients while workers perform the blocking tasks.
How does Waitress handle concurrent tasks?
Pretty much the same way I just described above. And for workers it creates threads, not processes.
how the python GIL affects them
Waitress uses threads for workers. So, yes they are affected by GIL in that they aren't truly concurrent though they seem to be. "Asynchronous" is the correct term.
Threads in Python run inside a single process, on a single CPU core, and don't run in parallel. A thread acquires the GIL for a very small amount of time and executes its code and then the GIL is acquired by another thread.
But since the GIL is released on network I/O, the parent process will always acquire the GIL whenever there's a network event (such as an incoming request) and this way you can stay assured that the GIL will not affect the network bound operations (like receiving requests or sending response).
On the other hand, Python processes are actually concurrent: they can run in parallel on multiple cores. But Waitress doesn't use processes.
Should you be worried?
If you're just doing small blocking tasks like database read/writes and serving only a few hundred users per second, then using threads isn't really that bad.
For serving a large volume of users or doing long running blocking tasks, you can look into using external task queues like Celery. This will be much better than spawning and managing processes yourself.

Hint: Those were my comments to the accepted answer and the conversation below, moved to a separate answer for space reasons.
Wait.. The 5th request will stay in the queue until one of the 4 threads is done with their previous handling, and therefore gone back to the pool. One thread will only ever server one request at a time. "IO bound" tasks only help in that the threads waiting for IO will implicitly (e.g. by calling time.sleep) tell the scheduler (python's internal one) that it can pass the GIL along to another thread since there's currently nothing to do, so that the others will get more CPU time for their stuff. On thread level this is fully sequential, which is still concurrent and asynchronous on process level, just not parallel. Just to get some wording staight.
Also, Python threads are "standard" OS threads (like those in C). So they will use all CPU cores and make full use of them. The only thing restricting them is that they need to hold the GIL when calling Python C-API functions, because the whole API in general is not thread-safe. On the other hand, calls to non-Python functions, i.e. functions in C extensions like numpy for example, but also many database APIs, including anything loaded via ctypes, do not hold the GIL while running. Why should they, they are running external C binaries which don't know anything of the Python interpreter running in the parent process. Therefore, such tasks will run truely in parallel when called from a WSGI app hosted by waitress. And if you've got more cores available, turn the thread number up to that amount (threads=X kwarg on waitress.create_server).

When should I be using asyncio over regular threads, and why? Does it provide performance increases?

I have a pretty basic understanding of multithreading in Python and an even basic-er understanding of asyncio.
I'm currently writing a small Curses-based program (eventually going to be using a full GUI, but that's another story) that handles the UI and user IO in the main thread, and then has two other daemon threads (each with their own queue/worker-method-that-gets-things-from-a-queue):
a watcher thread that watches for time-based and conditional (e.g. posts to a message board, received messages, etc.) events to occur and then puts required tasks into...
the other (worker) daemon thread's queue which then completes them.
All three threads are continuously running concurrently, which leads me to some questions:
When the worker thread's queue (or, more generally, any thread's queue) is empty, should it be stopped until is has something to do again, or is it okay to leave continuously running? Do concurrent threads take up a lot of processing power when they aren't doing anything other than watching its queue?
Should the two threads' queues be combined? Since the watcher thread is continuously running a single method, I guess the worker thread would be able to just pull tasks from the single queue that the watcher thread puts in.
I don't think it'll matter since I'm not multiprocessing, but is this setup affected by Python's GIL (which I believe still exists in 3.4) in any way?
Should the watcher thread be running continuously like that? From what I understand, and please correct me if I'm wrong, asyncio is supposed to be used for event-based multithreading, which seems relevant to what I'm trying to do.
The main thread is basically always just waiting for the user to press a key to access a different part of the menu. This seems like a situation asyncio would be perfect for, but, again, I'm not sure.
Thanks!

When the worker thread's queue (or, more generally, any thread's queue) is empty, should it be stopped until is has something to do again, or is it okay to leave continuously running? Do concurrent threads take up a lot of processing power when they aren't doing anything other than watching its queue?
You should just use a blocking call to queue.get(). That will leave the thread blocked on I/O, which means the GIL will be released, and no processing power (or at least a very minimal amount) will be used. Don't use non-blocking gets in a while loop, since that's going to require a lot more CPU wakeups.
Should the two threads' queues be combined? Since the watcher thread is continuously running a single method, I guess the worker thread would be able to just pull tasks from the single queue that the watcher thread puts in.
If all the watcher is doing is pulling things off a queue and immediately putting it into another queue, where it gets consumed by a single worker, it sounds like its unnecessary overhead - you may as well just consume it directly in the worker. It's not exactly clear to me if that's the case, though - is the watcher consuming from a queue, or just putting items into one? If it is consuming from a queue, who is putting stuff into it?
I don't think it'll matter since I'm not multiprocessing, but is this setup affected by Python's GIL (which I believe still exists in 3.4) in any way?
Yes, this is affected by the GIL. Only one of your threads can run Python bytecode at a time, so won't get true parallelism, except when threads are running I/O (which releases the GIL). If your worker thread is doing CPU-bound activities, you should seriously consider running it in a separate process via multiprocessing, if possible.
Should the watcher thread be running continuously like that? From what I understand, and please correct me if I'm wrong, asyncio is supposed to be used for event-based multithreading, which seems relevant to what I'm trying to do.
It's hard to say, because I don't know exactly what "running continuously" means. What is it doing continuously? If it spends most of its time sleeping or blocking on a queue, it's fine - both of those things release the GIL. If it's constantly doing actual work, that will require the GIL, and therefore degrade the performance of the other threads in your app (assuming they're trying to do work at the same time). asyncio is designed for programs that are I/O-bound, and can therefore be run in a single thread, using asynchronous I/O. It sounds like your program may be a good fit for that depending on what your worker is doing.
The main thread is basically always just waiting for the user to press a key to access a different part of the menu. This seems like a situation asyncio would be perfect for, but, again, I'm not sure.
Any program where you're mostly waiting for I/O is potentially a good for for asyncio - but only if you can find a library that makes curses (or whatever other GUI library you eventually choose) play nicely with it. Most GUI frameworks come with their own event loop, which will conflict with asyncio's. You would need to use a library that can make the GUI's event loop play nicely with asyncio's event loop. You'd also need to make sure that you can find asyncio-compatible versions of any other synchronous-I/O based library your application uses (e.g. a database driver).
That said, you're not likely to see any kind of performance improvement by switching from your thread-based program to something asyncio-based. It'll likely perform about the same. Since you're only dealing with 3 threads, the overhead of context switching between them isn't very significant, so switching from that a single-threaded, asynchronous I/O approach isn't going to make a very big difference. asyncio will help you avoid thread synchronization complexity (if that's an issue with your app - it's not clear that it is), and at least theoretically, would scale better if your app potentially needed lots of threads, but it doesn't seem like that's the case. I think for you, it's basically down to which style you prefer to code in (assuming you can find all the asyncio-compatible libraries you need).

Managing Python Multiprocess processes with different memory ussage

I use a simple RabbitMQ queue to distribute tasks to worker processes. Each worker process uses a pool of multiprocessing instances to work on multiple task at the same time to use the memory and the cpu as much as possible.
The problem is, that some of the task take much more RAM than the others, so that the worker process would crash if it starts more than one instance. But while the worker is working on the RAM intense task, I'd like it to work on other less RAM intense tasks to use the rest of the CPUs.
One idea would be to use multiple queues or topics but I am wondering what the recommended approach is. Can I catch out of memory errors before they crash the process?
What would be the right approach to solve this?
[updated update]
There whole system will consist of multiple multi core machines, but on each multi core machine there is only one worker program running, that creates as much multiprocessing instances as cores. The different machines should be independent of each other except that they get their tasks from the same queue.

I think trying to catch and recover from OOM errors will be very difficult, if not impossible. You would need a thread or process to be running that constantly monitors memory usage, and when it detects it's too high, does...what exactly? Kills a process that's processing a task? tries to pause it (if that's possible; it may not be depending what yours tasks are doing). Even then, pausing it isn't going to release any memory. You'd have to release the memory and restart the task when its safe, which means you'd have to requeue it, decide when its safe, etc.
Instead of trying to detect and recover from the problem, I would recommend trying to avoid it altogether. Create two queues, and two pools. One queue/pool for high-memory tasks, and another queue/pool for low-memory tasks. The high-memory pool would only have a single process in it, so it would be limited to running one task concurrently, which saves your memory. The low-memory queue would have multiprocessing.cpu_count() - 1 processes, allowing you to keep your CPUs saturated across the two pools.
One potential issue with this approach is that if you exhaust the high-memory queue while still having low-memory tasks pending, you'll be wasting one of your CPU. You could handle this consuming from the high-memory queue in a non-blocking way (or with a timeout), so that if the high-memory queue is empty when you're ready to consume a task, you can grab a low-memory task instead. Then when you're done processing it, check the high-memory queue again.
Something like this:
import multiprocessing
# hi_q and lo_q are placeholders for whatever library you're using to consume from RabbitMQ
def high_mem_consume():
while True:
task = hi_q.consume(timeout=2)
if not task:
lo_q.consume(timeout=2)
if task:
process_task(task)
def low_mem_consume():
while True:
task = lo_q.consume() # Blocks forever
process_task(task)
if __name__ == "__main__":
hi_pool = multiprocessing.Pool(1)
lo_pool = multiprocessing.Pool(multiprocessing.cpu_count() - 1)
hi_pool.apply_async(high_mem_consume)
lo_pool.apply_async(lo_mem_consume)

Threads vs. Async

I have been reading up on the threaded model of programming versus the asynchronous model from this really good article. http://krondo.com/blog/?p=1209
However, the article mentions the following points.
An async program will simply outperform a sync program by switching between tasks whenever there is a I/O.
Threads are managed by the operating system.
I remember reading that threads are managed by the operating system by moving around TCBs between the Ready-Queue and the Waiting-Queue(amongst other queues). In this case, threads don't waste time on waiting either do they?
In light of the above mentioned, what are the advantages of async programs over threaded programs?

It is very difficult to write code that is thread safe. With asyncronous code, you know exactly where the code will shift from one task to the next and race conditions are therefore much harder to come by.
Threads consume a fair amount of data since each thread needs to have its own stack. With async code, all the code shares the same stack and the stack is kept small due to continuously unwinding the stack between tasks.
Threads are OS structures and are therefore more memory for the platform to support. There is no such problem with asynchronous tasks.
Update 2022:
Many languages now support stackless co-routines (async/await). This allows us to write a task almost synchronously while yielding to other tasks (awaiting) at set places (sleeping or waiting for networking or other threads)

There are two ways to create threads:
synchronous threading - the parent creates one (or more) child threads and then must wait for each child to terminate. Synchronous threading is often referred to as the fork-join model.
asynchronous threading - the parent and child run concurrently/independently of one another. Multithreaded servers typically follow this model.
resource - http://www.amazon.com/Operating-System-Concepts-Abraham-Silberschatz/dp/0470128720

Assume you have 2 tasks, which does not involve any IO (on multiprocessor machine).
In this case threads outperform Async. Because Async like a
single threaded program executes your tasks in order. But threads can
execute both the tasks simultaneously.
Assume you have 2 tasks, which involve IO (on multiprocessor machine).
In this case both Async and Threads performs more or less same (performance
might vary based on number of cores, scheduling, how much process intensive
the task etc.). Also Async takes less amount of resources, low overhead and
less complex to program over multi threaded program.
How it works?
Thread 1 executes Task 1, since it is waiting for IO, it is moved to IO
waiting Queue. Similarly Thread 2 executes Task 2, since it is also involves
IO, it is moved to IO waiting Queue. As soon as it's IO request is resolved
it is moved to ready queue so the scheduler can schedule the thread for
execution.
Async executes Task 1 and without waiting for it's IO to complete it
continues with Task 2 then it waits for IO of both the task to complete. It
completes the tasks in the order of IO completion.
Async best suited for tasks which involve Web service calls, Database query
calls etc.,
Threads for process intensive tasks.
The below video explains aboutAsync vs Threaded model and also when to use etc.,
https://www.youtube.com/watch?v=kdzL3r-yJZY
Hope this is helpful.

First of all, note that a lot of the detail of how threads are implemented and scheduled are very OS-specific. In general, you shouldn't need to worry about threads waiting on each other, since the OS and the hardware will attempt to arrange for them to run efficiently, whether asynchronously on a single-processor system or in parallel on multi-processors.
Once a thread has finished waiting for something, say I/O, it can be thought of as runnable. Threads that are runnable will be scheduled for execution at some point soon. Whether this is implemented as a simple queue or something more sophisticated is, again, OS- and hardware-specific. You can think of the set of blocked threads as a set rather than as a strictly ordered queue.
Note that on a single-processor system, asynchronous programs as defined here are equivalent to threaded programs.

see http://en.wikipedia.org/wiki/Thread_(computing)#I.2FO_and_scheduling
However, the use of blocking system calls in user threads (as opposed to kernel threads) or fibers can be problematic. If a user thread or a fiber performs a system call that blocks, the other user threads and fibers in the process are unable to run until the system call returns. A typical example of this problem is when performing I/O: most programs are written to perform I/O synchronously. When an I/O operation is initiated, a system call is made, and does not return until the I/O operation has been completed. In the intervening period, the entire process is "blocked" by the kernel and cannot run, which starves other user threads and fibers in the same process from executing.
According to this, your whole process might be blocked, and no thread will be scheduled when one thread is blocked in IO. I think this is os-specific, and will not always be hold.

To be fair, let's point out the benefit of Threads under CPython GIL compared to async approach:
it's easier first to write typical code that has one flow of events (no parallel execution) and then to run multiple copies of it in separate threads: it will keep each copy responsive, while the benefit of executing all I/O operations in parallel will be achieved automatically;
many time-proven libraries are sync and therefore easy to be included in the threaded version, and not in async one;
some sync libraries actually let GIL go at C level that allows parallel execution for tasks beyond I/O-bound ones: e.g. NumPy;
it's harder to write async code in general: the inclusion of a heavy CPU-bound section will make concurrent tasks not responsive, or one may forget to await the result and finish execution earlier.
So if there are no immediate plans to scale your services beyond ~100 concurrent connections it may be easier to start with a threaded version and then rewrite it... using some other more performant language like Go.

Async I/O means there is already a thread in the driver that does the job, so you are duplicating functionality and incurring some overhead. On the other hand, often it is not documented how exactly the driver thread behaves, and in complex scenarios, when you want to control timeout/cancellation/start/stop behaviour, synchronization with other threads, it makes sense to implement your own thread. It is also sometimes easier to reason in sync terms.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.