Managing Python Multiprocess processes with different memory ussage

Managing Python Multiprocess processes with different memory ussage - python

I use a simple RabbitMQ queue to distribute tasks to worker processes. Each worker process uses a pool of multiprocessing instances to work on multiple task at the same time to use the memory and the cpu as much as possible.
The problem is, that some of the task take much more RAM than the others, so that the worker process would crash if it starts more than one instance. But while the worker is working on the RAM intense task, I'd like it to work on other less RAM intense tasks to use the rest of the CPUs.
One idea would be to use multiple queues or topics but I am wondering what the recommended approach is. Can I catch out of memory errors before they crash the process?
What would be the right approach to solve this?
[updated update]
There whole system will consist of multiple multi core machines, but on each multi core machine there is only one worker program running, that creates as much multiprocessing instances as cores. The different machines should be independent of each other except that they get their tasks from the same queue.

I think trying to catch and recover from OOM errors will be very difficult, if not impossible. You would need a thread or process to be running that constantly monitors memory usage, and when it detects it's too high, does...what exactly? Kills a process that's processing a task? tries to pause it (if that's possible; it may not be depending what yours tasks are doing). Even then, pausing it isn't going to release any memory. You'd have to release the memory and restart the task when its safe, which means you'd have to requeue it, decide when its safe, etc.
Instead of trying to detect and recover from the problem, I would recommend trying to avoid it altogether. Create two queues, and two pools. One queue/pool for high-memory tasks, and another queue/pool for low-memory tasks. The high-memory pool would only have a single process in it, so it would be limited to running one task concurrently, which saves your memory. The low-memory queue would have multiprocessing.cpu_count() - 1 processes, allowing you to keep your CPUs saturated across the two pools.
One potential issue with this approach is that if you exhaust the high-memory queue while still having low-memory tasks pending, you'll be wasting one of your CPU. You could handle this consuming from the high-memory queue in a non-blocking way (or with a timeout), so that if the high-memory queue is empty when you're ready to consume a task, you can grab a low-memory task instead. Then when you're done processing it, check the high-memory queue again.
Something like this:
import multiprocessing
# hi_q and lo_q are placeholders for whatever library you're using to consume from RabbitMQ
def high_mem_consume():
while True:
task = hi_q.consume(timeout=2)
if not task:
lo_q.consume(timeout=2)
if task:
process_task(task)
def low_mem_consume():
while True:
task = lo_q.consume() # Blocks forever
process_task(task)
if __name__ == "__main__":
hi_pool = multiprocessing.Pool(1)
lo_pool = multiprocessing.Pool(multiprocessing.cpu_count() - 1)
hi_pool.apply_async(high_mem_consume)
lo_pool.apply_async(lo_mem_consume)

Related

Multi Thread in Python run in parallel

I have a question about multithread in Python.
I already tried Multithread and MultiProcessing in python.
What I get is
in MultiThread, I will get a duplicate result when run it pararelly. After research, I found that in the multiThread, the Multithread can update the same variable(race Condition).
Meanwhile, in the multi processing, it will go smoothly, without problem like in the Multithread.
The question,
Can I use Multithread, but the mechanism is like Multiprocessing? Because I need to migrate more than 2 million record, and I need to run that function asynchronously as much as possible (My laptop only have 4 cores) that's why I need to use multiThread.
Can some explain to me about the question above?

In multithreading, each thread will share the same memory space as the parent process that spawned them. But in multi-processing, each process have their own memory space.
However, in multithreading, you need to use a lock (semaphore/mutex), (e.g. threading.Lock() to prevent the race condition. It is not to say that multiprocessing will not have race condition, it can have it if you specifically share the same object and not the copy of it. But by default it will copy the object.
Multithreading is also limited by python's GIL (Global Interpreter Lock) which ensures that only one thread is running at the same time. So if you have intensive computation task running on two threads, it doesn't really make it faster, as only one can be active at the same time.
However, multiprocessing can overcome it easily, as it runs on multiple process and each process will be handled by OS's scheduler and run parallely.
General rule of thumb:
if your process is computationally intensive, use process
if your process is I/O intensive, use threads
If your thread needs concurrent access to the same var/object, etc, you need to use lock.

Python multiprocessing: dealing with 2000 processes

Following is my multi processing code. regressTuple has around 2000 items. So, the following code creates around 2000 parallel processes. My Dell xps 15 laptop crashes when this is run.
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in minimal time? Am I not doing this correctly?
Is there a API call in python to get the possible hardware process count?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several times till completion - In this way, after few experiments, I will be able to get the optimal thread count.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Hereby my code:
regressTuple = [(x,) for x in regressList]
processes = []
for i in range(len(regressList)):
processes.append(Process(target=runRegressWriteStatus,args=regressTuple[i]))
for process in processes:
process.start()
for process in processes:
process.join()

There are multiple things that we need to keep in mind
Spinning the number of processes are not limited by number of cores on your system but the ulimit for your user id on your system that controls total number of processes that be launched by your user id.
The number of cores determine how many of those launched processes can actually be running in parallel at one time.
Crashing of your system can be due to the fact your target function that these processes are running is doing something heavy and resource intensive, which system is not able to handle when multiple processes run simultaneously or nprocs limit on the system has exhausted and now kernel is not able to spin new system processes.
That being said it is not a good idea to spawn as many as 2000 processes, no matter even if you have a 16 core Intel Skylake machine, because creating a new process on the system is not a light weight task because there are number of things like generating the pid, allocating memory, address space generation, scheduling the process, context switching and managing the entire life cycle of it that happen in the background. So it is a heavy operation for the kernel to generate a new process,
Unfortunately I guess what you are trying to do is a CPU bound task and hence limited by the hardware you have on the machine. Spinning more number of processes than the number of cores on your system is not going to help at all, but creating a process pool might. So basically you want to create a pool with as many number of processes as you have cores on the system and then pass the input to the pool. Something like this
def target_func(data):
# process the input data
with multiprocessing.pool(processes=multiprocessing.cpu_count()) as po:
res = po.map(f, regressionTuple)

Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in
minimal time? Am I not doing this correctly?
I don't think it's python's responsibility to manage the queue length. When people reach out for multiprocessing they tend to want efficiency, adding system performance tests to the run queue would be an overhead.
Is there a API call in python to get the possible hardware process count?
If there were, would it know ahead of time how much memory your task will need?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several
times till completion - In this way, after few experiments, I will be
able to get the optimal thread count.
As balderman pointed out, a pool is a good way forward with this.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Use a pool, or take the available system memory, divide by ~3MB and see how many tasks you can run at once.
This is probably more of a sysadmin task to balance the bottlenecks against the queue length, but generally, if your tasks are IO bound, then there isn't much point in having a long task queue if all the tasks are waiting at a the same T-junction to turn into the road. The tasks will then fight with each other for the next block of IO.

Worker pool where certain tasks can only be done by certain workers

I have a lot of tasks that I'd like to execute a few at a time. The normal solution for this is a thread pool. However, my tasks need resources that only certain threads have. So I can't just farm a task out to any old thread; the thread has to have the resource the task needs.
It seems like there should be a concurrency pattern for this, but I can't seem to find it. I'm implementing this in Python 2 with multiprocessing, so answers in those terms would be great, but a generic solution is fine. In my case the "threads" are actually separate OS processes and the resources are network connections (and no, it's not a server, so (e)poll/select is not going to help). In general, a thread/process can hold several resources.
Here is a naive solution: put the tasks in a work queue and turn my thread pool loose on it. Have each thread check, "Can I do this task?" If yes, do it; if no, put it back in the queue. However, if each task can only be done by one of N threads, then I'm doing ~2N expensive, wasted accesses to a shared queue just to get one unit of work.
Here is my current thought: have a shared work queue for each resource. Farm out tasks to the matching queue. Each thread checks the queue(s) it can handle.
Ideas?

A common approach to this is to not allocate resources to threads and queue the appropriate resource in with the data, though I appreciate that this is not always possible if a resource is bound to a particular thread.
The idea of using a queue per resource with threads only popping objects from the queues containing objects it can handle may work.
It may be possible to use a semaphore+concurrentQueue array, indexed by resource, for signaling such threads and also providing a priority system, so eliminating most of the polling and wasteful requeueing. I will have to think a bit more about that - it kinda depends on how the resources map to the threads.

Parallel processing within a queue (using Pool within Celery)

I'm using Celery to queue jobs from a CGI application I made. The way I've set it up, Celery makes each job run one- or two-at-a-time by setting CELERYD_CONCURRENCY = 1 or = 2 (so they don't crowd the processor or thrash from memory consumption). The queue works great, thanks to advice I got on StackOverflow.
Each of these jobs takes a fair amount of time (~30 minutes serial), but has an embarrassing parallelizability. For this reason, I was using Pool.map to split it and do the work in parallel. It worked great from the command line, and I got runtimes around 5 minutes using a new many-cored chip.
Unfortunately, there is some limitation that does not allow daemonic process to have subprocesses, and when I run the fancy parallelized code within the CGI queue, I get this error:
AssertionError: daemonic processes are not allowed to have children
I noticed other people have had similar questions, but I can't find an answer that wouldn't require abandoning Pool.map altogether, and making more complicated thread code.
What is the appropriate design choice here? I can easily run my serial jobs using my Celery queue. I can also run my much faster parallelized jobs without a queue. How should I approach this, and is it possible to get what I want (both the queue and the per-job parallelization)?
A couple of ideas I've had (some are quite hacky):
The job sent to the Celery queue simply calls the command line program. That program can use Pool as it pleases, and then saves the result figures & data to a file (just as it does now). Downside: I won't be able to check on the status of the job or see if it terminated successfully. Also, system calls from CGI may cause security issues.
Obviously, if the queue is very full of jobs, I can make use of the CPU resources (by setting CELERYD_CONCURRENCY = 6 or so); this will allow many people to be "at the front of the queue" at once.Downside: Each job will spend a lot of time at the front of the queue; if the queue isn't full, there will be no speedup. Also, many partially finished jobs will be stored in memory at the same time, using much more RAM.
Use Celery's #task to parallelize within sub-jobs. Then, instead of setting CELERYD_CONCURRENCY = 1, I would set it to 6 (or however many sub jobs I'd like to allow in memory at a time). Downside: First of all, I'm not sure whether this will successfully avoid the "task-within-task" problem. But also, the notion of queue position may be lost, and many partially finished jobs may end up in memory at once.
Perhaps there is a way to call Pool.map and specify that the threads are non-daemonic? Or perhaps there is something more lightweight I can use instead of Pool.map? This is similar to an approach taken on another open StackOverflow question. Also, I should note that the parallelization I exploit via Pool.map is similar to linear algebra, and there is no inter-process communication (each just runs independently and returns its result without talking to the others).
Throw away Celery and use multiprocessing.Queue. Then maybe there'd be some way to use the same "thread depth" for every thread I use (i.e. maybe all of the threads could use the same Pool, avoiding nesting)?
Thanks a lot in advance.

What you need is a workflow management system (WFMS) that manages
task concurrency
task dependency
task nesting
among other things.
From a very high level view, a WFMS sits on top of a task pool like celery, and submits the tasks which are ready to execute to the pool. It is also responsible for opening up a nest and submitting the tasks in the nest accordingly.
I've developed a system to do just that. It's called pomsets. Try it out, and feel free to send me any questions.

I using a multiprocessed deamons based on Twisted with forking and Gearman jobs query normally.
Try to look at Gearman.

Threads vs. Async

I have been reading up on the threaded model of programming versus the asynchronous model from this really good article. http://krondo.com/blog/?p=1209
However, the article mentions the following points.
An async program will simply outperform a sync program by switching between tasks whenever there is a I/O.
Threads are managed by the operating system.
I remember reading that threads are managed by the operating system by moving around TCBs between the Ready-Queue and the Waiting-Queue(amongst other queues). In this case, threads don't waste time on waiting either do they?
In light of the above mentioned, what are the advantages of async programs over threaded programs?

It is very difficult to write code that is thread safe. With asyncronous code, you know exactly where the code will shift from one task to the next and race conditions are therefore much harder to come by.
Threads consume a fair amount of data since each thread needs to have its own stack. With async code, all the code shares the same stack and the stack is kept small due to continuously unwinding the stack between tasks.
Threads are OS structures and are therefore more memory for the platform to support. There is no such problem with asynchronous tasks.
Update 2022:
Many languages now support stackless co-routines (async/await). This allows us to write a task almost synchronously while yielding to other tasks (awaiting) at set places (sleeping or waiting for networking or other threads)

There are two ways to create threads:
synchronous threading - the parent creates one (or more) child threads and then must wait for each child to terminate. Synchronous threading is often referred to as the fork-join model.
asynchronous threading - the parent and child run concurrently/independently of one another. Multithreaded servers typically follow this model.
resource - http://www.amazon.com/Operating-System-Concepts-Abraham-Silberschatz/dp/0470128720

Assume you have 2 tasks, which does not involve any IO (on multiprocessor machine).
In this case threads outperform Async. Because Async like a
single threaded program executes your tasks in order. But threads can
execute both the tasks simultaneously.
Assume you have 2 tasks, which involve IO (on multiprocessor machine).
In this case both Async and Threads performs more or less same (performance
might vary based on number of cores, scheduling, how much process intensive
the task etc.). Also Async takes less amount of resources, low overhead and
less complex to program over multi threaded program.
How it works?
Thread 1 executes Task 1, since it is waiting for IO, it is moved to IO
waiting Queue. Similarly Thread 2 executes Task 2, since it is also involves
IO, it is moved to IO waiting Queue. As soon as it's IO request is resolved
it is moved to ready queue so the scheduler can schedule the thread for
execution.
Async executes Task 1 and without waiting for it's IO to complete it
continues with Task 2 then it waits for IO of both the task to complete. It
completes the tasks in the order of IO completion.
Async best suited for tasks which involve Web service calls, Database query
calls etc.,
Threads for process intensive tasks.
The below video explains aboutAsync vs Threaded model and also when to use etc.,
https://www.youtube.com/watch?v=kdzL3r-yJZY
Hope this is helpful.

First of all, note that a lot of the detail of how threads are implemented and scheduled are very OS-specific. In general, you shouldn't need to worry about threads waiting on each other, since the OS and the hardware will attempt to arrange for them to run efficiently, whether asynchronously on a single-processor system or in parallel on multi-processors.
Once a thread has finished waiting for something, say I/O, it can be thought of as runnable. Threads that are runnable will be scheduled for execution at some point soon. Whether this is implemented as a simple queue or something more sophisticated is, again, OS- and hardware-specific. You can think of the set of blocked threads as a set rather than as a strictly ordered queue.
Note that on a single-processor system, asynchronous programs as defined here are equivalent to threaded programs.

see http://en.wikipedia.org/wiki/Thread_(computing)#I.2FO_and_scheduling
However, the use of blocking system calls in user threads (as opposed to kernel threads) or fibers can be problematic. If a user thread or a fiber performs a system call that blocks, the other user threads and fibers in the process are unable to run until the system call returns. A typical example of this problem is when performing I/O: most programs are written to perform I/O synchronously. When an I/O operation is initiated, a system call is made, and does not return until the I/O operation has been completed. In the intervening period, the entire process is "blocked" by the kernel and cannot run, which starves other user threads and fibers in the same process from executing.
According to this, your whole process might be blocked, and no thread will be scheduled when one thread is blocked in IO. I think this is os-specific, and will not always be hold.

To be fair, let's point out the benefit of Threads under CPython GIL compared to async approach:
it's easier first to write typical code that has one flow of events (no parallel execution) and then to run multiple copies of it in separate threads: it will keep each copy responsive, while the benefit of executing all I/O operations in parallel will be achieved automatically;
many time-proven libraries are sync and therefore easy to be included in the threaded version, and not in async one;
some sync libraries actually let GIL go at C level that allows parallel execution for tasks beyond I/O-bound ones: e.g. NumPy;
it's harder to write async code in general: the inclusion of a heavy CPU-bound section will make concurrent tasks not responsive, or one may forget to await the result and finish execution earlier.
So if there are no immediate plans to scale your services beyond ~100 concurrent connections it may be easier to start with a threaded version and then rewrite it... using some other more performant language like Go.

Async I/O means there is already a thread in the driver that does the job, so you are duplicating functionality and incurring some overhead. On the other hand, often it is not documented how exactly the driver thread behaves, and in complex scenarios, when you want to control timeout/cancellation/start/stop behaviour, synchronization with other threads, it makes sense to implement your own thread. It is also sometimes easier to reason in sync terms.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.