I wrote a data analysis program with python's multiprocessing library for parallelism. As I don't need to control the subprocess in detail, I used the multiprocessing.Pool for simplicity.
However, when running the program, I find all the sub-processes fall into status S(SLEEPING) after a short period of active(Running) state.
I investigated the wchan of the processes. The parent process and all but one sub-processes are waiting for _futex, the other one is waiting for pipe_wait.
Some information about my program:
I used multiprocessing.Pool#map to distribute the tasks.
The sub-process task contains disk IO and high memory usage. During the course of the program, the sub-process memory cost may exceed the memory capacity (32 sub-processes each takes at most 5% memory). The disk space is ample.
The arguments and return values of the mapped function are not very large in size (just the filenames of the file to be processed, to be specific).
I didn't explicitly create any pipe in my code.
This is the code skeleton of my program.
# imports emitted
def subprocess_task(filename):
read_the_file(filename) # Large disk IO
process_the_data() # High memory cost
write_the_file(new_filename) # Large disk IO
return newfile_name
if __name__=="__main__":
files=["","",...] # The filename of files to process, len(files)=32.
p=multiprocessing.Pool(32) # There are more than 32 cores on the computer.
res=p.map(subprocess_task,files)
p.close()
# Do something with res.
So I want to know why the processes stuck in such a state(especially the pipe_waiting one)? Does it have anything to do with the high memory usage, and how do I solve it?
Much thanks!
OK, after some efforts digging into pipe(7), multiprocessing source code and the log of my troublesome program, I finally identified the problem.
The sole child process which is pipe_wait seems suspicious, because of which I wasted hours trying to find the blocking pipe. However, the key problem has nothing to do with pipes.
The problem is solved when I put some print reporting the pid at some checkpoints in my program. The processes is not same when the tasks are submitted (which I will refer to as original processes) and when the program got stuck (referred as the stuck processes). One of the original 32 child processes is missing in the stuck processes, and the only stuck process which is pipe_wait is not present when the tasks are submitted.
So I can guess the reason now. And the multiprocessing source code corresponds with my guess.
As I said, the program costs lots of memory. At some point when the system is out of memory, the OOM killer kills one of the child processes, selected by some certain algorithm. The OOM killer is forcible and the process exited with all the finishing undone, which includes the communication with the multiprocessing.Pool.
As the source code indicates, the pool uses one thread to collect the task results, and another to manage the workers. The collector thread passively waits for the result to be sent by the child process, while the worker manager thread actively detects process exit by polling all processes.
Therefore, after the process is killed, the worker manager thread detects it, and repopulates the pool by spawning a new process. As no more task is submitted, the process is pipe_wait for some new task. That's the sole pipe_wait child process in my problem. Meanwhile, the result collector thread keeps waiting for the result from the killed thread, which will never arrive. So the other threads are also sleeping.
I have no root access to the environment, or this could be further verified by investigating OOM killer log.
Related
Following is my multi processing code. regressTuple has around 2000 items. So, the following code creates around 2000 parallel processes. My Dell xps 15 laptop crashes when this is run.
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in minimal time? Am I not doing this correctly?
Is there a API call in python to get the possible hardware process count?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several times till completion - In this way, after few experiments, I will be able to get the optimal thread count.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Hereby my code:
regressTuple = [(x,) for x in regressList]
processes = []
for i in range(len(regressList)):
processes.append(Process(target=runRegressWriteStatus,args=regressTuple[i]))
for process in processes:
process.start()
for process in processes:
process.join()
There are multiple things that we need to keep in mind
Spinning the number of processes are not limited by number of cores on your system but the ulimit for your user id on your system that controls total number of processes that be launched by your user id.
The number of cores determine how many of those launched processes can actually be running in parallel at one time.
Crashing of your system can be due to the fact your target function that these processes are running is doing something heavy and resource intensive, which system is not able to handle when multiple processes run simultaneously or nprocs limit on the system has exhausted and now kernel is not able to spin new system processes.
That being said it is not a good idea to spawn as many as 2000 processes, no matter even if you have a 16 core Intel Skylake machine, because creating a new process on the system is not a light weight task because there are number of things like generating the pid, allocating memory, address space generation, scheduling the process, context switching and managing the entire life cycle of it that happen in the background. So it is a heavy operation for the kernel to generate a new process,
Unfortunately I guess what you are trying to do is a CPU bound task and hence limited by the hardware you have on the machine. Spinning more number of processes than the number of cores on your system is not going to help at all, but creating a process pool might. So basically you want to create a pool with as many number of processes as you have cores on the system and then pass the input to the pool. Something like this
def target_func(data):
# process the input data
with multiprocessing.pool(processes=multiprocessing.cpu_count()) as po:
res = po.map(f, regressionTuple)
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in
minimal time? Am I not doing this correctly?
I don't think it's python's responsibility to manage the queue length. When people reach out for multiprocessing they tend to want efficiency, adding system performance tests to the run queue would be an overhead.
Is there a API call in python to get the possible hardware process count?
If there were, would it know ahead of time how much memory your task will need?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several
times till completion - In this way, after few experiments, I will be
able to get the optimal thread count.
As balderman pointed out, a pool is a good way forward with this.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Use a pool, or take the available system memory, divide by ~3MB and see how many tasks you can run at once.
This is probably more of a sysadmin task to balance the bottlenecks against the queue length, but generally, if your tasks are IO bound, then there isn't much point in having a long task queue if all the tasks are waiting at a the same T-junction to turn into the road. The tasks will then fight with each other for the next block of IO.
I'm trying to simulate a network (only TCP) using python's multiprocessing, multi-threading and raw socket.
What I'm doing?
Create a child process which sniffs the network for new connections.
A child process processes 1000 connections and terminates itself after spawning another child process (via main process obviously.) which will do the same job.
All the connection information is stored in a dictionary which is process specific.
Per connection, I'm creating a Timer thread which will delete the connection from the dictionary if the connection is idle for 5 seconds. (And I'm returning from the Thread also.)
While deleting the connection, I'm de-referencing the timer thread and all other dictionary parameters for that connection.
After deleting the last element from the dictionary, I'm calling a gc.collect() to collect the garbage and calling os._exit(0) so the child process will terminate. (Keep in mind that another sibling process has already taken over.)
Why I'm making it so complicated?
Each connection needs to have its own Timer as it has to die after 5 secs of inactivity.
I've tried with a single process processing all the connections which kept on eating up the memory. (Though I followed the above method, memory was not being released.) And at the end, the machine with 4GB memory used to hang making me unable to use it. (Keyboard and mouse interrupts were very very slow.)
So I made a child process for only 1000 connections, and then terminate it. Which releases the memory (I've seen it as the memory used by Python mostly remains a constant).
Since I'm terminating the child process, all threads associated to it should be deleted. (I've read that a thread is not removed in Python unless the parent process dies, which is not the case here.)
What is my problem?
Many times I'm seeing an error (Multiple times per child process), Can't start new thread. Though I'm explicitly terminating the child process using os._exit(). I know that there might be a limit in creating new threads. But I don't think it'd be too less like 2000 or 3000. Since I'm terminating the child processes, I hope the threads are also getting deleted. (Correct me if I'm wrong.)
Occasionally, This Can't start new thread. error comes while spawning a new child process. I couldn't understand why it's showing a thread error while creating a child process?
Rarely I've seen, at the beginning of the first child process itself the above error comes. (No doubt the previous instances of Python is being killed just before few seconds.) Since no instance of Python is there in the process list (Of Linux), It's obvious that all threads from the previous instance have been cleared which is not reflected in this rare scenario.
No doubt I'm ignoring the errors with try except, It'd be good to know:
Why that error comes with so many less threads?
On which parameters (of OS and python), creation of a new thread depends?
How to avoid the above error? (I've one, which will create just one background thread as daemon instead of a timer thread for each connection. So other than that any better solution?)
I am using multiprocessing library of python and also emcee which also uses different threads to implement MCMC. The problem is that even when I close the pool still it seems python uses the processors and slows down the cores and I have no idea what is the efficient way to release the cores after the job is done. Could anybody give me an idea of what I should do?
Update:
My code has been already posted here.
Closing a Pool doesn't stop it from doing work, it just prevents new work items from being added to it:
close()
Prevents any more tasks from being submitted to the pool. Once all the
tasks have been completed the worker processes will exit.
So if you have a lot of queued tasks, closing the Pool won't make any difference in resource usage - all the workers will keep consuming those queued tasks until they're gone. If you want to basically abort all the work items immediately, you have to use pool.terminate:
terminate()
Stops the worker processes immediately without completing outstanding work. When the pool object is garbage collected terminate() will be called immediately.
I use a simple RabbitMQ queue to distribute tasks to worker processes. Each worker process uses a pool of multiprocessing instances to work on multiple task at the same time to use the memory and the cpu as much as possible.
The problem is, that some of the task take much more RAM than the others, so that the worker process would crash if it starts more than one instance. But while the worker is working on the RAM intense task, I'd like it to work on other less RAM intense tasks to use the rest of the CPUs.
One idea would be to use multiple queues or topics but I am wondering what the recommended approach is. Can I catch out of memory errors before they crash the process?
What would be the right approach to solve this?
[updated update]
There whole system will consist of multiple multi core machines, but on each multi core machine there is only one worker program running, that creates as much multiprocessing instances as cores. The different machines should be independent of each other except that they get their tasks from the same queue.
I think trying to catch and recover from OOM errors will be very difficult, if not impossible. You would need a thread or process to be running that constantly monitors memory usage, and when it detects it's too high, does...what exactly? Kills a process that's processing a task? tries to pause it (if that's possible; it may not be depending what yours tasks are doing). Even then, pausing it isn't going to release any memory. You'd have to release the memory and restart the task when its safe, which means you'd have to requeue it, decide when its safe, etc.
Instead of trying to detect and recover from the problem, I would recommend trying to avoid it altogether. Create two queues, and two pools. One queue/pool for high-memory tasks, and another queue/pool for low-memory tasks. The high-memory pool would only have a single process in it, so it would be limited to running one task concurrently, which saves your memory. The low-memory queue would have multiprocessing.cpu_count() - 1 processes, allowing you to keep your CPUs saturated across the two pools.
One potential issue with this approach is that if you exhaust the high-memory queue while still having low-memory tasks pending, you'll be wasting one of your CPU. You could handle this consuming from the high-memory queue in a non-blocking way (or with a timeout), so that if the high-memory queue is empty when you're ready to consume a task, you can grab a low-memory task instead. Then when you're done processing it, check the high-memory queue again.
Something like this:
import multiprocessing
# hi_q and lo_q are placeholders for whatever library you're using to consume from RabbitMQ
def high_mem_consume():
while True:
task = hi_q.consume(timeout=2)
if not task:
lo_q.consume(timeout=2)
if task:
process_task(task)
def low_mem_consume():
while True:
task = lo_q.consume() # Blocks forever
process_task(task)
if __name__ == "__main__":
hi_pool = multiprocessing.Pool(1)
lo_pool = multiprocessing.Pool(multiprocessing.cpu_count() - 1)
hi_pool.apply_async(high_mem_consume)
lo_pool.apply_async(lo_mem_consume)
I have a python program which needs to scan some large log files to extract useful information.
In this program, to better utilize computing resource of sever (which runs ubuntu 12.04 LTS and has 64 cores and 96 GB memory), I create a process pool with size = 10 and apply sever jobs to these pool workers. Each job reads from several large files(about 50 GB each, 20 files in total) by using file.readlines(), and then analyze them line by line to find useful information and save the results in a dictionary. After all files are scanned and analyzed, the result dictionary is wrote to the disk. Besides, there is no explicit call of gc.collect() in the whole script.
I started this program on server using root account and these processes works fine at first: each process of this program will occupies about 3.8 GB memory, so there is 40 GB in total
After a few hours, some other user starts another memory-consuming program (also use root account), which aggressively uses almost all the memory (99% of total memory), and later this program is interrupted by CTRL-Z and killed by using killall -9 process_name
However, after this, I have found that the process state of most of my poolworkers have been changed to S, the CPU usage of these sleep process is decreased to 0. According to man top:
The status of the task which can be one of:
'D' = uninterruptible sleep,
'R' = running,
'S' = sleeping,
'T' = traced or stopped,
'Z' = zombie
I used ps -axl command to check the name of the kernel function where the process is sleeping, and it turns out to these poolworker processes sleep on _fastMutex.
This situation lasts for a long time(The process state is still S now) and I don't want to restart my process to scan all the files again, how can I change these process from state Sleep to Running ?
The Sleeping state indicates that they are waiting for something; the way to wake them up is to satisfy whatever condition it is they wait for (the mutex is probably the mechanism of waiting, not the condition itself). The references to memory consumption suggest the possibility that some processes are at least partially paged out, in which case they would be waiting for the swapper; however, that results in uninterruptible sleep D, not S.
System calls that are in interruptible sleep can also be interrupted by signals, such as alarm, terminate, stop, or continue. Most signals cause the program to abort, however. The two that are (usually) safe, continue and ignore, don't change program flow; so it would just go back to sleep on the same condition again.
Most likely, the reason your processes are in S is that they're genuinely waiting for outside input. Since all we know of your program is that it loads a lot of data, I can't tell you where that happens.
As for how you've described your program: "Each job reads from several large files ... using file.readlines(), and then analyze them line by line". It's highly unlikely this is an efficient way to do it; if you're only scanning line by line in one sweep, it's better to iterate on the file objects in the first place (getting one line at a time). If you're reading text lines in a random order, linecache is your friend. Using mmap you could avoid copying the data from the disk buffers. Which is the best fit depends on the structure of your data and algorithm.
By "state of most of my poolworkers have been changed to S" I suspect that the other workers are what's interesting. Perhaps the sleeping ones are just waiting for the ones that are paged out to return.