I have a python program which needs to scan some large log files to extract useful information.
In this program, to better utilize computing resource of sever (which runs ubuntu 12.04 LTS and has 64 cores and 96 GB memory), I create a process pool with size = 10 and apply sever jobs to these pool workers. Each job reads from several large files(about 50 GB each, 20 files in total) by using file.readlines(), and then analyze them line by line to find useful information and save the results in a dictionary. After all files are scanned and analyzed, the result dictionary is wrote to the disk. Besides, there is no explicit call of gc.collect() in the whole script.
I started this program on server using root account and these processes works fine at first: each process of this program will occupies about 3.8 GB memory, so there is 40 GB in total
After a few hours, some other user starts another memory-consuming program (also use root account), which aggressively uses almost all the memory (99% of total memory), and later this program is interrupted by CTRL-Z and killed by using killall -9 process_name
However, after this, I have found that the process state of most of my poolworkers have been changed to S, the CPU usage of these sleep process is decreased to 0. According to man top:
The status of the task which can be one of:
'D' = uninterruptible sleep,
'R' = running,
'S' = sleeping,
'T' = traced or stopped,
'Z' = zombie
I used ps -axl command to check the name of the kernel function where the process is sleeping, and it turns out to these poolworker processes sleep on _fastMutex.
This situation lasts for a long time(The process state is still S now) and I don't want to restart my process to scan all the files again, how can I change these process from state Sleep to Running ?
The Sleeping state indicates that they are waiting for something; the way to wake them up is to satisfy whatever condition it is they wait for (the mutex is probably the mechanism of waiting, not the condition itself). The references to memory consumption suggest the possibility that some processes are at least partially paged out, in which case they would be waiting for the swapper; however, that results in uninterruptible sleep D, not S.
System calls that are in interruptible sleep can also be interrupted by signals, such as alarm, terminate, stop, or continue. Most signals cause the program to abort, however. The two that are (usually) safe, continue and ignore, don't change program flow; so it would just go back to sleep on the same condition again.
Most likely, the reason your processes are in S is that they're genuinely waiting for outside input. Since all we know of your program is that it loads a lot of data, I can't tell you where that happens.
As for how you've described your program: "Each job reads from several large files ... using file.readlines(), and then analyze them line by line". It's highly unlikely this is an efficient way to do it; if you're only scanning line by line in one sweep, it's better to iterate on the file objects in the first place (getting one line at a time). If you're reading text lines in a random order, linecache is your friend. Using mmap you could avoid copying the data from the disk buffers. Which is the best fit depends on the structure of your data and algorithm.
By "state of most of my poolworkers have been changed to S" I suspect that the other workers are what's interesting. Perhaps the sleeping ones are just waiting for the ones that are paged out to return.
Related
I wrote a data analysis program with python's multiprocessing library for parallelism. As I don't need to control the subprocess in detail, I used the multiprocessing.Pool for simplicity.
However, when running the program, I find all the sub-processes fall into status S(SLEEPING) after a short period of active(Running) state.
I investigated the wchan of the processes. The parent process and all but one sub-processes are waiting for _futex, the other one is waiting for pipe_wait.
Some information about my program:
I used multiprocessing.Pool#map to distribute the tasks.
The sub-process task contains disk IO and high memory usage. During the course of the program, the sub-process memory cost may exceed the memory capacity (32 sub-processes each takes at most 5% memory). The disk space is ample.
The arguments and return values of the mapped function are not very large in size (just the filenames of the file to be processed, to be specific).
I didn't explicitly create any pipe in my code.
This is the code skeleton of my program.
# imports emitted
def subprocess_task(filename):
read_the_file(filename) # Large disk IO
process_the_data() # High memory cost
write_the_file(new_filename) # Large disk IO
return newfile_name
if __name__=="__main__":
files=["","",...] # The filename of files to process, len(files)=32.
p=multiprocessing.Pool(32) # There are more than 32 cores on the computer.
res=p.map(subprocess_task,files)
p.close()
# Do something with res.
So I want to know why the processes stuck in such a state(especially the pipe_waiting one)? Does it have anything to do with the high memory usage, and how do I solve it?
Much thanks!
OK, after some efforts digging into pipe(7), multiprocessing source code and the log of my troublesome program, I finally identified the problem.
The sole child process which is pipe_wait seems suspicious, because of which I wasted hours trying to find the blocking pipe. However, the key problem has nothing to do with pipes.
The problem is solved when I put some print reporting the pid at some checkpoints in my program. The processes is not same when the tasks are submitted (which I will refer to as original processes) and when the program got stuck (referred as the stuck processes). One of the original 32 child processes is missing in the stuck processes, and the only stuck process which is pipe_wait is not present when the tasks are submitted.
So I can guess the reason now. And the multiprocessing source code corresponds with my guess.
As I said, the program costs lots of memory. At some point when the system is out of memory, the OOM killer kills one of the child processes, selected by some certain algorithm. The OOM killer is forcible and the process exited with all the finishing undone, which includes the communication with the multiprocessing.Pool.
As the source code indicates, the pool uses one thread to collect the task results, and another to manage the workers. The collector thread passively waits for the result to be sent by the child process, while the worker manager thread actively detects process exit by polling all processes.
Therefore, after the process is killed, the worker manager thread detects it, and repopulates the pool by spawning a new process. As no more task is submitted, the process is pipe_wait for some new task. That's the sole pipe_wait child process in my problem. Meanwhile, the result collector thread keeps waiting for the result from the killed thread, which will never arrive. So the other threads are also sleeping.
I have no root access to the environment, or this could be further verified by investigating OOM killer log.
I used regex to check sequence record of a paired-end fastq files and write the matched sequences into the same files. I used multiprocessing to speed up it but when I ran it with 20 processes, the 20 cpu cores were all using 2% and the total time was the same as running in a single core. Does it mean the regex search is faster than writing output to file so the processes were waiting? Can you suggest how can I improve the multiprocessing? Attached is the code.
def mycallback(x):
SeqIO.write(x[0],outfile1,result.form)
SeqIO.write(x[1],outfile2,result.form)
SeqIO.write(x[2],outfile3,result.form)
SeqIO.write(x[3],outfile4,result.form)
def check(x):
if regex.search(r'^.{0,20}(?:'+fp+'){e<='+str(result.mm)+'}',str(x[0].seq),flags=regex.I) and regex.search(r'^.{0,20}(?:'+rp+'){e<='+str(result.mm)+'}',str(x[1].seq),flags=regex.I):
return((x[0],x[1],'',''))
elif regex.search(r'^.{0,20}(?:'+fp+'){e<='+str(result.mm)+'}',str(x[1].seq),flags=regex.I) and regex.search(r'^.{0,20}(?:'+rp+'){e<='+str(result.mm)+'}',str(x[0].seq),flags=regex.I):
return((x[1],x[0],'',''))
else:
return(('','',x[0],x[1]))
p=Pool(int(result.n))
for i in izip(SeqIO.parse(result.fseq,result.form),SeqIO.parse(result.rseq,result.form)):
p.apply_async(check,args=(i,),callback=mycallback)
p.close()
p.join()
Python's implementation of pool.apply_async calls the callback function inside a thread inside the main process and is such limited by the GIL. You are thus waiting on all your file writes sequentially.
Callbacks should complete immediately since otherwise the thread which handles the results will get blocked.
I would imagine your regex executes faster than file writing, so you would benefit the most from sending the callbacks to their own threads (so multiple file writes can be queued at once). Python threads should release the GIL when waiting on IO (file writes), and are much lighter (faster to start up) than processes.
I am running a Python (python3) script that spawns (using fork and not spawn) lots of processes through multiprocessing.Process (e.g 20-30 of them) at the same time. I make sure all of these processes are done (.join()) and don't become zombies. However, despite I am running the same code with the same random seed my job crashes due to a huge spike in memory usage at completely random times (memory usage goes up to a random value between 30GBs to 200GBs from the requested 14GBs all of the sudden). Sometimes my job/script crashes 10 minutes after running, sometimes right at the beginning) and sometimes 10 hours into running. Note that this process is deterministic and I can repeat it but cannot reproduce the crash, which is very weird. What I do with each of those processes is to load an image from disk using the cv2.imread (each might take 0.5MB on memory) and store them into a shared memory (mp.RawArray('f', 3*224*224) or mp.Array('f', 3*224*224)) that I created before running the process! My code creates and processes something around 1500-2000 of these images every minute on the server that I'm running it. It's very annoying to see that sometimes only 100-150 of those images have been read from disk but the job crashes at the very beginning as I'm requesting 25GBs of memory when I submit my job to our servers which use CentOS.
I've tried increasing the requested memory (from 25GBs to 115GBs) on our servers but my script crashes soon or late and at completely random times. Another thing that I noticed is that although I spawn lots of processes and do .start() at the same time, most of those processes do not start running until the ones that are spawned earlier are done first. This is because I do not request lots of cores (e.g. 30) cores when running my job and use 8 cores.
I wonder if people have had similar experiences? I would appreciate your comments/suggestions.
Following is my multi processing code. regressTuple has around 2000 items. So, the following code creates around 2000 parallel processes. My Dell xps 15 laptop crashes when this is run.
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in minimal time? Am I not doing this correctly?
Is there a API call in python to get the possible hardware process count?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several times till completion - In this way, after few experiments, I will be able to get the optimal thread count.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Hereby my code:
regressTuple = [(x,) for x in regressList]
processes = []
for i in range(len(regressList)):
processes.append(Process(target=runRegressWriteStatus,args=regressTuple[i]))
for process in processes:
process.start()
for process in processes:
process.join()
There are multiple things that we need to keep in mind
Spinning the number of processes are not limited by number of cores on your system but the ulimit for your user id on your system that controls total number of processes that be launched by your user id.
The number of cores determine how many of those launched processes can actually be running in parallel at one time.
Crashing of your system can be due to the fact your target function that these processes are running is doing something heavy and resource intensive, which system is not able to handle when multiple processes run simultaneously or nprocs limit on the system has exhausted and now kernel is not able to spin new system processes.
That being said it is not a good idea to spawn as many as 2000 processes, no matter even if you have a 16 core Intel Skylake machine, because creating a new process on the system is not a light weight task because there are number of things like generating the pid, allocating memory, address space generation, scheduling the process, context switching and managing the entire life cycle of it that happen in the background. So it is a heavy operation for the kernel to generate a new process,
Unfortunately I guess what you are trying to do is a CPU bound task and hence limited by the hardware you have on the machine. Spinning more number of processes than the number of cores on your system is not going to help at all, but creating a process pool might. So basically you want to create a pool with as many number of processes as you have cores on the system and then pass the input to the pool. Something like this
def target_func(data):
# process the input data
with multiprocessing.pool(processes=multiprocessing.cpu_count()) as po:
res = po.map(f, regressionTuple)
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in
minimal time? Am I not doing this correctly?
I don't think it's python's responsibility to manage the queue length. When people reach out for multiprocessing they tend to want efficiency, adding system performance tests to the run queue would be an overhead.
Is there a API call in python to get the possible hardware process count?
If there were, would it know ahead of time how much memory your task will need?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several
times till completion - In this way, after few experiments, I will be
able to get the optimal thread count.
As balderman pointed out, a pool is a good way forward with this.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Use a pool, or take the available system memory, divide by ~3MB and see how many tasks you can run at once.
This is probably more of a sysadmin task to balance the bottlenecks against the queue length, but generally, if your tasks are IO bound, then there isn't much point in having a long task queue if all the tasks are waiting at a the same T-junction to turn into the road. The tasks will then fight with each other for the next block of IO.
I have written a data munging script that is very CPU intensive. It has been running for a few days now, but now (thanks to trace messages sent to the console), I can see that it is not working (actually, has not been working for the last 10 hours or so.
When I run top, I notice that the process is either sleeping (S) or in uninterreptable sleep (D). This is wasting a lot of time.
I used sudo renice -10 PID to change the process's nice value, and after running for a short while, I notice that the process has gone back to sleep again.
My question(s):
Is there anything I can do to FORCE the script to run until it finishes (if even it means the machine is unusable until the end of the script?
Is there a yield command I can use in Python, which allows me to periodically pass control to other process/threads to stop the scheduler from trying to put my script to sleep?.
I am using python 2.7.x on Ubuntu 10.0.4
The scheduler will only put your process on hold if there is another process ready to run. If you have no other processes which hog up the CPU, your process will be running most of the time. The scheduler does not put your process to sleep just because it feels like it.
My guess is that there is some reason your process is not runnable, e.g. it is blocking and waiting for I/O or data.