Python multiprocessing isn't using full cpu cores - python

I used regex to check sequence record of a paired-end fastq files and write the matched sequences into the same files. I used multiprocessing to speed up it but when I ran it with 20 processes, the 20 cpu cores were all using 2% and the total time was the same as running in a single core. Does it mean the regex search is faster than writing output to file so the processes were waiting? Can you suggest how can I improve the multiprocessing? Attached is the code.
def mycallback(x):
SeqIO.write(x[0],outfile1,result.form)
SeqIO.write(x[1],outfile2,result.form)
SeqIO.write(x[2],outfile3,result.form)
SeqIO.write(x[3],outfile4,result.form)
def check(x):
if regex.search(r'^.{0,20}(?:'+fp+'){e<='+str(result.mm)+'}',str(x[0].seq),flags=regex.I) and regex.search(r'^.{0,20}(?:'+rp+'){e<='+str(result.mm)+'}',str(x[1].seq),flags=regex.I):
return((x[0],x[1],'',''))
elif regex.search(r'^.{0,20}(?:'+fp+'){e<='+str(result.mm)+'}',str(x[1].seq),flags=regex.I) and regex.search(r'^.{0,20}(?:'+rp+'){e<='+str(result.mm)+'}',str(x[0].seq),flags=regex.I):
return((x[1],x[0],'',''))
else:
return(('','',x[0],x[1]))
p=Pool(int(result.n))
for i in izip(SeqIO.parse(result.fseq,result.form),SeqIO.parse(result.rseq,result.form)):
p.apply_async(check,args=(i,),callback=mycallback)
p.close()
p.join()

Python's implementation of pool.apply_async calls the callback function inside a thread inside the main process and is such limited by the GIL. You are thus waiting on all your file writes sequentially.
Callbacks should complete immediately since otherwise the thread which handles the results will get blocked.
I would imagine your regex executes faster than file writing, so you would benefit the most from sending the callbacks to their own threads (so multiple file writes can be queued at once). Python threads should release the GIL when waiting on IO (file writes), and are much lighter (faster to start up) than processes.

Related

Python multiprocessing: dealing with 2000 processes

Following is my multi processing code. regressTuple has around 2000 items. So, the following code creates around 2000 parallel processes. My Dell xps 15 laptop crashes when this is run.
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in minimal time? Am I not doing this correctly?
Is there a API call in python to get the possible hardware process count?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several times till completion - In this way, after few experiments, I will be able to get the optimal thread count.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Hereby my code:
regressTuple = [(x,) for x in regressList]
processes = []
for i in range(len(regressList)):
processes.append(Process(target=runRegressWriteStatus,args=regressTuple[i]))
for process in processes:
process.start()
for process in processes:
process.join()
There are multiple things that we need to keep in mind
Spinning the number of processes are not limited by number of cores on your system but the ulimit for your user id on your system that controls total number of processes that be launched by your user id.
The number of cores determine how many of those launched processes can actually be running in parallel at one time.
Crashing of your system can be due to the fact your target function that these processes are running is doing something heavy and resource intensive, which system is not able to handle when multiple processes run simultaneously or nprocs limit on the system has exhausted and now kernel is not able to spin new system processes.
That being said it is not a good idea to spawn as many as 2000 processes, no matter even if you have a 16 core Intel Skylake machine, because creating a new process on the system is not a light weight task because there are number of things like generating the pid, allocating memory, address space generation, scheduling the process, context switching and managing the entire life cycle of it that happen in the background. So it is a heavy operation for the kernel to generate a new process,
Unfortunately I guess what you are trying to do is a CPU bound task and hence limited by the hardware you have on the machine. Spinning more number of processes than the number of cores on your system is not going to help at all, but creating a process pool might. So basically you want to create a pool with as many number of processes as you have cores on the system and then pass the input to the pool. Something like this
def target_func(data):
# process the input data
with multiprocessing.pool(processes=multiprocessing.cpu_count()) as po:
res = po.map(f, regressionTuple)
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in
minimal time? Am I not doing this correctly?
I don't think it's python's responsibility to manage the queue length. When people reach out for multiprocessing they tend to want efficiency, adding system performance tests to the run queue would be an overhead.
Is there a API call in python to get the possible hardware process count?
If there were, would it know ahead of time how much memory your task will need?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several
times till completion - In this way, after few experiments, I will be
able to get the optimal thread count.
As balderman pointed out, a pool is a good way forward with this.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Use a pool, or take the available system memory, divide by ~3MB and see how many tasks you can run at once.
This is probably more of a sysadmin task to balance the bottlenecks against the queue length, but generally, if your tasks are IO bound, then there isn't much point in having a long task queue if all the tasks are waiting at a the same T-junction to turn into the road. The tasks will then fight with each other for the next block of IO.

Python processing items from list/queue and saving progress

If I have about 10+ million little tasks to process in python (convert images or so), how can I create queue and save progress in case of crash in processing. To be clear, how can I save progress or stop process whatever I want and continue processing from the last point.
Also how to deal with multiple threads in that case?
In general question is how to save progress on processed data to file. Issue if it huge amount of very small files, saving file after each iteration will be longer than processing itself...
Thanks!
(sorry for my English if its not clear)
First of I would suggest not to go for multi-threading. Use multi-processing instead. Multiple threads do not work synchronously in python due to GIL when it comes to computation intensive task.
To solve the problem of saving result use following sequence
Get the names of all the files in a list and divide the list into chunks.
Now assign each process one chunk.
Append names of processed files after every 1000 steps to some file(say monitor.txt) on system(assuming that you can process 1000 files again in case of failure).
In case of failure skip all the files which are saved in the monitor.txt for each process.
You can have monitor_1.txt, monitor_2.txt ... for each process so you will not have to read the whole file for each process.
Following gist might help you. You just need to add code for the 4th point.
https://gist.github.com/rishibarve/ccab04b9d53c0106c6c3f690089d0229
I/O operations like saving files are always relatively slow. If you have to process a large batch of files, you will be stuck with a long I/O time regardless of the number of threads you use.
The easiest is to use multithreading and not multiprocessing, and let the OS's scheduler figure it all out. The docs have a good explanation of how to set up threads. A simple example would be
from threading import Thread
def process_data(file_name):
# does the processing
print(f'processed {file_name}')
if __name__ == '__main__':
file_names = ['file_1', 'file_2']
processes = [Thread(target=process_data, args=(file_name,)) for file_name in file_names]
# here you start all the processes
for proc in processes:
proc.start()
# here you wait for all processes to finish
for proc in processes:
proc.join()
One solution that might be faster is to create a separate process that does the I/O. Then you use a multiprocessing.Queue to queue the files from the `data process thread', and let the I/O thread pick these up and process them one after the other.
This way the I/O never has to rest, which will be close to optimal. I don't know if this will yield a big advantage over the threading based solution, but as is generally the case with concurrency, the best way to find out is to do some benchmarks with your own application.
One issue to watch out for is that if the data processing is much faster, then the Queue can grow very big. This might have a performance impact, depending on your system amongst other things. A quick workaround is to pause the data processing if the queue gets to large.
Remember to write all multiprocessing code in Python in a script with the
if __name__ == '__main__':
# mp code
guard, and be aware that some IDEs don't play nice with concurrent Python code. The safe bet is to test your code by executing it from a terminal.

Multithreaded parsing is slower then sequential

I am parsing 4 large XML files through threads and somehow the multithreaded code is slower then the sequential code?
Here is my multithreaded code:
def parse():
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
thread_list.append(t)
for t in thread_list:
t.join()
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
and thats the "faster" code:
def parse:
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
t.join()
thread_list.append(t)
for t in thread_list:
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
The sequential code is faster by 10 whole minutes, how is this possible?
Python uses the GIL (Global Interpreter Lock) to ensure only one thread executes Python code at a time. This is done to prevent data races and for some other reasons. That, however, means that multithreading in the default CPython will barely give you any code speedup (if it won't slow it down, as it did in your case).
To efficiently parallelize your workload, look into Python's multiprocessing module, which instead launches separate processes that are not affected by each other's GIL
Here's a SO question on that topic
Where did you read that multi-threading or even multi-processing should be always faster that sequential? That is simply wrong. Which one of the 3 modes is faster highly depends on the problem to solve, and where the bottleneck is.
if the algo needs plenty of memory, or if processing multiple parralel operation requires locking, sequential processing is often the best bet
if the bottleneck is IO, Python multithreading is the way to go: even if only one thread can be active at a time, the others will be waiting for io completion during that time and you will get a much better throughput - even if the really faster way is normally polling io with select when possible
only if the bottleneck is CPU processing - which IMHO is not the most common use case - parallelization over different cores is the winner. In Python that means multi-processing (*). That mainly concerns heavy computations
In your use case, there is one other potential cause: you wait for the threads in sequence in the join part. That means that if thread2 ends much before thread0, you will only process it after thread0 has ended which is subobtimal.
This kind of code is often more efficient because it allows processing as soon as one thread has finished:
active_list = thread_list[:]
while len(active_list) > 0:
for t in active_list:
if not t.is_active():
t.join()
active_list.remove[t]
# process t results
...
time.sleep(0.1)
(*) Some libraries specialized in heavy or parallel computation can allow Python threads to run simultaneously. A well knows example for that is numpy: complex operations using numpy and executed in multiple threads can actually run simultaneously on different cores. Thechnically this means releasing the Global Interpreter Lock.
If you're reading these files from a spinning disk, then trying to read 4 at once can really slow down the process.
The disk can only really read one at a time, and will have to physically move the read/write head back and forth between them many many times to service different reading threads. This takes a lot longer than actually reading the data, and you will have to wait for it.
If you're using an SSD, on the other hand, then you won't have this problem. You'll probably still be limited by I/O speed, but the 4-thread case should take about the same amount of time as the single-thread case.

Unexpected parallelism behaviours with threading in python 2.7 on Ubuntu [duplicate]

I've written a working program in Python that basically parses a batch of binary files, extracting data into a data structure. Each file takes around a second to parse, which translates to hours for thousands of files. I've successfully implemented a threaded version of the batch parsing method with an adjustable number of threads. I tested the method on 100 files with a varying number of threads, timing each run. Here are the results (0 threads refers to my original, pre-threading code, 1 threads to the new version run with a single thread spawned).
0 threads: 83.842 seconds
1 threads: 78.777 seconds
2 threads: 105.032 seconds
3 threads: 109.965 seconds
4 threads: 108.956 seconds
5 threads: 109.646 seconds
6 threads: 109.520 seconds
7 threads: 110.457 seconds
8 threads: 111.658 seconds
Though spawning a thread confers a small performance increase over having the main thread do all the work, increasing the number of threads actually decreases performance. I would have expected to see performance increases, at least up to four threads (one for each of my machine's cores). I know threads have associated overhead, but I didn't think this would matter so much with single-digit numbers of threads.
I've heard of the "global interpreter lock", but as I move up to four threads I do see the corresponding number of cores at work--with two threads two cores show activity during parsing, and so on.
I also tested some different versions of the parsing code to see if my program is IO bound. It doesn't seem to be; just reading in the file takes a relatively small proportion of time; processing the file is almost all of it. If I don't do the IO and process an already-read version of a file, I adding a second thread damages performance and a third thread improves it slightly. I'm just wondering why I can't take advantage of my computer's multiple cores to speed things up. Please post any questions or ways I could clarify.
This is sadly how things are in CPython, mainly due to the Global Interpreter Lock (GIL). Python code that's CPU-bound simply doesn't scale across threads (I/O-bound code, on the other hand, might scale to some extent).
There is a highly informative presentation by David Beazley where he discusses some of the issues surrounding the GIL. The video can be found here (thanks #Ikke!)
My recommendation would be to use the multiprocessing module instead of multiple threads.
The threading library does not actually utilize multiple cores simultaneously for computation. You should use the multiprocessing library instead for computational threading.

How to change process state from sleep to running in linux?

I have a python program which needs to scan some large log files to extract useful information.
In this program, to better utilize computing resource of sever (which runs ubuntu 12.04 LTS and has 64 cores and 96 GB memory), I create a process pool with size = 10 and apply sever jobs to these pool workers. Each job reads from several large files(about 50 GB each, 20 files in total) by using file.readlines(), and then analyze them line by line to find useful information and save the results in a dictionary. After all files are scanned and analyzed, the result dictionary is wrote to the disk. Besides, there is no explicit call of gc.collect() in the whole script.
I started this program on server using root account and these processes works fine at first: each process of this program will occupies about 3.8 GB memory, so there is 40 GB in total
After a few hours, some other user starts another memory-consuming program (also use root account), which aggressively uses almost all the memory (99% of total memory), and later this program is interrupted by CTRL-Z and killed by using killall -9 process_name
However, after this, I have found that the process state of most of my poolworkers have been changed to S, the CPU usage of these sleep process is decreased to 0. According to man top:
The status of the task which can be one of:
'D' = uninterruptible sleep,
'R' = running,
'S' = sleeping,
'T' = traced or stopped,
'Z' = zombie
I used ps -axl command to check the name of the kernel function where the process is sleeping, and it turns out to these poolworker processes sleep on _fastMutex.
This situation lasts for a long time(The process state is still S now) and I don't want to restart my process to scan all the files again, how can I change these process from state Sleep to Running ?
The Sleeping state indicates that they are waiting for something; the way to wake them up is to satisfy whatever condition it is they wait for (the mutex is probably the mechanism of waiting, not the condition itself). The references to memory consumption suggest the possibility that some processes are at least partially paged out, in which case they would be waiting for the swapper; however, that results in uninterruptible sleep D, not S.
System calls that are in interruptible sleep can also be interrupted by signals, such as alarm, terminate, stop, or continue. Most signals cause the program to abort, however. The two that are (usually) safe, continue and ignore, don't change program flow; so it would just go back to sleep on the same condition again.
Most likely, the reason your processes are in S is that they're genuinely waiting for outside input. Since all we know of your program is that it loads a lot of data, I can't tell you where that happens.
As for how you've described your program: "Each job reads from several large files ... using file.readlines(), and then analyze them line by line". It's highly unlikely this is an efficient way to do it; if you're only scanning line by line in one sweep, it's better to iterate on the file objects in the first place (getting one line at a time). If you're reading text lines in a random order, linecache is your friend. Using mmap you could avoid copying the data from the disk buffers. Which is the best fit depends on the structure of your data and algorithm.
By "state of most of my poolworkers have been changed to S" I suspect that the other workers are what's interesting. Perhaps the sleeping ones are just waiting for the ones that are paged out to return.

Categories