How to call a linux command line program in parallel with python - python

I have a command-line program which runs on single core. It takes an input file, does some calculations, and returns several files which I need to parse to store the produced output.
I have to call the program several times changing the input file. To speed up the things I was thinking parallelization would be useful.
Until now I have performed this task calling every run separately within a loop with the subprocess module.
I wrote a script which creates a new working folder on every run and than calls the execution of the program whose output is directed to that folder and returns some data which I need to store. My question is, how can I adapt the following code, found here, to execute my script always using the indicated amount of CPUs, and storing the output.
Note that each run has a unique running time.
Here the mentioned code:
import subprocess
import multiprocessing as mp
from tqdm import tqdm
NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)
def work(sec_sleep):
command = ['python', 'worker.py', sec_sleep]
subprocess.call(command)
def update_progress_bar(_):
progress_bar.update()
if __name__ == '__main__':
pool = mp.Pool(NUMBER_OF_TASKS)
for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
pool.apply_async(work, (seconds,), callback=update_progress_bar)
pool.close()
pool.join()

I am not entirely clear what your issue is. I have some recommendations for improvement below, but you seem to claim on the page that you link to that everything works as expected and I don't see anything very wrong with the code as long as you are running on Linux.
Since the subprocess.call method is already creating a new process, you should just be using multithreading to invoke your worker function, work. But had you been using multiprocessing and your platform was one that used the spawn method to create new processes (such as Windows), then having the creation of the progress bar outside of the if __name__ = '__main__': block would have resulted in the creation of 4 additional progress bars that did nothing. Not good! So for portability it would have been best to move its creation to inside the if __name__ = '__main__': block.
import subprocess
from multiprocessing.pool import ThreadPool
from tqdm import tqdm
def work(sec_sleep):
command = ['python', 'worker.py', sec_sleep]
subprocess.call(command)
def update_progress_bar(_):
progress_bar.update()
if __name__ == '__main__':
NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)
pool = ThreadPool(NUMBER_OF_TASKS)
for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
pool.apply_async(work, (seconds,), callback=update_progress_bar)
pool.close()
pool.join()
Note: If your worker.py program prints to the console, it will mess up the progress bar (the progress bar will be re-written repeatedly on multiple lines).
Have you considered instead importing worker.py (some refactoring of that code might be necessary) instead of invoking a new Python interpreter to execute it (in this case you would want to be explicitly using multiprocessing). On Windows this might not save you anything since a new Python interpreter would be executed for each new process anyway, but this could save you on Linux:
import subprocess
from multiprocessing.pool import Pool
from worker import do_work
from tqdm import tqdm
def update_progress_bar(_):
progress_bar.update()
if __name__ == '__main__':
NUMBER_OF_TASKS = 4
progress_bar = tqdm(total=NUMBER_OF_TASKS)
pool = Pool(NUMBER_OF_TASKS)
for seconds in [str(x) for x in range(1, NUMBER_OF_TASKS + 1)]:
pool.apply_async(do_work, (seconds,), callback=update_progress_bar)
pool.close()
pool.join()

Related

Python running subprocesses without waiting whilst still receiving return codes

I have found relating questions to mine but cannot find one that solves my problem.
The problem
I am building a program that monitors several directories, then spawns a subprocess based on directory or particular filename.
These subprocesses can often take up to several hours (for example if rendering 000's of PDFs) to complete. Because of this, I would like to know the best way for the program to continue monitoring the folders in parallel to the subprocess that is still running, and be able to spawn additional subprocesses, as long as they are of a different type to the subprocess currently running.
Once the subprocess has completed, the program should be able to receive a return code, that subprocess would be available to run again.
Code as it stands
This is the simple code that runs the program currently, calling functions when a file is found:
while 1:
paths_to_watch = ['/dir1','/dir2','/dir3','/dir4']
after = {}
for x in paths_to_watch:
key = x
after.update({key :[f for f in os.listdir(x)]})
for key, files in after.items():
if(key == '/dir1'):
function1(files)
elif(key == '/dir2'):
function2(files)
elif(key == '/dir3'):
function3(files)
elif(key == '/dir4'):
function3(files)
time.sleep(10)
Of course this means that the program waits for the process to be finished before it continues to check for files in paths_to_watch
From other questions, it looks like this is something that could be handled with process pools, however my lack of knowledge in this area means I do not know where to start.
I am assuming that you can use threads rather than processes, an assumption that will hold up if your functions function1 thorugh function4 are predominately I/O bound. Otherwise you should substitute ProcessPoolExecutor for ThreadPoolExecutor in the code below. Right now your program loops indefinitely, so the threads too will never terminate. I am also assuming that that functions function1 through function4 have unique implementations.
import os
import time
from concurrent.futures import ThreadPoolExecutor
def function1(files):
pass
def function2(files):
pass
def function3(files):
pass
def function4(files):
pass
def process_path(path, function):
while True:
files = os.listdir(path)
function(files)
time.sleep(10)
def main():
paths_to_watch = ['/dir1','/dir2','/dir3','/dir4']
functions = [function1, function2, function3, function4]
with ThreadPoolExecutor(max_workers=len(paths_to_watch)) as executor:
results = executor.map(process_path, paths_to_watch, functions)
for result in results:
# threads never return so we never get a result
print(result)
if __name__ == '__main__':
main()

Multiprocessing never executing function keeps repeating code before function

I have a multiprocessing pool , that runs with 1 thread, and it keeps repeating the code before my function, i have tried with different threads, and also, i make things like this quite a bit, so i think i know what is causing the problem but i dont understand why, usually i use argparse to to parse files from the user, but i instead wanted to use input, no errors are thrown so i honestly have no clue.
from colorama import Fore
import colorama
import os
import ctypes
import multiprocessing
from multiprocessing import Pool
import random
colorama.init(autoreset=False)
print("headerhere")
#as you can see i used input instead of argparse
g = open(input(Fore.RED + " File Path?: " + Fore.RESET))
gg = open(input(Fore.RED + "File Path?: " + Fore.RESET))
#I messed around with this to see if it was the problem, ultimately disabling it until i fixed it, i just use 1 thread
threads = int(input(Fore.RED + "Amount of Threads?: " + Fore.RESET))
arrange = [lines.replace("\n", "")for lines in g]
good = [items.replace("\n", "") for items in gg]
#this is all of the code before the function that Pool calls
def che(line):
print("f")
#i would show my code but as i said this isnt the problem since ive made programs like this before, the only thing i changed is how i take file inputs from the user
def main():
pool = Pool(1)
pool.daemon = True
result = pool.map(che, arrange)
if __name__ == "__main__":
main()
if __name__ == "__main__":
main()
Here's a minimal, reproducible example of your issue:
from multiprocessing import Pool
print('header')
def func(n):
print(f'func {n}')
def main():
pool = Pool(3)
pool.map(func,[1,2,3])
if __name__ == '__main__':
main()
On OSes where "spawn" (Windows and MacOS) or "forkserver" (some Unix) are the default start methods, the sub-process imports your script. Since print('header') is at global scope, it will run the first time a script is imported into a process, so the output is:
header
header
header
header
func 1
func 2
func 3
A multiprocessing script should have everything meant to run once inside function(s), and they should be called once by the main script via if_name__ == '__main__':, so the solution is to move it into your def main()::
from multiprocessing import Pool
def func(n):
print(f'func {n}')
def main():
print('header')
pool = Pool(3)
pool.map(func,[1,2,3])
if __name__ == '__main__':
main()
Output:
header
func 1
func 2
func 3
If you want the top level code before the definition of che to only be executed in the master process, then place it in a function and call that function in main.
In multiprocessing, the top level statements will be interpreted/executed by both the master process and every child process. So, if some code should be executed only by the master and not by the children, then such code should not placed that at the top-level. Instead, such code should be placed in functions and these functions should be invoked in the main scope, i.e., in the scope of if block controlled by __main__ (or called in the main function in your code snippet).

Python multiprocessing loses activity without exiting file

I have a problem where my .py file, which uses maximum CPU through multiprocessing, stops operating without exiting the .py file.
I am running a heavy task that uses all cores in an old MacBook Pro (2012). The task runs fine at first, where I can visually see four python3.7 tasks populate the Activity Monitor window. However, after about 20 minutes, those four python3.7 disappear from the Activity Monitor.
The strangest part is the multiprocessing .py file is still operating, i.e. it never threw an uncaught exception nor exited the file.
Would you guys/gals have any ideas as to what's going on? My guess is 1) it's most likely an error in the script, and 2) the old computer is overheating.
Thanks!
Edit: Below is the multiprocess code, where the multiprocess function to execute is func with a list as its argument. I hope this helps!
import multiprocessing
def main():
pool = multiprocessing.Pool()
for i in range(24):
pool.apply_async(func, args = ([], ))
pool.close()
pool.join()
if __name__ == '__main__':
main()
Use a context manager to handle closing processes properly.
from multiprocessing import Pool
def main():
with Pool() as p:
result = p.apply_async(func, args = ([], ))
print(result)
if __name__ == '__main__':
main()
I wasn't sure what you were doing with the for i in range() part.

Spawning multiple processes with Python

Earlier I tried to use the threading module in python to create multiple threads. Then I learned about the GIL and how it does not allow taking advantage of multiple CPU cores on a single machine. So now I'm trying to do multiprocessing (I don't strictly need seperate threads).
Here is a sample code I wrote to see if distinct processes are being created. But as can be seen in the output below, I'm getting the same process ID everytime. So multiple processes are not being created. What am I missing?
import multiprocessing as mp
import os
def pri():
print(os.getpid())
if __name__=='__main__':
# Checking number of CPU cores
print(mp.cpu_count())
processes=[mp.Process(target=pri()) for x in range(1,4)]
for p in processes:
p.start()
for p in processes:
p.join()
Output:
4
12554
12554
12554
The Process class requires a callable as its target.
Instead of running the function in the separate process, you are calling it and passing its result (None in this case) to the Process class.
Just change the following:
mp.Process(target=pri())
with:
mp.Process(target=pri)
Since the subprocesses runs on a different process, you won't see their print statements. They also don't share the same memory space. You pass pri() to target, where it needs to be pri. You need to pass a callable object, not execute it.
The prints you see are part of your main thread executions. Because you pass pri(), the code is actually executed. You need to change your code so the pri function returns value, rather than prints it.
Then you need to implement a queue, where all your threads write to it and when they're done, your main thread reads the queue.
A nice feature of the multiprocessing module is the Pool object. It allows you to create a thread pool, and then just use it. It's more convenient.
I have tried your code, the thing is the command executes too quick, so the OS reuses the PIDs. If you add a time.sleep(1) in your pri function, it would work as you expect.
That is True only for Windows. The example below is made on Windows platform. On Unix like machines, you won't need the sleep.
The more convenience solution is like this:
from multiprocessing import Pool
from time import sleep
import os
def pri(x):
sleep(1)
return os.getpid()
def use_procs():
p_pool = Pool(4)
p_results = p_pool.map(pri, [_ for _ in range(1,4)])
p_pool.close()
p_pool.join()
return p_results
if __name__ == '__main__':
res = use_procs()
for r in res:
print r
Without the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
6576
6576
6576
>>>
with the sleep:
==================== RESTART: C:/Python27/tests/test2.py ====================
10396
10944
9000

Python multiprocess profiling

I want to profile a simple multi-process Python script. I tried this code:
import multiprocessing
import cProfile
import time
def worker(num):
time.sleep(3)
print 'Worker:', num
if __name__ == '__main__':
for i in range(5):
p = multiprocessing.Process(target=worker, args=(i,))
cProfile.run('p.start()', 'prof%d.prof' %i)
I'm starting 5 processes and therefore cProfile generates 5 different files. Each log only shows what happens inside the start method. How can I get logs that profile the worker function (and show that it took approximately 3 seconds in each case)?
You're profiling the process startup, which is why you're only seeing what happens in p.start() as you say—and p.start() returns once the subprocess is kicked off. You need to profile inside the worker method, which will get called in the subprocesses.
It's not cool enough having to change your source code for profiling. Let's see what your code is supposed to be like:
import multiprocessing
import time
def worker(num):
time.sleep(3)
print('Worker:', num)
if __name__ == '__main__':
processes = []
for i in range(5):
p = multiprocessing.Process(target=worker, args=(i,))
p.start()
processes.append(p)
for p in processes:
p.join()
I added join here so your main process will wait for your workers before quitting.
Instead of cProfile, try viztracer.
Install it by pip install viztracer. Then use the multiprocess feature
viztracer --log_multiprocess your_script.py
It will generate an html file showing every process on a timeline. (use AWSD to zoom/navigate)
Of course this includes some info that you are not interested in(like the structure of the actual multiprocessing library). If you are already satisfied with this, you are good to go. However, if you want a clearer graph for only your function worker(). Try log_sparse feature.
First, decorate the function you want to log with #log_sparse
from viztracer import log_sparse
#log_sparse
def worker(num):
time.sleep(3)
print('Worker:', num)
Then run viztracer --log_multiprocess --log_sparse your_script.py
Only your worker function, taking 3s, will be displayed on the timeline.
If you have a complex processes structure and you want to profile a particular portion of the code, or maybe the particular working core of the process you can point to the profiler to collect stats there (see enable and disable methods https://docs.python.org/3.6/library/profile.html#module-cProfile). This is what you can do:
import cProfile
def my_particular_worker_code()
pr = cProfile.Profile()
pr.enable()
# Process stuff to be profiled
pr.disable()
pr.print_stats(sort='tottime') # sort as you wish
You can drop the reports to a file as well.

Categories