I write a simple script that executes a system command on a sequence of files.
To speed things up, I'd like to run them in parallel, but not all at once - i need to control maximum number of simultaneously running commands.
What whould be the easiest way to approach this ?
If you are calling subprocesses anyway, I don't see the need to use a thread pool. A basic implementation using the subprocess module would be
import subprocess
import os
import time
files = <list of file names>
command = "/bin/touch"
processes = set()
max_processes = 5
for name in files:
processes.add(subprocess.Popen([command, name]))
if len(processes) >= max_processes:
os.wait()
processes.difference_update([
p for p in processes if p.poll() is not None])
On Windows, os.wait() is not available (nor any other method of waiting for any child process to terminate). You can work around this by polling in certain intervals:
for name in files:
processes.add(subprocess.Popen([command, name]))
while len(processes) >= max_processes:
time.sleep(.1)
processes.difference_update([
p for p in processes if p.poll() is not None])
The time to sleep for depends on the expected execution time of the subprocesses.
The answer from Sven Marnach is almost right, but there is a problem. If one of the last max_processes processes ends, the main program will try to start another process, and the for looping will end. This will close the main process, which can in turn close the child processes. For me, this behavior happened with the screen command.
The code in Linux will be like this (and will only work on python2.7):
import subprocess
import os
import time
files = <list of file names>
command = "/bin/touch"
processes = set()
max_processes = 5
for name in files:
processes.add(subprocess.Popen([command, name]))
if len(processes) >= max_processes:
os.wait()
processes.difference_update(
[p for p in processes if p.poll() is not None])
#Check if all the child processes were closed
for p in processes:
if p.poll() is None:
p.wait()
You need to combine a Semaphore object with threads. A Semaphore is an object that lets you limit the number of threads that are running in a given section of code. In this case we'll use a semaphore to limit the number of threads that can run the os.system call.
First we import the modules we need:
#!/usr/bin/python
import threading
import os
Next we create a Semaphore object. The number four here is the number of threads that can acquire the semaphore at one time. This limits the number of subprocesses that can be run at once.
semaphore = threading.Semaphore(4)
This function simply wraps the call to the subprocess in calls to the Semaphore.
def run_command(cmd):
semaphore.acquire()
try:
os.system(cmd)
finally:
semaphore.release()
If you're using Python 2.6+ this can become even simpler as you can use the 'with' statement to perform both the acquire and release calls.
def run_command(cmd):
with semaphore:
os.system(cmd)
Finally, to show that this works as expected we'll call the "sleep 10" command eight times.
for i in range(8):
threading.Thread(target=run_command, args=("sleep 10", )).start()
Running the script using the 'time' program shows that it only takes 20 seconds as two lots of four sleeps are run in parallel.
aw#aw-laptop:~/personal/stackoverflow$ time python 4992400.py
real 0m20.032s
user 0m0.020s
sys 0m0.008s
I merged the solutions by Sven and Thuener into one that waits for trailing processes and also stops if one of the processes crashes:
def removeFinishedProcesses(processes):
""" given a list of (commandString, process),
remove those that have completed and return the result
"""
newProcs = []
for pollCmd, pollProc in processes:
retCode = pollProc.poll()
if retCode==None:
# still running
newProcs.append((pollCmd, pollProc))
elif retCode!=0:
# failed
raise Exception("Command %s failed" % pollCmd)
else:
logging.info("Command %s completed successfully" % pollCmd)
return newProcs
def runCommands(commands, maxCpu):
processes = []
for command in commands:
logging.info("Starting process %s" % command)
proc = subprocess.Popen(shlex.split(command))
procTuple = (command, proc)
processes.append(procTuple)
while len(processes) >= maxCpu:
time.sleep(.2)
processes = removeFinishedProcesses(processes)
# wait for all processes
while len(processes)>0:
time.sleep(0.5)
processes = removeFinishedProcesses(processes)
logging.info("All processes completed")
What you are asking for is a thread pool. There is a fixed number of threads that can be used to execute tasks. When is not running a task, it waits on a task queue in order to get a new piece of code to execute.
There is this thread pool module, but there is a comment saying it is not considered complete yet. There may be other packages out there, but this was the first one I found.
If your running system commands you can just create the process instances with the subprocess module, call them as you want. There shouldn't be any need to thread (its unpythonic) and multiprocess seems a tad overkill for this task.
This answer is very similar to other answers present here but it uses a list instead of sets.
For some reason when using those answers I was getting a runtime error regarding the size of the set changing.
from subprocess import PIPE
import subprocess
import time
def submit_job_max_len(job_list, max_processes):
sleep_time = 0.1
processes = list()
for command in job_list:
print 'running {n} processes. Submitting {proc}.'.format(n=len(processes),
proc=str(command))
processes.append(subprocess.Popen(command, shell=False, stdout=None,
stdin=PIPE))
while len(processes) >= max_processes:
time.sleep(sleep_time)
processes = [proc for proc in processes if proc.poll() is None]
while len(processes) > 0:
time.sleep(sleep_time)
processes = [proc for proc in processes if proc.poll() is None]
cmd = '/bin/bash run_what.sh {n}'
job_list = ((cmd.format(n=i)).split() for i in range(100))
submit_job_max_len(job_list, max_processes=50)
Related
I'm writing a small application with a Tkinter GUI to interact with an existing executable that does not have a GUI. The executable can export Solid Edge files to different formats (to PDF for example.) (see Solid Edge Translation services on the www). The goal is to export files in batch to PDF.
So the part of the code that calls the executable is here. I need multiprocessing because running the executable takes a while and it would make my app not responsive.
for cmd in commands:
print(f'running cmd {cmd}')
p = Process(target=exportSingleFile, args=(cmd,))
p.start()
(commands = list of commands (as strings) with arguments for input and output file and output filetype (pdf) ). Something like this:
"C:/Program Files/Solid Edge ST9/Program/SolidEdgeTranslationServices.exe" -i="input file" -o="output file" -t=pdf"
But when I try to replace it with this, it seems my app becomes unresponsive and nothing really happens. I guess it's better to use a pool when exporting potentially dozens of files.
exportResult = []
with Pool() as pool:
exportResult = pool.imap_unordered(exportSingleFile,commands)
for r in exportResult:
print (r)
This is what "exportsinglefile" does
def exportSingleFile(cmd):
return subprocess.run(cmd, shell=True)
The multiprocessing module is mostly for running multiple parallel Python processes. Since your commands are already running as separate processes, it's redundant to use multiprocessing on top of that.
Instead, consider using the subprocess.Popen constructor directly, which starts a subprocess but does not wait for it to complete. Store these process objects in a list. You can then regularly poll() every process in the list to see if it completed. To schedule such a poll, use Tkinter's after function.
Rough sketch of such an implementation — you will need to adapt this to your situation, and I didn't test it:
class ParallelCommands:
def __init__(self, commands, num_parallel):
self.commands = commands[::-1]
self.num_parallel = num_parallel
self.processes = []
self.poll()
def poll(self):
# Poll processes for completion, and raise on errors.
for process in self.processes:
process.poll()
if process.returncode is not None and process.returncode != 0:
raise RuntimeError("Process finished with nonzero exit code")
# Remove completed processes.
self.processes = [
p for p in self.processes
if p.returncode is None
]
# Start new processes up to the maximum amount.
while self.commands and len(self.processes) < self.num_parallel:
command = self.commands.pop()
process = subprocess.Popen(command, shell=True)
self.processes.push(process)
def is_done(self):
return not self.processes and not self.commands
To start a bunch of commands, running at most 10 at the same time:
commands = ParallelCommands(["ls /bin", "ls /lib"], 10)
To wait for completion synchronously, blocking the UI; just for demonstration purposes:
while not commands.is_done():
commands.poll()
time.sleep(0.1)
I'm trying to make a non blocking subprocess call to run a slave.py script from my main.py program. I need to pass args from main.py to slave.py once when it(slave.py) is first started via subprocess.call after this slave.py runs for a period of time then exits.
main.py
for insert, (list) in enumerate(list, start =1):
sys.args = [list]
subprocess.call(["python", "slave.py", sys.args], shell = True)
{loop through program and do more stuff..}
And my slave script
slave.py
print sys.args
while True:
{do stuff with args in loop till finished}
time.sleep(30)
Currently, slave.py blocks main.py from running the rest of its tasks, I simply want slave.py to be independent of main.py, once I've passed args to it. The two scripts no longer need to communicate.
I've found a few posts on the net about non blocking subprocess.call but most of them are centered on requiring communication with slave.py at some-point which I currently do not need. Would anyone know how to implement this in a simple fashion...?
You should use subprocess.Popen instead of subprocess.call.
Something like:
subprocess.Popen(["python", "slave.py"] + sys.argv[1:])
From the docs on subprocess.call:
Run the command described by args. Wait for command to complete, then return the returncode attribute.
(Also don't use a list to pass in the arguments if you're going to use shell = True).
Here's a MCVE1 example that demonstrates a non-blocking suprocess call:
import subprocess
import time
p = subprocess.Popen(['sleep', '5'])
while p.poll() is None:
print('Still sleeping')
time.sleep(1)
print('Not sleeping any longer. Exited with returncode %d' % p.returncode)
An alternative approach that relies on more recent changes to the python language to allow for co-routine based parallelism is:
# python3.5 required but could be modified to work with python3.4.
import asyncio
async def do_subprocess():
print('Subprocess sleeping')
proc = await asyncio.create_subprocess_exec('sleep', '5')
returncode = await proc.wait()
print('Subprocess done sleeping. Return code = %d' % returncode)
async def sleep_report(number):
for i in range(number + 1):
print('Slept for %d seconds' % i)
await asyncio.sleep(1)
loop = asyncio.get_event_loop()
tasks = [
asyncio.ensure_future(do_subprocess()),
asyncio.ensure_future(sleep_report(5)),
]
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
1Tested on OS-X using python2.7 & python3.6
There's three levels of thoroughness here.
As mgilson says, if you just swap out subprocess.call for subprocess.Popen, keeping everything else the same, then main.py will not wait for slave.py to finish before it continues. That may be enough by itself. If you care about zombie processes hanging around, you should save the object returned from subprocess.Popen and at some later point call its wait method. (The zombies will automatically go away when main.py exits, so this is only a serious problem if main.py runs for a very long time and/or might create many subprocesses.) And finally, if you don't want a zombie but you also don't want to decide where to do the waiting (this might be appropriate if both processes run for a long and unpredictable time afterward), use the python-daemon library to have the slave disassociate itself from the master -- in that case you can continue using subprocess.call in the master.
For Python 3.8.x
import shlex
import subprocess
cmd = "<full filepath plus arguments of child process>"
cmds = shlex.split(cmd)
p = subprocess.Popen(cmds, start_new_session=True)
This will allow the parent process to exit while the child process continues to run. Not sure about zombies.
Tested on Python 3.8.1 on macOS 10.15.5
The easiest solution for your non-blocking situation would be to add & at the end of the Popen like this:
subprocess.Popen(["python", "slave.py", " &"])
This does not block the execution of the rest of the program.
If you want to start a function several times with different arguments in a non-blocking way, you can use the ThreadPoolExecuter.
You submit your function calls to the executer like this
from concurrent.futures import ThreadPoolExecutor
def threadmap(fun, xs):
with ThreadPoolExecutor(max_workers=8) as executer:
return list(executer.map(fun, xs))
I want to launch an external command from Python for about 8000 files. Every file is processed independently from the others. The only constraint is to continue execution once all files have been processed. I have 4 physical cores, each one with 2 logical cores (multiprocessing.cpu_count() returns 8). My idea was to use a pool of four parallel independent processes that are to be run on 4 of the 8 cores. This way my machine should be usable in the meantime.
Here's what I've been doing:
import multiprocessing
import subprocess
import os
from multiprocessing.pool import ThreadPool
def process_files(input_dir, output_dir, option):
pool = ThreadPool(multiprocessing.cpu_count()/2)
for filename in os.listdir(input_dir): # about 8000 files
f_in = os.path.join(input_dir, filename)
f_out = os.path.join(output_dir, filename)
cmd = ['molconvert', option, f_in, '-o', f_out]
pool.apply_async(subprocess.Popen, (cmd,))
pool.close()
pool.join()
def main():
process_files('dir1', 'dir2', 'mol:H')
do_some_stuff('dir2')
process_files('dir2', 'dir3', 'mol:a')
do_more_stuff('dir3')
A sequential treatment takes 120s for a batch of 100 files. The multiprocessing version outlined above (function process_files) takes only 20s for the batch. However, when I run process_files on the whole set of 8000 files, my PC hangs and does not un-freeze after one hour.
My questions are:
1) I thought ThreadPool is supposed to initialize a pool of processes (of multiprocessing.cpu_count()/2 processes here, to be exact). However my computer hanging up on 8000 files but not on 100 suggests that maybe the size of the pool is not taken into account. Either that, or I'm doing something wrong. Could you explain?
2) Is this the right way to launch independent processes under Python when each of them must launch an external command, and in such a way that all the resources are not taken up by the processing?
I think your basic problem is the use of subprocess.Popen. That method does not wait for a command to complete before returning. Since the function returns immediately (even though the command is still running), the function is finished as far as your ThreadPool is concerned and it can spawn another...which means that you end up spawning 8000 or so processes.
You would probably have better luck using subprocess.check_call:
Run command with arguments. Wait for command to complete. If
the exit code was zero then return, otherwise raise
CalledProcessError. The CalledProcessError object will have the
return code in the returncode attribute.
So:
def process_files(input_dir, output_dir, option):
pool = ThreadPool(multiprocessing.cpu_count()/2)
for filename in os.listdir(input_dir): # about 8000 files
f_in = os.path.join(input_dir, filename)
f_out = os.path.join(output_dir, filename)
cmd = ['molconvert', option, f_in, '-o', f_out]
pool.apply_async(subprocess.check_call, (cmd,))
pool.close()
pool.join()
If you really don't care about the exit code, then you may want subprocess.call, which will not raise an exception in the event of a non-zero exit code from the process.
If you are using Python 3, I would consider using the map method of concurrent.futures.ThreadPoolExecutor.
Alternatively, you can manage a list of subprocesses yourself.
The following example defines a function to start ffmpeg to convert a video file to Theora/Vorbis format. It returns a Popen object for each started subprocess.
def startencoder(iname, oname, offs=None):
args = ['ffmpeg']
if offs is not None and offs > 0:
args += ['-ss', str(offs)]
args += ['-i', iname, '-c:v', 'libtheora', '-q:v', '6', '-c:a',
'libvorbis', '-q:a', '3', '-sn', oname]
with open(os.devnull, 'w') as bb:
p = subprocess.Popen(args, stdout=bb, stderr=bb)
return p
In the main program, a list of Popen objects representing running subprocesses is maintained like this.
outbase = tempname()
ogvlist = []
procs = []
maxprocs = cpu_count()
for n, ifile in enumerate(argv):
# Wait while the list of processes is full.
while len(procs) == maxprocs:
manageprocs(procs)
# Add a new process
ogvname = outbase + '-{:03d}.ogv'.format(n + 1)
procs.append(startencoder(ifile, ogvname, offset))
ogvlist.append(ogvname)
# All jobs have been submitted, wail for them to finish.
while len(procs) > 0:
manageprocs(procs)
So a new process is only started when there are less running subprocesses than cores. Code that is used multiple times is separated into the manageprocs function.
def manageprocs(proclist):
for pr in proclist:
if pr.poll() is not None:
proclist.remove(pr)
sleep(0.5)
The call to sleep is used to prevent the program from spinning in the loop.
I'm trying to make a non blocking subprocess call to run a slave.py script from my main.py program. I need to pass args from main.py to slave.py once when it(slave.py) is first started via subprocess.call after this slave.py runs for a period of time then exits.
main.py
for insert, (list) in enumerate(list, start =1):
sys.args = [list]
subprocess.call(["python", "slave.py", sys.args], shell = True)
{loop through program and do more stuff..}
And my slave script
slave.py
print sys.args
while True:
{do stuff with args in loop till finished}
time.sleep(30)
Currently, slave.py blocks main.py from running the rest of its tasks, I simply want slave.py to be independent of main.py, once I've passed args to it. The two scripts no longer need to communicate.
I've found a few posts on the net about non blocking subprocess.call but most of them are centered on requiring communication with slave.py at some-point which I currently do not need. Would anyone know how to implement this in a simple fashion...?
You should use subprocess.Popen instead of subprocess.call.
Something like:
subprocess.Popen(["python", "slave.py"] + sys.argv[1:])
From the docs on subprocess.call:
Run the command described by args. Wait for command to complete, then return the returncode attribute.
(Also don't use a list to pass in the arguments if you're going to use shell = True).
Here's a MCVE1 example that demonstrates a non-blocking suprocess call:
import subprocess
import time
p = subprocess.Popen(['sleep', '5'])
while p.poll() is None:
print('Still sleeping')
time.sleep(1)
print('Not sleeping any longer. Exited with returncode %d' % p.returncode)
An alternative approach that relies on more recent changes to the python language to allow for co-routine based parallelism is:
# python3.5 required but could be modified to work with python3.4.
import asyncio
async def do_subprocess():
print('Subprocess sleeping')
proc = await asyncio.create_subprocess_exec('sleep', '5')
returncode = await proc.wait()
print('Subprocess done sleeping. Return code = %d' % returncode)
async def sleep_report(number):
for i in range(number + 1):
print('Slept for %d seconds' % i)
await asyncio.sleep(1)
loop = asyncio.get_event_loop()
tasks = [
asyncio.ensure_future(do_subprocess()),
asyncio.ensure_future(sleep_report(5)),
]
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
1Tested on OS-X using python2.7 & python3.6
There's three levels of thoroughness here.
As mgilson says, if you just swap out subprocess.call for subprocess.Popen, keeping everything else the same, then main.py will not wait for slave.py to finish before it continues. That may be enough by itself. If you care about zombie processes hanging around, you should save the object returned from subprocess.Popen and at some later point call its wait method. (The zombies will automatically go away when main.py exits, so this is only a serious problem if main.py runs for a very long time and/or might create many subprocesses.) And finally, if you don't want a zombie but you also don't want to decide where to do the waiting (this might be appropriate if both processes run for a long and unpredictable time afterward), use the python-daemon library to have the slave disassociate itself from the master -- in that case you can continue using subprocess.call in the master.
For Python 3.8.x
import shlex
import subprocess
cmd = "<full filepath plus arguments of child process>"
cmds = shlex.split(cmd)
p = subprocess.Popen(cmds, start_new_session=True)
This will allow the parent process to exit while the child process continues to run. Not sure about zombies.
Tested on Python 3.8.1 on macOS 10.15.5
The easiest solution for your non-blocking situation would be to add & at the end of the Popen like this:
subprocess.Popen(["python", "slave.py", " &"])
This does not block the execution of the rest of the program.
If you want to start a function several times with different arguments in a non-blocking way, you can use the ThreadPoolExecuter.
You submit your function calls to the executer like this
from concurrent.futures import ThreadPoolExecutor
def threadmap(fun, xs):
with ThreadPoolExecutor(max_workers=8) as executer:
return list(executer.map(fun, xs))
I'm trying to make a non blocking subprocess call to run a slave.py script from my main.py program. I need to pass args from main.py to slave.py once when it(slave.py) is first started via subprocess.call after this slave.py runs for a period of time then exits.
main.py
for insert, (list) in enumerate(list, start =1):
sys.args = [list]
subprocess.call(["python", "slave.py", sys.args], shell = True)
{loop through program and do more stuff..}
And my slave script
slave.py
print sys.args
while True:
{do stuff with args in loop till finished}
time.sleep(30)
Currently, slave.py blocks main.py from running the rest of its tasks, I simply want slave.py to be independent of main.py, once I've passed args to it. The two scripts no longer need to communicate.
I've found a few posts on the net about non blocking subprocess.call but most of them are centered on requiring communication with slave.py at some-point which I currently do not need. Would anyone know how to implement this in a simple fashion...?
You should use subprocess.Popen instead of subprocess.call.
Something like:
subprocess.Popen(["python", "slave.py"] + sys.argv[1:])
From the docs on subprocess.call:
Run the command described by args. Wait for command to complete, then return the returncode attribute.
(Also don't use a list to pass in the arguments if you're going to use shell = True).
Here's a MCVE1 example that demonstrates a non-blocking suprocess call:
import subprocess
import time
p = subprocess.Popen(['sleep', '5'])
while p.poll() is None:
print('Still sleeping')
time.sleep(1)
print('Not sleeping any longer. Exited with returncode %d' % p.returncode)
An alternative approach that relies on more recent changes to the python language to allow for co-routine based parallelism is:
# python3.5 required but could be modified to work with python3.4.
import asyncio
async def do_subprocess():
print('Subprocess sleeping')
proc = await asyncio.create_subprocess_exec('sleep', '5')
returncode = await proc.wait()
print('Subprocess done sleeping. Return code = %d' % returncode)
async def sleep_report(number):
for i in range(number + 1):
print('Slept for %d seconds' % i)
await asyncio.sleep(1)
loop = asyncio.get_event_loop()
tasks = [
asyncio.ensure_future(do_subprocess()),
asyncio.ensure_future(sleep_report(5)),
]
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
1Tested on OS-X using python2.7 & python3.6
There's three levels of thoroughness here.
As mgilson says, if you just swap out subprocess.call for subprocess.Popen, keeping everything else the same, then main.py will not wait for slave.py to finish before it continues. That may be enough by itself. If you care about zombie processes hanging around, you should save the object returned from subprocess.Popen and at some later point call its wait method. (The zombies will automatically go away when main.py exits, so this is only a serious problem if main.py runs for a very long time and/or might create many subprocesses.) And finally, if you don't want a zombie but you also don't want to decide where to do the waiting (this might be appropriate if both processes run for a long and unpredictable time afterward), use the python-daemon library to have the slave disassociate itself from the master -- in that case you can continue using subprocess.call in the master.
For Python 3.8.x
import shlex
import subprocess
cmd = "<full filepath plus arguments of child process>"
cmds = shlex.split(cmd)
p = subprocess.Popen(cmds, start_new_session=True)
This will allow the parent process to exit while the child process continues to run. Not sure about zombies.
Tested on Python 3.8.1 on macOS 10.15.5
The easiest solution for your non-blocking situation would be to add & at the end of the Popen like this:
subprocess.Popen(["python", "slave.py", " &"])
This does not block the execution of the rest of the program.
If you want to start a function several times with different arguments in a non-blocking way, you can use the ThreadPoolExecuter.
You submit your function calls to the executer like this
from concurrent.futures import ThreadPoolExecutor
def threadmap(fun, xs):
with ThreadPoolExecutor(max_workers=8) as executer:
return list(executer.map(fun, xs))