Multi processing subprocess - python

I'm new to subprocess module of python, currently my implementation is not multi processed.
import subprocess,shlex
def forcedParsing(fname):
cmd = 'strings "%s"' % (fname)
#print cmd
args= shlex.split(cmd)
try:
sp = subprocess.Popen( args, shell = False, stdout = subprocess.PIPE, stderr = subprocess.PIPE )
out, err = sp.communicate()
except OSError:
print "Error no %s Message %s" % (OSError.errno,OSError.message)
pass
if sp.returncode== 0:
#print "Processed %s" %fname
return out
res=[]
for f in file_list: res.append(forcedParsing(f))
my questions:
Is sp.communicate a good way to go? should I use poll?
if I use poll I need a sperate process which monitors if process finished right?
should I fork at the for loop?

1) subprocess.communicate() seems the right option for what you are trying to do. And you don't need to poll the proces, communicate() returns only when it's finished.
2) you mean forking to paralellize work? take a look at multiprocessing (python >= 2.6). Running parallel processes using subprocess is of course possible but it's quite a work, you cannot just call communicate(), which is blocking.
About your code:
cmd = 'strings "%s"' % (fname)
args= shlex.split(cmd)
Why not simply?
args = ["strings", fname]
As for this ugly pattern:
res=[]
for f in file_list: res.append(forcedParsing(f))
You should use list-comprehensions whenever possible:
res = [forcedParsing(f) for f in file_list]

About question 2: forking at the for loop will mostly speed things up if the script's supposed to run on a system with multiple cores/processors. It will consume more memory, though, and will stress IO harder. There will be a sweet spot somewhere that depends on the number of files in file_list, but only benchmarking on a realistic target system can tell you where it is. If you find that number, you could add an if len(file_list) > <your number>: with optional fork() 'ing [Edit: rather, as #tokland say's via multiprocessing if it's available on your Python version (2.6+)] that chooses the most efficient strategy on a per-job basis.
Read about Python profiling here: http://docs.python.org/library/profile.html
If you're on Linux, you can also run time: http://linuxmanpages.com/man1/time.1.php

There are several warnings in the subprocess documentation that advise you to use communicate to avoid problems with a processes blocking, so it would be a good idea to use that.

Related

Python subprocess always waits for programm [duplicate]

I'm trying to port a shell script to the much more readable python version. The original shell script starts several processes (utilities, monitors, etc.) in the background with "&". How can I achieve the same effect in python? I'd like these processes not to die when the python scripts complete. I am sure it's related to the concept of a daemon somehow, but I couldn't find how to do this easily.
While jkp's solution works, the newer way of doing things (and the way the documentation recommends) is to use the subprocess module. For simple commands its equivalent, but it offers more options if you want to do something complicated.
Example for your case:
import subprocess
subprocess.Popen(["rm","-r","some.file"])
This will run rm -r some.file in the background. Note that calling .communicate() on the object returned from Popen will block until it completes, so don't do that if you want it to run in the background:
import subprocess
ls_output=subprocess.Popen(["sleep", "30"])
ls_output.communicate() # Will block for 30 seconds
See the documentation here.
Also, a point of clarification: "Background" as you use it here is purely a shell concept; technically, what you mean is that you want to spawn a process without blocking while you wait for it to complete. However, I've used "background" here to refer to shell-background-like behavior.
Note: This answer is less current than it was when posted in 2009. Using the subprocess module shown in other answers is now recommended in the docs
(Note that the subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using these functions.)
If you want your process to start in the background you can either use system() and call it in the same way your shell script did, or you can spawn it:
import os
os.spawnl(os.P_DETACH, 'some_long_running_command')
(or, alternatively, you may try the less portable os.P_NOWAIT flag).
See the documentation here.
You probably want the answer to "How to call an external command in Python".
The simplest approach is to use the os.system function, e.g.:
import os
os.system("some_command &")
Basically, whatever you pass to the system function will be executed the same as if you'd passed it to the shell in a script.
I found this here:
On windows (win xp), the parent process will not finish until the longtask.py has finished its work. It is not what you want in CGI-script. The problem is not specific to Python, in PHP community the problems are the same.
The solution is to pass DETACHED_PROCESS Process Creation Flag to the underlying CreateProcess function in win API. If you happen to have installed pywin32 you can import the flag from the win32process module, otherwise you should define it yourself:
DETACHED_PROCESS = 0x00000008
pid = subprocess.Popen([sys.executable, "longtask.py"],
creationflags=DETACHED_PROCESS).pid
Use subprocess.Popen() with the close_fds=True parameter, which will allow the spawned subprocess to be detached from the Python process itself and continue running even after Python exits.
https://gist.github.com/yinjimmy/d6ad0742d03d54518e9f
import os, time, sys, subprocess
if len(sys.argv) == 2:
time.sleep(5)
print 'track end'
if sys.platform == 'darwin':
subprocess.Popen(['say', 'hello'])
else:
print 'main begin'
subprocess.Popen(['python', os.path.realpath(__file__), '0'], close_fds=True)
print 'main end'
Both capture output and run on background with threading
As mentioned on this answer, if you capture the output with stdout= and then try to read(), then the process blocks.
However, there are cases where you need this. For example, I wanted to launch two processes that talk over a port between them, and save their stdout to a log file and stdout.
The threading module allows us to do that.
First, have a look at how to do the output redirection part alone in this question: Python Popen: Write to stdout AND log file simultaneously
Then:
main.py
#!/usr/bin/env python3
import os
import subprocess
import sys
import threading
def output_reader(proc, file):
while True:
byte = proc.stdout.read(1)
if byte:
sys.stdout.buffer.write(byte)
sys.stdout.flush()
file.buffer.write(byte)
else:
break
with subprocess.Popen(['./sleep.py', '0'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) as proc1, \
subprocess.Popen(['./sleep.py', '10'], stdout=subprocess.PIPE, stderr=subprocess.PIPE) as proc2, \
open('log1.log', 'w') as file1, \
open('log2.log', 'w') as file2:
t1 = threading.Thread(target=output_reader, args=(proc1, file1))
t2 = threading.Thread(target=output_reader, args=(proc2, file2))
t1.start()
t2.start()
t1.join()
t2.join()
sleep.py
#!/usr/bin/env python3
import sys
import time
for i in range(4):
print(i + int(sys.argv[1]))
sys.stdout.flush()
time.sleep(0.5)
After running:
./main.py
stdout get updated every 0.5 seconds for every two lines to contain:
0
10
1
11
2
12
3
13
and each log file contains the respective log for a given process.
Inspired by: https://eli.thegreenplace.net/2017/interacting-with-a-long-running-child-process-in-python/
Tested on Ubuntu 18.04, Python 3.6.7.
You probably want to start investigating the os module for forking different threads (by opening an interactive session and issuing help(os)). The relevant functions are fork and any of the exec ones. To give you an idea on how to start, put something like this in a function that performs the fork (the function needs to take a list or tuple 'args' as an argument that contains the program's name and its parameters; you may also want to define stdin, out and err for the new thread):
try:
pid = os.fork()
except OSError, e:
## some debug output
sys.exit(1)
if pid == 0:
## eventually use os.putenv(..) to set environment variables
## os.execv strips of args[0] for the arguments
os.execv(args[0], args)
You can use
import os
pid = os.fork()
if pid == 0:
Continue to other code ...
This will make the python process run in background.
I haven't tried this yet but using .pyw files instead of .py files should help. pyw files dosen't have a console so in theory it should not appear and work like a background process.

Handling interactive shells with Python subprocess

I am trying to run multiple instances of a console-based game (dungeon crawl stone soup -- for research purposes naturally) using a multiprocessing pool to evaluate each run.
In the past when I've used a pool to evaluate similar code (genetic algorithms), I've used subprocess.call to split off each process. However, with dcss being quite interactive having a shared subshell seems to be problematic.
I have the code I normally use for this kind of thing, with crawl replacing other applications I've thrown a GA at. Is there a better way to handle highly-interactive shells than this? I'd considered kicking off a screen for each instance, but thought there was a cleaner way. My understanding was that shell=True should be spawning a sub-shell, but I guess I it is spawning one in a way that is shared between each call.
I should mention I have a bot running the game, so I don't want any actual interaction from the user's end to occur.
# Kick off the GA execution
pool_args = zip(trial_ids,run_types,self.__population)
pool.map(self._GAExecute, pool_args)
---
# called by pool.map
def _GAExecute(self,pool_args):
trial_id = pool_args[0]
run_type = pool_args[1]
genome = pool_args[2]
self._RunSimulation(trial_id)
# Call the actual binary
def _RunSimulation(self, trial_id):
command = "./%s" % self.__crawl_binary
name = "-name %s" % trial_id
rc = "-rc %s" % os.path.join(self.__output_dir,'qw-%s'%trial_id,"qw -%s.rc"%trial_id)
seed = "-seed %d" % self.__seed
cdir = "-dir %s" % os.path.join(self.__output_dir,'qw-%s'%trial_id)
shell_command = "%s %s %s %s %s" % (command,name,rc,seed,cdir)
call(shell_command, shell=True)
You can indeed associate stdin and stdout to files, as in the answer from #napuzba:
fout = open('stdout.txt','w')
ferr = open('stderr.txt','w')
subprocess.call(cmd, stdout=fout, stderr=ferr)
Another option would be to use Popen instead of call. The difference is that call waits for completion (is blocking) while Popen is not, see What's the difference between subprocess Popen and call (how can I use them)?
Using Popen, you can then keep stdout and stderr inside your object, and then use them later, without having to rely on a file:
p = subprocess.Popen(cmd,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
p.wait()
stderr = p.stderr.read()
stdout = p.stdout.read()
Another potential advantage of this method is that you could run multiple instances of Popen without waiting for completion instead of having a thread pool:
processes=[
subprocess.Popen(cmd1,stdout=subprocess.PIPE, stderr=subprocess.PIPE),
subprocess.Popen(cmd2,stdout=subprocess.PIPE, stderr=subprocess.PIPE),
subprocess.Popen(cmd3,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
]
for p in processes:
if p.poll():
# process completed
else:
# no completion yet
On a side note, you should avoid shell=True if you can, and if you do not use it Popen expects a list as a command instead of a string. Do not generate this list manually, but use shlex which will take care of all corner cases for you, eg.:
Popen(shlex.split(cmd), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
Specify the standard input, standard output and standard error with unique file handles for each call:
import subprocess
cmd = ""
fout = open('stdout.txt','w')
fin = open('stdin.txt','r')
ferr = open('stderr.txt','w')
subprocess.call(cmd, stdout=fout , stdin = fin , stderr=ferr )

Subprocess doesn't respect arguments when using multiprocessing

The main objective here is to create a daemon-spawning function. The daemons need to run arbitrary programs (i.e. use subprocess).
What I have so far in my daemonizer.py module is:
import os
from multiprocessing import Process
from time import sleep
from subprocess import call, STDOUT
def _daemon_process(path_to_exec, std_out_path, args, shell):
with open(std_out_path, 'w') as fh:
args = (str(a) for a in args)
if shell:
fh.write("*** LAUNCHING IN SHELL: {0} ***\n\n".format(" ".join([path_to_exec] + list(args))))
retcode = call(" ".join([path_to_exec] + list(args)), stderr=STDOUT, stdout=fh, shell=True)
else:
fh.write("*** LAUNCHING WITHOUT SHELL: {0} ***\n\n".format([path_to_exec] + list(args)))
retcode = call([path_to_exec] + list(args), stderr=STDOUT, stdout=fh, shell=False)
if retcode:
fh.write("\n*** DAEMON EXITED WITH CODE {0} ***\n".format(retcode))
else:
fh.write("\n*** DAEMON DONE ***\n")
def daemon(path_to_executable, std_out=os.devnull, daemon_args=tuple(), shell=True):
d = Process(name='daemon', target=_daemon_process, args=(path_to_executable, std_out, daemon_args, shell))
d.daemon = True
d.start()
sleep(1)
When trying to run this in bash (This will create a file called test.log in your current directory.):
python -c"import daemonizer;daemonizer.daemon('ping', std_out='test.log', daemon_args=('-c', '5', '192.168.1.1'), shell=True)"
It correctly spawns a daemon that launches ping but it doesn't respect the arguments passed. This is true if shell is set to False as well. The log-file clearly states that it attempted to launch it with the arguments passed.
As a proof of concept creating the following executable:
echo "ping -c 5 192.168.1.1" > ping_test
chmod +x ping_test
The following works as intended:
python -c"import daemonizer;daemonizer.daemon('./ping_test', std_out='test.log', shell=True)"
If I test the same call code outside of the multiprocessing.Process-target it does work as expected.
So how do I fix this mess so that I can spawn processes with arguments?
I'm open to entirely different structures and modules, but they should be included among the standard ones and be compatible with python 2.7.x. The requirement is that the the daemon function should be callable several times asynchronously within a script and produce a daemon each and their target processes should be able to end up on different CPUs. Also the scripts need to be able to end without affecting the spawned daemons of course.
As a bonus, I noticed I needed to have a sleep for the spawning to work at all else the script terminates too fast. Any way to get around that arbitrary hack and/or how long do I really need to have it wait to be safe?
Your arguments are being "used up" by the printing of them!
First, you do this:
args = (str(a) for a in args)
That creates a generator, not a list or tuple. So when you later do this:
list(args)
That consumes the arguments, and they will not be seen a second time. So you do this again:
list(args)
And get an empty list!
You could fix this by commenting out your print statements, but much better would be to simply create a list in the first place:
args = [str(a) for a in args]
Then you can use args directly and not list(args). And it will always have the arguments inside.

Best way to fork multiple shell commands/processes in Python?

Most of the examples I've seen with os.fork and the subprocess/multiprocessing modules show how to fork a new instance of the calling python script or a chunk of python code. What would be the best way to spawn a set of arbitrary shell command concurrently?
I suppose, I could just use subprocess.call or one of the Popen commands and pipe the output to a file, which I believe will return immediately, at least to the caller. I know this is not that hard to do, I'm just trying to figure out the simplest, most Pythonic way to do it.
Thanks in advance
All calls to subprocess.Popen return immediately to the caller. It's the calls to wait and communicate which block. So all you need to do is spin up a number of processes using subprocess.Popen (set stdin to /dev/null for safety), and then one by one call communicate until they're all complete.
Naturally I'm assuming you're just trying to start a bunch of unrelated (i.e. not piped together) commands.
I like to use PTYs instead of pipes. For a bunch of processes where I only want to capture error messages I did this.
RNULL = open('/dev/null', 'r')
WNULL = open('/dev/null', 'w')
logfile = open("myprocess.log", "a", 1)
REALSTDERR = sys.stderr
sys.stderr = logfile
This next part was in a loop spawning about 30 processes.
sys.stderr = REALSTDERR
master, slave = pty.openpty()
self.subp = Popen(self.parsed, shell=False, stdin=RNULL, stdout=WNULL, stderr=slave)
sys.stderr = logfile
After this I had a select loop which collected any error messages and sent them to the single log file. Using PTYs meant that I never had to worry about partial lines getting mixed up because the line discipline provides simple framing.
There is no best for all possible circumstances. The best depends on the problem at hand.
Here's how to spawn a process and save its output to a file combining stdout/stderr:
import subprocess
import sys
def spawn(cmd, output_file):
on_posix = 'posix' in sys.builtin_module_names
return subprocess.Popen(cmd, close_fds=on_posix, bufsize=-1,
stdin=open(os.devnull,'rb'),
stdout=output_file,
stderr=subprocess.STDOUT)
To spawn multiple processes that can run in parallel with your script and each other:
processes, files = [], []
try:
for i, cmd in enumerate(commands):
files.append(open('out%d' % i, 'wb'))
processes.append(spawn(cmd, files[-1]))
finally:
for p in processes:
p.wait()
for f in files:
f.close()
Note: cmd is a list everywhere.
I suppose, I could just us subprocess.call or one of the Popen
commands and pipe the output to a file, which I believe will return
immediately, at least to the caller.
That's not a good way to do it if you want to process the data.
In this case, better do
sp = subprocess.Popen(['ls', '-l'], stdout=subprocess.PIPE)
and then sp.communicate() or read directly from sp.stdout.read().
If the data shall be processed in the calling program at a later time, there are two ways to go:
You can retrieve the data ASAP, maybe via a separate thread, reading them and storing them somewhere where the consumer can get them.
You can have the producing subprocess have block and retrieve the data from it when you need them. The subprocess produces as many data as fit in the pipe buffer (usually 64 kiB) and then blocks on further writes. As soon as you need the data, you read() from the subprocess object's stdout (maybe stderr as well) and use them - or, again, you use sp.communicate() at that later time.
Way 1 would the way to go if producing the data needs much time, so that your wprogram would have to wait.
Way 2 would be to be preferred if the size of the data is quite huge and/or the data is produced so fast that buffering would make no sense.
See an older answer of mine including code snippets to do:
Uses processes not threads for blocking I/O because they can more reliably be p.terminated()
Implements a retriggerable timeout watchdog that restarts counting whenever some output happens
Implements a long-term timeout watchdog to limit overall runtime
Can feed in stdin (although I only need to feed in one-time short strings)
Can capture stdout/stderr in the usual Popen means (Only stdout is coded, and stderr redirected to stdout; but can easily be separated)
It's almost realtime because it only checks every 0.2 seconds for output. But you could decrease this or remove the waiting interval easily
Lots of debugging printouts still enabled to see whats happening when.
For spawning multiple concurrent commands, you would need to alter the class RunCmd to instantiate mutliple read output/write input queues and to spawn mutliple Popen subprocesses.

subprocess replacement of popen2 with Python

I tried to run this code from the book 'Python Standard Library' of 'Fred Lunde'.
import popen2, string
fin, fout = popen2.popen2("sort")
fout.write("foo\n")
fout.write("bar\n")
fout.close()
print fin.readline(),
print fin.readline(),
fin.close()
It runs well with a warning of
~/python_standard_library_oreilly_lunde/scripts/popen2-example-1.py:1:
DeprecationWarning: The popen2 module is deprecated. Use the subprocess module.
How to translate the previous function with subprocess? I tried as follows, but it doesn't work.
from subprocess import *
p = Popen("sort", shell=True, stdin=PIPE, stdout=PIPE, close_fds=True)
p.stdin("foo\n") #p.stdin("bar\n")
import subprocess
proc=subprocess.Popen(['sort'],stdin=subprocess.PIPE,stdout=subprocess.PIPE)
proc.stdin.write('foo\n')
proc.stdin.write('bar\n')
out,err=proc.communicate()
print(out)
Within the multiprocessing module there is a method called 'Pool' which might be perfect for your needs considering you are planning to do sort (not sure how huge the data is, but...).
It's optimizes itself to the number of cores your system has. i.e. only as many processes are spawned as the no. of cores. Of course this is customizable.
from multiprocessing import Pool
def main():
po = Pool()
po.apply_async(sort_fn, (any_args,), callback=save_data)
po.close()
po.join()
return
def sort_fn(any_args):
#do whatever it is that you want to do in a separate process.
return data
def save_data(data):
#data is a object. Store it in a file, mysql or...
return

Categories