Subprocess started from Python runs slower than from bash - python

I am using the following Python code to run a subprocess and collect its output as a string:
def run(command):
''' Run a command and return the output as a string '''
args = shlex.split(command)
out = subprocess.Popen(args,
stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()[0]
# Save to log file
with open("log_file.txt", "a") as log_file:
log_file.write("$ " + command + "\n")
log_file.write(out)
return out
My goal is to run a benchmark application (openssl speed) multiple time and use Python to parse the output and calculate the average results.
However, I have noticed that the results are consistently slower (about 10%) than when I run the same command directly from the command line in bash.
How would you explain this?
EDIT
The output of the command is quite short: about 10 lines.
Also, note that the benchmark does not print any output while the performance test is running. It only prints the results outside the critical loop.
In the script I only run the benchmark of a particular cipher at a time, so for example I use the following arguments:
openssl speed -elapsed -engine my_engine rsa2048
Note that I am using a custom engine (target of the benchmark) and not the standard software implementation.
My engine spawns another pthread but I would not expect to make a big difference since the Python script is not supposed to interact in any way.

Related

How to Parallelize a Python program on Linux

I have a script that takes in input a list of filenames and loops over them to generate an output file per input file, so this is a case which can be easily parallelized I think.
I have a 8 core machine.
I tried on using -parallel flag on this command:
python perfile_code.py list_of_files.txt
But I can't make it work, i.e. specific question is: how to use parallel in bash with a python command in Linux, along with the arguments for the specific case mentioned above.
There is a Linux parallel command (sudo apt-get install parallel), which I read somewhere can do this job but I don't know how to use it.
Most of the internet resources explain how to do it in python but can it be done in bash?
Please help, thanks.
Based on an answer, here is a working example that is still not working, please suggest how to make it work.
I have a folder with 2 files, i just want to create their duplicates with a different name parallely in this example.
# filelist is the directory containing two file names, a.txt and b.txt.
# a.txt is the first file, b.xt is the second file
# i pass an .txt file with both the names to the main program
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
import sys
def translate(filename):
print(filename)
f = open(filename, "r")
g = open(filename + ".x", , "w")
for line in f:
g.write(line)
def main(path_to_file_with_list):
futures = []
with ProcessPoolExecutor(max_workers=8) as executor:
for filename in Path(path_to_file_with_list).open():
executor.submit(translate, "filelist/" + filename)
for future in as_completed(futures):
future.result()
if __name__ == "__main__":
main(sys.argv[1])
Based on your comment,
#Ouroborus no, no consider this opensource.com/article/18/5/gnu-parallel i want to run a python program along with this parallel..for a very specific case..if an arbitrary convert program can be piped to parallel ..why wouldn't a python program?
I think this might help:
convert wasn't chosen arbitrarily. It was chosen because it is a better known program that (roughly) maps a single input file, provided via the command line, to a single output file, also provided via the command line.
The typical shell for loop can be used to iterate over a list. In the article you linked, they show an example
for i in *jpeg; do convert $i $i.png ; done
This (again, roughly) takes a list of file names and applies them, one by one, to a command template and then runs that command.
The issue here is that for would necessarily wait until a command is finished before running the next one and so may under-utilize today's multi-core processors.
parallel acts a kind of replacement for for. It makes the assumption that a command can be executed multiple times simultaneously, each with different arguments, without each instance interfering with the others.
In the article, they show a command using parallel
find . -name "*jpeg" | parallel -I% --max-args 1 convert % %.png
that is equivalent to the previous for command. The difference (still roughly) is that parallel runs several variants of the templated command simultaneously without necessarily waiting for each to complete.
For your specific situation, in order to be able to use parallel, you would need to:
Adjust your python script so that it takes one input (such as a file name) and one output (also possibly a file name), both via the command line.
Figure out how to setup parallel so that it can receive a list of those file names for insertion into a command template to run your python script on each of those files individually.
You can just use an ordinary shell for command, and append the & background indicator to the python command inside the for:
for file in `cat list_of_files.txt`;
do python perfile_code.py $file &
done
Of course, assuming your python code will generate separate outputs by itself.
It is just this simple.
Although not usual - in general people will favor using Python itself to control the parallel execution of the loop, if you can edit the program. One nice way to do is to use concurrent.futures in Python to create a worker pool with 8 workers - the shell approach above will launch all instances in parallel at once.
Assuming your code have a translate function that takes in a filename, your Python code could be written as:
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path:
def translate(filename):
...
def main(path_to_file_with_list):
futures = []
with ProcessPoolExecutor(max_workers=8) as executor:
for filename in Path(path_to_file_with_list).open():
executor.submit(translate, filename)
for future in as_completed(futures):
future.result()
if __name__ == "__main__":
import sys
main(argv[1])
This won't depend on special shell syntax, and takes care of corner cases, and number-or-workers handling, which could be hard to do properly from bash.
It is unclear from your question how you run your tasks in serial. But if we assume you run:
python perfile_code.py file1
python perfile_code.py file2
python perfile_code.py file3
:
python perfile_code.py fileN
then the simple way to parallelize this would be:
parallel python perfile_code.py ::: file*
If you have a list of files with one line per file then use:
parallel python perfile_code.py :::: filelist.txt
It will run one job per cpu thread in parallel. So if filelist.txt contains 1000000 names, then it will not run them all at the same time, but only start a new job when one finishes.

Proper use of os.wait()?

I am trying to solve an issue with automating a series of scripts used in my workplace. I am a beginner so I apologise for what most likely will be an easy question (hopefully), I have read the literature but it didn't quite make sense to me.
Essentially I have a bash script that runs a python script and an R script that needs to be run in order, currently running the code the R script begins before the python is finished and I have been told here than I cannot use the shell wait function as my python script launches child processes and shell wait cannot be used to wait on grandchild processes.
Thats fine, so the solution offered was to make the python and R script wait on their own child processes so that when they exit, the the bash script can properly run in order. Unfortunately I cannot figure out the proper nomenclature of this in my python script.
Here's what i have:
cmd = "python %s/create_keyfile.py %s %s %s %s" %(input, input, input,
input, input)
print cmd
os.system(cmd)
cmd = "python %s/uneak_name_plus_barcode_v2.py %s %s %s %s" %(input,
input, input, input, input)
print cmd
os.system(cmd)
cmd = "python %s/run_production_mode.py %s %s %s %s %s" %(input, input,
input, input, input, input)
print cmd
os.system(cmd)
Where 'input' is actual inputs in my code, I probably just cant share exactly what we are doing :)
So essentially I am trying to figure out the best way of having the whole script wait on these three scripts before exiting.
Use subprocess.check_call() not os.system()
subprocess.check_call() will block your main Python script's execution until the function has returned a value.
Documentation for check_call() here
The subprocess module should always be used instead of os.system() for subprocess management and execution.
Thank you to all that helped, here is what caused my dilemma for anyone google searching this. I determined from inserting "python -c "from time import sleep; sleep(30)"" into my code that the first two python scripts were waiting as expected, the final one was not (the timer would immediately trigger after that script ran), turns out that the third python script also called another small python script that had a "&" at the end of it that was ignoring any commands to wait on it. Simply removing this & allowed all the code to run sequentially. – Michael Bates

Python subprocess.popen fails when interacting with the subprocess

I have a python build script for a Xamarin application that I need to compile into different ipa's and apk's based on locale.
The script manipulates the necessary values in info.plist and the Android manifest and then builds each of the versions using subprocess.popen to call xbuild. Or at least that's how it's suppose to be.
The problem is that when I in anyway interact with the subprocess (basically i need to wait until it's done before I start changing values for the next version)
This works:
build_path = os.path.dirname(os.path.realpath(__file__))
ipa_path = "/path/to/my.ipa"
cmd = '/Library/Frameworks/Mono.framework/Versions/4.6.2/Commands/xbuild /p:Configuration="Release" /p:Platform="iPhone" /p:IpaPackageDir="%s" /t:Build %s/MyApp/iOS/MyApp.iOS.csproj' % (ipa_path, build_path)
subprocess.Popen(cmd, env=os.environ, shell=True)
However it will result in the python script continuing in parallel with the build.
If I do this:
subprocess.Popen(cmd, env=os.environ, shell=True).wait()
Xbuild fail with the following error message:
Build FAILED.
Errors:
/Users/sune/dev/MyApp/iOS/MyApp.iOS.csproj: error :
/Users/sune/dev/MyApp/iOS/MyApp.iOS.csproj: There is an unclosed literal string.
Line 2434, position 56.
It fails within milliseconds of being called, whereas normally the build process takes several minutes
Any other shorthand methods of subprocess.popen such as .call, .check_call, and the underlying operations of subprocess.poll and subprocess.communicate causes the same error to happen.
What's really strange is that even calling time.sleep can provoke the same error:
subprocess.Popen(cmd, env=os.environ, shell=True)
time.sleep(2)
Which I don't get because as I understand it I should also be able to do something like this:
shell = subprocess.Popen(cmd, env=os.environ, shell=True)
while shell.poll() is None:
time.sleep(2)
print "done"
To essentially achieve the same as calling shell.wait()
Edit: Using command list instead of string
If I use a command list and shell=False like this
cmd = [
'/Library/Frameworks/Mono.framework/Versions/4.6.2/Commands/xbuild',
'/p:Configuration="Release"',
'/p:Platform="iPhone"',
'/p:IpaPackageDir="%s' % ipa_path,
'/t:Build %s/MyApp/iOS/MyApp.iOS.csproj' % build_path
]
subprocess.Popen(cmd, env=os.environ, shell=False)
Then this is the result:
MSBUILD: error MSBUILD0003: Please specify the project or solution file to build, as none was found in the current directory.
Any input is much appreciated. I'm banging my head against the wall here.
I firmly believe that this is not possible. It must be a shortcoming of the way the subprocess module is implemented.
xbuild spawns multiple subprocesses during the build and if polled for status the subprocess in python will discover that one of these had a non-zero return status and stop the execution of one or more of the xbuild subprocesses causing the build to fail as described.
I ended up using a bash script to do the compiling and use python to manipulate xml files etc.

Python's check_output method doesn't return output sometimes

I have a Python script which is supposed to run a large number of other scripts, each located within a subdirectory of the script's working directory. Each of these other scripts is supposed to connect to a game client and run an AI for that game. To make this run, I had to run each script over two separate threads (one for each player). The problem I'm having is that sometimes the scripts' output isn't captured. My run-code looks like this:
def run(command, name, count):
chdir(name)
output = check_output(" ".join(command), stderr = STDOUT, shell = True).split('\r')
chdir('..')
with open("results_" + str(count) + ".txt", "w") as f:
for line in output:
f.write(line)
The strange part is that it does manage to capture longer streams, but the short ones go unnoticed. How can I change my code to fix this problem?
UPDATE: I don't think it's a buffering issue because check_output("ls ..", shell = True).split('\n')[:-1] returns the expected result and that command should take much less time than the scripts I'm trying to run.
UPDATE 2: I have discovered that output is being cut for the longer runs. It turns out that the end of output is being missed for all processes that I run for some reason. This also explains why the shorter runs don't produce any output at all.

OS system call to Sicstus hangs indefinitely using Python

I'm trying to write a proofchecking application that receives proofs from a user on a website and sends it through to a Prolog script to check its validity.
I'm using Django, Python 2.7 and Sicstus. In my server "view.py" file, I call a python script "checkProof.py", passing it the raw text form of the proof the user submits. Inside of that file I have the following function:
def checkProof(pFile, fFile):
p = subprocess.Popen(['/bin/bash', '-i', '-c', 'sicstus -l ProofServer/server/proofChecker.pl -- %s %s' % (pFile, fFile)],
stdout=subprocess.PIPE)
p.communicate() # Hangs here.
proofChecker.pl receives a modified version of the proof (pFile), analyses it and outputs feedback into a feedback file (fFile). The Python script loops until the feedback file is generated, and returns this to the rest of the server.
The first time I call this function, everything works fine and I get the expected output. The second time I call this function, the program hangs indefinitely at "p.communicate()".
This means that, currently, only one proof can be checked using the application between server restarts. The server should be able to check an indefinite number of proofs between restarts.
Does anyone know why this is happening? I'd be happy to include additional information if necessary.
Update
Based on advice given below, I tried three different kinds of calls to try to determine where the problem lies. The first is what I'm trying to do already - calling Sicstus on my real proofchecking code. The second was calling a very simple Prolog script that writes a hardcoded output. The third was a simple Python script that does the same:
def checkProof(pFile, fFile):
cmd1 = 'sicstus -l ProofServer/server/proofChecker.pl -- %s %s' % (pFile, fFile)
cmd2 = 'sicstus -l ProofServer/server/tempFeedback.pl -- %s %s' % (pFile, fFile)
cmd3 = 'python ProofServer/server/tempFeedback.py %s %s' % (pFile, fFile)
p = subprocess.Popen(['/bin/bash', '-i', '-c', cmd3],
stdout=subprocess.PIPE)
p.communicate() # Hangs here.
In all three cases, the application continues to hang on the second attempted call. This implies that the problem is not with calling Sicstus, but just with the way I'm calling programs in general. This is a bit reassuring but I'm still not sure what I'm doing wrong.
I managed to fix this issue, eventually.
I think the issue was that appending the -i (interactive) flag to bash meant that it expected input, and when it didn't get that input it suspended the process on the second call. This is what was happening when trying to replicate the process with something simpler.
I got rid of the -i flag, and found that it now raised the error "/bin/bash: sicstus: command not found", even though sicstus is on my server's PATH and I can call it fine if I ssh into the server and call it directly. I fixed this by specifying the full path. I can now check proofs an indefinite number of times between server restarts, which is great. My code is now:
def checkProof(pFile, fFile):
cmd = '/usr/local/sicstus4.2.3/bin/sicstus -l ProofServer/server/proofChecker.pl -- %s %s' % (pFile, fFile)
p = subprocess.Popen(['/bin/bash', '-c', cmd])
p.communicate()

Categories