How to Parallelize a Python program on Linux

How to Parallelize a Python program on Linux - python

I have a script that takes in input a list of filenames and loops over them to generate an output file per input file, so this is a case which can be easily parallelized I think.
I have a 8 core machine.
I tried on using -parallel flag on this command:
python perfile_code.py list_of_files.txt
But I can't make it work, i.e. specific question is: how to use parallel in bash with a python command in Linux, along with the arguments for the specific case mentioned above.
There is a Linux parallel command (sudo apt-get install parallel), which I read somewhere can do this job but I don't know how to use it.
Most of the internet resources explain how to do it in python but can it be done in bash?
Please help, thanks.
Based on an answer, here is a working example that is still not working, please suggest how to make it work.
I have a folder with 2 files, i just want to create their duplicates with a different name parallely in this example.
# filelist is the directory containing two file names, a.txt and b.txt.
# a.txt is the first file, b.xt is the second file
# i pass an .txt file with both the names to the main program
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
import sys
def translate(filename):
print(filename)
f = open(filename, "r")
g = open(filename + ".x", , "w")
for line in f:
g.write(line)
def main(path_to_file_with_list):
futures = []
with ProcessPoolExecutor(max_workers=8) as executor:
for filename in Path(path_to_file_with_list).open():
executor.submit(translate, "filelist/" + filename)
for future in as_completed(futures):
future.result()
if __name__ == "__main__":
main(sys.argv[1])

Based on your comment,
#Ouroborus no, no consider this opensource.com/article/18/5/gnu-parallel i want to run a python program along with this parallel..for a very specific case..if an arbitrary convert program can be piped to parallel ..why wouldn't a python program?
I think this might help:
convert wasn't chosen arbitrarily. It was chosen because it is a better known program that (roughly) maps a single input file, provided via the command line, to a single output file, also provided via the command line.
The typical shell for loop can be used to iterate over a list. In the article you linked, they show an example
for i in *jpeg; do convert $i $i.png ; done
This (again, roughly) takes a list of file names and applies them, one by one, to a command template and then runs that command.
The issue here is that for would necessarily wait until a command is finished before running the next one and so may under-utilize today's multi-core processors.
parallel acts a kind of replacement for for. It makes the assumption that a command can be executed multiple times simultaneously, each with different arguments, without each instance interfering with the others.
In the article, they show a command using parallel
find . -name "*jpeg" | parallel -I% --max-args 1 convert % %.png
that is equivalent to the previous for command. The difference (still roughly) is that parallel runs several variants of the templated command simultaneously without necessarily waiting for each to complete.
For your specific situation, in order to be able to use parallel, you would need to:
Adjust your python script so that it takes one input (such as a file name) and one output (also possibly a file name), both via the command line.
Figure out how to setup parallel so that it can receive a list of those file names for insertion into a command template to run your python script on each of those files individually.

You can just use an ordinary shell for command, and append the & background indicator to the python command inside the for:
for file in `cat list_of_files.txt`;
do python perfile_code.py $file &
done
Of course, assuming your python code will generate separate outputs by itself.
It is just this simple.
Although not usual - in general people will favor using Python itself to control the parallel execution of the loop, if you can edit the program. One nice way to do is to use concurrent.futures in Python to create a worker pool with 8 workers - the shell approach above will launch all instances in parallel at once.
Assuming your code have a translate function that takes in a filename, your Python code could be written as:
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path:
def translate(filename):
...
def main(path_to_file_with_list):
futures = []
with ProcessPoolExecutor(max_workers=8) as executor:
for filename in Path(path_to_file_with_list).open():
executor.submit(translate, filename)
for future in as_completed(futures):
future.result()
if __name__ == "__main__":
import sys
main(argv[1])
This won't depend on special shell syntax, and takes care of corner cases, and number-or-workers handling, which could be hard to do properly from bash.

It is unclear from your question how you run your tasks in serial. But if we assume you run:
python perfile_code.py file1
python perfile_code.py file2
python perfile_code.py file3
:
python perfile_code.py fileN
then the simple way to parallelize this would be:
parallel python perfile_code.py ::: file*
If you have a list of files with one line per file then use:
parallel python perfile_code.py :::: filelist.txt
It will run one job per cpu thread in parallel. So if filelist.txt contains 1000000 names, then it will not run them all at the same time, but only start a new job when one finishes.

Related

Communicating between two python files using unix pipe and filter (concurrently)

I want to be able to start two concurrent processes in python from the terminal and have one's output fed to the next one's input.
I am trying the following code:
File 1:(f1.py)
while(1):
sys.stdout.write("a");
File 2:(f2.py)
while(x=sys.stdin.readline()):
print x
Running the files:
$ python f1.py | python f2.py
Problem is that the readline command blocks the reading program from running unless the writing one finishes. Is there a way to achieve concurrency??

Subprocess started from Python runs slower than from bash

I am using the following Python code to run a subprocess and collect its output as a string:
def run(command):
''' Run a command and return the output as a string '''
args = shlex.split(command)
out = subprocess.Popen(args,
stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()[0]
# Save to log file
with open("log_file.txt", "a") as log_file:
log_file.write("$ " + command + "\n")
log_file.write(out)
return out
My goal is to run a benchmark application (openssl speed) multiple time and use Python to parse the output and calculate the average results.
However, I have noticed that the results are consistently slower (about 10%) than when I run the same command directly from the command line in bash.
How would you explain this?
EDIT
The output of the command is quite short: about 10 lines.
Also, note that the benchmark does not print any output while the performance test is running. It only prints the results outside the critical loop.
In the script I only run the benchmark of a particular cipher at a time, so for example I use the following arguments:
openssl speed -elapsed -engine my_engine rsa2048
Note that I am using a custom engine (target of the benchmark) and not the standard software implementation.
My engine spawns another pthread but I would not expect to make a big difference since the Python script is not supposed to interact in any way.

Python's check_output method doesn't return output sometimes

I have a Python script which is supposed to run a large number of other scripts, each located within a subdirectory of the script's working directory. Each of these other scripts is supposed to connect to a game client and run an AI for that game. To make this run, I had to run each script over two separate threads (one for each player). The problem I'm having is that sometimes the scripts' output isn't captured. My run-code looks like this:
def run(command, name, count):
chdir(name)
output = check_output(" ".join(command), stderr = STDOUT, shell = True).split('\r')
chdir('..')
with open("results_" + str(count) + ".txt", "w") as f:
for line in output:
f.write(line)
The strange part is that it does manage to capture longer streams, but the short ones go unnoticed. How can I change my code to fix this problem?
UPDATE: I don't think it's a buffering issue because check_output("ls ..", shell = True).split('\n')[:-1] returns the expected result and that command should take much less time than the scripts I'm trying to run.
UPDATE 2: I have discovered that output is being cut for the longer runs. It turns out that the end of output is being missed for all processes that I run for some reason. This also explains why the shorter runs don't produce any output at all.

Multi-thread Linux tool commands?

I want to create a script that will run two Linux based tools from the shell at the same time and based on write their outputs into a single results file?
I am pretty clueless in all honesty having done some research into things such as os.fork and such and was really just looking for some guidance.
I am currently using subprocess.call([command here]) to run one command and output that into a file but I was just wondering how I could run two tools simultaneously such as.
subprocess.call([command 1 >> results.txt])
subprocess.call([command 2 >> results.txt])
Both of these happening at the same time.

Firstly you'll want to call Popen rather than call if you want to run these simultaneously as call blocks until the process finishes. Also you can use the stdout parameter to pipe the output to a file like object.
with open("results.txt", "w") as results:
p1 = subprocess.Popen(["command1"], stdout=results)
p2 = subprocess.Popen(["command2"], stdout=results)
p1.wait()
p2.wait()

Run command for all files in parallel

I have the following command on the build-server as a part of the build process:
os.system ('signtool sign /a /t http://timestamp.verisign.com/scripts/timstamp.dll "%s\\*.exe"' % (dir) )
This command signs each executable file in the specified directory. Is there a way to run this command in parallel for each executable file using Python? Is there something like OpenMP for Python?

You could use threads. This tutorial shows how to do something similar to what you're asking for using threads.

Perhaps multiprocessing could be of help here?
Specifically, multiprocessing.Pool.map() might be relevant to your needs.

The above answers are perfectly sensible ways of approaching things from the Python side, eg
from multiprocessing import Pool
import os
def processFile(x):
return os.system('ls '+x)
if __name__ == '__main__':
pool = Pool(processes=2)
files=['foo','foo.py','foo.cpp','foo.txt','foo.bar']
result = pool.map(processFile, files)
print 'Results are', result
But if you're using the shell anyway, you might want to consider using Gnu Parallel on the shell side, which runs like xargs but does the individual tasks in parallel, with options to control how many jobs can run simultaneously, etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.