Multi-thread Linux tool commands? - python

I want to create a script that will run two Linux based tools from the shell at the same time and based on write their outputs into a single results file?
I am pretty clueless in all honesty having done some research into things such as os.fork and such and was really just looking for some guidance.
I am currently using subprocess.call([command here]) to run one command and output that into a file but I was just wondering how I could run two tools simultaneously such as.
subprocess.call([command 1 >> results.txt])
subprocess.call([command 2 >> results.txt])
Both of these happening at the same time.

Firstly you'll want to call Popen rather than call if you want to run these simultaneously as call blocks until the process finishes. Also you can use the stdout parameter to pipe the output to a file like object.
with open("results.txt", "w") as results:
p1 = subprocess.Popen(["command1"], stdout=results)
p2 = subprocess.Popen(["command2"], stdout=results)
p1.wait()
p2.wait()

Related

How to Parallelize a Python program on Linux

I have a script that takes in input a list of filenames and loops over them to generate an output file per input file, so this is a case which can be easily parallelized I think.
I have a 8 core machine.
I tried on using -parallel flag on this command:
python perfile_code.py list_of_files.txt
But I can't make it work, i.e. specific question is: how to use parallel in bash with a python command in Linux, along with the arguments for the specific case mentioned above.
There is a Linux parallel command (sudo apt-get install parallel), which I read somewhere can do this job but I don't know how to use it.
Most of the internet resources explain how to do it in python but can it be done in bash?
Please help, thanks.
Based on an answer, here is a working example that is still not working, please suggest how to make it work.
I have a folder with 2 files, i just want to create their duplicates with a different name parallely in this example.
# filelist is the directory containing two file names, a.txt and b.txt.
# a.txt is the first file, b.xt is the second file
# i pass an .txt file with both the names to the main program
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
import sys
def translate(filename):
print(filename)
f = open(filename, "r")
g = open(filename + ".x", , "w")
for line in f:
g.write(line)
def main(path_to_file_with_list):
futures = []
with ProcessPoolExecutor(max_workers=8) as executor:
for filename in Path(path_to_file_with_list).open():
executor.submit(translate, "filelist/" + filename)
for future in as_completed(futures):
future.result()
if __name__ == "__main__":
main(sys.argv[1])
Based on your comment,
#Ouroborus no, no consider this opensource.com/article/18/5/gnu-parallel i want to run a python program along with this parallel..for a very specific case..if an arbitrary convert program can be piped to parallel ..why wouldn't a python program?
I think this might help:
convert wasn't chosen arbitrarily. It was chosen because it is a better known program that (roughly) maps a single input file, provided via the command line, to a single output file, also provided via the command line.
The typical shell for loop can be used to iterate over a list. In the article you linked, they show an example
for i in *jpeg; do convert $i $i.png ; done
This (again, roughly) takes a list of file names and applies them, one by one, to a command template and then runs that command.
The issue here is that for would necessarily wait until a command is finished before running the next one and so may under-utilize today's multi-core processors.
parallel acts a kind of replacement for for. It makes the assumption that a command can be executed multiple times simultaneously, each with different arguments, without each instance interfering with the others.
In the article, they show a command using parallel
find . -name "*jpeg" | parallel -I% --max-args 1 convert % %.png
that is equivalent to the previous for command. The difference (still roughly) is that parallel runs several variants of the templated command simultaneously without necessarily waiting for each to complete.
For your specific situation, in order to be able to use parallel, you would need to:
Adjust your python script so that it takes one input (such as a file name) and one output (also possibly a file name), both via the command line.
Figure out how to setup parallel so that it can receive a list of those file names for insertion into a command template to run your python script on each of those files individually.
You can just use an ordinary shell for command, and append the & background indicator to the python command inside the for:
for file in `cat list_of_files.txt`;
do python perfile_code.py $file &
done
Of course, assuming your python code will generate separate outputs by itself.
It is just this simple.
Although not usual - in general people will favor using Python itself to control the parallel execution of the loop, if you can edit the program. One nice way to do is to use concurrent.futures in Python to create a worker pool with 8 workers - the shell approach above will launch all instances in parallel at once.
Assuming your code have a translate function that takes in a filename, your Python code could be written as:
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path:
def translate(filename):
...
def main(path_to_file_with_list):
futures = []
with ProcessPoolExecutor(max_workers=8) as executor:
for filename in Path(path_to_file_with_list).open():
executor.submit(translate, filename)
for future in as_completed(futures):
future.result()
if __name__ == "__main__":
import sys
main(argv[1])
This won't depend on special shell syntax, and takes care of corner cases, and number-or-workers handling, which could be hard to do properly from bash.
It is unclear from your question how you run your tasks in serial. But if we assume you run:
python perfile_code.py file1
python perfile_code.py file2
python perfile_code.py file3
:
python perfile_code.py fileN
then the simple way to parallelize this would be:
parallel python perfile_code.py ::: file*
If you have a list of files with one line per file then use:
parallel python perfile_code.py :::: filelist.txt
It will run one job per cpu thread in parallel. So if filelist.txt contains 1000000 names, then it will not run them all at the same time, but only start a new job when one finishes.

Communicating between two python files using unix pipe and filter (concurrently)

I want to be able to start two concurrent processes in python from the terminal and have one's output fed to the next one's input.
I am trying the following code:
File 1:(f1.py)
while(1):
sys.stdout.write("a");
File 2:(f2.py)
while(x=sys.stdin.readline()):
print x
Running the files:
$ python f1.py | python f2.py
Problem is that the readline command blocks the reading program from running unless the writing one finishes. Is there a way to achieve concurrency??

Python's check_output method doesn't return output sometimes

I have a Python script which is supposed to run a large number of other scripts, each located within a subdirectory of the script's working directory. Each of these other scripts is supposed to connect to a game client and run an AI for that game. To make this run, I had to run each script over two separate threads (one for each player). The problem I'm having is that sometimes the scripts' output isn't captured. My run-code looks like this:
def run(command, name, count):
chdir(name)
output = check_output(" ".join(command), stderr = STDOUT, shell = True).split('\r')
chdir('..')
with open("results_" + str(count) + ".txt", "w") as f:
for line in output:
f.write(line)
The strange part is that it does manage to capture longer streams, but the short ones go unnoticed. How can I change my code to fix this problem?
UPDATE: I don't think it's a buffering issue because check_output("ls ..", shell = True).split('\n')[:-1] returns the expected result and that command should take much less time than the scripts I'm trying to run.
UPDATE 2: I have discovered that output is being cut for the longer runs. It turns out that the end of output is being missed for all processes that I run for some reason. This also explains why the shorter runs don't produce any output at all.

How can I read the memory of a process in python in linux?

I'm trying to use python and python ptrace to read the memory of an external process. I need to work entirely in python, and I've been trying to read and print out the memory of a process in linux.
So for example I've tried the following code, which keeps giving me IO errors:
proc_mem = open("/proc/%i/mem" % process.pid, "r")
print proc_mem.read()
proc_mem.close()
Mostly I just want to repeatedly dump the memory of a process and look for changes over time. If this is the correct way to do this, then what is my problem? OR is there a more appropriate way to do this?
Call a shell command from python - subprocess module
import subprocess
# ps -ux | grep 1842 (Assuming 1842 is the process id. replace with process id you get)
p1 = subprocess.Popen(["ps", "-ux"], stdout=subprocess.PIPE)
p2 = subprocess.Popen(["grep", "1842"], stdin=p1.stdout, stdout=subprocess.PIPE)
p1.stdout.close() # Allow p1 to receive a SIGPIPE if p2 exits.
output = p2.communicate()[0]
print output
and parse through output to see its memory utilization
Mostly I just want to repeatedly dump the memory of a process and look for changes over time. If this is the correct way to do this, then what is my problem? OR is there a more appropriate way to do this?
You may be interested in gdb's reverse debugging, which records all changes to process memory. Here is the tutorial (google cache).
There is also Robert O'Callahan's Chronicle/Chronomancer work, if you want to play with the raw recording tools.

writing output to a file with subprocess in python

I have a code which spawns at the max 4 processes at a go. It looks for any news jobs submitted and if it exists it runs the python code
for index,row in enumerate(rows):
if index < 4:
dirs=row[0]
dirName=os.path.join(homeFolder,dirs)
logFile=os.path.join(dirName,(dirs+".log"))
proc=subprocess.Popen(["python","test.py",dirs],stdout=open(logFile,'w'))
I have few questions:
When I try to write the output or errors in log file it does not write into the file until the process finishes.Is it possible to write the output in the file as the process runs as this will help to know at what stage it is running.
When one process finishes, I want the next job in the queue to be run rather than waiting for all child processes to finish and then the daemon starts any new.
Any help will be appreciated.Thanks!
For 2. you can take a look at http://docs.python.org/library/multiprocessing.html
Concerning point 1, try to adjust the buffering used for the log file:
open(logFile,'w', 1) # line-buffered (writes to the file after each logged line)
open(logFile,'w', 0) # unbuffered (should immediately write to the file)
If it suits your need, you should choose line-buffered instead of unbuffered.
Concerning your general problem, as #Tichodroma suggests, you should have a try with Python's multiprocessing module.

Categories