Run external application on multiple processors - python

Currently I have a program that uses subprocess.Popen to open an executable file and pass an argument - equivalent to running ./path/to/file args on Linux.
This works very well but I have to execute this file over 1000 times and currently it is done one at a time, on a single processor. I want to be able to execute this file in sets of 8 for example, as I have an 8-core PC.
I have tried the following:
bolsig = ("/home/rdoyle/TEST_PROC/BOLSIG/bolsigminus")
infile_list = glob.glob(str(cwd)+"/BOLSIG Run Files/run*")
cmds_list = [[bolsig, infile] for infile in infile_list]
procs_list = [Popen(cmd) for cmd in cmds_list]
for proc in procs_list:
proc.wait()
But this tries to execute all 1000 commands at the same time.
Anyone have any suggestions?

I like concurrent.futures for simple cases like this, it's so simple and yet so effective.
import os
import glob
from concurrent import futures
from subprocess import Popen
optim = ("/usr/bin/jpegoptim")
img_path = os.path.join(os.path.abspath(os.path.curdir), 'images')
file_list = glob.glob(img_path+'/*jpg')
def compress(fname):
Popen([optim, fname, '-d', 'out/', '-f'])
ex = futures.ThreadPoolExecutor(max_workers=8)
ex.map(compress, file_list)
A great intro at Doug Hellman's PyMOTW. https://pymotw.com/3/concurrent.futures/

You can use the Python multiprocessing module and its multiprocessing.pool.Pool -> https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing

Related

Python shutil module move exception handling

Destination path /tmp/abc in both Process 1 & Process 2
Say there are N number of process running
we need to retain the file generated by the latest one
Process1
import shutil
shutil.move(src_path, destination_path)
Process 2
import os
os.remove(destination_path)
Solution
1. Handle the process saying if copy fails with [ErrNo2]No Such File or Directory
Is this the correct solution? Is there a better way to handle this
Useful Link A safe, atomic file-copy operation
You can use FileNotFoundError error
try :
shutil.move(src_path, destination_path)
except FileNotFoundError:
print ('File Not Found')
# Add whatever logic you want to execute
except :
print ('Some Other error')
The primary solutions that come to mind are either
Keep partial information in per-process staging file
Rely on the os for atomic moves
or
Keep partial information in per-process memory
Rely on interprocess communication / locks for atomic writes
For the first, e.g.:
import tempfile
import os
FINAL = '/tmp/something'
def do_stuff():
fd, name = tempfile.mkstemp(suffix="-%s" % os.getpid())
while keep_doing_stuff():
os.write(fd, get_output())
os.close(fd)
os.rename(name, FINAL)
if __name__ == '__main__':
do_stuff()
You can choose to invoke individually from a shell (as shown above) or with some process wrappers (subprocess or multiprocessing would be fine), and either way will work.
For interprocess you would probably want to spawn everything from a parent process
from multiprocessing import Process, Lock
from cStringIO import StringIO
def do_stuff(lock):
output = StringIO()
while keep_doing_stuff():
output.write(get_output())
with lock:
with open(FINAL, 'w') as f:
f.write(output.getvalue())
output.close()
if __name__ == '__main__':
lock = Lock()
for num in range(2):
Process(target=do_stuff, args=(lock,)).start()

Subprocess.Popen only runs second time

I have a boot controller which runs a boot.py file contained in each folder of each tool i am trying to deploy. I want my boot controller to run all of these boot files simultaneously. The config file has the tool names and the versions desired, which help to generate the path to the boot.py.
def run_boot():
config_file = get_config_file()
parse_config_file.init(config_file)
tools = parse_config_file.get_tools_to_deploy()
#tools is now a list of tool names
top_dir = os.getcwd()
for tool in tools:
ver = parse_config_file.get_tool_version(tool).strip()
boot_file_path = "{0}\\Deploy\\{1}\\{2}".format(os.getcwd(),tool,ver)
try:
subprocess.Popen('boot.py', shell=True, cwd=boot_file_path)
except:
print ("{0} failed to open".format(tool))
print(tool, boot_file_path)
os.chdir(top_dir)
The first time i run this, the print(tool, boot_file_path) executes but the processes do not. the second time it is run the processes do open. I cannot find a reason for this.

Sharing a path string between modules in python

I am trying to make a gui that displays a path to a file, and the user can change it anytime. I have my defaults which are in my first script.The following is a simplified version without any of the gui stuff. But then the user pushes a button and it runs a different script (script2). In this script, the information on the file is read.
script1:
import os
import multiprocessing as mp
import script2
specsfile = mp.Array('c',1000, lock=True)
path_save = mp.Array('c',1000, lock=True)
p = mp.Process(target=script2, args=(specsfile,path_save))
p.start()
specsfile = '//_an_excel_sheet_directory.xlsx'
path_save = '//path/to/my/directory/'
subprocess.call([sys.executable, 'script2.py'])
script2:
import multiprocessing as mp
from script1 import specsfile
from script1 import path_save
print(specsfile)
spec= pd.read_excel(specsfile)
When I run it, it gives me this error: PermissionError: [WinError 5] Access is denied
I'm not sure if I'm initializing this wrong or not. I've never used multiprocessing but I was reading about some recommendations for that when sharing data. so basically I want to initialize a specsfile string and a path_save string but when it changes,I want it to be reflected and sent to specs2 file.

Parallel processing in Python: options and alternative

I tried joblib, however, I got stuck at setting the processor affinity as explained here (the error is shown below along with my script).
Now I want to know if there are other options or alternatives that would allow me to accomplish the same goal, which is to run the same script in parallel, using my 8 cores (in a fashion that resembles GNU parallel).
Error:
AttributeError: 'Process' object has no attribute 'set_cpu_affinity'
My script:
from datetime import datetime
from subprocess import call
from joblib import Parallel, delayed
import multiprocessing
import psutil
import os
startTime = datetime.now()
pdb_name_list = []
for filename in os.listdir('/home/labusr/Documents/python_scripts/Spyder/Refinement'):
if filename.endswith(".pdb"):
pdb_name_list.append(filename)
num_cores = multiprocessing.cpu_count()
p = psutil.Process(os.getpid())
p.set_cpu_affinity(range(num_cores))
print(p.get_cpu_affinity())
inputs = range(2)
def strcuture_refine(file_name,i):
print('Refining strcuture %s round %s......\n' %(file_name, i))
call(['/home/labusr/rosetta/main/source/bin/rosetta_scripts.linuxgccrelease',
'-in::file::s', '/home/labusr/Documents/python_scripts/Spyder/_Refinement/%s' %file_name,
'-parser::protocol', '/home/labusr/Documents/A_asymm_refine.xml',
'-parser::script_vars', 'denswt=35', 'rms=1.5', 'reso=4.3', 'map=/home/labusr/Documents/tubulin_exercise/masked_map_center.mrc',
'testmap=/home/labusr/Documents/tubulin_exercise/mmasked_map_top_centered_resampled.mrc',
'-in:ignore_unrecognized_res',
'-edensity::mapreso', '4.3',
'-default_max_cycles', '200',
'-edensity::cryoem_scatterers',
'-beta',
'-out::suffix', '_%s' %i,
'-crystal_refine'])
print('Time for refining %s round %s is: \n' %(file_name, i), datetime.now() - startTime)
for file_name in pdb_name_list:
Parallel(n_jobs=num_cores)(delayed(strcuture_refine)(file_name,i) for i in inputs)
The simplest thing to do is to just launch multiple Python processes from e.g. a command line. To make each process handle its own file, you can pass it when invoking Python:
python myscript.py filename
The passed filename is then available in Python via
import sys
filename = sys.argv[1]

What File Descriptor object does Python AsyncIO's loop.add_reader() expect?

I'm trying to understand how to use the new AsyncIO functionality in Python 3.4 and I'm struggling with how to use the event_loop.add_reader(). From the limited discussions that I've found it looks like its for reading the standard out of a separate process as opposed to the contents of an open file. Is that true? If so it appears that there's no AsyncIO specific way to integrate standard file IO, is this also true?
I've been playing with the following code. The output of the following gives the exception PermissionError: [Errno 1] Operation not permitted from line 399 of /python3.4/selectors.py self._epoll.register(key.fd, epoll_events) that is triggered by the add_reader() line below
import asyncio
import urllib.parse
import sys
import pdb
import os
def fileCallback(*args):
pdb.set_trace()
path = sys.argv[1]
loop = asyncio.get_event_loop()
#fd = os.open(path, os.O_RDONLY)
fd = open(path, 'r')
#data = fd.read()
#print(data)
#fd.close()
pdb.set_trace()
task = loop.add_reader(fd, fileCallback, fd)
loop.run_until_complete(task)
loop.close()
EDIT
For those looking for an example of how to use AsyncIO to read more than one file at a time like I was curious about, here's an example of how it can be accomplished. The secret is in the line yield from asyncio.sleep(0). This essentially pauses the current function, putting it back in the event loop queue, to be called after all other ready functions are executed. Functions are determined to be ready based on how they were scheduled.
import asyncio
#asyncio.coroutine
def read_section(file, length):
yield from asyncio.sleep(0)
return file.read(length)
#asyncio.coroutine
def read_file(path):
fd = open(path, 'r')
retVal = []
cnt = 0
while True:
cnt = cnt + 1
data = yield from read_section(fd, 102400)
print(path + ': ' + str(cnt) + ' - ' + str(len(data)))
if len(data) == 0:
break;
fd.close()
paths = ["loadme.txt", "loadme also.txt"]
loop = asyncio.get_event_loop()
tasks = []
for path in paths:
tasks.append(asyncio.async(read_file(path)))
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
These functions expect a file descriptor, that is, the underlying integers the operating system uses, not Python's file objects. File objects that are based on file descriptors return that descriptor on the fileno() method, so for example:
>>> sys.stderr.fileno()
2
In Unix, file descriptors can be attached to files or a lot of other things, including other processes.
Edit for the OP's edit:
As Max in the comments says, you can not use epoll on local files (and asyncio uses epoll). Yes, that's kind of weird. You can use it on pipes, though, for example:
import asyncio
import urllib.parse
import sys
import pdb
import os
def fileCallback(*args):
print("Received: " + sys.stdin.readline())
loop = asyncio.get_event_loop()
task = loop.add_reader(sys.stdin.fileno(), fileCallback)
loop.run_forever()
This will echo stuff you write on stdin.
you cannot use add_reader on local files, because:
It cannot be done using select/poll/epoll
It depends on the operating system
It cannot be fully asynchronous because of os limitations (linux does not support async fs metadata read/write)
But, technically, yes you should be able to do async filesystem read/write, (almost) all systems have DMA mechanism for doing i/o "in the background". And no, local i/o is not really fast such that no one would want it, the CPU are in the order of millions times faster that disk i/o.
Look for aiofile or aiofiles if you want to try async i/o

Categories