Parallel processing in Python: options and alternative - python

I tried joblib, however, I got stuck at setting the processor affinity as explained here (the error is shown below along with my script).
Now I want to know if there are other options or alternatives that would allow me to accomplish the same goal, which is to run the same script in parallel, using my 8 cores (in a fashion that resembles GNU parallel).
Error:
AttributeError: 'Process' object has no attribute 'set_cpu_affinity'
My script:
from datetime import datetime
from subprocess import call
from joblib import Parallel, delayed
import multiprocessing
import psutil
import os
startTime = datetime.now()
pdb_name_list = []
for filename in os.listdir('/home/labusr/Documents/python_scripts/Spyder/Refinement'):
if filename.endswith(".pdb"):
pdb_name_list.append(filename)
num_cores = multiprocessing.cpu_count()
p = psutil.Process(os.getpid())
p.set_cpu_affinity(range(num_cores))
print(p.get_cpu_affinity())
inputs = range(2)
def strcuture_refine(file_name,i):
print('Refining strcuture %s round %s......\n' %(file_name, i))
call(['/home/labusr/rosetta/main/source/bin/rosetta_scripts.linuxgccrelease',
'-in::file::s', '/home/labusr/Documents/python_scripts/Spyder/_Refinement/%s' %file_name,
'-parser::protocol', '/home/labusr/Documents/A_asymm_refine.xml',
'-parser::script_vars', 'denswt=35', 'rms=1.5', 'reso=4.3', 'map=/home/labusr/Documents/tubulin_exercise/masked_map_center.mrc',
'testmap=/home/labusr/Documents/tubulin_exercise/mmasked_map_top_centered_resampled.mrc',
'-in:ignore_unrecognized_res',
'-edensity::mapreso', '4.3',
'-default_max_cycles', '200',
'-edensity::cryoem_scatterers',
'-beta',
'-out::suffix', '_%s' %i,
'-crystal_refine'])
print('Time for refining %s round %s is: \n' %(file_name, i), datetime.now() - startTime)
for file_name in pdb_name_list:
Parallel(n_jobs=num_cores)(delayed(strcuture_refine)(file_name,i) for i in inputs)

The simplest thing to do is to just launch multiple Python processes from e.g. a command line. To make each process handle its own file, you can pass it when invoking Python:
python myscript.py filename
The passed filename is then available in Python via
import sys
filename = sys.argv[1]

Related

Concurrent.futures Problems

So I have this code that needs to use Concurrent.futures module and for some reason it is telling me it does not exist. I have looked it up and I can not find what the problem is. I have tried installing the tools I need from it thinking that was the case but I can only get one of them to download.
Error message:
from concurrent.futures import ProcessPoolExecutor
ModuleNotFoundError: No module named 'concurrent.futures';
'concurrent' is not a package
my code:
import requests, time
from concurrent.futures import ProcessPoolExecutor
sites = ["http://www.youtube.com"]
def get_one(site):
resp = requests.get(site)
size = len(resp.content)
print(f"download {site} bytes from {site}")
return size
def main():
total_size = 0
start = time.perf_counter()
with ProcessPoolExecutor as exec:
total_size = sum(exec.map(get_one, sites))
end = time.perf_counter()
for site in sites:
total_size += size
#print(f"downlded {size} bytes from {site}")
#end = time.perf_counter()
print(f"elapsed time: {end - start} seconds")
print (f"downloaded a totla of {total_size} bytes")
if __name__== "__main__":
main()
I know that normally there should be a file when I say "from" but everything I look up says concurrent.futures is a part of python, but for some reason mine will not work properly. If it is out there do I have to install it?
I found that I had a file named concurrent.py in my folder that was messing everything up!

Run external application on multiple processors

Currently I have a program that uses subprocess.Popen to open an executable file and pass an argument - equivalent to running ./path/to/file args on Linux.
This works very well but I have to execute this file over 1000 times and currently it is done one at a time, on a single processor. I want to be able to execute this file in sets of 8 for example, as I have an 8-core PC.
I have tried the following:
bolsig = ("/home/rdoyle/TEST_PROC/BOLSIG/bolsigminus")
infile_list = glob.glob(str(cwd)+"/BOLSIG Run Files/run*")
cmds_list = [[bolsig, infile] for infile in infile_list]
procs_list = [Popen(cmd) for cmd in cmds_list]
for proc in procs_list:
proc.wait()
But this tries to execute all 1000 commands at the same time.
Anyone have any suggestions?
I like concurrent.futures for simple cases like this, it's so simple and yet so effective.
import os
import glob
from concurrent import futures
from subprocess import Popen
optim = ("/usr/bin/jpegoptim")
img_path = os.path.join(os.path.abspath(os.path.curdir), 'images')
file_list = glob.glob(img_path+'/*jpg')
def compress(fname):
Popen([optim, fname, '-d', 'out/', '-f'])
ex = futures.ThreadPoolExecutor(max_workers=8)
ex.map(compress, file_list)
A great intro at Doug Hellman's PyMOTW. https://pymotw.com/3/concurrent.futures/
You can use the Python multiprocessing module and its multiprocessing.pool.Pool -> https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing

getting svchost path by PID using python 2.7

i am trying to get svchost.exe path by PID using python. i tried it using psutil but i got access denied error.
here is my code-
import psutil
p = psutil.Process(1832)
print p.exe()
you can get process path using wmi module.
Here is a sample code for you. This code fids all process named svchost and return path to the process. This returns full information on process.
import wmi
import psutil
c = wmi.WMI ()
process = psutil.Process(2276)
process_name = process.name()
for process in c.Win32_Process(name=process_name):
if process.ExecutablePath:
print (process.ExecutablePath)
Output
c:\windows\system32\svchost.exe
c:\windows\system32\svchost.exe
c:\windows\system32\svchost.exe

Sharing a path string between modules in python

I am trying to make a gui that displays a path to a file, and the user can change it anytime. I have my defaults which are in my first script.The following is a simplified version without any of the gui stuff. But then the user pushes a button and it runs a different script (script2). In this script, the information on the file is read.
script1:
import os
import multiprocessing as mp
import script2
specsfile = mp.Array('c',1000, lock=True)
path_save = mp.Array('c',1000, lock=True)
p = mp.Process(target=script2, args=(specsfile,path_save))
p.start()
specsfile = '//_an_excel_sheet_directory.xlsx'
path_save = '//path/to/my/directory/'
subprocess.call([sys.executable, 'script2.py'])
script2:
import multiprocessing as mp
from script1 import specsfile
from script1 import path_save
print(specsfile)
spec= pd.read_excel(specsfile)
When I run it, it gives me this error: PermissionError: [WinError 5] Access is denied
I'm not sure if I'm initializing this wrong or not. I've never used multiprocessing but I was reading about some recommendations for that when sharing data. so basically I want to initialize a specsfile string and a path_save string but when it changes,I want it to be reflected and sent to specs2 file.

What File Descriptor object does Python AsyncIO's loop.add_reader() expect?

I'm trying to understand how to use the new AsyncIO functionality in Python 3.4 and I'm struggling with how to use the event_loop.add_reader(). From the limited discussions that I've found it looks like its for reading the standard out of a separate process as opposed to the contents of an open file. Is that true? If so it appears that there's no AsyncIO specific way to integrate standard file IO, is this also true?
I've been playing with the following code. The output of the following gives the exception PermissionError: [Errno 1] Operation not permitted from line 399 of /python3.4/selectors.py self._epoll.register(key.fd, epoll_events) that is triggered by the add_reader() line below
import asyncio
import urllib.parse
import sys
import pdb
import os
def fileCallback(*args):
pdb.set_trace()
path = sys.argv[1]
loop = asyncio.get_event_loop()
#fd = os.open(path, os.O_RDONLY)
fd = open(path, 'r')
#data = fd.read()
#print(data)
#fd.close()
pdb.set_trace()
task = loop.add_reader(fd, fileCallback, fd)
loop.run_until_complete(task)
loop.close()
EDIT
For those looking for an example of how to use AsyncIO to read more than one file at a time like I was curious about, here's an example of how it can be accomplished. The secret is in the line yield from asyncio.sleep(0). This essentially pauses the current function, putting it back in the event loop queue, to be called after all other ready functions are executed. Functions are determined to be ready based on how they were scheduled.
import asyncio
#asyncio.coroutine
def read_section(file, length):
yield from asyncio.sleep(0)
return file.read(length)
#asyncio.coroutine
def read_file(path):
fd = open(path, 'r')
retVal = []
cnt = 0
while True:
cnt = cnt + 1
data = yield from read_section(fd, 102400)
print(path + ': ' + str(cnt) + ' - ' + str(len(data)))
if len(data) == 0:
break;
fd.close()
paths = ["loadme.txt", "loadme also.txt"]
loop = asyncio.get_event_loop()
tasks = []
for path in paths:
tasks.append(asyncio.async(read_file(path)))
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
These functions expect a file descriptor, that is, the underlying integers the operating system uses, not Python's file objects. File objects that are based on file descriptors return that descriptor on the fileno() method, so for example:
>>> sys.stderr.fileno()
2
In Unix, file descriptors can be attached to files or a lot of other things, including other processes.
Edit for the OP's edit:
As Max in the comments says, you can not use epoll on local files (and asyncio uses epoll). Yes, that's kind of weird. You can use it on pipes, though, for example:
import asyncio
import urllib.parse
import sys
import pdb
import os
def fileCallback(*args):
print("Received: " + sys.stdin.readline())
loop = asyncio.get_event_loop()
task = loop.add_reader(sys.stdin.fileno(), fileCallback)
loop.run_forever()
This will echo stuff you write on stdin.
you cannot use add_reader on local files, because:
It cannot be done using select/poll/epoll
It depends on the operating system
It cannot be fully asynchronous because of os limitations (linux does not support async fs metadata read/write)
But, technically, yes you should be able to do async filesystem read/write, (almost) all systems have DMA mechanism for doing i/o "in the background". And no, local i/o is not really fast such that no one would want it, the CPU are in the order of millions times faster that disk i/o.
Look for aiofile or aiofiles if you want to try async i/o

Categories