Python pint module with multiprocessing - python

The python pint module implements physical quantities. I would like to use it together with multiprocessing. However, I don't know how to handle creating a UnitRegistry in the new process. If I do the intuitive:
from multiprocessing import Process
from pint import UnitRegistry, set_application_registry
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
def f(one, two):
print(one / two)
if __name__ == '__main__':
p = Process(target=f, args=(Q(50, 'ms'), Q(50, 'ns')))
p.start()
p.join()
Then I get an the following exception:
Traceback (most recent call last):
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\pmaunz\PyCharmProjects\IonControl34\tests\pintmultiprocessing.py", line 12, in f
print(one / two)
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 738, in __truediv__
return self._mul_div(other, operator.truediv)
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 675, in _mul_div
offset_units_self = self._get_non_multiplicative_units()
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 1312, in _get_non_multiplicative_units
offset_units = [unit for unit in self._units.keys()
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 1313, in <listcomp>
if not self._REGISTRY._units[unit].is_multiplicative]
KeyError: 'millisecond'
Which I assume originates from the lack of initializing the UnitRegistry on the child process before unpickling the arguments. (Initializing the UnitRegistry in the function f does not work, as the variables have already been unpickled).
How would I go about sending a pint Quantity to a child process?
Edit after Tim Peter's answer:
The problem is not tied to multiprocessing. Simply pickling quantities
from pint import UnitRegistry, set_application_registry
import pickle
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
with open("pint.pkl", 'wb') as f:
pickle.dump(Q(50, 'ms'), f)
pickle.dump(Q(50, 'ns'), f)
and then unpickling in a new script leads to the same problem:
from pint import UnitRegistry, set_application_registry
import pickle
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
with open("pint.pkl", 'rb') as f:
t1 = pickle.load(f)
t2 = pickle.load(f)
print(t1 / t2)
results in the same exception. As Tim points out, it is sufficient to add a line Q(50, 'ns'); Q(50, 'ms') before unpickling. When digging into the source code for pint, upon creation of a quantity with unit ms this unit is added to an internal registry. Pickling uses a UnitContainer instance to save the units. When creating a Quantity via unpickling the unit is not added to the registry.
A simple fix (in pint source code) is to change the function Quantity.__reduce__ to return a string.
diff --git a/pint/quantity.py b/pint/quantity.py
index 3f30a25..695866a 100644
--- a/pint/quantity.py
+++ b/pint/quantity.py
## -57,7 +57,7 ## class _Quantity(SharedRegistryObject):
def __reduce__(self):
from . import _build_quantity
- return _build_quantity, (self.magnitude, self._units)
+ return _build_quantity, (self.magnitude, str(self._units))
def __new__(cls, value, units=None):
if units is None:
I have opened an issue on pint's github site.

I never used pint before, but this looked interesting ;-) First thing I noted is that I have no problem if I stick to units explicitly listed by this line:
print(dir(ureg.sys.mks))
For example, "hour" and "second" are both in the output of that, and your program runs fine if the Process line is changed to:
p = Process(target=f, args=(Q(50, 'hour'), Q(50, 'second')))
You're on Windows, so multiprocessing is using the "spawn" method: the entire program is imported fresh by the worker process, so in particular the:
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
lines were executed in the worker process too. So the unit registry is initialized in the worker, but it's not the same (identical) registry used in the main program - no memory is shared between processes.
To get much deeper we really need an expert in how pint is implemented. My guess is that for units "made up" (not in the output produced by the dir() line above) by parsing strings, new stuff is added to the registry at some level, which is needed later to reconstruct the values. "ns" and "ms" are of this nature: they are not in the dir() output.
Your program works fine as-is if I add a line like this immediately after your Q=ureg.Quantity line:
Q(1, 'ms'); Q(1, 'ns')
That was a shot in the dark (an "educated guess") that worked: it just forced the worker process to parse the same "made up" units used in the main process, to try to force its unit registry into a similar state.
I hope there's a cleaner way to get it to work, but can't help more. I'd ask the pint authors about it.

Related

Why am I getting an error from the following usage of multiprocessing.Manager().Queue() [duplicate]

I am trying to:
share a dataframe between processes
update a shared dict based on calculations performed on (but not changing) that dataframe
I am using a multiprocessing.Manager() to create a dict in shared memory (to store results) and a Namespace to store/share my dataframe that I want to read from.
import multiprocessing
import pandas as pd
import numpy as np
def add_empty_dfs_to_shared_dict(shared_dict, key):
shared_dict[key] = pd.DataFrame()
def edit_df_in_shared_dict(shared_dict, namespace, ind):
row_to_insert = namespace.df.loc[ind]
df = shared_dict[ind]
df[ind] = row_to_insert
shared_dict[ind] = df
if __name__ == '__main__':
manager = multiprocessing.Manager()
shared_dict = manager.dict()
namespace = manager.Namespace()
n = 100
dataframe_to_be_shared = pd.DataFrame({
'player_id': list(range(n)),
'data': np.random.random(n),
}).set_index('player_id')
namespace.df = dataframe_to_be_shared
for i in range(n):
add_empty_dfs_to_shared_dict(shared_dict, i)
jobs = []
for i in range(n):
p = multiprocessing.Process(
target=edit_df_in_shared_dict,
args=(shared_dict, namespace, i)
)
jobs.append(p)
p.start()
for p in jobs:
p.join()
print(shared_dict[1])
When running the above, it writes to shared_dict correctly as my print statement executes with some data. I also get an error regarding the manager:
Process Process-88:
Traceback (most recent call last):
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 788, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/henrysorsky/Library/Preferences/PyCharm2019.2/scratches/scratch_13.py", line 34, in edit_df_in_shared_dict
row_to_insert = namespace.df.loc[ind]
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 1099, in __getattr__
return callmethod('__getattribute__', (key,))
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 792, in _callmethod
self._connect()
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 779, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused
I understand this is coming from the manager and seems to be due to it not shutting down properly. The only similar issue I can find online:
Share list between process in python server
suggests joining all the child processes, which I am already doing.
So after a full nights sleep I realised it was actually the reading of the dataframe in shared memory that was causing issues and that at around the 20th child process, some of them were failing this read. I added a max number of processes to run at once and this solved it.
For anyone wondering, the code I used is:
import multiprocessing
import pandas as pd
import numpy as np
def add_empty_dfs_to_shared_dict(shared_dict, key):
shared_dict[key] = pd.DataFrame()
def edit_df_in_shared_dict(shared_dict, namespace, ind):
row_to_insert = namespace.df.loc[ind]
df = shared_dict[ind]
df[ind] = row_to_insert
shared_dict[ind] = df
if __name__ == '__main__':
# region define inputs
max_jobs_running = 4
n = 100
# endregion
manager = multiprocessing.Manager()
shared_dict = manager.dict()
namespace = manager.Namespace()
dataframe_to_be_shared = pd.DataFrame({
'player_id': list(range(n)),
'data': np.random.random(n),
}).set_index('player_id')
namespace.df = dataframe_to_be_shared
for i in range(n):
add_empty_dfs_to_shared_dict(shared_dict, i)
jobs = []
jobs_running = 0
for i in range(n):
p = multiprocessing.Process(
target=edit_df_in_shared_dict,
args=(shared_dict, namespace, i)
)
jobs.append(p)
p.start()
jobs_running += 1
if jobs_running >= max_jobs_running:
while jobs_running >= max_jobs_running:
jobs_running = 0
for p in jobs:
jobs_running += p.is_alive()
for p in jobs:
p.join()
for key, value in shared_dict.items():
print(f"key: {key}")
print(f"value: {value}")
print("-" * 50)
This would probably be better handled by a Queue and Pool setup rather than my hacky fix.
The problem is probably in your main process, which created the shared dict. If you forgot to use process.join() (or an infinite loop) in your main process, then the main process may finish before the other processes using the dict. This way the dict gets destroyed, and the processes cannot connect to it.
The number of processes should not be a problem. You should be able to use the dict with as many as you wish.
TL;DR This error might happen if you initiate too many new connections to multiprocessing.Manager() objects in parallel due to hard-coded backlog limit (16 at the time of writing) in multiprocessing/managers.py:
# do authentication later
self.listener = Listener(address=address, backlog=16)
self.address = self.listener.address
Details: I was starting a few hundreds subprocesses trying to get a value from multiprocessing.Manager().dict object at the very start of my program (basically instantly parallel). First few worked fine, but then they started to fail sporadically.
Interestingly, in my case, this only happened under VSCode debugger. I have found a mailing list discussion mentioning this issue more than 10 years ago. Looking at the source code of multiprocessing I found out that the backlog limit is still hard-coded (seems to get increased from 5 to 16 in modern versions). I increased it to 64 and all errors were gone.
So if the pending connections queue reaches the limit, all new connections will be refused. Especially when you run your code under debugger, connections are getting served a tick slower and the backlog buffer may get full when hundreds of them are flowing fast in parallel.

multiprocessing exits before doing any job

I have inherited a certain parser that is supposed to parse 10 files with ~4m lines each.
The code was written in Python 2, which I updated.
There is a multiprocessing logic which i just don't seem to be able to get to work.
from multiprocessing.pool import ThreadPool
import glob
DATADIR = 'home/my_dir/where/all/my/files/are'
def process_file(filepath):
# read line by line, parse and insert to postgres database.
def process_directory(dirpath):
pattern = f'{dirpath}/*dat' # files have .dat extension.
tp = ThreadPool(10)
for filepath in glob.glob(pattern):
print(filepath)
tp.apply_async(process_file, filepath)
tp.close()
tp.join()
if __name__ == '__main__':
process_directory(DATADIR)
I have gone through a lot of the documentation and some similar questions but it just doesn't seem to work.
With the parser code what happens is that I do get printed on the console all the paths of the file that I need parsed, but then that's it the program doesn't do anything else.
The problem is in the way you're calling apply_async. I made a simple reproducer of your problem, but with a slight tweak to get the result from each call:
from multiprocessing.pool import ThreadPool
def func(f):
print("hey " + f)
return f + "1"
l = ["name", "name2", "name3"]
pool = ThreadPool(3)
out = []
for a in l:
print(a)
out.append(pool.apply_async(func, a))
# Check the response from each `apply_async` call
for a in out:
a.get()
pool.close()
pool.join()
This returns an error:
Traceback (most recent call last):
File "a.py", line 16, in <module>
a.get()
File "/usr/lib64/python3.4/multiprocessing/pool.py", line 599, in get
raise self._value
File "/usr/lib64/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
TypeError: func() takes 1 positional argument but 4 were given
It thinks you're passing four positional arguments, instead of one. This is because apply_async wants all the arguments passed in a tuple, like this:
pool.apply_async(func, (a,))
If you put filepath in a tuple when you call apply_async, I think you'll get the behavior you expect.
It's also worth noting that your usecase is well-suited to using pool.map instead of apply_async, which is a little more succinct:
pool.map(process_file, glob.glob(pattern))

How to fix pickle.unpickling error caused by calls to subprocess.Popen in parallel script that uses mpi4py

Repeated serial calls to subprocess.Popen() in a script parallelized with mpi4py eventually cause what seems to be data corruption during communication, manifesting as a pickle.unpickling error of various types (I have seen the unpickling errors: EOF, invalid unicode character, invalid load key, unpickling stack underflow). It seems to only happen when the data being communicated is large, the number of serial calls to subprocess is large, or the number of mpi processes is large.
I can reproduce the error with python>=2.7, mpi4py>=3.0.1, and openmpi>=3.0.0. I would ultimately like to communicate python objects so I am using the lowercase mpi4py methods. Here is a minimum code which reproduces the error:
#!/usr/bin/env python
from mpi4py import MPI
from copy import deepcopy
import subprocess
nr_calcs = 4
tasks_per_calc = 44
data_size = 55000
# --------------------------------------------------------------------
def run_test(nr_calcs, tasks_per_calc, data_size):
# Init MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
comm_size = comm.Get_size()
# Run Moc Calcs
icalc = 0
while True:
if icalc > nr_calcs - 1: break
index = icalc
icalc += 1
# Init Moc Tasks
task_list = []
moc_task = data_size*"x"
if rank==0:
task_list = [deepcopy(moc_task) for i in range(tasks_per_calc)]
task_list = comm.bcast(task_list)
# Moc Run Tasks
itmp = rank
while True:
if itmp > len(task_list)-1: break
itmp += comm_size
proc = subprocess.Popen(["echo", "TEST CALL TO SUBPROCESS"],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
out,err = proc.communicate()
print("Rank {:3d} Finished Calc {:3d}".format(rank, index))
# --------------------------------------------------------------------
if __name__ == '__main__':
run_test(nr_calcs, tasks_per_calc, data_size)
Running this on one 44 core node with 44 mpi processes completes the first 3 "calculations" successfully, but on the final loop some processes raise:
Traceback (most recent call last):
File "./run_test.py", line 54, in <module>
run_test(nr_calcs, tasks_per_calc, data_size)
File "./run_test.py", line 39, in run_test
task_list = comm.bcast(task_list)
File "mpi4py/MPI/Comm.pyx", line 1257, in mpi4py.MPI.Comm.bcast
File "mpi4py/MPI/msgpickle.pxi", line 639, in mpi4py.MPI.PyMPI_bcast
File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
File "mpi4py/MPI/msgpickle.pxi", line 101, in mpi4py.MPI.Pickle.cloads
_pickle.UnpicklingError
Sometimes the UnpicklingError has a descriptor, such as invalid load key "x", or EOF Error, invalid unicode character, or unpickling stack underflow.
Edit: It appears the problem disappears with openmpi<3.0.0, and using mvapich2, but it would still be good to understand what's going on.
I had the same problem. In my case, I got my code to work by installing mpi4py in a Python virtual environment, and by setting mpi4py.rc.recv_mprobe = False as suggested by Intel:
https://software.intel.com/en-us/articles/python-mpi4py-on-intel-true-scale-and-omni-path-clusters
However, in the end I just switched to using the capital letter methods Recv and Send with NumPy arrays. They work fine with subprocess and they don't need any additional tricks.

Writing to file and running the files sometimes works, mostly only first one

This code produces WindowsError most times, rarely (like often first time it is run) not.
""" Running hexlified codes from codefiles module prepared previously """
import tempfile
import subprocess
import threading
import os
import codefiles
if __name__ == '__main__':
for ind, c in enumerate(codefiles.exes):
fn = tempfile.mktemp() + '.exe'
# for some reason hexlified code is sometimes odd length and one nibble
if len(c) & 1:
c += '0'
c = c.decode('hex')
with open(fn, 'wb') as f:
f.write(c)
threading.Thread(target=lambda:subprocess.Popen("", executable=fn)).start()
""" One time works, one time WindowsError 32
>>>
132096 c:\docume~1\admin\locals~1\temp\tmpkhhxxo.exe
991232 c:\docume~1\admin\locals~1\temp\tmp9ow6zz.exe
>>> ================================ RESTART ================================
>>>
132096 c:\docume~1\admin\locals~1\temp\tmp3hb0cf.exe
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "C:\Python27\lib\threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "C:\Documents and Settings\Admin\My Documents\Google Drive\Python\Tools\runner.pyw", line 18, in <lambda>
threading.Thread(target=lambda:subprocess.Popen("", executable=fn)).start()
File "C:\Python27\lib\subprocess.py", line 710, in __init__
errread, errwrite)
File "C:\Python27\lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process
991232 c:\docume~1\admin\locals~1\temp\tmpnkfuon.exe
>>>
"""
Hexlification is done with this script, and it sometimes seems to produce odd number of nibbles which seems also odd.
# bootstrapper for running hexlified executables,
# hexlification can be done by running this file directly or calling function make
# running can be done by run function in runner module, which imports the code,
# runs them from temporary files
import os
modulename = 'codefiles.py'
def code_file_names(d='.', ext='.exe'):
return [n for n in os.listdir(d)
if n.lower().endswith(ext)]
def make():
codes = code_file_names(os.curdir)
with open(modulename, 'a') as f:
f.write('exes = (')
hex_codes = [open(n, 'rb').read().encode('hex') for n in codes]
assert all(len(c) & 1 == 0 for c in hex_codes)
print len(hex_codes),map(len, hex_codes)
hex_codes = [repr(c) for c in hex_codes]
f.write(',\n'.join(hex_codes))
f.write(')\n')
if __name__ == '__main__':
import make_exe
# prepare hexlified exes for exes in directory if codefiles not prepared
if modulename not in os.listdir('.'):
try:
os.remove(modulename)
except:
pass
make()
# prepare script for py2_exe to execute the scripts by run from runner
make_exe.py2_exe('runner.pyw', 'tools/dvd.ico')
After Mr. Martelli's suggestion, I got the error disappear, but still not expected result.
I pass in new version of code the creation of exe file if it exists (saved file names in new creation routine). It launches both codes (two hexlified codes) when creating the files, but multiple copies of first code afterwards.
tempfile.mktemp(), besides being deprecated since many years because of its security problems, guarantees unique names only within a single run of a program, since, per https://docs.python.org/2/library/tempfile.html#tempfile.mktemp , "The module uses a global variable that tell it how to construct a temporary name". So on the second run, as the very clear error message tells you, "The process cannot access the file because it is being used by another process" (specifically, the process started by the previous run, i.e you cannot re-write an .exe that's currently running in some process).
The fix is to make sure each run uses its own unique directory for temporary files. See mkdtemp at https://docs.python.org/2/library/tempfile.html#tempfile.mkdtemp . How to eventually clean up those temporary directories is a different issue since that can't be done as long as any process is running an .exe file within such a directory -- you'll probably need a separate "clean-up script" that does what it can for the purpose, run periodically (e.g in Unix I'd use cron) and a repository (e.g a small sqlite DB, or even just a file) to record which of the temporary directories created in previous runs still exist and need to be cleared up (catching the exceptions seen when they can't get cleaned up yet, so as to retry in the future).

multiprocessing to a python function

How I can implement the multiprocessing to my function.I tried like this but did not work.
def steric_clashes_parallel(system):
rna_st = system[MolWithResID("G")].molecule()
for i in system.molNums():
peg_st = system[i].molecule()
if rna_st != peg_st:
print(peg_st)
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
# print(Vector.distance(i.evaluate().center(), j.evaluate().center()))
dist = Vector.distance(i.evaluate().center(), j.evaluate().center())
if dist<2:
return print("there is a steric clash")
return print("there is no steric clashes")
mix = PDB().read("clash_1.pdb")
system = System()
system.add(mix)
from multiprocessing import Pool
p = Pool(4)
p.map(steric_clashes_parallel,system)
I've thousand of pdb or system files to test through this function. It took 2 h for one file on a single core without multiprocessing module. Any suggestion would be great help.
My traceback looks something like this:
self.run()
File "/home/sajid/sire.app/bundled/lib/python3.3/threading.py", line 858,
in run self._target(*self._args, **self._kwargs)
File "/home/sajid/sire.app/bundled/lib/python3.3/multiprocessing/pool.py", line 351,
in _handle_tasks put(task)
File "/home/sajid/sire.app/bundled/lib/python3.3/multiprocessing/connection.py", line 206,
in send ForkingPickler(buf, pickle.HIGHEST_PROTOCOL).dump(obj)
RuntimeError: Pickling of "Sire.System._System.System" instances is not enabled
(boost.org/libs/python/doc/v2/pickle.html)
The problem is that Sire.System._System.System can't be serialized so it can't be sent to the child process. Multiprocessing uses the pickle module for serialization and you can frequently do a sanity check in the main program with pickle.dumps(my_mp_object) to verify.
You have another problem, though (or I think you do, based on variable names). the map method takes an iterable and fans its iterated objects out to pool members, but it appears that you want to process system itself, not something at it iterates.
One trick to multiprocessing is to keep the payload that you send from the parent to the child simple and let the child do the heavy lifting of creating its objects. Here, you might be better off just sending down filenames and letting the children do most of the work.
def steric_clashes_from_file(filename):
mix = PDB().read(filename)
system = System()
system.add(mix)
steric_clashes_parallel(system)
def steric_clashes_parallel(system):
rna_st = system[MolWithResID("G")].molecule()
for i in system.molNums():
peg_st = system[i].molecule()
if rna_st != peg_st:
print(peg_st)
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
# print(Vector.distance(i.evaluate().center(), j.evaluate().center()))
dist = Vector.distance(i.evaluate().center(), j.evaluate().center())
if dist<2:
return print("there is a steric clash")
return print("there is no steric clashes")
filenames = ["clash_1.pdb",]
from multiprocessing import Pool
p = Pool(4, chunksize=1)
p.map(steric_clashes_from_file,filenames)
# martineau:
I tested pickle command and it gave me;
----> 1 pickle.dumps(clash_1.pdb)
RuntimeError: Pickling of "Sire.Mol._Mol.MoleculeGroup" instances is not enabled (http://www.boost.org/libs/python/doc/v2/pickle.html)
----> 1 pickle.dumps(system)
RuntimeError: Pickling of "Sire.System._System.System" instances is not enabled (http://www.boost.org/libs/python/doc/v2/pickle.html)
With your script it took the same time and using a single core only. dist line is iterable though. Can i run this single line over multicores ? I modify the line as;
for i in rna_st.atoms(AtomIdx()):
icent = i.evaluate().center()
for j in peg_st.atoms(AtomIdx()):
dist = Vector.distance(icent, j.evaluate().center())
There is one trick you can do to get a faster computation for each file -- processing each file sequentially, but processing the contents of the file in parallel. This relies on a number of caveats:
Your are running on a system that can fork processes (such as Linux).
The computations you are doing do not have side effects that effect the result of future computations.
It seems like this is the case in your situation, but I can't be 100% sure.
When a process is forked, all the memory in the child process is duplicated from the parent process (what's more it is duplicated in an efficient manner -- bits of memory that are only read from aren't duplicated). This makes it easy to share big, complex initial states between processes. However, once the child processes have started they will not see any changes to objects made in the parent process though (and vice versa).
Sample code:
import multiprocessing
system = None
rna_st = None
class StericClash(Exception):
"""Exception used to halt processing of a file. Could be modified to
include information about what caused the clash if this is useful."""
pass
def steric_clashes_parallel(system_index):
peg_st = system[system_index].molecule()
if rna_st != peg_st:
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
dist = Vector.distance(i.evaluate().center(),
j.evaluate().center())
if dist < 2:
raise StericClash()
def process_file(filename):
global system, rna_st
# initialise global values before creating pool
mix = PDB().read(filename)
system = System()
system.add(mix)
rna_st = system[MolWithResID("G")].molecule()
with multiprocessing.Pool() as pool:
# contents of file processed in parallel
try:
pool.map(steric_clashes_parallel, range(system.molNums()))
except StericClash:
# terminate called to halt current jobs and further processing
# of file
pool.terminate()
# wait for pool processes to terminate before returning
pool.join()
return False
else:
pool.close()
pool.join()
return True
finally:
# reset globals
system = rna_st = None
if __name__ == "__main__":
for filename in get_files_to_be_processed():
# files are being processed in serial
result = process_file(filename)
save_result_to_disk(filename, result)

Categories