Train two models concurrently - python

All I need to do is, train two regression models (using scikit-learn) on the same data at the same time, using different cores. I've tried to figured out by myself using Process without success.
gb1 = GradientBoostingRegressor(n_estimators=10)
gb2 = GradientBoostingRegressor(n_estimators=100)
def train_model(model, data, target):
model.fit(data, target)
live_data # Pandas DataFrame object
target # Numpy array object
p1 = Process(target=train_model, args=(gb1, live_data, target)) # same data
p2 = Process(target=train_model, args=(gb2, live_data, target)) # same data
p1.start()
p2.start()
If I run the code above I get the following error while trying to start the p1 process.
Traceback (most recent call last):
File "<pyshell#28>", line 1, in <module>
p1.start()
File "C:\Python27\lib\multiprocessing\process.py", line 130, in start
self._popen = Popen(self)
File "C:\Python27\lib\multiprocessing\forking.py", line 274, in __init__
to_child.close()
IOError: [Errno 22] Invalid argument
I'm running all this as a script (in IDLE) on Windows. Any suggestions on how should I proceed?

Ok.. after hours spent in try to get this working, I'll post my solution.
First thing. If you're on Windows and you're using the interactive intepreter you need to encapsualte all your code under the 'main' condition, at exeption of function definition and imports. This because when a new process will be spawned will go on loop.
My solution below:
from sklearn.ensemble import GradientBoostingRegressor
from multiprocessing import Pool
from itertools import repeat
def train_model(params):
model, data, target = params
# since Pool args accept once argument, we need to pass only one
# and then unroll it as above
model.fit(data, target)
return model
if __name__ == '__main__':
gb1 = GradientBoostingRegressor(n_estimators=10)
gb2 = GradientBoostingRegressor(n_estimators=100)
live_data # Pandas DataFrame object
target # Numpy array object
po = Pool(2) # 2 is numbers of process we want to spawn
gb, gb2 = po.map_async(train_model,
zip([gb1,gb2], repeat(data), repeat(target))
# this will zip in one iterable object
).get()
# get will start the processes and execute them
po.terminate()
# kill the spawned processes

Related

Why am I getting an error from the following usage of multiprocessing.Manager().Queue() [duplicate]

I am trying to:
share a dataframe between processes
update a shared dict based on calculations performed on (but not changing) that dataframe
I am using a multiprocessing.Manager() to create a dict in shared memory (to store results) and a Namespace to store/share my dataframe that I want to read from.
import multiprocessing
import pandas as pd
import numpy as np
def add_empty_dfs_to_shared_dict(shared_dict, key):
shared_dict[key] = pd.DataFrame()
def edit_df_in_shared_dict(shared_dict, namespace, ind):
row_to_insert = namespace.df.loc[ind]
df = shared_dict[ind]
df[ind] = row_to_insert
shared_dict[ind] = df
if __name__ == '__main__':
manager = multiprocessing.Manager()
shared_dict = manager.dict()
namespace = manager.Namespace()
n = 100
dataframe_to_be_shared = pd.DataFrame({
'player_id': list(range(n)),
'data': np.random.random(n),
}).set_index('player_id')
namespace.df = dataframe_to_be_shared
for i in range(n):
add_empty_dfs_to_shared_dict(shared_dict, i)
jobs = []
for i in range(n):
p = multiprocessing.Process(
target=edit_df_in_shared_dict,
args=(shared_dict, namespace, i)
)
jobs.append(p)
p.start()
for p in jobs:
p.join()
print(shared_dict[1])
When running the above, it writes to shared_dict correctly as my print statement executes with some data. I also get an error regarding the manager:
Process Process-88:
Traceback (most recent call last):
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 788, in _callmethod
conn = self._tls.connection
AttributeError: 'ForkAwareLocal' object has no attribute 'connection'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/Users/henrysorsky/Library/Preferences/PyCharm2019.2/scratches/scratch_13.py", line 34, in edit_df_in_shared_dict
row_to_insert = namespace.df.loc[ind]
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 1099, in __getattr__
return callmethod('__getattribute__', (key,))
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 792, in _callmethod
self._connect()
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/managers.py", line 779, in _connect
conn = self._Client(self._token.address, authkey=self._authkey)
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/connection.py", line 492, in Client
c = SocketClient(address)
File "/Users/henrysorsky/.pyenv/versions/3.7.3/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 61] Connection refused
I understand this is coming from the manager and seems to be due to it not shutting down properly. The only similar issue I can find online:
Share list between process in python server
suggests joining all the child processes, which I am already doing.
So after a full nights sleep I realised it was actually the reading of the dataframe in shared memory that was causing issues and that at around the 20th child process, some of them were failing this read. I added a max number of processes to run at once and this solved it.
For anyone wondering, the code I used is:
import multiprocessing
import pandas as pd
import numpy as np
def add_empty_dfs_to_shared_dict(shared_dict, key):
shared_dict[key] = pd.DataFrame()
def edit_df_in_shared_dict(shared_dict, namespace, ind):
row_to_insert = namespace.df.loc[ind]
df = shared_dict[ind]
df[ind] = row_to_insert
shared_dict[ind] = df
if __name__ == '__main__':
# region define inputs
max_jobs_running = 4
n = 100
# endregion
manager = multiprocessing.Manager()
shared_dict = manager.dict()
namespace = manager.Namespace()
dataframe_to_be_shared = pd.DataFrame({
'player_id': list(range(n)),
'data': np.random.random(n),
}).set_index('player_id')
namespace.df = dataframe_to_be_shared
for i in range(n):
add_empty_dfs_to_shared_dict(shared_dict, i)
jobs = []
jobs_running = 0
for i in range(n):
p = multiprocessing.Process(
target=edit_df_in_shared_dict,
args=(shared_dict, namespace, i)
)
jobs.append(p)
p.start()
jobs_running += 1
if jobs_running >= max_jobs_running:
while jobs_running >= max_jobs_running:
jobs_running = 0
for p in jobs:
jobs_running += p.is_alive()
for p in jobs:
p.join()
for key, value in shared_dict.items():
print(f"key: {key}")
print(f"value: {value}")
print("-" * 50)
This would probably be better handled by a Queue and Pool setup rather than my hacky fix.
The problem is probably in your main process, which created the shared dict. If you forgot to use process.join() (or an infinite loop) in your main process, then the main process may finish before the other processes using the dict. This way the dict gets destroyed, and the processes cannot connect to it.
The number of processes should not be a problem. You should be able to use the dict with as many as you wish.
TL;DR This error might happen if you initiate too many new connections to multiprocessing.Manager() objects in parallel due to hard-coded backlog limit (16 at the time of writing) in multiprocessing/managers.py:
# do authentication later
self.listener = Listener(address=address, backlog=16)
self.address = self.listener.address
Details: I was starting a few hundreds subprocesses trying to get a value from multiprocessing.Manager().dict object at the very start of my program (basically instantly parallel). First few worked fine, but then they started to fail sporadically.
Interestingly, in my case, this only happened under VSCode debugger. I have found a mailing list discussion mentioning this issue more than 10 years ago. Looking at the source code of multiprocessing I found out that the backlog limit is still hard-coded (seems to get increased from 5 to 16 in modern versions). I increased it to 64 and all errors were gone.
So if the pending connections queue reaches the limit, all new connections will be refused. Especially when you run your code under debugger, connections are getting served a tick slower and the backlog buffer may get full when hundreds of them are flowing fast in parallel.

Second `ParallelRunStep` in pipeline times out at start

Im trying to run a sequence of more than one ParallelRunStep in an AzureML pipeline. To do so, I create a step with the following helper:
def create_step(name, script, inp, inp_ds):
out = pip_core.PipelineData(name=f"{name}_out", datastore=dstore, is_directory=True)
out_ds = out.as_dataset()
out_ds_named = out_ds.as_named_input(f"{name}_out")
config = cont_steps.ParallelRunConfig(
source_directory="src",
entry_script=script,
mini_batch_size="1",
error_threshold=0,
output_action="summary_only",
compute_target=compute_target,
environment=component_env,
node_count=2,
logging_level="DEBUG"
)
step = cont_steps.ParallelRunStep(
name=name,
parallel_run_config=config,
inputs=[inp_ds],
output=out,
arguments=[],
allow_reuse=False,
)
return step, out, out_ds_named
As an example I create two steps like this
step1, out1, out1_ds_named = create_step("step1", "demo_s1.py", input_ds, named_input_ds)
step2, out2, out2_ds_named = create_step("step2", "demo_s2.py", out1, out1_ds_named)
Creating an experiment and submitting it to an existing workspace and Azure ML compute cluster works. Also the first step step1 uses the input_ds runs its script demo_s1.py (which produces its output files, and finishes successfully.
However the second step step2 never get started.
And there is a final exception
The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.16968441009521484 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 394
Traceback (most recent call last):
File "driver/amlbi_main.py", line 52, in <module>
main()
File "driver/amlbi_main.py", line 44, in main
JobStarter().start_job()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/job_starter.py", line 48, in start_job
job.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/job.py", line 70, in start
master.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 174, in start
self._start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 149, in _start
self.wait_for_input_init()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 124, in wait_for_input_init
raise exc
exception.FirstTaskCreationTimeout: Unable to create any task within 600 seconds.
Load the datasource and read the first row locally to see how long it will take.
Set the advanced argument '--first_task_creation_timeout' to a larger value in arguments in ParallelRunStep.
I have the impression, that the second step is waiting for some data. However the first step creates the supplied output directory and also a file.
import argparse
import os
def init():
pass
def run(parallel_input):
print(f"*** Running {os.path.basename(__file__)} with input {parallel_input}")
parser = argparse.ArgumentParser(description="Data Preparation")
parser.add_argument('--output', type=str, required=True)
args, unknown_args = parser.parse_known_args()
out_path = os.path.join(args.output, "1.data")
os.makedirs(args.output, exist_ok=True)
open(out_path, "a").close()
return [out_path]
I have no idea how to debug further. Has anybody an idea?
You can check this notebook for parallel run and make sure that you are using the same packages.
https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb

How to fix pickle.unpickling error caused by calls to subprocess.Popen in parallel script that uses mpi4py

Repeated serial calls to subprocess.Popen() in a script parallelized with mpi4py eventually cause what seems to be data corruption during communication, manifesting as a pickle.unpickling error of various types (I have seen the unpickling errors: EOF, invalid unicode character, invalid load key, unpickling stack underflow). It seems to only happen when the data being communicated is large, the number of serial calls to subprocess is large, or the number of mpi processes is large.
I can reproduce the error with python>=2.7, mpi4py>=3.0.1, and openmpi>=3.0.0. I would ultimately like to communicate python objects so I am using the lowercase mpi4py methods. Here is a minimum code which reproduces the error:
#!/usr/bin/env python
from mpi4py import MPI
from copy import deepcopy
import subprocess
nr_calcs = 4
tasks_per_calc = 44
data_size = 55000
# --------------------------------------------------------------------
def run_test(nr_calcs, tasks_per_calc, data_size):
# Init MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
comm_size = comm.Get_size()
# Run Moc Calcs
icalc = 0
while True:
if icalc > nr_calcs - 1: break
index = icalc
icalc += 1
# Init Moc Tasks
task_list = []
moc_task = data_size*"x"
if rank==0:
task_list = [deepcopy(moc_task) for i in range(tasks_per_calc)]
task_list = comm.bcast(task_list)
# Moc Run Tasks
itmp = rank
while True:
if itmp > len(task_list)-1: break
itmp += comm_size
proc = subprocess.Popen(["echo", "TEST CALL TO SUBPROCESS"],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
out,err = proc.communicate()
print("Rank {:3d} Finished Calc {:3d}".format(rank, index))
# --------------------------------------------------------------------
if __name__ == '__main__':
run_test(nr_calcs, tasks_per_calc, data_size)
Running this on one 44 core node with 44 mpi processes completes the first 3 "calculations" successfully, but on the final loop some processes raise:
Traceback (most recent call last):
File "./run_test.py", line 54, in <module>
run_test(nr_calcs, tasks_per_calc, data_size)
File "./run_test.py", line 39, in run_test
task_list = comm.bcast(task_list)
File "mpi4py/MPI/Comm.pyx", line 1257, in mpi4py.MPI.Comm.bcast
File "mpi4py/MPI/msgpickle.pxi", line 639, in mpi4py.MPI.PyMPI_bcast
File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
File "mpi4py/MPI/msgpickle.pxi", line 101, in mpi4py.MPI.Pickle.cloads
_pickle.UnpicklingError
Sometimes the UnpicklingError has a descriptor, such as invalid load key "x", or EOF Error, invalid unicode character, or unpickling stack underflow.
Edit: It appears the problem disappears with openmpi<3.0.0, and using mvapich2, but it would still be good to understand what's going on.
I had the same problem. In my case, I got my code to work by installing mpi4py in a Python virtual environment, and by setting mpi4py.rc.recv_mprobe = False as suggested by Intel:
https://software.intel.com/en-us/articles/python-mpi4py-on-intel-true-scale-and-omni-path-clusters
However, in the end I just switched to using the capital letter methods Recv and Send with NumPy arrays. They work fine with subprocess and they don't need any additional tricks.

Python pint module with multiprocessing

The python pint module implements physical quantities. I would like to use it together with multiprocessing. However, I don't know how to handle creating a UnitRegistry in the new process. If I do the intuitive:
from multiprocessing import Process
from pint import UnitRegistry, set_application_registry
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
def f(one, two):
print(one / two)
if __name__ == '__main__':
p = Process(target=f, args=(Q(50, 'ms'), Q(50, 'ns')))
p.start()
p.join()
Then I get an the following exception:
Traceback (most recent call last):
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\pmaunz\PyCharmProjects\IonControl34\tests\pintmultiprocessing.py", line 12, in f
print(one / two)
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 738, in __truediv__
return self._mul_div(other, operator.truediv)
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 675, in _mul_div
offset_units_self = self._get_non_multiplicative_units()
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 1312, in _get_non_multiplicative_units
offset_units = [unit for unit in self._units.keys()
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 1313, in <listcomp>
if not self._REGISTRY._units[unit].is_multiplicative]
KeyError: 'millisecond'
Which I assume originates from the lack of initializing the UnitRegistry on the child process before unpickling the arguments. (Initializing the UnitRegistry in the function f does not work, as the variables have already been unpickled).
How would I go about sending a pint Quantity to a child process?
Edit after Tim Peter's answer:
The problem is not tied to multiprocessing. Simply pickling quantities
from pint import UnitRegistry, set_application_registry
import pickle
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
with open("pint.pkl", 'wb') as f:
pickle.dump(Q(50, 'ms'), f)
pickle.dump(Q(50, 'ns'), f)
and then unpickling in a new script leads to the same problem:
from pint import UnitRegistry, set_application_registry
import pickle
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
with open("pint.pkl", 'rb') as f:
t1 = pickle.load(f)
t2 = pickle.load(f)
print(t1 / t2)
results in the same exception. As Tim points out, it is sufficient to add a line Q(50, 'ns'); Q(50, 'ms') before unpickling. When digging into the source code for pint, upon creation of a quantity with unit ms this unit is added to an internal registry. Pickling uses a UnitContainer instance to save the units. When creating a Quantity via unpickling the unit is not added to the registry.
A simple fix (in pint source code) is to change the function Quantity.__reduce__ to return a string.
diff --git a/pint/quantity.py b/pint/quantity.py
index 3f30a25..695866a 100644
--- a/pint/quantity.py
+++ b/pint/quantity.py
## -57,7 +57,7 ## class _Quantity(SharedRegistryObject):
def __reduce__(self):
from . import _build_quantity
- return _build_quantity, (self.magnitude, self._units)
+ return _build_quantity, (self.magnitude, str(self._units))
def __new__(cls, value, units=None):
if units is None:
I have opened an issue on pint's github site.
I never used pint before, but this looked interesting ;-) First thing I noted is that I have no problem if I stick to units explicitly listed by this line:
print(dir(ureg.sys.mks))
For example, "hour" and "second" are both in the output of that, and your program runs fine if the Process line is changed to:
p = Process(target=f, args=(Q(50, 'hour'), Q(50, 'second')))
You're on Windows, so multiprocessing is using the "spawn" method: the entire program is imported fresh by the worker process, so in particular the:
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
lines were executed in the worker process too. So the unit registry is initialized in the worker, but it's not the same (identical) registry used in the main program - no memory is shared between processes.
To get much deeper we really need an expert in how pint is implemented. My guess is that for units "made up" (not in the output produced by the dir() line above) by parsing strings, new stuff is added to the registry at some level, which is needed later to reconstruct the values. "ns" and "ms" are of this nature: they are not in the dir() output.
Your program works fine as-is if I add a line like this immediately after your Q=ureg.Quantity line:
Q(1, 'ms'); Q(1, 'ns')
That was a shot in the dark (an "educated guess") that worked: it just forced the worker process to parse the same "made up" units used in the main process, to try to force its unit registry into a similar state.
I hope there's a cleaner way to get it to work, but can't help more. I'd ask the pint authors about it.

multiprocessing to a python function

How I can implement the multiprocessing to my function.I tried like this but did not work.
def steric_clashes_parallel(system):
rna_st = system[MolWithResID("G")].molecule()
for i in system.molNums():
peg_st = system[i].molecule()
if rna_st != peg_st:
print(peg_st)
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
# print(Vector.distance(i.evaluate().center(), j.evaluate().center()))
dist = Vector.distance(i.evaluate().center(), j.evaluate().center())
if dist<2:
return print("there is a steric clash")
return print("there is no steric clashes")
mix = PDB().read("clash_1.pdb")
system = System()
system.add(mix)
from multiprocessing import Pool
p = Pool(4)
p.map(steric_clashes_parallel,system)
I've thousand of pdb or system files to test through this function. It took 2 h for one file on a single core without multiprocessing module. Any suggestion would be great help.
My traceback looks something like this:
self.run()
File "/home/sajid/sire.app/bundled/lib/python3.3/threading.py", line 858,
in run self._target(*self._args, **self._kwargs)
File "/home/sajid/sire.app/bundled/lib/python3.3/multiprocessing/pool.py", line 351,
in _handle_tasks put(task)
File "/home/sajid/sire.app/bundled/lib/python3.3/multiprocessing/connection.py", line 206,
in send ForkingPickler(buf, pickle.HIGHEST_PROTOCOL).dump(obj)
RuntimeError: Pickling of "Sire.System._System.System" instances is not enabled
(boost.org/libs/python/doc/v2/pickle.html)
The problem is that Sire.System._System.System can't be serialized so it can't be sent to the child process. Multiprocessing uses the pickle module for serialization and you can frequently do a sanity check in the main program with pickle.dumps(my_mp_object) to verify.
You have another problem, though (or I think you do, based on variable names). the map method takes an iterable and fans its iterated objects out to pool members, but it appears that you want to process system itself, not something at it iterates.
One trick to multiprocessing is to keep the payload that you send from the parent to the child simple and let the child do the heavy lifting of creating its objects. Here, you might be better off just sending down filenames and letting the children do most of the work.
def steric_clashes_from_file(filename):
mix = PDB().read(filename)
system = System()
system.add(mix)
steric_clashes_parallel(system)
def steric_clashes_parallel(system):
rna_st = system[MolWithResID("G")].molecule()
for i in system.molNums():
peg_st = system[i].molecule()
if rna_st != peg_st:
print(peg_st)
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
# print(Vector.distance(i.evaluate().center(), j.evaluate().center()))
dist = Vector.distance(i.evaluate().center(), j.evaluate().center())
if dist<2:
return print("there is a steric clash")
return print("there is no steric clashes")
filenames = ["clash_1.pdb",]
from multiprocessing import Pool
p = Pool(4, chunksize=1)
p.map(steric_clashes_from_file,filenames)
# martineau:
I tested pickle command and it gave me;
----> 1 pickle.dumps(clash_1.pdb)
RuntimeError: Pickling of "Sire.Mol._Mol.MoleculeGroup" instances is not enabled (http://www.boost.org/libs/python/doc/v2/pickle.html)
----> 1 pickle.dumps(system)
RuntimeError: Pickling of "Sire.System._System.System" instances is not enabled (http://www.boost.org/libs/python/doc/v2/pickle.html)
With your script it took the same time and using a single core only. dist line is iterable though. Can i run this single line over multicores ? I modify the line as;
for i in rna_st.atoms(AtomIdx()):
icent = i.evaluate().center()
for j in peg_st.atoms(AtomIdx()):
dist = Vector.distance(icent, j.evaluate().center())
There is one trick you can do to get a faster computation for each file -- processing each file sequentially, but processing the contents of the file in parallel. This relies on a number of caveats:
Your are running on a system that can fork processes (such as Linux).
The computations you are doing do not have side effects that effect the result of future computations.
It seems like this is the case in your situation, but I can't be 100% sure.
When a process is forked, all the memory in the child process is duplicated from the parent process (what's more it is duplicated in an efficient manner -- bits of memory that are only read from aren't duplicated). This makes it easy to share big, complex initial states between processes. However, once the child processes have started they will not see any changes to objects made in the parent process though (and vice versa).
Sample code:
import multiprocessing
system = None
rna_st = None
class StericClash(Exception):
"""Exception used to halt processing of a file. Could be modified to
include information about what caused the clash if this is useful."""
pass
def steric_clashes_parallel(system_index):
peg_st = system[system_index].molecule()
if rna_st != peg_st:
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
dist = Vector.distance(i.evaluate().center(),
j.evaluate().center())
if dist < 2:
raise StericClash()
def process_file(filename):
global system, rna_st
# initialise global values before creating pool
mix = PDB().read(filename)
system = System()
system.add(mix)
rna_st = system[MolWithResID("G")].molecule()
with multiprocessing.Pool() as pool:
# contents of file processed in parallel
try:
pool.map(steric_clashes_parallel, range(system.molNums()))
except StericClash:
# terminate called to halt current jobs and further processing
# of file
pool.terminate()
# wait for pool processes to terminate before returning
pool.join()
return False
else:
pool.close()
pool.join()
return True
finally:
# reset globals
system = rna_st = None
if __name__ == "__main__":
for filename in get_files_to_be_processed():
# files are being processed in serial
result = process_file(filename)
save_result_to_disk(filename, result)

Categories