multiprocessing exits before doing any job - python

I have inherited a certain parser that is supposed to parse 10 files with ~4m lines each.
The code was written in Python 2, which I updated.
There is a multiprocessing logic which i just don't seem to be able to get to work.
from multiprocessing.pool import ThreadPool
import glob
DATADIR = 'home/my_dir/where/all/my/files/are'
def process_file(filepath):
# read line by line, parse and insert to postgres database.
def process_directory(dirpath):
pattern = f'{dirpath}/*dat' # files have .dat extension.
tp = ThreadPool(10)
for filepath in glob.glob(pattern):
print(filepath)
tp.apply_async(process_file, filepath)
tp.close()
tp.join()
if __name__ == '__main__':
process_directory(DATADIR)
I have gone through a lot of the documentation and some similar questions but it just doesn't seem to work.
With the parser code what happens is that I do get printed on the console all the paths of the file that I need parsed, but then that's it the program doesn't do anything else.

The problem is in the way you're calling apply_async. I made a simple reproducer of your problem, but with a slight tweak to get the result from each call:
from multiprocessing.pool import ThreadPool
def func(f):
print("hey " + f)
return f + "1"
l = ["name", "name2", "name3"]
pool = ThreadPool(3)
out = []
for a in l:
print(a)
out.append(pool.apply_async(func, a))
# Check the response from each `apply_async` call
for a in out:
a.get()
pool.close()
pool.join()
This returns an error:
Traceback (most recent call last):
File "a.py", line 16, in <module>
a.get()
File "/usr/lib64/python3.4/multiprocessing/pool.py", line 599, in get
raise self._value
File "/usr/lib64/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
TypeError: func() takes 1 positional argument but 4 were given
It thinks you're passing four positional arguments, instead of one. This is because apply_async wants all the arguments passed in a tuple, like this:
pool.apply_async(func, (a,))
If you put filepath in a tuple when you call apply_async, I think you'll get the behavior you expect.
It's also worth noting that your usecase is well-suited to using pool.map instead of apply_async, which is a little more succinct:
pool.map(process_file, glob.glob(pattern))

Related

Second `ParallelRunStep` in pipeline times out at start

Im trying to run a sequence of more than one ParallelRunStep in an AzureML pipeline. To do so, I create a step with the following helper:
def create_step(name, script, inp, inp_ds):
out = pip_core.PipelineData(name=f"{name}_out", datastore=dstore, is_directory=True)
out_ds = out.as_dataset()
out_ds_named = out_ds.as_named_input(f"{name}_out")
config = cont_steps.ParallelRunConfig(
source_directory="src",
entry_script=script,
mini_batch_size="1",
error_threshold=0,
output_action="summary_only",
compute_target=compute_target,
environment=component_env,
node_count=2,
logging_level="DEBUG"
)
step = cont_steps.ParallelRunStep(
name=name,
parallel_run_config=config,
inputs=[inp_ds],
output=out,
arguments=[],
allow_reuse=False,
)
return step, out, out_ds_named
As an example I create two steps like this
step1, out1, out1_ds_named = create_step("step1", "demo_s1.py", input_ds, named_input_ds)
step2, out2, out2_ds_named = create_step("step2", "demo_s2.py", out1, out1_ds_named)
Creating an experiment and submitting it to an existing workspace and Azure ML compute cluster works. Also the first step step1 uses the input_ds runs its script demo_s1.py (which produces its output files, and finishes successfully.
However the second step step2 never get started.
And there is a final exception
The experiment failed. Finalizing run...
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.16968441009521484 seconds
Starting the daemon thread to refresh tokens in background for process with pid = 394
Traceback (most recent call last):
File "driver/amlbi_main.py", line 52, in <module>
main()
File "driver/amlbi_main.py", line 44, in main
JobStarter().start_job()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/job_starter.py", line 48, in start_job
job.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/job.py", line 70, in start
master.start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 174, in start
self._start()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 149, in _start
self.wait_for_input_init()
File "/mnt/batch/tasks/shared/LS_root/jobs/pipeline/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/mounts/workspaceblobstore/azureml/08a1e1e1-7c3f-4c5a-84ad-ca99b8a6cb31/driver/master.py", line 124, in wait_for_input_init
raise exc
exception.FirstTaskCreationTimeout: Unable to create any task within 600 seconds.
Load the datasource and read the first row locally to see how long it will take.
Set the advanced argument '--first_task_creation_timeout' to a larger value in arguments in ParallelRunStep.
I have the impression, that the second step is waiting for some data. However the first step creates the supplied output directory and also a file.
import argparse
import os
def init():
pass
def run(parallel_input):
print(f"*** Running {os.path.basename(__file__)} with input {parallel_input}")
parser = argparse.ArgumentParser(description="Data Preparation")
parser.add_argument('--output', type=str, required=True)
args, unknown_args = parser.parse_known_args()
out_path = os.path.join(args.output, "1.data")
os.makedirs(args.output, exist_ok=True)
open(out_path, "a").close()
return [out_path]
I have no idea how to debug further. Has anybody an idea?
You can check this notebook for parallel run and make sure that you are using the same packages.
https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/parallel-run/tabular-dataset-inference-iris.ipynb

How to fix pickle.unpickling error caused by calls to subprocess.Popen in parallel script that uses mpi4py

Repeated serial calls to subprocess.Popen() in a script parallelized with mpi4py eventually cause what seems to be data corruption during communication, manifesting as a pickle.unpickling error of various types (I have seen the unpickling errors: EOF, invalid unicode character, invalid load key, unpickling stack underflow). It seems to only happen when the data being communicated is large, the number of serial calls to subprocess is large, or the number of mpi processes is large.
I can reproduce the error with python>=2.7, mpi4py>=3.0.1, and openmpi>=3.0.0. I would ultimately like to communicate python objects so I am using the lowercase mpi4py methods. Here is a minimum code which reproduces the error:
#!/usr/bin/env python
from mpi4py import MPI
from copy import deepcopy
import subprocess
nr_calcs = 4
tasks_per_calc = 44
data_size = 55000
# --------------------------------------------------------------------
def run_test(nr_calcs, tasks_per_calc, data_size):
# Init MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
comm_size = comm.Get_size()
# Run Moc Calcs
icalc = 0
while True:
if icalc > nr_calcs - 1: break
index = icalc
icalc += 1
# Init Moc Tasks
task_list = []
moc_task = data_size*"x"
if rank==0:
task_list = [deepcopy(moc_task) for i in range(tasks_per_calc)]
task_list = comm.bcast(task_list)
# Moc Run Tasks
itmp = rank
while True:
if itmp > len(task_list)-1: break
itmp += comm_size
proc = subprocess.Popen(["echo", "TEST CALL TO SUBPROCESS"],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=False)
out,err = proc.communicate()
print("Rank {:3d} Finished Calc {:3d}".format(rank, index))
# --------------------------------------------------------------------
if __name__ == '__main__':
run_test(nr_calcs, tasks_per_calc, data_size)
Running this on one 44 core node with 44 mpi processes completes the first 3 "calculations" successfully, but on the final loop some processes raise:
Traceback (most recent call last):
File "./run_test.py", line 54, in <module>
run_test(nr_calcs, tasks_per_calc, data_size)
File "./run_test.py", line 39, in run_test
task_list = comm.bcast(task_list)
File "mpi4py/MPI/Comm.pyx", line 1257, in mpi4py.MPI.Comm.bcast
File "mpi4py/MPI/msgpickle.pxi", line 639, in mpi4py.MPI.PyMPI_bcast
File "mpi4py/MPI/msgpickle.pxi", line 111, in mpi4py.MPI.Pickle.load
File "mpi4py/MPI/msgpickle.pxi", line 101, in mpi4py.MPI.Pickle.cloads
_pickle.UnpicklingError
Sometimes the UnpicklingError has a descriptor, such as invalid load key "x", or EOF Error, invalid unicode character, or unpickling stack underflow.
Edit: It appears the problem disappears with openmpi<3.0.0, and using mvapich2, but it would still be good to understand what's going on.
I had the same problem. In my case, I got my code to work by installing mpi4py in a Python virtual environment, and by setting mpi4py.rc.recv_mprobe = False as suggested by Intel:
https://software.intel.com/en-us/articles/python-mpi4py-on-intel-true-scale-and-omni-path-clusters
However, in the end I just switched to using the capital letter methods Recv and Send with NumPy arrays. They work fine with subprocess and they don't need any additional tricks.

Python pint module with multiprocessing

The python pint module implements physical quantities. I would like to use it together with multiprocessing. However, I don't know how to handle creating a UnitRegistry in the new process. If I do the intuitive:
from multiprocessing import Process
from pint import UnitRegistry, set_application_registry
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
def f(one, two):
print(one / two)
if __name__ == '__main__':
p = Process(target=f, args=(Q(50, 'ms'), Q(50, 'ns')))
p.start()
p.join()
Then I get an the following exception:
Traceback (most recent call last):
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\multiprocessing\process.py", line 254, in _bootstrap
self.run()
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\multiprocessing\process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\pmaunz\PyCharmProjects\IonControl34\tests\pintmultiprocessing.py", line 12, in f
print(one / two)
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 738, in __truediv__
return self._mul_div(other, operator.truediv)
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 675, in _mul_div
offset_units_self = self._get_non_multiplicative_units()
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 1312, in _get_non_multiplicative_units
offset_units = [unit for unit in self._units.keys()
File "C:\WinPython-64bit-3.4.4.2Qt5\python-3.4.4.amd64\lib\site-packages\pint\quantity.py", line 1313, in <listcomp>
if not self._REGISTRY._units[unit].is_multiplicative]
KeyError: 'millisecond'
Which I assume originates from the lack of initializing the UnitRegistry on the child process before unpickling the arguments. (Initializing the UnitRegistry in the function f does not work, as the variables have already been unpickled).
How would I go about sending a pint Quantity to a child process?
Edit after Tim Peter's answer:
The problem is not tied to multiprocessing. Simply pickling quantities
from pint import UnitRegistry, set_application_registry
import pickle
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
with open("pint.pkl", 'wb') as f:
pickle.dump(Q(50, 'ms'), f)
pickle.dump(Q(50, 'ns'), f)
and then unpickling in a new script leads to the same problem:
from pint import UnitRegistry, set_application_registry
import pickle
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
with open("pint.pkl", 'rb') as f:
t1 = pickle.load(f)
t2 = pickle.load(f)
print(t1 / t2)
results in the same exception. As Tim points out, it is sufficient to add a line Q(50, 'ns'); Q(50, 'ms') before unpickling. When digging into the source code for pint, upon creation of a quantity with unit ms this unit is added to an internal registry. Pickling uses a UnitContainer instance to save the units. When creating a Quantity via unpickling the unit is not added to the registry.
A simple fix (in pint source code) is to change the function Quantity.__reduce__ to return a string.
diff --git a/pint/quantity.py b/pint/quantity.py
index 3f30a25..695866a 100644
--- a/pint/quantity.py
+++ b/pint/quantity.py
## -57,7 +57,7 ## class _Quantity(SharedRegistryObject):
def __reduce__(self):
from . import _build_quantity
- return _build_quantity, (self.magnitude, self._units)
+ return _build_quantity, (self.magnitude, str(self._units))
def __new__(cls, value, units=None):
if units is None:
I have opened an issue on pint's github site.
I never used pint before, but this looked interesting ;-) First thing I noted is that I have no problem if I stick to units explicitly listed by this line:
print(dir(ureg.sys.mks))
For example, "hour" and "second" are both in the output of that, and your program runs fine if the Process line is changed to:
p = Process(target=f, args=(Q(50, 'hour'), Q(50, 'second')))
You're on Windows, so multiprocessing is using the "spawn" method: the entire program is imported fresh by the worker process, so in particular the:
ureg = UnitRegistry()
set_application_registry(ureg)
Q = ureg.Quantity
lines were executed in the worker process too. So the unit registry is initialized in the worker, but it's not the same (identical) registry used in the main program - no memory is shared between processes.
To get much deeper we really need an expert in how pint is implemented. My guess is that for units "made up" (not in the output produced by the dir() line above) by parsing strings, new stuff is added to the registry at some level, which is needed later to reconstruct the values. "ns" and "ms" are of this nature: they are not in the dir() output.
Your program works fine as-is if I add a line like this immediately after your Q=ureg.Quantity line:
Q(1, 'ms'); Q(1, 'ns')
That was a shot in the dark (an "educated guess") that worked: it just forced the worker process to parse the same "made up" units used in the main process, to try to force its unit registry into a similar state.
I hope there's a cleaner way to get it to work, but can't help more. I'd ask the pint authors about it.

multiprocessing to a python function

How I can implement the multiprocessing to my function.I tried like this but did not work.
def steric_clashes_parallel(system):
rna_st = system[MolWithResID("G")].molecule()
for i in system.molNums():
peg_st = system[i].molecule()
if rna_st != peg_st:
print(peg_st)
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
# print(Vector.distance(i.evaluate().center(), j.evaluate().center()))
dist = Vector.distance(i.evaluate().center(), j.evaluate().center())
if dist<2:
return print("there is a steric clash")
return print("there is no steric clashes")
mix = PDB().read("clash_1.pdb")
system = System()
system.add(mix)
from multiprocessing import Pool
p = Pool(4)
p.map(steric_clashes_parallel,system)
I've thousand of pdb or system files to test through this function. It took 2 h for one file on a single core without multiprocessing module. Any suggestion would be great help.
My traceback looks something like this:
self.run()
File "/home/sajid/sire.app/bundled/lib/python3.3/threading.py", line 858,
in run self._target(*self._args, **self._kwargs)
File "/home/sajid/sire.app/bundled/lib/python3.3/multiprocessing/pool.py", line 351,
in _handle_tasks put(task)
File "/home/sajid/sire.app/bundled/lib/python3.3/multiprocessing/connection.py", line 206,
in send ForkingPickler(buf, pickle.HIGHEST_PROTOCOL).dump(obj)
RuntimeError: Pickling of "Sire.System._System.System" instances is not enabled
(boost.org/libs/python/doc/v2/pickle.html)
The problem is that Sire.System._System.System can't be serialized so it can't be sent to the child process. Multiprocessing uses the pickle module for serialization and you can frequently do a sanity check in the main program with pickle.dumps(my_mp_object) to verify.
You have another problem, though (or I think you do, based on variable names). the map method takes an iterable and fans its iterated objects out to pool members, but it appears that you want to process system itself, not something at it iterates.
One trick to multiprocessing is to keep the payload that you send from the parent to the child simple and let the child do the heavy lifting of creating its objects. Here, you might be better off just sending down filenames and letting the children do most of the work.
def steric_clashes_from_file(filename):
mix = PDB().read(filename)
system = System()
system.add(mix)
steric_clashes_parallel(system)
def steric_clashes_parallel(system):
rna_st = system[MolWithResID("G")].molecule()
for i in system.molNums():
peg_st = system[i].molecule()
if rna_st != peg_st:
print(peg_st)
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
# print(Vector.distance(i.evaluate().center(), j.evaluate().center()))
dist = Vector.distance(i.evaluate().center(), j.evaluate().center())
if dist<2:
return print("there is a steric clash")
return print("there is no steric clashes")
filenames = ["clash_1.pdb",]
from multiprocessing import Pool
p = Pool(4, chunksize=1)
p.map(steric_clashes_from_file,filenames)
# martineau:
I tested pickle command and it gave me;
----> 1 pickle.dumps(clash_1.pdb)
RuntimeError: Pickling of "Sire.Mol._Mol.MoleculeGroup" instances is not enabled (http://www.boost.org/libs/python/doc/v2/pickle.html)
----> 1 pickle.dumps(system)
RuntimeError: Pickling of "Sire.System._System.System" instances is not enabled (http://www.boost.org/libs/python/doc/v2/pickle.html)
With your script it took the same time and using a single core only. dist line is iterable though. Can i run this single line over multicores ? I modify the line as;
for i in rna_st.atoms(AtomIdx()):
icent = i.evaluate().center()
for j in peg_st.atoms(AtomIdx()):
dist = Vector.distance(icent, j.evaluate().center())
There is one trick you can do to get a faster computation for each file -- processing each file sequentially, but processing the contents of the file in parallel. This relies on a number of caveats:
Your are running on a system that can fork processes (such as Linux).
The computations you are doing do not have side effects that effect the result of future computations.
It seems like this is the case in your situation, but I can't be 100% sure.
When a process is forked, all the memory in the child process is duplicated from the parent process (what's more it is duplicated in an efficient manner -- bits of memory that are only read from aren't duplicated). This makes it easy to share big, complex initial states between processes. However, once the child processes have started they will not see any changes to objects made in the parent process though (and vice versa).
Sample code:
import multiprocessing
system = None
rna_st = None
class StericClash(Exception):
"""Exception used to halt processing of a file. Could be modified to
include information about what caused the clash if this is useful."""
pass
def steric_clashes_parallel(system_index):
peg_st = system[system_index].molecule()
if rna_st != peg_st:
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
dist = Vector.distance(i.evaluate().center(),
j.evaluate().center())
if dist < 2:
raise StericClash()
def process_file(filename):
global system, rna_st
# initialise global values before creating pool
mix = PDB().read(filename)
system = System()
system.add(mix)
rna_st = system[MolWithResID("G")].molecule()
with multiprocessing.Pool() as pool:
# contents of file processed in parallel
try:
pool.map(steric_clashes_parallel, range(system.molNums()))
except StericClash:
# terminate called to halt current jobs and further processing
# of file
pool.terminate()
# wait for pool processes to terminate before returning
pool.join()
return False
else:
pool.close()
pool.join()
return True
finally:
# reset globals
system = rna_st = None
if __name__ == "__main__":
for filename in get_files_to_be_processed():
# files are being processed in serial
result = process_file(filename)
save_result_to_disk(filename, result)

How do I subclass pyCLI's cli.app.CommandLineApp?

The documentation is really vague about subclassing the CommandLineApp, only mentioning one example:
class YourApp(cli.app.CommandLineApp):
def main(self):
do_stuff()
So with the information I've found I've pieced together this code:
#!/usr/bin/env python
import os
import sys
from cli.app import CommandLineApp
# Append the parent folder to the python path
sys.path.append(os.path.join(os.path.dirname(__file__), '../'))
import tabulardata
from addrtools import extract_address
class SplitAddressApp(CommandLineApp):
def main(self):
"""
Split an address from one column to separate columns.
"""
table = tabulardata.from_file(self.params.file)
def for_each_row(i, item):
addr = extract_address(item['Address'])
print '%-3d %-75s %s' % (i, item['Address'], repr(addr))
table.each(for_each_row)
def setup(self):
self.add_param('file', metavar='FILE', help='The data file.')
self.add_param(
'cols', metavar='ADDRESS_COLUMN', nargs='+',
help='The name of the address column. If multiple names are ' + \
'passed, each column will be checked for an address in order'
)
if __name__ == '__main__':
SplitAddressApp().run()
Which seems correct to me. The documentation gives no examples on how to handle the setup method or running the application when using subclassing. I get the error:
Traceback (most recent call last):
File "bin/split_address_column", line 36, in
SplitAddressApp().run()
File "/Users/tomas/.pythonbrew/venvs/Python-2.7.3/address_cleaner/lib/python2.7/site-packages/cli/app.py", line 440, in __init__
Application.__init__(self, main, **kwargs)
File "/Users/tomas/.pythonbrew/venvs/Python-2.7.3/address_cleaner/lib/python2.7/site-packages/cli/app.py", line 129, in __init__
self.setup()
File "bin/split_address_column", line 28, in setup
self.add_param('file', metavar='FILE', help='The data file.')
File "/Users/tomas/.pythonbrew/venvs/Python-2.7.3/address_cleaner/lib/python2.7/site-packages/cli/app.py", line 385, in add_param
action = self.argparser.add_argument(*args, **kwargs)
AttributeError: 'SplitAddressApp' object has no attribute 'argparser'
So presumably I'm doing something wrong, but what?
I figured it out. Reading the source of pyCLI it turns out that the setup function is quite important for the functionality of the whole library, while I thought it was just a function where I could put my setup code. argparser is created in cli.app.CommandLineApp.setup which means I at least have to call
cli.app.CommandLineApp.setup(self)
inside the setup function for it to even work. And now the code works perfectly!

Categories