Using Multiprocessing with Modules - python

I am writing a module such that in one function I want to use the Pool function from the multiprocessing library in Python 3.6. I have done some research on the problem and the it seems that you cannot use if __name__=="__main__" as the code is not being run from main. I have also noticed that the python pool processes get initialized in my task manager but essentially are stuck.
So for example:
class myClass()
...
lots of different functions here
...
def multiprocessFunc()
do stuff in here
def funcThatCallsMultiprocessFunc()
array=[array of filenames to be called]
if __name__=="__main__":
p = Pool(processes=20)
p.map_async(multiprocessFunc,array)
I tried to remove the if __name__=="__main__" part but still no dice. any help would appreciated.

It seems to me that your have just missed out a self. from your code. I should think this will work:
class myClass():
...
# lots of different functions here
...
def multiprocessFunc(self, file):
# do stuff in here
def funcThatCallsMultiprocessFunc(self):
array = [array of filenames to be called]
p = Pool(processes=20)
p.map_async(self.multiprocessFunc, array) #added self. here
Now having done some experiments, I see that map_async could take quite some time to start up (I think because multiprocessing creates processes) and any test code might call funcThatCallsMultiprocessFunc and then quit before the Pool has got started.
In my tests I had to wait for over 10 seconds after funcThatCallsMultiprocessFunc before calls to multiprocessFunc started. But once started, they seemed to run just fine.
This is the actual code I've used:
MyClass.py
from multiprocessing import Pool
import time
import string
class myClass():
def __init__(self):
self.result = None
def multiprocessFunc(self, f):
time.sleep(1)
print(f)
return f
def funcThatCallsMultiprocessFunc(self):
array = [c for c in string.ascii_lowercase]
print(array)
p = Pool(processes=20)
p.map_async(self.multiprocessFunc, array, callback=self.done)
p.close()
def done(self, arg):
self.result = 'Done'
print('done', arg)
Run.py
from MyClass import myClass
import time
def main():
c = myClass()
c.funcThatCallsMultiprocessFunc()
for i in range(30):
print(i, c.result)
time.sleep(1)
if __name__=="__main__":
main()

The if __name__=='__main__' construct is an import protection. You want to use it, to stop multiprocessing from running your setup on import.
In your case, you can leave out this protection in the class setup. Be sure to protect the execution points of the class in the calling file like this:
def apply_async_with_callback():
pool = mp.Pool(processes=30)
for i in range(z):
pool.apply_async(parallel_function, args = (i,x,y, ), callback = callback_function)
pool.close()
pool.join()
print "Multiprocessing done!"
if __name__ == '__main__':
apply_async_with_callback()

Related

Sharing a Pool object across processes in python multiprocessing

I am trying to share a pool object between multiple processes using the following code
from multiprocessing import Process, Pool
import time
pool = Pool(5)
def print_hello():
time.sleep(1)
return "hello"
def pipeline():
print("In pipeline")
msg = pool.apply_async(print_hello()).get(timeout=1.5)
print("In pipeline")
print(msg)
def post():
p = Process(target = pipeline)
p.start()
return
if __name__ == '__main__':
post()
print("Returned from post")
However the code exists with the timout since get() doesnot return. I believe this has to do with pool being a globally accessible variable because it works just fine when I move pool to being local to pipeline function. Can anyone give me suggestions if there exists a workaround for this problem ?
Edit: finally got working with thread instead of process for running pipeline function.

my python Ray script runs on a single worker only

I am a new with Ray and after have read he documentation, I came up with a script that mimics what I want to do further with Ray. Here is my script:
import ray
import time
import h5py
#ray.remote
class Analysis:
def __init__(self):
self._file = h5py.File('./Data/Trajectories/MDANSE/apoferritin.h5')
def __getstate__(self):
print('I dump')
d = self.__dict__.copy()
del d['_file']
return d
def __setstate__(self,state):
self.__dict__ = state
self._file = h5py.File('./Data/Trajectories/MDANSE/apoferritin.h5')
def run_step(self,index):
time.sleep(5)
print('I run a step',index)
def combine(self,index):
print('I combine',index)
ray.init(num_cpus=4)
a = Analysis.remote()
obj_id = ray.put(a)
for i in range(100):
output = ray.get(a.run_step.remote(i))
My problem is that when I run this script it runs on a single worker as indicated by the Ray output whereas I would expect 4 workers to be fired. Would you know what is wrong with my script ?
Quoting from ray docs on actor
Methods called on different actors can execute in parallel, and methods called on the same actor are executed serially in the order that they are called.
Another issue with the above code is that ray.get is a blocking call.
I will suggest instantiating multiple actors and running the jobs, like
actors = [Analysis.remote() for i in range(num_cpus)]
outputs = []
for i in range(100):
outputs.append(actors[i % num_cpus].run_step.remote(i))
output = ray.get(outputs)

Basic multiprocessing with infinity loop and queue

import random
import queue as Queue
import _thread as Thread
a = Queue.Queue()
def af():
while True:
a.put(random.randint(0,1000))
def bf():
while True:
if (not a.empty()): print (a.get())
def main():
Thread.start_new_thread(af, ())
Thread.start_new_thread(bf, ())
return
if __name__ == "__main__":
main()
the above code works fine with extreme high CPU usage, i tried to use multiprocessing with no avail. i have tried
def main():
multiprocessing.Process(target=af).run()
multiprocessing.Process(target=bf).run()
and
def main():
manager = multiprocessing.Manager()
a = manager.Queue()
pool = multiprocessing.Pool()
pool.apply_async(af)
pool.apply_async(bf)
both not working, can anyone please help me? thanks a bunch ^_^
def main():
multiprocessing.Process(target=af).run() # will not return
multiprocessing.Process(target=bf).run()
The above code does not work because af does not return; no chance to call bf. You need to separate run call to start/join so that both can run in parallel. (+ to make them share manage.Queue)
To make the second code work, you need to pass a (manager.Queue object) to functions. Otherwise they will use Queue.Queue global object which is not shared between processes; need to modify af, bf to accepts a, and main to pass a.
def af(a):
while True:
a.put(random.randint(0, 1000))
def bf(a):
while True:
print(a.get())
def main():
manager = multiprocessing.Manager()
a = manager.Queue()
pool = multiprocessing.Pool()
proc1 = pool.apply_async(af, [a])
proc2 = pool.apply_async(bf, [a])
# Wait until process ends. Uncomment following line if there's no waiting code.
# proc1.get()
# proc2.get()
In the first alternative main you use Process, but the method you should call to start the activity is not run(), as one would think, but rather start(). You will want to follow that up with appropriate join() statements. Following the information in multiprocessing (available here: https://docs.python.org/2/library/multiprocessing.html), here is a working sample:
import random
from multiprocessing import Process, Queue
def af(q):
while True:
q.put(random.randint(0,1000))
def bf(q):
while True:
if not q.empty():
print (q.get())
def main():
a = Queue()
p = Process(target=af, args=(a,))
c = Process(target=bf, args=(a,))
p.start()
c.start()
p.join()
c.join()
if __name__ == "__main__":
main()
To add to the accepted answer, in the original code:
while True:
if not q.empty():
print (q.get())
q.empty() is being called every time which is unnecessary since q.get() if the queue is empty will wait until something is available here documentation.
Similar answer here
I assume that this could affect the performance since calling the .empty() every iteration should consume more resources (it should be more noticeable if Thread was used instead of Process because Python Global Interpreter Lock (GIL))
I know it's an old question but hope it helps!

Extending mp.Process in Python 3

import multiprocessing as mp
import time as t
class MyProcess(mp.Process):
def __init__(self, target, args, name):
mp.Process.__init__(self, target=target, args=args)
self.exit = mp.Event()
self.name = name
print("{0} initiated".format(self.name))
def run(self):
while not self.exit.is_set():
pass
print("Process {0} exited.".format(self.name))
def shutdown(self):
print("Shutdown initiated for {0}.".format(self.name))
self.exit.set()
def f(x):
while True:
print(x)
x = x+1
if __name__ == "__main__":
p = MyProcess(target=f, args=[3], name="function")
p.start()
#p.join()
t.wait(2)
p.shutdown()
I'm trying to extend the multiprocessing.Process class to add a shutdown method in order to be able to exit a function which could potentially have to be run for an undefined amount of time. Following instructions from Python Multiprocessing Exit Elegantly How? and adding the argument passing I came up with myself, only gets me this output:
function initiated
Shutdown initiated for function.
Process function exited.
But no actual method f(x) output. It seems that the actual process target doesn't get started. I'm obviously doing something wrong, but just can't figure out what, any ideas?
Thanks!
The sane way to handle this situation is, where possible, to have the background task cooperate in the exit mechanism by periodically checking the exit event. For that, there's no need to subclass Process: you can rewrite your background task to include that check. For example, here's your code rewritten using that approach:
import multiprocessing as mp
import time as t
def f(x, exit_event):
while not exit_event.is_set():
print(x)
x = x+1
print("Exiting")
if __name__ == "__main__":
exit_event = mp.Event()
p = mp.Process(target=f, args=(3, exit_event), name="function")
p.start()
t.sleep(2)
exit_event.set()
p.join()
If that's not an option (for example because you can't modify the code that's being run in the background job), then you can use the Process.terminate method. But you should be aware that using it is dangerous: the child process won't have an opportunity to clean up properly, so for example if it's shutdown while holding a multiprocessing lock, no other process will be able to acquire that lock, giving a risk of deadlock. It's far better to have the child cooperate in the shutdown if possible.
The solution to this problem is to call the super().run() function in your class run method.
Of course, this will cause the permanent execution of your function due to the existence of while True, and the specified event will not cause its end.
You can use Process.terminate() method to end your process.
import multiprocessing as mp
import time as t
class MyProcess(mp.Process):
def __init__(self, target, args, name):
mp.Process.__init__(self, target=target, args=args)
self.name = name
print("{0} initiated".format(self.name))
def run(self):
print("Process {0} started.".format(self.name))
super().run()
def shutdown(self):
print("Shutdown initiated for {0}.".format(self.name))
self.terminate()
def f(x):
while True:
print(x)
t.sleep(1)
x += 1
if __name__ == "__main__":
p = MyProcess(target=f, args=(3,), name="function")
p.start()
# p.join()
t.sleep(5)
p.shutdown()

Python Process Pool non-daemonic?

Would it be possible to create a python Pool that is non-daemonic? I want a pool to be able to call a function that has another pool inside.
I want this because deamon processes cannot create process. Specifically, it will cause the error:
AssertionError: daemonic processes are not allowed to have children
For example, consider the scenario where function_a has a pool which runs function_b which has a pool which runs function_c. This function chain will fail, because function_b is being run in a daemon process, and daemon processes cannot create processes.
The multiprocessing.pool.Pool class creates the worker processes in its __init__ method, makes them daemonic and starts them, and it is not possible to re-set their daemon attribute to False before they are started (and afterwards it's not allowed anymore). But you can create your own sub-class of multiprocesing.pool.Pool (multiprocessing.Pool is just a wrapper function) and substitute your own multiprocessing.Process sub-class, which is always non-daemonic, to be used for the worker processes.
Here's a full example of how to do this. The important parts are the two classes NoDaemonProcess and MyPool at the top and to call pool.close() and pool.join() on your MyPool instance at the end.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import multiprocessing
# We must import this explicitly, it is not imported by the top-level
# multiprocessing module.
import multiprocessing.pool
import time
from random import randint
class NoDaemonProcess(multiprocessing.Process):
# make 'daemon' attribute always return False
def _get_daemon(self):
return False
def _set_daemon(self, value):
pass
daemon = property(_get_daemon, _set_daemon)
# We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
# because the latter is only a wrapper function, not a proper class.
class MyPool(multiprocessing.pool.Pool):
Process = NoDaemonProcess
def sleepawhile(t):
print("Sleeping %i seconds..." % t)
time.sleep(t)
return t
def work(num_procs):
print("Creating %i (daemon) workers and jobs in child." % num_procs)
pool = multiprocessing.Pool(num_procs)
result = pool.map(sleepawhile,
[randint(1, 5) for x in range(num_procs)])
# The following is not really needed, since the (daemon) workers of the
# child's pool are killed when the child is terminated, but it's good
# practice to cleanup after ourselves anyway.
pool.close()
pool.join()
return result
def test():
print("Creating 5 (non-daemon) workers and jobs in main process.")
pool = MyPool(5)
result = pool.map(work, [randint(1, 5) for x in range(5)])
pool.close()
pool.join()
print(result)
if __name__ == '__main__':
test()
I had the necessity to employ a non-daemonic pool in Python 3.7 and ended up adapting the code posted in the accepted answer. Below there's the snippet that creates the non-daemonic pool:
import multiprocessing.pool
class NoDaemonProcess(multiprocessing.Process):
#property
def daemon(self):
return False
#daemon.setter
def daemon(self, value):
pass
class NoDaemonContext(type(multiprocessing.get_context())):
Process = NoDaemonProcess
# We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
# because the latter is only a wrapper function, not a proper class.
class NestablePool(multiprocessing.pool.Pool):
def __init__(self, *args, **kwargs):
kwargs['context'] = NoDaemonContext()
super(NestablePool, self).__init__(*args, **kwargs)
As the current implementation of multiprocessing has been extensively refactored to be based on contexts, we need to provide a NoDaemonContext class that has our NoDaemonProcess as attribute. NestablePool will then use that context instead of the default one.
That said, I should warn that there are at least two caveats to this approach:
It still depends on implementation details of the multiprocessing package, and could therefore break at any time.
There are valid reasons why multiprocessing made it so hard to use non-daemonic processes, many of which are explained here. The most compelling in my opinion is:
As for allowing children threads to spawn off children of its own using
subprocess runs the risk of creating a little army of zombie
'grandchildren' if either the parent or child threads terminate before
the subprocess completes and returns.
As of Python 3.8, concurrent.futures.ProcessPoolExecutor doesn't have this limitation. It can have a nested process pool with no problem at all:
from concurrent.futures import ProcessPoolExecutor as Pool
from itertools import repeat
from multiprocessing import current_process
import time
def pid():
return current_process().pid
def _square(i): # Runs in inner_pool
square = i ** 2
time.sleep(i / 10)
print(f'{pid()=} {i=} {square=}')
return square
def _sum_squares(i, j): # Runs in outer_pool
with Pool(max_workers=2) as inner_pool:
squares = inner_pool.map(_square, (i, j))
sum_squares = sum(squares)
time.sleep(sum_squares ** .5)
print(f'{pid()=}, {i=}, {j=} {sum_squares=}')
return sum_squares
def main():
with Pool(max_workers=3) as outer_pool:
for sum_squares in outer_pool.map(_sum_squares, range(5), repeat(3)):
print(f'{pid()=} {sum_squares=}')
if __name__ == "__main__":
main()
The above demonstration code was tested with Python 3.8.
A limitation of ProcessPoolExecutor, however, is that it doesn't have maxtasksperchild. If you need this, consider the answer by Massimiliano instead.
Credit: answer by jfs
The multiprocessing module has a nice interface to use pools with processes or threads. Depending on your current use case, you might consider using multiprocessing.pool.ThreadPool for your outer Pool, which will result in threads (that allow to spawn processes from within) as opposed to processes.
It might be limited by the GIL, but in my particular case (I tested both), the startup time for the processes from the outer Pool as created here far outweighed the solution with ThreadPool.
It's really easy to swap Processes for Threads. Read more about how to use a ThreadPool solution here or here.
On some Python versions replacing standard Pool to custom can raise error: AssertionError: group argument must be None for now.
Here I found a solution that can help:
class NoDaemonProcess(multiprocessing.Process):
# make 'daemon' attribute always return False
#property
def daemon(self):
return False
#daemon.setter
def daemon(self, val):
pass
class NoDaemonProcessPool(multiprocessing.pool.Pool):
def Process(self, *args, **kwds):
proc = super(NoDaemonProcessPool, self).Process(*args, **kwds)
proc.__class__ = NoDaemonProcess
return proc
I have seen people dealing with this issue by using celery's fork of multiprocessing called billiard (multiprocessing pool extensions), which allows daemonic processes to spawn children. The walkaround is to simply replace the multiprocessing module by:
import billiard as multiprocessing
The issue I encountered was in trying to import globals between modules, causing the ProcessPool() line to get evaluated multiple times.
globals.py
from processing import Manager, Lock
from pathos.multiprocessing import ProcessPool
from pathos.threading import ThreadPool
class SingletonMeta(type):
def __new__(cls, name, bases, dict):
dict['__deepcopy__'] = dict['__copy__'] = lambda self, *args: self
return super(SingletonMeta, cls).__new__(cls, name, bases, dict)
def __init__(cls, name, bases, dict):
super(SingletonMeta, cls).__init__(name, bases, dict)
cls.instance = None
def __call__(cls,*args,**kw):
if cls.instance is None:
cls.instance = super(SingletonMeta, cls).__call__(*args, **kw)
return cls.instance
def __deepcopy__(self, item):
return item.__class__.instance
class Globals(object):
__metaclass__ = SingletonMeta
"""
This class is a workaround to the bug: AssertionError: daemonic processes are not allowed to have children
The root cause is that importing this file from different modules causes this file to be reevalutated each time,
thus ProcessPool() gets reexecuted inside that child thread, thus causing the daemonic processes bug
"""
def __init__(self):
print "%s::__init__()" % (self.__class__.__name__)
self.shared_manager = Manager()
self.shared_process_pool = ProcessPool()
self.shared_thread_pool = ThreadPool()
self.shared_lock = Lock() # BUG: Windows: global name 'lock' is not defined | doesn't affect cygwin
Then import safely from elsewhere in your code
from globals import Globals
Globals().shared_manager
Globals().shared_process_pool
Globals().shared_thread_pool
Globals().shared_lock
I have written a more expanded wrapper class around pathos.multiprocessing here:
https://github.com/JamesMcGuigan/python2-timeseries-datapipeline/blob/master/src/util/MultiProcessing.py
As a side note, if your usecase just requires async multiprocess map as a performance optimization, then joblib will manage all your process pools behind the scenes and allow this very simple syntax:
squares = Parallel(-1)( delayed(lambda num: num**2)(x) for x in range(100) )
https://joblib.readthedocs.io/
This presents a workaround for when the error is seemingly a false-positive. As also noted by James, this can happen to an unintentional import from a daemonic process.
For example, if you have the following simple code, WORKER_POOL can inadvertently be imported from a worker, leading to the error.
import multiprocessing
WORKER_POOL = multiprocessing.Pool()
A simple but reliable approach for a workaround is:
import multiprocessing
import multiprocessing.pool
class MyClass:
#property
def worker_pool(self) -> multiprocessing.pool.Pool:
# Ref: https://stackoverflow.com/a/63984747/
try:
return self._worker_pool # type: ignore
except AttributeError:
# pylint: disable=protected-access
self.__class__._worker_pool = multiprocessing.Pool() # type: ignore
return self.__class__._worker_pool # type: ignore
# pylint: enable=protected-access
In the above workaround, MyClass.worker_pool can be used without the error. If you think this approach can be improved upon, let me know.
Since Python version 3.7 we can create non-daemonic ProcessPoolExecutor
Using if __name__ == "__main__": is necessary while using multiprocessing.
from concurrent.futures import ProcessPoolExecutor as Pool
num_pool = 10
def main_pool(num):
print(num)
strings_write = (f'{num}-{i}' for i in range(num))
with Pool(num) as subp:
subp.map(sub_pool,strings_write)
return None
def sub_pool(x):
print(f'{x}')
return None
if __name__ == "__main__":
with Pool(num_pool) as p:
p.map(main_pool,list(range(1,num_pool+1)))
Here is how you can start a pool, even if you are in a daemonic process already. This was tested in python 3.8.5
First, define the Undaemonize context manager, which temporarily deletes the daemon state of the current process.
class Undaemonize(object):
'''Context Manager to resolve AssertionError: daemonic processes are not allowed to have children
Tested in python 3.8.5'''
def __init__(self):
self.p = multiprocessing.process.current_process()
if 'daemon' in self.p._config:
self.daemon_status_set = True
else:
self.daemon_status_set = False
self.daemon_status_value = self.p._config.get('daemon')
def __enter__(self):
if self.daemon_status_set:
del self.p._config['daemon']
def __exit__(self, type, value, traceback):
if self.daemon_status_set:
self.p._config['daemon'] = self.daemon_status_value
Now you can start a pool as follows, even from within a daemon process:
with Undaemonize():
pool = multiprocessing.Pool(1)
pool.map(... # you can do something with the pool outside of the context manager
While the other approaches here aim to create pool that is not daemonic in the first place, this approach allows you to start a pool even if you are in a daemonic process already.

Categories