Python multiprocessing pool force distribution of process

Python multiprocessing pool force distribution of process - python

This question comes as a result of trying to combine logging with a multiprocessing pool. Under Linux there is nothing to do; the module containing my pool worker method inherits the main app logger properties. Under Windows I have to initialize the logger in each process, which I do by running pool.map_async with the initializer method. The problem is that the method runs so quickly that it gets executed more than once in some processes and not at all in others. I can get it to work properly if I add a short time delay to the method but this seems inelegant.
Is there a way to force the pool to distribute the processes evenly?
(some background: http://plumberjack.blogspot.de/2010/09/using-logging-with-multiprocessing.html)
The code is as follows, I can't really post the whole module ;-)
The call is this:
# Set up logger on Windows platforms
if os.name == 'nt':
_ = pool.map_async(ml.worker_configurer,
[self._q for _ in range(mp.cpu_count())])
The function ml.worker_configurer is this:
def worker_configurer(queue, delay=True):
h = QueueHandler(queue)
root = logging.getLogger()
root.addHandler(h)
root.setLevel(logging.DEBUG)
if delay:
import time
time.sleep(1.0)
return
New worker configurer
def worker_configurer2(queue):
root = logging.getLogger()
if not root.handlers:
h = QueueHandler(queue)
root.addHandler(h)
root.setLevel(logging.DEBUG)
return

You can do something like this:
sub_logger = None
def get_logger():
global sub_logger
if sub_logger is None:
# configure logger
return sub_logger
def worker1():
logger = get_logger()
# DO WORK
def worker2():
logger = get_logger()
# DO WORK
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
result = pool.map_async(worker1, some_data)
result.get()
result = pool.map_async(worker2, some_data)
result.get()
# and so on and so forth
Because each process has its own memory space (and thus it's own set of global variables), you can set the initial global logger to None and only configure the logger if it has not been previously configured.

Related

Is it right to init multiprocess in class init?

from multiprocessing.dummy import Pool as ThreadPool
class TSNew:
def __init__(self):
self.redis_client = redis.StrictRedis(host="172.17.31.147", port=4401, db=0)
self.global_switch = 0
self.pool = ThreadPool(40) # init pool
self.dnn_model = None
self.nnf = None
self.md5sum_nnf = "initialize"
self.thread = threading.Thread(target=self.load_model_item)
self.ts_picked_ids = None
self.thread.start()
self.memory = deque(maxlen=3000)
self.process = threading.Thread(target=self.process_user_dict)
self.process.start()
def load_model_item(self):
'''
code
'''
def predict_memcache(self,user_dict):
'''
code
'''
def process_user_dict(self):
while True:
'''
code to generate user_dicts which is a list
'''
results = self.pool.map(self.predict_memcache, user_dicts)
'''
code
'''
TSNew_ = TSNew()
def get_user_result():
logging.info("----------------come in ------------------")
if request.method == 'POST':
user_dict_json = request.get_data()# userid
if user_dict_json == '' or user_dict_json is None:
logging.info("----------------user_dict_json is ''------------------")
return ''
try:
user_dict = json.loads(user_dict_json)
except:
logging.info("json load error, pass")
return ''
TSNew_.memory.append(user_dict)
logging.info('add to deque TSNew_.memory size: %d PID: %d', len(TSNew_.memory), os.getpid())
logging.info("add to deque userid: %s, nation: %s \n",user_dict['user_id'], user_dict['user_country'])
return 'SUCCESS\n'
#app.route('/', methods=['POST'])
def get_ts_gbdt_id():
return get_user_result()
from werkzeug.contrib.fixers import ProxyFix
app.wsgi_app = ProxyFix(app.wsgi_app)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=4444)
I create a multi thread pool in class __init__ and I use the self.pool
to map the function of predict_memcache.
I have two doubts:
(a) Should I initialize the pool in __init__ or just init it right before
results = self.pool.map(self.predict_memcache, user_dicts)
(b) Since the pool is a multi thread operation and it is executed in the thread of process_user_dict, so is there any hidden error ?
Thanks.

Question (a):
It depends. If you need to run process_user_dict more than once, then it makes sense to start the pool in the constructor and keep it running. Creating a thread pool always comes with some overhead and by keeping the pool alive between calls to process_user_dict you would avoid that additional overhead.
If you just want to process one set of input, you can as well create your pool right inside process_user_dict. But probably not right before results = self.pool.map(self.predict_memcache, user_dicts) because that would create a pool for every iteration of your surrounding while loop.
In your specific case, it does not make any difference. You create your TSNew_ object on module-level, so that it remains alive (and with it the thread pool) while your app is running; the same thread pool from the same TSNew instance is used to process all the requests during the lifetime of app.run().
Since you seem to be using that construct with self.process = threading.Thread(target=self.process_user_dict) as some sort of listener on self.memory, creating the pool in the constructor is functionally equivalent to creating the pool inside of process_user_dict (but outside the loop).
Question (b):
Technically, there is no hidden error by default when creating a thread inside a thread. In the end, any additional thread's ultimate parent is always the MainThread, that is implicitly created for every instance of a Python interpreter. Basically, every time you create a thread inside a Python program, you create a thread in a thread.
Actually, your code does not even create a thread inside a thread. Your self.pool is created inside the MainThread. When the pool is instantiated via self.pool = ThreadPool(40) it creates the desired number (40) of worker threads, plus one worker handler thread, one task handler thread and one result handler thread. All of these are child threads of the MainThread. All you do with regards to your pool inside your thread under self.process is calling its map method to assign tasks to it.
However, I do not really see the point of what you are doing with that self.process here.
Making a guess, I would say that you want to start the loop in process_user_dict to act as kind of a listener on self.memory, so that the pool starts processing user_dict as soon as they start showing up in the deque in self.memory. From what I see you doing in get_user_result, you seem to get one user_dict per request. I understand that you might have concurrent user sessions passing in these dicts, but do you really see benfit from process_user_dict running in an infinite loop over simply calling TSNew_.process_user_dict() after TSNew_.memory.append(user_dict)? You could even omit self.memory completely and pass the dict directly to process_user_dict, unless I am missing something you did not show us.

python-daemon and logging: set logging level interactively

I have a python-daemon process that logs to a file via a ThreadedTCPServer (inspired by the cookbook example: https://docs.python.org/2/howto/logging-cookbook.html#sending-and-receiving-logging-events-across-a-network, as I will have many such processes writing to the same file). I am controlling the spawning of the daemon process using subprocess.Popen from an ipython console, and this is how the application will be run. I am able to successfully write to the log file from both the main ipython process, as well as the daemon process, but I am unable to change the level of both by just simply setting the level of the root logger in ipython. Is this something that should be possible? Or will it require custom functionality to set the logging.level of the daemon separately?
Edit: As requested, here is an attempt to provide a pseudo-code example of what I am trying to achieve. I hope that this is a sufficient description.
daemon_script.py
import logging
import daemon
from other_module import function_to_run_as_daemon
class daemon(object):
def __init__(self):
self.daemon_name = __name__
logging.basicConfig() # <--- required, or I don't get any log messages
self.logger = logging.getLogger(self.daemon_name)
self.logger.debug( "Created logger successfully" )
def run(self):
with daemon.daemonContext( files_preserve = [self.logger.handlers[0].stream] )
self.logger.debug( "Daemonised successfully - about to enter function" )
function_to_run_as_daemon()
if __name__ == "__main__":
d = daemon()
d.run()
Then in ipython i would run something like
>>> import logging
>>> rootlogger = logging.getLogger()
>>> rootlogger.info( "test" )
INFO:root:"test"
>>> subprocess.Popen( ["python" , "daemon_script.py"] )
DEBUG:__main__:"Created logger successfully"
DEBUG:__main__:"Daemonised successfully - about to enter function"
# now i'm finished debugging and testing, i want to reduce the level for all the loggers by changing the level of the handler
# Note that I also tried changing the level of the root handler, but saw no change
>>> rootlogger.handlers[0].setLevel(logging.INFO)
>>> rootlogger.info( "test" )
INFO:root:"test"
>>> print( rootlogger.debug("test") )
None
>>> subprocess.Popen( ["python" , "daemon_script.py"] )
DEBUG:__main__:"Created logger successfully"
DEBUG:__main__:"Daemonised successfully - about to enter function"
I think that I may not be approaching this correctly, but, its not clear to me what would work better. Any advice would be appreciated.

The logger you create in your daemon won't be the same as the logger you made in ipython. You could test this to be sure, by just printing out both logger objects themselves, which will show you their memory addresses.
I think a better pattern would be be that you pass if you want to be in "debug" mode or not, when you run the daemon. In other words, call popen like this:
subprocess.Popen( ["python" , "daemon_script.py", "debug"] )
It's up to you, you could pass a string meaning "debug mode is on" as above, or you could pass the log level constant that means "debug", e.g.:
subprocess.Popen( ["python" , "daemon_script.py", "10"] )
(https://docs.python.org/2/library/logging.html#levels)
Then in the daemon's init function use argv for example, to get that argument and use it:
...
import sys
def __init__(self):
self.daemon_name = __name__
logging.basicConfig() # <--- required, or I don't get any log messages
log_level = int(sys.argv[1]) # Probably don't actually just blindly convert it without error handling
self.logger = logging.getLogger(self.daemon_name)
self.logger.setLevel(log_level)
...

Python multiprocessing module do not work

i am trying to write a spider with multiprocessing module
here is my python code:
# -*- coding:utf-8 -*-
import multiprocessing
import requests
class SpiderWorker(object):
def __init__(self, q):
self._q = q
def run(self):
def _crawl_item(url):
requests.get("http://www.baidu.com")
if respon.ok:
print respon.url
while True:
rst = self._q.get()
_crawl_item(rst)
def general_worker():
q = multiprocessing.Queue()
CPU_COUNT = multiprocessing.cpu_count()
worker_processes = [
multiprocessing.Process(target=SpiderWorker(q).run)
for i in range(CPU_COUNT)
]
map( lambda process: process.start(), worker_processes )
return q, worker_processes
maybe it is my process way wrong
every time i run this code, my process tell me
<Process(Process-1, stopped[SIGSEGV])>
hope love it

The major problem here is that you don't have any information on why your processes fail. It could be gevent, but it could just as easily be something else. So learning the actual reason why your processes get terminated is the first step before doing anything else.
What you need is multiprocessing.log_to_stderr():
class SpiderWorker(object):
# ...
def run(self):
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)
try:
# Here goes your original run() code
except Exception:
logger.exception('whoopsie')
What this code does:
Creates a special logger which will transmit it's information to the main process and dump it to stderr (console by default).
Configures this logger to report everything, including some internal multiprocessing module events (just in case as you probably don't need them).
Wraps your entire code in catch-all statement so whatever happens there cannot escape your notice.
Runs .exception() method on the logger, which not only logs the message (it's meaningless anyway as we don't know what actually happens) but most importantly logs the entire error traceback - which we actually need.

Python multiprocessing - logging.FileHandler object raises PicklingError

It seems that handlers from the logging module and multiprocessing jobs do not mix:
import functools
import logging
import multiprocessing as mp
logger = logging.getLogger( 'myLogger' )
handler = logging.FileHandler( 'logFile' )
def worker( x, handler ) :
print x ** 2
pWorker = functools.partial( worker, handler=handler )
#
if __name__ == '__main__' :
pool = mp.Pool( processes=1 )
pool.map( pWorker, range(3) )
pool.close()
pool.join()
Out:
cPickle.PicklingError: Can't pickle <type 'thread.lock'>: attribute lookup thread.lock failed
If I replace pWorker be either one of the following methods, no error is raised
# this works
def pWorker( x ) :
worker( x, handler )
# this works too
pWorker = functools.partial( worker, handler=open( 'logFile' ) )
I don't really understand the PicklingError. Is it because objects of class logging.FileHandler are not pickable? (I googled it but didn't find anything)

The FileHandler object internally uses a threading.Lock to synchronize writes between threads. However, the thread.lock object returned by threading.Lock can't be pickled, which means it can't be sent between processes, which is required to send it to the child via pool.map.
There is a section in the multiprocessing docs that talks about how logging with multiprocessing works here. Basically, you need to let the child process inherit the parent process' logger, rather than trying to explicitly pass loggers or handlers via calls to map.
Note, though, that on Linux, you can do this:
from multiprocessing import Pool
import logging
logger = logging.getLogger( 'myLogger' )
def worker(x):
print handler
print x **2
def initializer(handle):
global handler
handler = handle
if __name__ == "__main__":
handler = logging.FileHandler( 'logFile' )
#pWorker = functools.partial( worker, handler=handler )
pool = Pool(processes=4, initializer=initializer, initargs=(handler,))
pool.map(worker, range(3))
pool.close()
pool.join
initializer/initargs are used to run a method once in each of the pool's child processes as soon as they start. On Linux this allows the handler to be into the child via inheritance, thanks to the way os.fork works. However, this won't work on Windows; because it lacks support for os.fork, it would still need to pickle handler to pass it via initargs.

Python Process Pool non-daemonic?

Would it be possible to create a python Pool that is non-daemonic? I want a pool to be able to call a function that has another pool inside.
I want this because deamon processes cannot create process. Specifically, it will cause the error:
AssertionError: daemonic processes are not allowed to have children
For example, consider the scenario where function_a has a pool which runs function_b which has a pool which runs function_c. This function chain will fail, because function_b is being run in a daemon process, and daemon processes cannot create processes.

The multiprocessing.pool.Pool class creates the worker processes in its __init__ method, makes them daemonic and starts them, and it is not possible to re-set their daemon attribute to False before they are started (and afterwards it's not allowed anymore). But you can create your own sub-class of multiprocesing.pool.Pool (multiprocessing.Pool is just a wrapper function) and substitute your own multiprocessing.Process sub-class, which is always non-daemonic, to be used for the worker processes.
Here's a full example of how to do this. The important parts are the two classes NoDaemonProcess and MyPool at the top and to call pool.close() and pool.join() on your MyPool instance at the end.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import multiprocessing
# We must import this explicitly, it is not imported by the top-level
# multiprocessing module.
import multiprocessing.pool
import time
from random import randint
class NoDaemonProcess(multiprocessing.Process):
# make 'daemon' attribute always return False
def _get_daemon(self):
return False
def _set_daemon(self, value):
pass
daemon = property(_get_daemon, _set_daemon)
# We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
# because the latter is only a wrapper function, not a proper class.
class MyPool(multiprocessing.pool.Pool):
Process = NoDaemonProcess
def sleepawhile(t):
print("Sleeping %i seconds..." % t)
time.sleep(t)
return t
def work(num_procs):
print("Creating %i (daemon) workers and jobs in child." % num_procs)
pool = multiprocessing.Pool(num_procs)
result = pool.map(sleepawhile,
[randint(1, 5) for x in range(num_procs)])
# The following is not really needed, since the (daemon) workers of the
# child's pool are killed when the child is terminated, but it's good
# practice to cleanup after ourselves anyway.
pool.close()
pool.join()
return result
def test():
print("Creating 5 (non-daemon) workers and jobs in main process.")
pool = MyPool(5)
result = pool.map(work, [randint(1, 5) for x in range(5)])
pool.close()
pool.join()
print(result)
if __name__ == '__main__':
test()

I had the necessity to employ a non-daemonic pool in Python 3.7 and ended up adapting the code posted in the accepted answer. Below there's the snippet that creates the non-daemonic pool:
import multiprocessing.pool
class NoDaemonProcess(multiprocessing.Process):
#property
def daemon(self):
return False
#daemon.setter
def daemon(self, value):
pass
class NoDaemonContext(type(multiprocessing.get_context())):
Process = NoDaemonProcess
# We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
# because the latter is only a wrapper function, not a proper class.
class NestablePool(multiprocessing.pool.Pool):
def __init__(self, *args, **kwargs):
kwargs['context'] = NoDaemonContext()
super(NestablePool, self).__init__(*args, **kwargs)
As the current implementation of multiprocessing has been extensively refactored to be based on contexts, we need to provide a NoDaemonContext class that has our NoDaemonProcess as attribute. NestablePool will then use that context instead of the default one.
That said, I should warn that there are at least two caveats to this approach:
It still depends on implementation details of the multiprocessing package, and could therefore break at any time.
There are valid reasons why multiprocessing made it so hard to use non-daemonic processes, many of which are explained here. The most compelling in my opinion is:
As for allowing children threads to spawn off children of its own using
subprocess runs the risk of creating a little army of zombie
'grandchildren' if either the parent or child threads terminate before
the subprocess completes and returns.

As of Python 3.8, concurrent.futures.ProcessPoolExecutor doesn't have this limitation. It can have a nested process pool with no problem at all:
from concurrent.futures import ProcessPoolExecutor as Pool
from itertools import repeat
from multiprocessing import current_process
import time
def pid():
return current_process().pid
def _square(i): # Runs in inner_pool
square = i ** 2
time.sleep(i / 10)
print(f'{pid()=} {i=} {square=}')
return square
def _sum_squares(i, j): # Runs in outer_pool
with Pool(max_workers=2) as inner_pool:
squares = inner_pool.map(_square, (i, j))
sum_squares = sum(squares)
time.sleep(sum_squares ** .5)
print(f'{pid()=}, {i=}, {j=} {sum_squares=}')
return sum_squares
def main():
with Pool(max_workers=3) as outer_pool:
for sum_squares in outer_pool.map(_sum_squares, range(5), repeat(3)):
print(f'{pid()=} {sum_squares=}')
if __name__ == "__main__":
main()
The above demonstration code was tested with Python 3.8.
A limitation of ProcessPoolExecutor, however, is that it doesn't have maxtasksperchild. If you need this, consider the answer by Massimiliano instead.
Credit: answer by jfs

The multiprocessing module has a nice interface to use pools with processes or threads. Depending on your current use case, you might consider using multiprocessing.pool.ThreadPool for your outer Pool, which will result in threads (that allow to spawn processes from within) as opposed to processes.
It might be limited by the GIL, but in my particular case (I tested both), the startup time for the processes from the outer Pool as created here far outweighed the solution with ThreadPool.
It's really easy to swap Processes for Threads. Read more about how to use a ThreadPool solution here or here.

On some Python versions replacing standard Pool to custom can raise error: AssertionError: group argument must be None for now.
Here I found a solution that can help:
class NoDaemonProcess(multiprocessing.Process):
# make 'daemon' attribute always return False
#property
def daemon(self):
return False
#daemon.setter
def daemon(self, val):
pass
class NoDaemonProcessPool(multiprocessing.pool.Pool):
def Process(self, *args, **kwds):
proc = super(NoDaemonProcessPool, self).Process(*args, **kwds)
proc.__class__ = NoDaemonProcess
return proc

I have seen people dealing with this issue by using celery's fork of multiprocessing called billiard (multiprocessing pool extensions), which allows daemonic processes to spawn children. The walkaround is to simply replace the multiprocessing module by:
import billiard as multiprocessing

The issue I encountered was in trying to import globals between modules, causing the ProcessPool() line to get evaluated multiple times.
globals.py
from processing import Manager, Lock
from pathos.multiprocessing import ProcessPool
from pathos.threading import ThreadPool
class SingletonMeta(type):
def __new__(cls, name, bases, dict):
dict['__deepcopy__'] = dict['__copy__'] = lambda self, *args: self
return super(SingletonMeta, cls).__new__(cls, name, bases, dict)
def __init__(cls, name, bases, dict):
super(SingletonMeta, cls).__init__(name, bases, dict)
cls.instance = None
def __call__(cls,*args,**kw):
if cls.instance is None:
cls.instance = super(SingletonMeta, cls).__call__(*args, **kw)
return cls.instance
def __deepcopy__(self, item):
return item.__class__.instance
class Globals(object):
__metaclass__ = SingletonMeta
"""
This class is a workaround to the bug: AssertionError: daemonic processes are not allowed to have children
The root cause is that importing this file from different modules causes this file to be reevalutated each time,
thus ProcessPool() gets reexecuted inside that child thread, thus causing the daemonic processes bug
"""
def __init__(self):
print "%s::__init__()" % (self.__class__.__name__)
self.shared_manager = Manager()
self.shared_process_pool = ProcessPool()
self.shared_thread_pool = ThreadPool()
self.shared_lock = Lock() # BUG: Windows: global name 'lock' is not defined | doesn't affect cygwin
Then import safely from elsewhere in your code
from globals import Globals
Globals().shared_manager
Globals().shared_process_pool
Globals().shared_thread_pool
Globals().shared_lock
I have written a more expanded wrapper class around pathos.multiprocessing here:
https://github.com/JamesMcGuigan/python2-timeseries-datapipeline/blob/master/src/util/MultiProcessing.py
As a side note, if your usecase just requires async multiprocess map as a performance optimization, then joblib will manage all your process pools behind the scenes and allow this very simple syntax:
squares = Parallel(-1)( delayed(lambda num: num**2)(x) for x in range(100) )
https://joblib.readthedocs.io/

This presents a workaround for when the error is seemingly a false-positive. As also noted by James, this can happen to an unintentional import from a daemonic process.
For example, if you have the following simple code, WORKER_POOL can inadvertently be imported from a worker, leading to the error.
import multiprocessing
WORKER_POOL = multiprocessing.Pool()
A simple but reliable approach for a workaround is:
import multiprocessing
import multiprocessing.pool
class MyClass:
#property
def worker_pool(self) -> multiprocessing.pool.Pool:
# Ref: https://stackoverflow.com/a/63984747/
try:
return self._worker_pool # type: ignore
except AttributeError:
# pylint: disable=protected-access
self.__class__._worker_pool = multiprocessing.Pool() # type: ignore
return self.__class__._worker_pool # type: ignore
# pylint: enable=protected-access
In the above workaround, MyClass.worker_pool can be used without the error. If you think this approach can be improved upon, let me know.

Since Python version 3.7 we can create non-daemonic ProcessPoolExecutor
Using if __name__ == "__main__": is necessary while using multiprocessing.
from concurrent.futures import ProcessPoolExecutor as Pool
num_pool = 10
def main_pool(num):
print(num)
strings_write = (f'{num}-{i}' for i in range(num))
with Pool(num) as subp:
subp.map(sub_pool,strings_write)
return None
def sub_pool(x):
print(f'{x}')
return None
if __name__ == "__main__":
with Pool(num_pool) as p:
p.map(main_pool,list(range(1,num_pool+1)))

Here is how you can start a pool, even if you are in a daemonic process already. This was tested in python 3.8.5
First, define the Undaemonize context manager, which temporarily deletes the daemon state of the current process.
class Undaemonize(object):
'''Context Manager to resolve AssertionError: daemonic processes are not allowed to have children
Tested in python 3.8.5'''
def __init__(self):
self.p = multiprocessing.process.current_process()
if 'daemon' in self.p._config:
self.daemon_status_set = True
else:
self.daemon_status_set = False
self.daemon_status_value = self.p._config.get('daemon')
def __enter__(self):
if self.daemon_status_set:
del self.p._config['daemon']
def __exit__(self, type, value, traceback):
if self.daemon_status_set:
self.p._config['daemon'] = self.daemon_status_value
Now you can start a pool as follows, even from within a daemon process:
with Undaemonize():
pool = multiprocessing.Pool(1)
pool.map(... # you can do something with the pool outside of the context manager
While the other approaches here aim to create pool that is not daemonic in the first place, this approach allows you to start a pool even if you are in a daemonic process already.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.