Python multiprocessing - logging.FileHandler object raises PicklingError - python

It seems that handlers from the logging module and multiprocessing jobs do not mix:
import functools
import logging
import multiprocessing as mp
logger = logging.getLogger( 'myLogger' )
handler = logging.FileHandler( 'logFile' )
def worker( x, handler ) :
print x ** 2
pWorker = functools.partial( worker, handler=handler )
#
if __name__ == '__main__' :
pool = mp.Pool( processes=1 )
pool.map( pWorker, range(3) )
pool.close()
pool.join()
Out:
cPickle.PicklingError: Can't pickle <type 'thread.lock'>: attribute lookup thread.lock failed
If I replace pWorker be either one of the following methods, no error is raised
# this works
def pWorker( x ) :
worker( x, handler )
# this works too
pWorker = functools.partial( worker, handler=open( 'logFile' ) )
I don't really understand the PicklingError. Is it because objects of class logging.FileHandler are not pickable? (I googled it but didn't find anything)

The FileHandler object internally uses a threading.Lock to synchronize writes between threads. However, the thread.lock object returned by threading.Lock can't be pickled, which means it can't be sent between processes, which is required to send it to the child via pool.map.
There is a section in the multiprocessing docs that talks about how logging with multiprocessing works here. Basically, you need to let the child process inherit the parent process' logger, rather than trying to explicitly pass loggers or handlers via calls to map.
Note, though, that on Linux, you can do this:
from multiprocessing import Pool
import logging
logger = logging.getLogger( 'myLogger' )
def worker(x):
print handler
print x **2
def initializer(handle):
global handler
handler = handle
if __name__ == "__main__":
handler = logging.FileHandler( 'logFile' )
#pWorker = functools.partial( worker, handler=handler )
pool = Pool(processes=4, initializer=initializer, initargs=(handler,))
pool.map(worker, range(3))
pool.close()
pool.join
initializer/initargs are used to run a method once in each of the pool's child processes as soon as they start. On Linux this allows the handler to be into the child via inheritance, thanks to the way os.fork works. However, this won't work on Windows; because it lacks support for os.fork, it would still need to pickle handler to pass it via initargs.

Related

How do I share a list across processes with SyncManager in Python without passing references

In python, the multiprocessing module provides managers that can generate shared lists/dicts between processes.
However, I'm having an issue using these shared objects if the processes accessing the manager are not child processes, but are instead connecting to the manager via Manager.connect.
Here's a very basic example: I'm trying to create a shared list that can be accessed by a group of processes. For this example, I'm just launching this same code twice in two terminal windows:
import os, time
from multiprocessing.managers import SyncManager
def main() -> None:
print(f"I am process {os.getpid()}")
print(f"Starting proxy server...")
manager = SyncManager(address=("127.0.0.1", 8000), authkey=b"noauth")
try:
manager.start() # will start the sync process if it doesn't exist
except:
manager.connect() # if it does already exist, connect to it instead
print(f"Proxy server started/connected")
# would like to generate a shared list that each process can access.
sharedList = manager.list() # this generates a new list, so each process gets their own, which is not what I want
sharedList.append(os.getpid())
time.sleep(20)
if __name__ == '__main__':
main()
Pythons documentation on using remote managers seems similar to what I'm looking for, but there's no information on how to get a Manager.list or Manager.dict shared.
Note: I'd also be perfectly happy sharing a namespace object.
Here's how I ended up solving the problem. You need to spawn a manager process, that is in possession of the shared list, manually.
import multiprocessing
from multiprocessing import process
import os, time, sys
from multiprocessing.managers import SyncManager, ListProxy
from queue import Queue
class SharedStorage(SyncManager):
pass
def ManagerProcess():
sys.stdout = open(os.devnull, 'w')
sys.stderr = open(os.devnull, 'w')
l = list()
SharedStorage.register('get_list', lambda: l, ListProxy)
try:
ss = SharedStorage(address=("127.0.0.1", 8000), authkey=b"noauth")
ss.get_server().serve_forever()
except OSError:
# failed to listen on port - already in use.
pass
def main() -> None:
print(f"I am process {os.getpid()}")
print(f"Starting proxy server...")
mainProcess = multiprocessing.Process(target=ManagerProcess, daemon=True)
mainProcess.start()
SharedStorage.register('get_list')
manager = SharedStorage(address=("127.0.0.1", 8000), authkey=b"noauth")
manager.connect()
print(f"Proxy server started/connected")
# required - see https://bugs.python.org/issue7503
multiprocessing.current_process().authkey = b"noauth"
# get reference to the shared list object
shared_list = manager.get_list()
shared_list.append(os.getpid())
for i in shared_list:
print(i)
time.sleep(20)
if __name__ == '__main__':
main()
This can be run several times safely, as manager processes spawned by later processes will exit after they cannot listen on the port.

python-daemon and logging: set logging level interactively

I have a python-daemon process that logs to a file via a ThreadedTCPServer (inspired by the cookbook example: https://docs.python.org/2/howto/logging-cookbook.html#sending-and-receiving-logging-events-across-a-network, as I will have many such processes writing to the same file). I am controlling the spawning of the daemon process using subprocess.Popen from an ipython console, and this is how the application will be run. I am able to successfully write to the log file from both the main ipython process, as well as the daemon process, but I am unable to change the level of both by just simply setting the level of the root logger in ipython. Is this something that should be possible? Or will it require custom functionality to set the logging.level of the daemon separately?
Edit: As requested, here is an attempt to provide a pseudo-code example of what I am trying to achieve. I hope that this is a sufficient description.
daemon_script.py
import logging
import daemon
from other_module import function_to_run_as_daemon
class daemon(object):
def __init__(self):
self.daemon_name = __name__
logging.basicConfig() # <--- required, or I don't get any log messages
self.logger = logging.getLogger(self.daemon_name)
self.logger.debug( "Created logger successfully" )
def run(self):
with daemon.daemonContext( files_preserve = [self.logger.handlers[0].stream] )
self.logger.debug( "Daemonised successfully - about to enter function" )
function_to_run_as_daemon()
if __name__ == "__main__":
d = daemon()
d.run()
Then in ipython i would run something like
>>> import logging
>>> rootlogger = logging.getLogger()
>>> rootlogger.info( "test" )
INFO:root:"test"
>>> subprocess.Popen( ["python" , "daemon_script.py"] )
DEBUG:__main__:"Created logger successfully"
DEBUG:__main__:"Daemonised successfully - about to enter function"
# now i'm finished debugging and testing, i want to reduce the level for all the loggers by changing the level of the handler
# Note that I also tried changing the level of the root handler, but saw no change
>>> rootlogger.handlers[0].setLevel(logging.INFO)
>>> rootlogger.info( "test" )
INFO:root:"test"
>>> print( rootlogger.debug("test") )
None
>>> subprocess.Popen( ["python" , "daemon_script.py"] )
DEBUG:__main__:"Created logger successfully"
DEBUG:__main__:"Daemonised successfully - about to enter function"
I think that I may not be approaching this correctly, but, its not clear to me what would work better. Any advice would be appreciated.
The logger you create in your daemon won't be the same as the logger you made in ipython. You could test this to be sure, by just printing out both logger objects themselves, which will show you their memory addresses.
I think a better pattern would be be that you pass if you want to be in "debug" mode or not, when you run the daemon. In other words, call popen like this:
subprocess.Popen( ["python" , "daemon_script.py", "debug"] )
It's up to you, you could pass a string meaning "debug mode is on" as above, or you could pass the log level constant that means "debug", e.g.:
subprocess.Popen( ["python" , "daemon_script.py", "10"] )
(https://docs.python.org/2/library/logging.html#levels)
Then in the daemon's init function use argv for example, to get that argument and use it:
...
import sys
def __init__(self):
self.daemon_name = __name__
logging.basicConfig() # <--- required, or I don't get any log messages
log_level = int(sys.argv[1]) # Probably don't actually just blindly convert it without error handling
self.logger = logging.getLogger(self.daemon_name)
self.logger.setLevel(log_level)
...

Python multiprocessing pool force distribution of process

This question comes as a result of trying to combine logging with a multiprocessing pool. Under Linux there is nothing to do; the module containing my pool worker method inherits the main app logger properties. Under Windows I have to initialize the logger in each process, which I do by running pool.map_async with the initializer method. The problem is that the method runs so quickly that it gets executed more than once in some processes and not at all in others. I can get it to work properly if I add a short time delay to the method but this seems inelegant.
Is there a way to force the pool to distribute the processes evenly?
(some background: http://plumberjack.blogspot.de/2010/09/using-logging-with-multiprocessing.html)
The code is as follows, I can't really post the whole module ;-)
The call is this:
# Set up logger on Windows platforms
if os.name == 'nt':
_ = pool.map_async(ml.worker_configurer,
[self._q for _ in range(mp.cpu_count())])
The function ml.worker_configurer is this:
def worker_configurer(queue, delay=True):
h = QueueHandler(queue)
root = logging.getLogger()
root.addHandler(h)
root.setLevel(logging.DEBUG)
if delay:
import time
time.sleep(1.0)
return
New worker configurer
def worker_configurer2(queue):
root = logging.getLogger()
if not root.handlers:
h = QueueHandler(queue)
root.addHandler(h)
root.setLevel(logging.DEBUG)
return
You can do something like this:
sub_logger = None
def get_logger():
global sub_logger
if sub_logger is None:
# configure logger
return sub_logger
def worker1():
logger = get_logger()
# DO WORK
def worker2():
logger = get_logger()
# DO WORK
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
result = pool.map_async(worker1, some_data)
result.get()
result = pool.map_async(worker2, some_data)
result.get()
# and so on and so forth
Because each process has its own memory space (and thus it's own set of global variables), you can set the initial global logger to None and only configure the logger if it has not been previously configured.

Python Process Pool non-daemonic?

Would it be possible to create a python Pool that is non-daemonic? I want a pool to be able to call a function that has another pool inside.
I want this because deamon processes cannot create process. Specifically, it will cause the error:
AssertionError: daemonic processes are not allowed to have children
For example, consider the scenario where function_a has a pool which runs function_b which has a pool which runs function_c. This function chain will fail, because function_b is being run in a daemon process, and daemon processes cannot create processes.
The multiprocessing.pool.Pool class creates the worker processes in its __init__ method, makes them daemonic and starts them, and it is not possible to re-set their daemon attribute to False before they are started (and afterwards it's not allowed anymore). But you can create your own sub-class of multiprocesing.pool.Pool (multiprocessing.Pool is just a wrapper function) and substitute your own multiprocessing.Process sub-class, which is always non-daemonic, to be used for the worker processes.
Here's a full example of how to do this. The important parts are the two classes NoDaemonProcess and MyPool at the top and to call pool.close() and pool.join() on your MyPool instance at the end.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import multiprocessing
# We must import this explicitly, it is not imported by the top-level
# multiprocessing module.
import multiprocessing.pool
import time
from random import randint
class NoDaemonProcess(multiprocessing.Process):
# make 'daemon' attribute always return False
def _get_daemon(self):
return False
def _set_daemon(self, value):
pass
daemon = property(_get_daemon, _set_daemon)
# We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
# because the latter is only a wrapper function, not a proper class.
class MyPool(multiprocessing.pool.Pool):
Process = NoDaemonProcess
def sleepawhile(t):
print("Sleeping %i seconds..." % t)
time.sleep(t)
return t
def work(num_procs):
print("Creating %i (daemon) workers and jobs in child." % num_procs)
pool = multiprocessing.Pool(num_procs)
result = pool.map(sleepawhile,
[randint(1, 5) for x in range(num_procs)])
# The following is not really needed, since the (daemon) workers of the
# child's pool are killed when the child is terminated, but it's good
# practice to cleanup after ourselves anyway.
pool.close()
pool.join()
return result
def test():
print("Creating 5 (non-daemon) workers and jobs in main process.")
pool = MyPool(5)
result = pool.map(work, [randint(1, 5) for x in range(5)])
pool.close()
pool.join()
print(result)
if __name__ == '__main__':
test()
I had the necessity to employ a non-daemonic pool in Python 3.7 and ended up adapting the code posted in the accepted answer. Below there's the snippet that creates the non-daemonic pool:
import multiprocessing.pool
class NoDaemonProcess(multiprocessing.Process):
#property
def daemon(self):
return False
#daemon.setter
def daemon(self, value):
pass
class NoDaemonContext(type(multiprocessing.get_context())):
Process = NoDaemonProcess
# We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
# because the latter is only a wrapper function, not a proper class.
class NestablePool(multiprocessing.pool.Pool):
def __init__(self, *args, **kwargs):
kwargs['context'] = NoDaemonContext()
super(NestablePool, self).__init__(*args, **kwargs)
As the current implementation of multiprocessing has been extensively refactored to be based on contexts, we need to provide a NoDaemonContext class that has our NoDaemonProcess as attribute. NestablePool will then use that context instead of the default one.
That said, I should warn that there are at least two caveats to this approach:
It still depends on implementation details of the multiprocessing package, and could therefore break at any time.
There are valid reasons why multiprocessing made it so hard to use non-daemonic processes, many of which are explained here. The most compelling in my opinion is:
As for allowing children threads to spawn off children of its own using
subprocess runs the risk of creating a little army of zombie
'grandchildren' if either the parent or child threads terminate before
the subprocess completes and returns.
As of Python 3.8, concurrent.futures.ProcessPoolExecutor doesn't have this limitation. It can have a nested process pool with no problem at all:
from concurrent.futures import ProcessPoolExecutor as Pool
from itertools import repeat
from multiprocessing import current_process
import time
def pid():
return current_process().pid
def _square(i): # Runs in inner_pool
square = i ** 2
time.sleep(i / 10)
print(f'{pid()=} {i=} {square=}')
return square
def _sum_squares(i, j): # Runs in outer_pool
with Pool(max_workers=2) as inner_pool:
squares = inner_pool.map(_square, (i, j))
sum_squares = sum(squares)
time.sleep(sum_squares ** .5)
print(f'{pid()=}, {i=}, {j=} {sum_squares=}')
return sum_squares
def main():
with Pool(max_workers=3) as outer_pool:
for sum_squares in outer_pool.map(_sum_squares, range(5), repeat(3)):
print(f'{pid()=} {sum_squares=}')
if __name__ == "__main__":
main()
The above demonstration code was tested with Python 3.8.
A limitation of ProcessPoolExecutor, however, is that it doesn't have maxtasksperchild. If you need this, consider the answer by Massimiliano instead.
Credit: answer by jfs
The multiprocessing module has a nice interface to use pools with processes or threads. Depending on your current use case, you might consider using multiprocessing.pool.ThreadPool for your outer Pool, which will result in threads (that allow to spawn processes from within) as opposed to processes.
It might be limited by the GIL, but in my particular case (I tested both), the startup time for the processes from the outer Pool as created here far outweighed the solution with ThreadPool.
It's really easy to swap Processes for Threads. Read more about how to use a ThreadPool solution here or here.
On some Python versions replacing standard Pool to custom can raise error: AssertionError: group argument must be None for now.
Here I found a solution that can help:
class NoDaemonProcess(multiprocessing.Process):
# make 'daemon' attribute always return False
#property
def daemon(self):
return False
#daemon.setter
def daemon(self, val):
pass
class NoDaemonProcessPool(multiprocessing.pool.Pool):
def Process(self, *args, **kwds):
proc = super(NoDaemonProcessPool, self).Process(*args, **kwds)
proc.__class__ = NoDaemonProcess
return proc
I have seen people dealing with this issue by using celery's fork of multiprocessing called billiard (multiprocessing pool extensions), which allows daemonic processes to spawn children. The walkaround is to simply replace the multiprocessing module by:
import billiard as multiprocessing
The issue I encountered was in trying to import globals between modules, causing the ProcessPool() line to get evaluated multiple times.
globals.py
from processing import Manager, Lock
from pathos.multiprocessing import ProcessPool
from pathos.threading import ThreadPool
class SingletonMeta(type):
def __new__(cls, name, bases, dict):
dict['__deepcopy__'] = dict['__copy__'] = lambda self, *args: self
return super(SingletonMeta, cls).__new__(cls, name, bases, dict)
def __init__(cls, name, bases, dict):
super(SingletonMeta, cls).__init__(name, bases, dict)
cls.instance = None
def __call__(cls,*args,**kw):
if cls.instance is None:
cls.instance = super(SingletonMeta, cls).__call__(*args, **kw)
return cls.instance
def __deepcopy__(self, item):
return item.__class__.instance
class Globals(object):
__metaclass__ = SingletonMeta
"""
This class is a workaround to the bug: AssertionError: daemonic processes are not allowed to have children
The root cause is that importing this file from different modules causes this file to be reevalutated each time,
thus ProcessPool() gets reexecuted inside that child thread, thus causing the daemonic processes bug
"""
def __init__(self):
print "%s::__init__()" % (self.__class__.__name__)
self.shared_manager = Manager()
self.shared_process_pool = ProcessPool()
self.shared_thread_pool = ThreadPool()
self.shared_lock = Lock() # BUG: Windows: global name 'lock' is not defined | doesn't affect cygwin
Then import safely from elsewhere in your code
from globals import Globals
Globals().shared_manager
Globals().shared_process_pool
Globals().shared_thread_pool
Globals().shared_lock
I have written a more expanded wrapper class around pathos.multiprocessing here:
https://github.com/JamesMcGuigan/python2-timeseries-datapipeline/blob/master/src/util/MultiProcessing.py
As a side note, if your usecase just requires async multiprocess map as a performance optimization, then joblib will manage all your process pools behind the scenes and allow this very simple syntax:
squares = Parallel(-1)( delayed(lambda num: num**2)(x) for x in range(100) )
https://joblib.readthedocs.io/
This presents a workaround for when the error is seemingly a false-positive. As also noted by James, this can happen to an unintentional import from a daemonic process.
For example, if you have the following simple code, WORKER_POOL can inadvertently be imported from a worker, leading to the error.
import multiprocessing
WORKER_POOL = multiprocessing.Pool()
A simple but reliable approach for a workaround is:
import multiprocessing
import multiprocessing.pool
class MyClass:
#property
def worker_pool(self) -> multiprocessing.pool.Pool:
# Ref: https://stackoverflow.com/a/63984747/
try:
return self._worker_pool # type: ignore
except AttributeError:
# pylint: disable=protected-access
self.__class__._worker_pool = multiprocessing.Pool() # type: ignore
return self.__class__._worker_pool # type: ignore
# pylint: enable=protected-access
In the above workaround, MyClass.worker_pool can be used without the error. If you think this approach can be improved upon, let me know.
Since Python version 3.7 we can create non-daemonic ProcessPoolExecutor
Using if __name__ == "__main__": is necessary while using multiprocessing.
from concurrent.futures import ProcessPoolExecutor as Pool
num_pool = 10
def main_pool(num):
print(num)
strings_write = (f'{num}-{i}' for i in range(num))
with Pool(num) as subp:
subp.map(sub_pool,strings_write)
return None
def sub_pool(x):
print(f'{x}')
return None
if __name__ == "__main__":
with Pool(num_pool) as p:
p.map(main_pool,list(range(1,num_pool+1)))
Here is how you can start a pool, even if you are in a daemonic process already. This was tested in python 3.8.5
First, define the Undaemonize context manager, which temporarily deletes the daemon state of the current process.
class Undaemonize(object):
'''Context Manager to resolve AssertionError: daemonic processes are not allowed to have children
Tested in python 3.8.5'''
def __init__(self):
self.p = multiprocessing.process.current_process()
if 'daemon' in self.p._config:
self.daemon_status_set = True
else:
self.daemon_status_set = False
self.daemon_status_value = self.p._config.get('daemon')
def __enter__(self):
if self.daemon_status_set:
del self.p._config['daemon']
def __exit__(self, type, value, traceback):
if self.daemon_status_set:
self.p._config['daemon'] = self.daemon_status_value
Now you can start a pool as follows, even from within a daemon process:
with Undaemonize():
pool = multiprocessing.Pool(1)
pool.map(... # you can do something with the pool outside of the context manager
While the other approaches here aim to create pool that is not daemonic in the first place, this approach allows you to start a pool even if you are in a daemonic process already.

Log output of multiprocessing.Process

Is there a way to log the stdout output from a given Process when using the multiprocessing.Process class in python?
The easiest way might be to just override sys.stdout. Slightly modifying an example from the multiprocessing manual:
from multiprocessing import Process
import os
import sys
def info(title):
print title
print 'module name:', __name__
print 'parent process:', os.getppid()
print 'process id:', os.getpid()
def f(name):
sys.stdout = open(str(os.getpid()) + ".out", "w")
info('function f')
print 'hello', name
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
q = Process(target=f, args=('fred',))
q.start()
p.join()
q.join()
And running it:
$ ls
m.py
$ python m.py
$ ls
27493.out 27494.out m.py
$ cat 27493.out
function f
module name: __main__
parent process: 27492
process id: 27493
hello bob
$ cat 27494.out
function f
module name: __main__
parent process: 27492
process id: 27494
hello fred
There are only two things I would add to #Mark Rushakoff answer. When debugging, I found it really useful to change the buffering parameter of my open() calls to 0.
sys.stdout = open(str(os.getpid()) + ".out", "a", buffering=0)
Otherwise, madness, because when tail -fing the output file the results can be verrry intermittent. buffering=0 for tail -fing great.
And for completeness, do yourself a favor and redirect sys.stderr as well.
sys.stderr = open(str(os.getpid()) + "_error.out", "a", buffering=0)
Also, for convenience you might dump that into a separate process class if you wish,
class MyProc(Process):
def run(self):
# Define the logging in run(), MyProc's entry function when it is .start()-ed
# p = MyProc()
# p.start()
self.initialize_logging()
print 'Now output is captured.'
# Now do stuff...
def initialize_logging(self):
sys.stdout = open(str(os.getpid()) + ".out", "a", buffering=0)
sys.stderr = open(str(os.getpid()) + "_error.out", "a", buffering=0)
print 'stdout initialized'
Heres a corresponding gist
You can set sys.stdout = Logger() where Logger is a class whose write method (immediately, or accumulating until a \n is detected) calls logging.info (or any other way you want to log). An example of this in action.
I'm not sure what you mean by "a given" process (who's given it, what distinguishes it from all others...?), but if you mean you know what process you want to single out that way at the time you instantiate it, then you could wrap its target function (and that only) -- or the run method you're overriding in a Process subclass -- into a wrapper that performs this sys.stdout "redirection" -- and leave other processes alone.
Maybe if you nail down the specs a bit I can help in more detail...?
Here is the simple and straightforward way for capturing stdout for multiprocessing.Process and io.TextIOWrapper:
import app
import io
import sys
from multiprocessing import Process
def run_app(some_param):
out_file = open(sys.stdout.fileno(), 'wb', 0)
sys.stdout = io.TextIOWrapper(out_file, write_through=True)
app.run()
app_process = Process(target=run_app, args=('some_param',))
app_process.start()
# Use app_process.termninate() for python <= 3.7.
app_process.kill()
The log_to_stderr() function is the simplest solution.
From PYMOTW:
multiprocessing has a convenient module-level function to enable logging called log_to_stderr(). It sets up a logger object using logging and adds a handler so that log messages are sent to the standard error channel. By default, the logging level is set to NOTSET so no messages are produced. Pass a different level to initialize the logger to the level of detail desired.
import logging
from multiprocessing import Process, log_to_stderr
print("Running main script...")
def my_process(my_var):
print(f"Running my_process with {my_var}...")
# Initialize logging for multiprocessing.
log_to_stderr(logging.DEBUG)
# Start the process.
my_var = 100;
process = Process(target=my_process, args=(my_var,))
process.start()
process.kill()
This code will output both print() statements to stderr.

Categories