Python: How to use different logfiles for processes in multiprocessing.Pool? - python

I am using multiprocessing.Pool to run a number of independent processes in parallel. Not so much different from the basic example in the python docs:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3]))
I would like each process to have a separate log file. I log various info from other modules in my codebase and some third-party packages (none of them is multiprocessing aware). So, for example, I would like this:
import logging
from multiprocessing import Pool
def f(x):
logging.info(f"x*x={x*x}")
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, range(10)))
to write on disk:
log1.log
log2.log
log3.log
log4.log
log5.log
How do I achieve it?

You'll need to use Pool's initializer() to set up and register the separate loggers immediately after workers start up. Under the hood the arguments to Pool(initializer) and Pool(initargs) end up being passed to Process(target) and Process(args) for creating new worker-processes...
Pool-workers get named in the format {start_method}PoolWorker-{number}, so e.g. SpawnWorker-1 if you use spawn as starting method for new processes.
The file number for the logfiles then can be extracted from the assigned worker-names with mp.current_process().name.split('-')[1].
import logging
import multiprocessing as mp
def f(x):
logger.info(f"x*x={x*x}")
return x*x
def _init_logging(level=logging.INFO, mode='a'):
worker_no = mp.current_process().name.split('-')[1]
filename = f"log{worker_no}.log"
fh = logging.FileHandler(filename, mode=mode)
fmt = logging.Formatter(
'%(asctime)s %(processName)-10s %(name)s %(levelname)-8s --- %(message)s'
)
fh.setFormatter(fmt)
logger = logging.getLogger()
logger.addHandler(fh)
logger.setLevel(level)
globals()['logger'] = logger
if __name__ == '__main__':
with mp.Pool(5, initializer=_init_logging, initargs=(logging.DEBUG,)) as pool:
print(pool.map(f, range(10)))
Note, due to the nature of multiprocessing, there's no guarantee for the exact number of files you end up with in your small example.
Since multiprocessing.Pool (contrary to concurrent.futures.ProcessPoolExecutor) starts workers as soon as you create the instance, you're bound to get the specified Pool(process)-number of files, so in your case 5. Actual thread/process-scheduling by your OS might cut this number short here, though.

Related

Python create multiple log files

I am using batch processing and call below function parallelly. I need to create new log file for each process
Below is sample code
import logging
def processDocument(inputfilename):
logfile=inputfilename+'.log'
logging.basicConfig(
filename=logfile,
level=logging.INFO)
//performing some function
logging.info("process completed for file")
logging.shutdown()
It is creating log file. But when I pass this function in batch for calling 20 times. Only 16 log files are getting created.
These issue can happen with threads race conditions.
If the document processing is independent from each other, I would suggest to use multiprocessing via the high level class concurrent.futures.ProcessPoolExecutor.
If you want to stick to threads because the document processing is more I/O bound, there is concurrent.futures.ThreadPoolExecutor, which provides the same interface but with threads.
Last but not least, configure properly your logging like this toy example (which uses standard threading library):
import logging
from sys import stdout
import time
import threading
def processDocument(inputfilename:str):
logfile = inputfilename + '.log'
this_thread = threading.current_thread().name
this_thread_logger = logging.getLogger(this_thread)
file_handler = logging.FileHandler(filename=logfile)
out_handler = logging.StreamHandler(stdout)
file_handler.level = logging.INFO
out_handler.level = logging.INFO
this_thread_logger.addHandler(file_handler)
this_thread_logger.addHandler(out_handler)
this_thread_logger.info(f'Processing {inputfilename} from {this_thread}...')
time.sleep(1) # processing
this_thread_logger.info(f'Processing {inputfilename} from {this_thread}... Done')
def main():
filenames = ['hello', 'hello2', 'hello3']
threads = [threading.Thread(target=processDocument, args=(name,)) for name in filenames]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
logging.shutdown()
main()

dask distributed 1.19 client logging?

The following code used to emit logs at some point, but no longer seems to do so. Shouldn't configuration of the logging mechanism in each worker permit logs to appear on stdout? If not, what am I overlooking?
import logging
from distributed import Client, LocalCluster
import numpy as np
def func(args):
i, x = args
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(name)s %(levelname)s %(message)s')
logger = logging.getLogger('func %i' % i)
logger.info('computing svd')
return np.linalg.svd(x)
if __name__ == '__main__':
lc = LocalCluster(10)
c = Client(lc)
data = [np.random.rand(50, 50) for i in range(50)]
fut = c.map(func, zip(range(len(data)), data))
results = c.gather(fut)
lc.close()
As per this question, I tried putting the logger configuration code into a separate function invoked via c.run(init_logging) right after instantiation of the client, but that didn't make any difference either.
I'm using distributed 1.19.3 with Python 3.6.3 on Linux. I have
logging:
distributed: info
distributed.client: info
in ~/.dask/config.yaml.
Evidently the submitted functions do not actually execute until one tries to retrieve the results from the generated futures, i.e., using the line
print(list(results))
before shutting down the local cluster. I'm not sure how to reconcile this with the section in the online docs that seems to state that direct submissions to a cluster are executed immediately.

Python: why logging in multiprocessing not working

After I port my script to Windows from Mac (both python 2.7.*), I find that all the logging not working in subprocess, only the father's logging are write to file. Here is my example code:
# test log among multiple process env
import logging, os
from multiprocessing import Process
def child():
logging.info('this is child')
if __name__ == '__main__':
logging.basicConfig(filename=os.path.join(os.getcwd(), 'log.out'),
level = logging.DEBUG, filemode='w',
format = '[%(filename)s:%(lineno)d]: %(asctime)s - %(levelname)s: %(message)s')
p = Process(target = child, args = ())
p.start()
p.join()
logging.info('this is father')
the output only write this is father into log.out, and the child's log missing. How to make logging woking in child process?
Each child is an independent process, and file handles in the parent may be closed in the child after a fork (assuming POSIX). In any case, logging to the same file from multiple processes is not supported. See the documentation for suggested approaches.

Python multiprocessing pool force distribution of process

This question comes as a result of trying to combine logging with a multiprocessing pool. Under Linux there is nothing to do; the module containing my pool worker method inherits the main app logger properties. Under Windows I have to initialize the logger in each process, which I do by running pool.map_async with the initializer method. The problem is that the method runs so quickly that it gets executed more than once in some processes and not at all in others. I can get it to work properly if I add a short time delay to the method but this seems inelegant.
Is there a way to force the pool to distribute the processes evenly?
(some background: http://plumberjack.blogspot.de/2010/09/using-logging-with-multiprocessing.html)
The code is as follows, I can't really post the whole module ;-)
The call is this:
# Set up logger on Windows platforms
if os.name == 'nt':
_ = pool.map_async(ml.worker_configurer,
[self._q for _ in range(mp.cpu_count())])
The function ml.worker_configurer is this:
def worker_configurer(queue, delay=True):
h = QueueHandler(queue)
root = logging.getLogger()
root.addHandler(h)
root.setLevel(logging.DEBUG)
if delay:
import time
time.sleep(1.0)
return
New worker configurer
def worker_configurer2(queue):
root = logging.getLogger()
if not root.handlers:
h = QueueHandler(queue)
root.addHandler(h)
root.setLevel(logging.DEBUG)
return
You can do something like this:
sub_logger = None
def get_logger():
global sub_logger
if sub_logger is None:
# configure logger
return sub_logger
def worker1():
logger = get_logger()
# DO WORK
def worker2():
logger = get_logger()
# DO WORK
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
result = pool.map_async(worker1, some_data)
result.get()
result = pool.map_async(worker2, some_data)
result.get()
# and so on and so forth
Because each process has its own memory space (and thus it's own set of global variables), you can set the initial global logger to None and only configure the logger if it has not been previously configured.

python multiprocessing, manager initiates process spawn loop

I have a simple python multiprocessing script that sets up a pool of workers that attempt to append work-output to a Manager list. The script has 3 call stacks: - main calls f1 that spawns several worker processes that call another function g1. When one attempts to debug the script (incidentally on Windows 7/64 bit/VS 2010/PyTools) the script runs into a nested process creation loop, spawning an endless number of processes. Can anyone determine why? I'm sure I am missing something very simple. Here's the problematic code: -
import multiprocessing
import logging
manager = multiprocessing.Manager()
results = manager.list()
def g1(x):
y = x*x
print "processing: y = %s" % y
results.append(y)
def f1():
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)
pool = multiprocessing.Pool(processes=4)
for (i) in range(0,15):
pool.apply_async(g1, [i])
pool.close()
pool.join()
def main():
f1()
if __name__ == "__main__":
main()
PS: tried adding multiprocessing.freeze_support() to main to no avail.
Basically, what sr2222 mentions in his comment is correct. From the multiprocessing manager docs, it says that the ____main____ module must be importable by the children. Each manager " object corresponds to a spawned child process", so each child is basically re-importing your module (you can see by adding a print statement at module scope to my fixed version!)...which leads to infinite recursion.
One solution would be to move your manager code into f1():
import multiprocessing
import logging
def g1(results, x):
y = x*x
print "processing: y = %s" % y
results.append(y)
def f1():
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)
manager = multiprocessing.Manager()
results = manager.list()
pool = multiprocessing.Pool(processes=4)
for (i) in range(0,15):
pool.apply_async(g1, [results, i])
pool.close()
pool.join()
def main():
f1()
if __name__ == "__main__":
main()

Categories