I am using batch processing and call below function parallelly. I need to create new log file for each process
Below is sample code
import logging
def processDocument(inputfilename):
logfile=inputfilename+'.log'
logging.basicConfig(
filename=logfile,
level=logging.INFO)
//performing some function
logging.info("process completed for file")
logging.shutdown()
It is creating log file. But when I pass this function in batch for calling 20 times. Only 16 log files are getting created.
These issue can happen with threads race conditions.
If the document processing is independent from each other, I would suggest to use multiprocessing via the high level class concurrent.futures.ProcessPoolExecutor.
If you want to stick to threads because the document processing is more I/O bound, there is concurrent.futures.ThreadPoolExecutor, which provides the same interface but with threads.
Last but not least, configure properly your logging like this toy example (which uses standard threading library):
import logging
from sys import stdout
import time
import threading
def processDocument(inputfilename:str):
logfile = inputfilename + '.log'
this_thread = threading.current_thread().name
this_thread_logger = logging.getLogger(this_thread)
file_handler = logging.FileHandler(filename=logfile)
out_handler = logging.StreamHandler(stdout)
file_handler.level = logging.INFO
out_handler.level = logging.INFO
this_thread_logger.addHandler(file_handler)
this_thread_logger.addHandler(out_handler)
this_thread_logger.info(f'Processing {inputfilename} from {this_thread}...')
time.sleep(1) # processing
this_thread_logger.info(f'Processing {inputfilename} from {this_thread}... Done')
def main():
filenames = ['hello', 'hello2', 'hello3']
threads = [threading.Thread(target=processDocument, args=(name,)) for name in filenames]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
logging.shutdown()
main()
Related
I have been told that logging can not be used in Multiprocessing. You have to do the concurrency control in case multiprocessing messes the log.
But I did some test, it seems like there is no problem using logging in multiprocessing
import time
import logging
from multiprocessing import Process, current_process, pool
# setup log
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
datefmt='%a, %d %b %Y %H:%M:%S',
filename='/tmp/test.log',
filemode='w')
def func(the_time, logger):
proc = current_process()
while True:
if time.time() >= the_time:
logger.info('proc name %s id %s' % (proc.name, proc.pid))
return
if __name__ == '__main__':
the_time = time.time() + 5
for x in xrange(1, 10):
proc = Process(target=func, name=x, args=(the_time, logger))
proc.start()
As you can see from the code.
I deliberately let the subprocess write log at the same moment( 5s after start) to increase the chance of conflict. But there are no conflict at all.
So my question is can we use logging in multiprocessing?
Why so many posts say we can not ?
As Matino correctly explained: logging in a multiprocessing setup is not safe, as multiple processes (who do not know anything about the other ones existing) are writing into the same file, potentially intervening with each other.
Now what happens is that every process holds an open file handle and does an "append write" into that file. The question is under what circumstances the append write is "atomic" (that is, cannot be interrupted by e.g. another process writing to the same file and intermingling his output). This problem applies to every programming language, as in the end they'll do a syscall to the kernel. This answer answers under which circumstances a shared log file is ok.
It comes down to checking your pipe buffer size, on linux that is defined in /usr/include/linux/limits.h and is 4096 bytes. For other OSes you find here a good list.
That means: If your log line is less than 4'096 bytes (if on Linux), then the append is safe, if the disk is directly attached (i.e. no network in between). But for more details please check the first link in my answer. To test this you can do logger.info('proc name %s id %s %s' % (proc.name, proc.pid, str(proc.name)*5000)) with different lenghts. With 5000 for instance I got already mixed up log lines in /tmp/test.log.
In this question there are already quite a few solutions to this, so I won't add my own solution here.
Update: Flask and multiprocessing
Web frameworks like flask will be run in multiple workers if hosted by uwsgi or nginx. In that case, multiple processes may write into one log file. Will it have problems?
The error handling in flask is done via stdout/stderr which is then cought by the webserver (uwsgi, nginx, etc.) which needs to take care that logs are written in correct fashion (see e.g. this flask+nginx example), probably also adding process information so you can associate error lines to processes. From flasks doc:
By default as of Flask 0.11, errors are logged to your webserver’s log automatically. Warnings however are not.
So you'd still have this issue of intermingled log files if you use warn and the message exceeds the pipe buffer size.
It is not safe to write to a single file from multiple processes.
According to https://docs.python.org/3/howto/logging-cookbook.html#logging-to-a-single-file-from-multiple-processes
Although logging is thread-safe, and logging to a single file from
multiple threads in a single process is supported, logging to a single
file from multiple processes is not supported, because there is no
standard way to serialize access to a single file across multiple
processes in Python.
One possible solution would be to have each process write to its own file. You can achieve this by writing your own handler that adds process pid to the end of the file:
import logging.handlers
import os
class PIDFileHandler(logging.handlers.WatchedFileHandler):
def __init__(self, filename, mode='a', encoding=None, delay=0):
filename = self._append_pid_to_filename(filename)
super(PIDFileHandler, self).__init__(filename, mode, encoding, delay)
def _append_pid_to_filename(self, filename):
pid = os.getpid()
path, extension = os.path.splitext(filename)
return '{0}-{1}{2}'.format(path, pid, extension)
Then you just need to call addHandler:
logger = logging.getLogger('foo')
fh = PIDFileHandler('bar.log')
logger.addHandler(fh)
Use a queue for correct handling of concurrency simultaneously recovering from errors by feeding everything to the parent process via a pipe.
from logging.handlers import RotatingFileHandler
import multiprocessing, threading, logging, sys, traceback
class MultiProcessingLog(logging.Handler):
def __init__(self, name, mode, maxsize, rotate):
logging.Handler.__init__(self)
self._handler = RotatingFileHandler(name, mode, maxsize, rotate)
self.queue = multiprocessing.Queue(-1)
t = threading.Thread(target=self.receive)
t.daemon = True
t.start()
def setFormatter(self, fmt):
logging.Handler.setFormatter(self, fmt)
self._handler.setFormatter(fmt)
def receive(self):
while True:
try:
record = self.queue.get()
self._handler.emit(record)
except (KeyboardInterrupt, SystemExit):
raise
except EOFError:
break
except:
traceback.print_exc(file=sys.stderr)
def send(self, s):
self.queue.put_nowait(s)
def _format_record(self, record):
# ensure that exc_info and args
# have been stringified. Removes any chance of
# unpickleable things inside and possibly reduces
# message size sent over the pipe
if record.args:
record.msg = record.msg % record.args
record.args = None
if record.exc_info:
dummy = self.format(record)
record.exc_info = None
return record
def emit(self, record):
try:
s = self._format_record(record)
self.send(s)
except (KeyboardInterrupt, SystemExit):
raise
except:
self.handleError(record)
def close(self):
self._handler.close()
logging.Handler.close(self)
The handler does all the file writing from the parent process and uses just one thread to receive messages passed from child processes
QueueHandler is native in Python 3.2+, and safely handles multiprocessing logging.
Python docs have two complete examples: Logging to a single file from multiple processes
For those using Python < 3.2, just copy QueueHandler into your own code from: https://gist.github.com/vsajip/591589 or alternatively import logutils.
Each process (including the parent process) puts its logging on the Queue, and then a listener thread or process (one example is provided for each) picks those up and writes them all to a file - no risk of corruption or garbling.
Note: this question is basically a duplicate of How should I log while using multiprocessing in Python? so I've copied my answer from that question as I'm pretty sure it's currently the best solution.
After I port my script to Windows from Mac (both python 2.7.*), I find that all the logging not working in subprocess, only the father's logging are write to file. Here is my example code:
# test log among multiple process env
import logging, os
from multiprocessing import Process
def child():
logging.info('this is child')
if __name__ == '__main__':
logging.basicConfig(filename=os.path.join(os.getcwd(), 'log.out'),
level = logging.DEBUG, filemode='w',
format = '[%(filename)s:%(lineno)d]: %(asctime)s - %(levelname)s: %(message)s')
p = Process(target = child, args = ())
p.start()
p.join()
logging.info('this is father')
the output only write this is father into log.out, and the child's log missing. How to make logging woking in child process?
Each child is an independent process, and file handles in the parent may be closed in the child after a fork (assuming POSIX). In any case, logging to the same file from multiple processes is not supported. See the documentation for suggested approaches.
This question comes as a result of trying to combine logging with a multiprocessing pool. Under Linux there is nothing to do; the module containing my pool worker method inherits the main app logger properties. Under Windows I have to initialize the logger in each process, which I do by running pool.map_async with the initializer method. The problem is that the method runs so quickly that it gets executed more than once in some processes and not at all in others. I can get it to work properly if I add a short time delay to the method but this seems inelegant.
Is there a way to force the pool to distribute the processes evenly?
(some background: http://plumberjack.blogspot.de/2010/09/using-logging-with-multiprocessing.html)
The code is as follows, I can't really post the whole module ;-)
The call is this:
# Set up logger on Windows platforms
if os.name == 'nt':
_ = pool.map_async(ml.worker_configurer,
[self._q for _ in range(mp.cpu_count())])
The function ml.worker_configurer is this:
def worker_configurer(queue, delay=True):
h = QueueHandler(queue)
root = logging.getLogger()
root.addHandler(h)
root.setLevel(logging.DEBUG)
if delay:
import time
time.sleep(1.0)
return
New worker configurer
def worker_configurer2(queue):
root = logging.getLogger()
if not root.handlers:
h = QueueHandler(queue)
root.addHandler(h)
root.setLevel(logging.DEBUG)
return
You can do something like this:
sub_logger = None
def get_logger():
global sub_logger
if sub_logger is None:
# configure logger
return sub_logger
def worker1():
logger = get_logger()
# DO WORK
def worker2():
logger = get_logger()
# DO WORK
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
result = pool.map_async(worker1, some_data)
result.get()
result = pool.map_async(worker2, some_data)
result.get()
# and so on and so forth
Because each process has its own memory space (and thus it's own set of global variables), you can set the initial global logger to None and only configure the logger if it has not been previously configured.
I have the following code that causes Nuke to hang. Basically, what I'm trying to do is get a list of files and folders from the file system, and I am trying to speed it up through parallel processing. This works perfectly outside of Nuke, but as I said before, running this in Nuke will cause Nuke to hang. Is there a better way to do this that will cause Nuke to not hang? Preferably, I'd like to fix this through Python's standard library, or packages that are platform agnostic. But, if there's no way to do that, then I'm fine with that. Worst case, I will have to go back to not using parallel processing and find other optimizations.
Also, when I run this code in Nuke, I get the following error in the console:
Unknown units in -c from multiprocessing.forking import main; main()
The code:
#!/bin/env python
import multiprocessing
import os
CPU_COUNT = multiprocessing.cpu_count()
def _threaded_master(root):
in_queue = multiprocessing.JoinableQueue()
folder_queue = multiprocessing.JoinableQueue()
file_queue = multiprocessing.JoinableQueue()
in_queue.put(root)
for _ in xrange(CPU_COUNT):
multiprocessing.Process(target=_threaded_slave, args=(in_queue, folder_queue, file_queue)).start()
in_queue.join()
return {"folders": folder_queue, "files": file_queue}
def _threaded_slave(in_queue, folder_queue, file_queue):
while True:
path_item = in_queue.get()
if os.path.isdir(path_item):
for item in os.listdir(path_item):
path = os.path.join(path_item, item)
in_queue.put(path)
in_queue.task_done()
if __name__ == "__main__":
print _threaded_master(r"/path/to/root")
Here's my code to scan through a large tree of directories using several threads.
I'd originally written the code to use good old multiprocessing.Pool(), because it's very easy and gives you the results of the functions. Input and output queues are not needed. Another difference is it uses processes over threads, which have some tradeoffs.
The Pool has a big drawback: it assumes you have a static list of items to process.
So, I rewrote the code following your original example: input/output queue of directories to process, and an output queue. The caller has to explicitly grab items from the output queue.
For grins I ran a timing comparison with good old os.walk() and... at least on my machine the traditional solution was faster. The two solutions produced quite different numbers of files, which I can't explain.
Have fun!
source
#!/bin/env python
import multiprocessing, threading, time
import logging, os, Queue, sys
logging.basicConfig(
level=logging.INFO,
format="%(asctime)-4s %(levelname)s %(threadName)s %(message)s",
datefmt="%H:%M:%S",
stream=sys.stderr,
)
def scan_dir(topdir):
try:
for name in os.listdir(topdir):
path = os.path.join(topdir, name)
yield (path, os.path.isdir(path))
except OSError:
logging.error('uhoh: %s', topdir)
def scan_dir_queue(inqueue, outqueue):
logging.info('start')
while True:
try:
dir_item = inqueue.get_nowait()
except Queue.Empty:
break
res = list( scan_dir(dir_item) )
logging.debug('- %d paths', len(res))
for path,isdir in res:
outqueue.put( (path,isdir) )
if isdir:
inqueue.put(path)
logging.info('done')
def thread_master(root):
dir_queue = Queue.Queue() # pylint: disable=E1101
dir_queue.put(root)
result_queue = Queue.Queue()
threads = [
threading.Thread(
target=scan_dir_queue, args=[dir_queue, result_queue]
)
for _ in range(multiprocessing.cpu_count())
]
for th in threads:
th.start()
for th in threads:
th.join()
return result_queue.queue
if __name__ == "__main__":
topdir = os.path.expanduser('~')
start = time.time()
res = thread_master(topdir)
print 'threaded:', time.time() - start
print len(res), 'paths'
def mywalk(topdir):
for (dirpath, _dirnames, filenames) in os.walk(topdir):
for name in filenames:
yield os.path.join(dirpath, name)
start = time.time()
res = list(mywalk(topdir))
print 'os.walk:', time.time() - start
print len(res), 'paths'
output
11:56:35 INFO Thread-1 start
11:56:35 INFO Thread-2 start
11:56:35 INFO Thread-3 start
11:56:35 INFO Thread-4 start
11:56:35 INFO Thread-2 done
11:56:35 INFO Thread-3 done
11:56:35 INFO Thread-4 done
11:56:42 INFO Thread-1 done
threaded: 6.49218010902
299230 paths
os.walk: 1.6940600872
175741 paths
Here's a link to refer to: https://learn.foundry.com/nuke/developers/63/pythondevguide/threading.html
What's notable is the warning mentioned in there: nuke.executeInMainThread and nuke.executeInMainThreadWithResult should always be run from a child thread. If run from within the main thread, they freeze NUKE.
So, spawn a new child thread, and do your stuff there.
Is there a way to log the stdout output from a given Process when using the multiprocessing.Process class in python?
The easiest way might be to just override sys.stdout. Slightly modifying an example from the multiprocessing manual:
from multiprocessing import Process
import os
import sys
def info(title):
print title
print 'module name:', __name__
print 'parent process:', os.getppid()
print 'process id:', os.getpid()
def f(name):
sys.stdout = open(str(os.getpid()) + ".out", "w")
info('function f')
print 'hello', name
if __name__ == '__main__':
p = Process(target=f, args=('bob',))
p.start()
q = Process(target=f, args=('fred',))
q.start()
p.join()
q.join()
And running it:
$ ls
m.py
$ python m.py
$ ls
27493.out 27494.out m.py
$ cat 27493.out
function f
module name: __main__
parent process: 27492
process id: 27493
hello bob
$ cat 27494.out
function f
module name: __main__
parent process: 27492
process id: 27494
hello fred
There are only two things I would add to #Mark Rushakoff answer. When debugging, I found it really useful to change the buffering parameter of my open() calls to 0.
sys.stdout = open(str(os.getpid()) + ".out", "a", buffering=0)
Otherwise, madness, because when tail -fing the output file the results can be verrry intermittent. buffering=0 for tail -fing great.
And for completeness, do yourself a favor and redirect sys.stderr as well.
sys.stderr = open(str(os.getpid()) + "_error.out", "a", buffering=0)
Also, for convenience you might dump that into a separate process class if you wish,
class MyProc(Process):
def run(self):
# Define the logging in run(), MyProc's entry function when it is .start()-ed
# p = MyProc()
# p.start()
self.initialize_logging()
print 'Now output is captured.'
# Now do stuff...
def initialize_logging(self):
sys.stdout = open(str(os.getpid()) + ".out", "a", buffering=0)
sys.stderr = open(str(os.getpid()) + "_error.out", "a", buffering=0)
print 'stdout initialized'
Heres a corresponding gist
You can set sys.stdout = Logger() where Logger is a class whose write method (immediately, or accumulating until a \n is detected) calls logging.info (or any other way you want to log). An example of this in action.
I'm not sure what you mean by "a given" process (who's given it, what distinguishes it from all others...?), but if you mean you know what process you want to single out that way at the time you instantiate it, then you could wrap its target function (and that only) -- or the run method you're overriding in a Process subclass -- into a wrapper that performs this sys.stdout "redirection" -- and leave other processes alone.
Maybe if you nail down the specs a bit I can help in more detail...?
Here is the simple and straightforward way for capturing stdout for multiprocessing.Process and io.TextIOWrapper:
import app
import io
import sys
from multiprocessing import Process
def run_app(some_param):
out_file = open(sys.stdout.fileno(), 'wb', 0)
sys.stdout = io.TextIOWrapper(out_file, write_through=True)
app.run()
app_process = Process(target=run_app, args=('some_param',))
app_process.start()
# Use app_process.termninate() for python <= 3.7.
app_process.kill()
The log_to_stderr() function is the simplest solution.
From PYMOTW:
multiprocessing has a convenient module-level function to enable logging called log_to_stderr(). It sets up a logger object using logging and adds a handler so that log messages are sent to the standard error channel. By default, the logging level is set to NOTSET so no messages are produced. Pass a different level to initialize the logger to the level of detail desired.
import logging
from multiprocessing import Process, log_to_stderr
print("Running main script...")
def my_process(my_var):
print(f"Running my_process with {my_var}...")
# Initialize logging for multiprocessing.
log_to_stderr(logging.DEBUG)
# Start the process.
my_var = 100;
process = Process(target=my_process, args=(my_var,))
process.start()
process.kill()
This code will output both print() statements to stderr.