Does python logging support multiprocessing?

Does python logging support multiprocessing? - python

I have been told that logging can not be used in Multiprocessing. You have to do the concurrency control in case multiprocessing messes the log.
But I did some test, it seems like there is no problem using logging in multiprocessing
import time
import logging
from multiprocessing import Process, current_process, pool
# setup log
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
datefmt='%a, %d %b %Y %H:%M:%S',
filename='/tmp/test.log',
filemode='w')
def func(the_time, logger):
proc = current_process()
while True:
if time.time() >= the_time:
logger.info('proc name %s id %s' % (proc.name, proc.pid))
return
if __name__ == '__main__':
the_time = time.time() + 5
for x in xrange(1, 10):
proc = Process(target=func, name=x, args=(the_time, logger))
proc.start()
As you can see from the code.
I deliberately let the subprocess write log at the same moment( 5s after start) to increase the chance of conflict. But there are no conflict at all.
So my question is can we use logging in multiprocessing?
Why so many posts say we can not ?

As Matino correctly explained: logging in a multiprocessing setup is not safe, as multiple processes (who do not know anything about the other ones existing) are writing into the same file, potentially intervening with each other.
Now what happens is that every process holds an open file handle and does an "append write" into that file. The question is under what circumstances the append write is "atomic" (that is, cannot be interrupted by e.g. another process writing to the same file and intermingling his output). This problem applies to every programming language, as in the end they'll do a syscall to the kernel. This answer answers under which circumstances a shared log file is ok.
It comes down to checking your pipe buffer size, on linux that is defined in /usr/include/linux/limits.h and is 4096 bytes. For other OSes you find here a good list.
That means: If your log line is less than 4'096 bytes (if on Linux), then the append is safe, if the disk is directly attached (i.e. no network in between). But for more details please check the first link in my answer. To test this you can do logger.info('proc name %s id %s %s' % (proc.name, proc.pid, str(proc.name)*5000)) with different lenghts. With 5000 for instance I got already mixed up log lines in /tmp/test.log.
In this question there are already quite a few solutions to this, so I won't add my own solution here.
Update: Flask and multiprocessing
Web frameworks like flask will be run in multiple workers if hosted by uwsgi or nginx. In that case, multiple processes may write into one log file. Will it have problems?
The error handling in flask is done via stdout/stderr which is then cought by the webserver (uwsgi, nginx, etc.) which needs to take care that logs are written in correct fashion (see e.g. this flask+nginx example), probably also adding process information so you can associate error lines to processes. From flasks doc:
By default as of Flask 0.11, errors are logged to your webserver’s log automatically. Warnings however are not.
So you'd still have this issue of intermingled log files if you use warn and the message exceeds the pipe buffer size.

It is not safe to write to a single file from multiple processes.
According to https://docs.python.org/3/howto/logging-cookbook.html#logging-to-a-single-file-from-multiple-processes
Although logging is thread-safe, and logging to a single file from
multiple threads in a single process is supported, logging to a single
file from multiple processes is not supported, because there is no
standard way to serialize access to a single file across multiple
processes in Python.
One possible solution would be to have each process write to its own file. You can achieve this by writing your own handler that adds process pid to the end of the file:
import logging.handlers
import os
class PIDFileHandler(logging.handlers.WatchedFileHandler):
def __init__(self, filename, mode='a', encoding=None, delay=0):
filename = self._append_pid_to_filename(filename)
super(PIDFileHandler, self).__init__(filename, mode, encoding, delay)
def _append_pid_to_filename(self, filename):
pid = os.getpid()
path, extension = os.path.splitext(filename)
return '{0}-{1}{2}'.format(path, pid, extension)
Then you just need to call addHandler:
logger = logging.getLogger('foo')
fh = PIDFileHandler('bar.log')
logger.addHandler(fh)

Use a queue for correct handling of concurrency simultaneously recovering from errors by feeding everything to the parent process via a pipe.
from logging.handlers import RotatingFileHandler
import multiprocessing, threading, logging, sys, traceback
class MultiProcessingLog(logging.Handler):
def __init__(self, name, mode, maxsize, rotate):
logging.Handler.__init__(self)
self._handler = RotatingFileHandler(name, mode, maxsize, rotate)
self.queue = multiprocessing.Queue(-1)
t = threading.Thread(target=self.receive)
t.daemon = True
t.start()
def setFormatter(self, fmt):
logging.Handler.setFormatter(self, fmt)
self._handler.setFormatter(fmt)
def receive(self):
while True:
try:
record = self.queue.get()
self._handler.emit(record)
except (KeyboardInterrupt, SystemExit):
raise
except EOFError:
break
except:
traceback.print_exc(file=sys.stderr)
def send(self, s):
self.queue.put_nowait(s)
def _format_record(self, record):
# ensure that exc_info and args
# have been stringified. Removes any chance of
# unpickleable things inside and possibly reduces
# message size sent over the pipe
if record.args:
record.msg = record.msg % record.args
record.args = None
if record.exc_info:
dummy = self.format(record)
record.exc_info = None
return record
def emit(self, record):
try:
s = self._format_record(record)
self.send(s)
except (KeyboardInterrupt, SystemExit):
raise
except:
self.handleError(record)
def close(self):
self._handler.close()
logging.Handler.close(self)
The handler does all the file writing from the parent process and uses just one thread to receive messages passed from child processes

QueueHandler is native in Python 3.2+, and safely handles multiprocessing logging.
Python docs have two complete examples: Logging to a single file from multiple processes
For those using Python < 3.2, just copy QueueHandler into your own code from: https://gist.github.com/vsajip/591589 or alternatively import logutils.
Each process (including the parent process) puts its logging on the Queue, and then a listener thread or process (one example is provided for each) picks those up and writes them all to a file - no risk of corruption or garbling.
Note: this question is basically a duplicate of How should I log while using multiprocessing in Python? so I've copied my answer from that question as I'm pretty sure it's currently the best solution.

Related

Python create multiple log files

I am using batch processing and call below function parallelly. I need to create new log file for each process
Below is sample code
import logging
def processDocument(inputfilename):
logfile=inputfilename+'.log'
logging.basicConfig(
filename=logfile,
level=logging.INFO)
//performing some function
logging.info("process completed for file")
logging.shutdown()
It is creating log file. But when I pass this function in batch for calling 20 times. Only 16 log files are getting created.

These issue can happen with threads race conditions.
If the document processing is independent from each other, I would suggest to use multiprocessing via the high level class concurrent.futures.ProcessPoolExecutor.
If you want to stick to threads because the document processing is more I/O bound, there is concurrent.futures.ThreadPoolExecutor, which provides the same interface but with threads.
Last but not least, configure properly your logging like this toy example (which uses standard threading library):
import logging
from sys import stdout
import time
import threading
def processDocument(inputfilename:str):
logfile = inputfilename + '.log'
this_thread = threading.current_thread().name
this_thread_logger = logging.getLogger(this_thread)
file_handler = logging.FileHandler(filename=logfile)
out_handler = logging.StreamHandler(stdout)
file_handler.level = logging.INFO
out_handler.level = logging.INFO
this_thread_logger.addHandler(file_handler)
this_thread_logger.addHandler(out_handler)
this_thread_logger.info(f'Processing {inputfilename} from {this_thread}...')
time.sleep(1) # processing
this_thread_logger.info(f'Processing {inputfilename} from {this_thread}... Done')
def main():
filenames = ['hello', 'hello2', 'hello3']
threads = [threading.Thread(target=processDocument, args=(name,)) for name in filenames]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
logging.shutdown()
main()

Python - improving logging to file and console using multiprocessing

I am trying to download a file on CAN bus using python-can. It involves sending data very quickly (in the order of 2-3 messages per millisecond). I am trying to log to file these messages without impacting the speed of sending. Doing the file I/O slows down the sending due to the logging overhead. I tried various methods to improve this (including using queues and reading the queue from another thread but this was not much better - possibly due to GIL). Most of these tests started with using the Python logging module and trying various handlers (QueueHandler/QueueListener, MemoryHandler, etc).
I've managed to make some significant improvements by moving the file I/O into a separate process. I initially ran into an issue with the overhead of sending data from one process to another - so I now buffer it. Now, instead of taking 150% longer with direct file I/O in the main process, I see ~20% increase in time.
I thought that, since this is running in another process, I could also print() the data to console (which I know is relative expensive) but I see a huge increase in the file download time.
What is happening that means the print() affects the main process even though it is running in a child process?
Code below:
file_logger_mp() is called from the main process and it starts the child process that does the logging. The main process then uses the log_hdl function to add a message to the buffer. When the buffer reaches a certain size (100) it is sent to the child process for logging to file or printing to console.
Device: rpi4. And the main process uses asyncio, in case that affects it.
def file_logger_mp(logger_name: str, log_file_pth: str):
conn_rec, conn_send = multiprocessing.Pipe()
log_hdl_c = MyLogger(conn_send)
log_hdl = log_hdl_c.log_hdl # This is used by main code to provide log messages to child process
listener = MyProcess(conn_rec, log_file_pth)
atexit.register(log_hdl_c.final_flush, listener)
listener.start() # Start the child process
return log_hdl, listener
class MyLogger():
def __init__(self, conn_send) -> None:
self.buffer = []
self.conn_send = conn_send
def log_hdl(self, msg):
self.buffer.append(msg)
if len(self.buffer) > 100:
self.conn_send.send(self.buffer)
self.buffer.clear()
def final_flush(self, listener):
self.conn_send.send(self.buffer)
listener.terminate()
class MyProcess(multiprocessing.Process):
def __init__(self, queue, f_hdl):
multiprocessing.Process.__init__(self)
self.exit = multiprocessing.Event()
self.queue = queue
self.f_hdl = f_hdl
def run(self):
f = open(self.f_hdl, "w+")
while not self.exit.is_set():
try:
record = self.queue.recv()
for msg in record:
output = str(msg)
f.write(output+'\n')
print(output) # This `print()` causes large delays to main process?!
record.clear()
except Exception:
import sys, traceback
print('Whoops! Problem:', file=sys.stderr)
traceback.print_exc(file=sys.stderr)
for msg in record: # Flush any pending records before finishing
f.write(str(msg)+'\n')
f.close()
def terminate(self):
self.exit.set()

Python multiprocessing module do not work

i am trying to write a spider with multiprocessing module
here is my python code:
# -*- coding:utf-8 -*-
import multiprocessing
import requests
class SpiderWorker(object):
def __init__(self, q):
self._q = q
def run(self):
def _crawl_item(url):
requests.get("http://www.baidu.com")
if respon.ok:
print respon.url
while True:
rst = self._q.get()
_crawl_item(rst)
def general_worker():
q = multiprocessing.Queue()
CPU_COUNT = multiprocessing.cpu_count()
worker_processes = [
multiprocessing.Process(target=SpiderWorker(q).run)
for i in range(CPU_COUNT)
]
map( lambda process: process.start(), worker_processes )
return q, worker_processes
maybe it is my process way wrong
every time i run this code, my process tell me
<Process(Process-1, stopped[SIGSEGV])>
hope love it

The major problem here is that you don't have any information on why your processes fail. It could be gevent, but it could just as easily be something else. So learning the actual reason why your processes get terminated is the first step before doing anything else.
What you need is multiprocessing.log_to_stderr():
class SpiderWorker(object):
# ...
def run(self):
logger = multiprocessing.log_to_stderr()
logger.setLevel(multiprocessing.SUBDEBUG)
try:
# Here goes your original run() code
except Exception:
logger.exception('whoopsie')
What this code does:
Creates a special logger which will transmit it's information to the main process and dump it to stderr (console by default).
Configures this logger to report everything, including some internal multiprocessing module events (just in case as you probably don't need them).
Wraps your entire code in catch-all statement so whatever happens there cannot escape your notice.
Runs .exception() method on the logger, which not only logs the message (it's meaningless anyway as we don't know what actually happens) but most importantly logs the entire error traceback - which we actually need.

Python Watchdog issue - missing events

I am using Python Watchdog to monitor a folder on Ubuntu. It's working fine with 1 or 2 files, but when I moved 50 files by command mv *.xml dest_folder then it received only 2 events and processed only 2 files. Below is the code.
def on_moved(self, event):
try:
logger.debug("on_moved event :" + str(event) )
self._validate_xml(event.dest_path)
except Exception as ex:
logger.exception(ex)
If I comment out _validate_xml function then I receive all 45 events.
Can any one tell me what is exactly happened in the Watchdog and what is the best solution for this?

I haven't used Python Watchdog, but from a generic real-time systems perspective,
processing xml with _validate_xml can be slow, and make you miss events.
event = similar to an interrupt, handling should be as fast as possible.
To more you do while handling an event, the less "real-time" your system becomes. What you can do is offload the xml validity check to another process and exchange messages with a Queue (message would be event.dest_path) the paths you have seen moving. Your event handling will be as simple as putting messages on a queue, and the files can be processed in batch by the consumer of the queue.
In short:
instantiate a Queue
fork() process
in the on_moved handler, put messages on the queue,
in the forked process, pop messages from the queue and call _validate_xml.
you may optionally leverage multiprocessing.Pool do validate xml files in parallel.
good luck.
EDIT: tested out on my system; most of the comments above seem not to apply because watchdog's code seems to handle threading just fine.
#!/usr/bin/env python
import time
from watchdog.observers import Observer, api
from watchdog.events import LoggingEventHandler, FileSystemEventHandler, FileMovedEvent
import logging
def counter_gen():
count = 0
while True:
count += 1
yield count
class XmlValidatorHandler(FileSystemEventHandler):
sleep_time = 0.1
COUNTER = counter_gen()
def on_moved(self, event):
if isinstance(event, FileMovedEvent):
print '%s - event %d; validate: %s' % (
type(self).__name__, self.COUNTER.next(), event.dest_path)
time.sleep(self.sleep_time)
class SlowXmlValidatorHandler(XmlValidatorHandler):
sleep_time = 2
COUNTER = counter_gen()
def get_observer(handler):
observer = Observer(timeout=0.5)
observer.event_queue.maxsize=10
observer.schedule(handler, path='.', recursive=True)
return observer
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
event_handler = LoggingEventHandler()
observer1 = get_observer(XmlValidatorHandler())
observer2 = get_observer(SlowXmlValidatorHandler())
observer1.start()
observer2.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer1.stop()
observer2.stop()
observer1.join()
observer2.join()
Wasn't able to reproduce your issue. some pointers:
check queue maxsize, if you already have items in there and they don't get handled in a timely fashion, then my guess is that the timeout kicks in and the event is lost. You may want to resize in that case.
check timeout, if it is configured, you may want to tune that parameter.
Maybe a more complete snippet would help us help you.

Daemonizing python's BaseHTTPServer

I am working on a daemon where I need to embed a HTTP server. I am attempting to do it with BaseHTTPServer, which when I run it in the foreground, it works fine, but when I try and fork the daemon into the background, it stops working. My main application continues to work, but BaseHTTPServer does not.
I believe this has something to do with the fact that BaseHTTPServer sends log data to STDOUT and STDERR. I am redirecting those to files. Here is the code snippet:
# Start the HTTP Server
server = HTTPServer((config['HTTPServer']['listen'],config['HTTPServer']['port']),HTTPHandler)
# Fork our process to detach if not told to stay in foreground
if options.foreground is False:
try:
pid = os.fork()
if pid > 0:
logging.info('Parent process ending.')
sys.exit(0)
except OSError, e:
sys.stderr.write("Could not fork: %d (%s)\n" % (e.errno, e.strerror))
sys.exit(1)
# Second fork to put into daemon mode
try:
pid = os.fork()
if pid > 0:
# exit from second parent, print eventual PID before
print 'Daemon has started - PID # %d.' % pid
logging.info('Child forked as PID # %d' % pid)
sys.exit(0)
except OSError, e:
sys.stderr.write("Could not fork: %d (%s)\n" % (e.errno, e.strerror))
sys.exit(1)
logging.debug('After child fork')
# Detach from parent environment
os.chdir('/')
os.setsid()
os.umask(0)
# Close stdin
sys.stdin.close()
# Redirect stdout, stderr
sys.stdout = open('http_access.log', 'w')
sys.stderr = open('http_errors.log', 'w')
# Main Thread Object for Stats
threads = []
logging.debug('Kicking off threads')
while ...
lots of code here
...
server.serve_forever()
Am I doing something wrong here or is BaseHTTPServer somehow prevented from becoming daemonized?
Edit: Updated code to demonstrate the additional, previously missing code flow and that log.debug shows in my forked, background daemon I am hitting code after fork.

After a bit of googling I finally stumbled over this BaseHTTPServer documentation and after that I ended up with:
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer
from SocketServer import ThreadingMixIn
class ThreadedHTTPServer(ThreadingMixIn, HTTPServer):
"""Handle requests in a separate thread."""
server = ThreadedHTTPServer((config['HTTPServer']['listen'],config['HTTPServer']['port']), HTTPHandler)
server.serve_forever()
Which for the most part comes after I fork and ended up resolving my problem.

Here's how to do this with the python-daemon library:
from BaseHTTPServer import (HTTPServer, BaseHTTPRequestHandler)
import contextlib
import daemon
from my_app_config import config
# Make the HTTP Server instance.
server = HTTPServer(
(config['HTTPServer']['listen'], config['HTTPServer']['port']),
BaseHTTPRequestHandler)
# Make the context manager for becoming a daemon process.
daemon_context = daemon.DaemonContext()
daemon_context.files_preserve = [server.fileno()]
# Become a daemon process.
with daemon_context:
server.serve_forever()
As usual for a daemon, you need to decide how you will interact with the program after it becomes a daemon. For example, you might register a systemd service, or write a PID file, etc. That's all outside the scope of the question though.
In particular, it's outside the scope of the question to ask: once it's become a daemon process (necessarily detached from any controlling terminal), how do I stop the daemon process? That's up to you to decide, as part of defining the program's behaviour.

You start by instantiating a HTTPServer. But you don't actually tell it to start serving in any of the supplied code. In your child process try calling server.serve_forever().
See this for reference

A simple solution that worked for me was to override the BaseHTTPRequestHandler method log_message(), so we prevent any kind of writing in stdout and avoid problems when demonizing.
class CustomRequestHandler(BaseHTTPServer.BaseHTTPRequestHandler):
def log_message(self, format, *args):
pass
...
rest of custom class code
...

Just use daemontools or some other similar script instead of rolling your own daemonizing process. It is much better to keep this off your script.
Also, your best option: Don't use BaseHTTPServer. It is really bad. There are many good HTTP servers for python, i.e. cherrypy or paste. Both includes ready-to-use daemonizing scripts.

Since this has solicited answers since I originally posted, I thought that I'd share a little info.
The issue with the output has to do with the fact that the default handler for the logging module uses the StreamHandler. The best way to handle this is to create your own handlers. In the case where you want to use the default logging module, you can do something like this:
# Get the default logger
default_logger = logging.getLogger('')
# Add the handler
default_logger.addHandler(myotherhandler)
# Remove the default stream handler
for handler in default_logger.handlers:
if isinstance(handler, logging.StreamHandler):
default_logger.removeHandler(handler)
Also at this point I have moved to using the very nice Tornado project for my embedded http servers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.