dask distributed 1.19 client logging?

dask distributed 1.19 client logging? - python

The following code used to emit logs at some point, but no longer seems to do so. Shouldn't configuration of the logging mechanism in each worker permit logs to appear on stdout? If not, what am I overlooking?
import logging
from distributed import Client, LocalCluster
import numpy as np
def func(args):
i, x = args
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(name)s %(levelname)s %(message)s')
logger = logging.getLogger('func %i' % i)
logger.info('computing svd')
return np.linalg.svd(x)
if __name__ == '__main__':
lc = LocalCluster(10)
c = Client(lc)
data = [np.random.rand(50, 50) for i in range(50)]
fut = c.map(func, zip(range(len(data)), data))
results = c.gather(fut)
lc.close()
As per this question, I tried putting the logger configuration code into a separate function invoked via c.run(init_logging) right after instantiation of the client, but that didn't make any difference either.
I'm using distributed 1.19.3 with Python 3.6.3 on Linux. I have
logging:
distributed: info
distributed.client: info
in ~/.dask/config.yaml.

Evidently the submitted functions do not actually execute until one tries to retrieve the results from the generated futures, i.e., using the line
print(list(results))
before shutting down the local cluster. I'm not sure how to reconcile this with the section in the online docs that seems to state that direct submissions to a cluster are executed immediately.

Related

Python: How to use different logfiles for processes in multiprocessing.Pool?

I am using multiprocessing.Pool to run a number of independent processes in parallel. Not so much different from the basic example in the python docs:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3]))
I would like each process to have a separate log file. I log various info from other modules in my codebase and some third-party packages (none of them is multiprocessing aware). So, for example, I would like this:
import logging
from multiprocessing import Pool
def f(x):
logging.info(f"x*x={x*x}")
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, range(10)))
to write on disk:
log1.log
log2.log
log3.log
log4.log
log5.log
How do I achieve it?

You'll need to use Pool's initializer() to set up and register the separate loggers immediately after workers start up. Under the hood the arguments to Pool(initializer) and Pool(initargs) end up being passed to Process(target) and Process(args) for creating new worker-processes...
Pool-workers get named in the format {start_method}PoolWorker-{number}, so e.g. SpawnWorker-1 if you use spawn as starting method for new processes.
The file number for the logfiles then can be extracted from the assigned worker-names with mp.current_process().name.split('-')[1].
import logging
import multiprocessing as mp
def f(x):
logger.info(f"x*x={x*x}")
return x*x
def _init_logging(level=logging.INFO, mode='a'):
worker_no = mp.current_process().name.split('-')[1]
filename = f"log{worker_no}.log"
fh = logging.FileHandler(filename, mode=mode)
fmt = logging.Formatter(
'%(asctime)s %(processName)-10s %(name)s %(levelname)-8s --- %(message)s'
)
fh.setFormatter(fmt)
logger = logging.getLogger()
logger.addHandler(fh)
logger.setLevel(level)
globals()['logger'] = logger
if __name__ == '__main__':
with mp.Pool(5, initializer=_init_logging, initargs=(logging.DEBUG,)) as pool:
print(pool.map(f, range(10)))
Note, due to the nature of multiprocessing, there's no guarantee for the exact number of files you end up with in your small example.
Since multiprocessing.Pool (contrary to concurrent.futures.ProcessPoolExecutor) starts workers as soon as you create the instance, you're bound to get the specified Pool(process)-number of files, so in your case 5. Actual thread/process-scheduling by your OS might cut this number short here, though.

Python: logging object loses its file handler when passed to an RQ task queue

Problem
I pass a logging (logger) object, supposed to add lines to test.log, to a function background_task() that is run by the rq utility (task queues manager). logger has a FileHandler assigned to it to allow logging to test.log. Until background_task() is run, you can see the file handler present in logger.handlers, but when the logger is passed to background_task() and background_task() is run by rq worker, logger.handlers gets empty.
But if I ditch rq (and Redis) and just run background_task right away, the content of logger.handlers is preserved. So, it has something to do with rq (and, probably, task queuing in general, it's a new topic for me).
Steps to reproduce
Run add_job.py: python3 add_job.py. You'll see the output of print(logger.handlers) called from within add_job(): there will be a handlers list containing FileHandler added in get_job_logger().
Run command rq worker to start executing the queued task. You'll see the output of print(logger.handlers) once again but this time called from within background_task() and the list will be empty! Handlers of the logging (logger) object somehow get lost when the function that accepts a logger as an argument is run by rq (rq worker). What gives?
Here's how it looks like in the terminal:
$ python3 add_job.py
[<FileHandler /home/user/project/test.log (INFO)>]
$ rq worker
17:44:45 Worker rq:worker:2bbad3623e95438f81396c662cb01284: started, version 1.10.1
17:44:45 Subscribing to channel rq:pubsub:2bbad3623e95438f81396c662cb01284
17:44:45 *** Listening on default...
17:44:45 default: tasks.background_task(<RootLogger root (INFO)>) (5a5301be-efc3-49a7-ab0c-f7cf0a4bd3e5)
[]
Source code
add_job.py
import logging
from logging import FileHandler
from redis import Redis
from rq import Queue
from tasks import background_task
def add_job():
r = Redis()
qu = Queue(connection=r)
logger = get_job_logger()
print(logger.handlers)
job = qu.enqueue(background_task, logger)
def get_job_logger():
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger_file_handler = FileHandler('test.log')
logger_file_handler.setLevel(logging.INFO)
logger.addHandler(logger_file_handler)
return logger
if __name__ == '__main__':
add_job()
tasks.py
def background_task(logger):
print(logger.handlers)

Answered here.
FileHandler does not get carried over into other threads. You start the FileHandler in the main thread and rq worker starts other threads. Memory is not shared like that.
Hm, I see... Thanks!
I assumed the FileHandler was being serialized or whatnot when written to Redis as a part of the whole logger object and then reinitialized when popping out of the queue.
Anyway, I'll try passing a file path to the function and initialize a logger from within. That way, keeping the FileHandler object to one thread.
EDIT: yeah, it works

How do you get console output from Python pool?

I have some simple code testing latency for browsers which opens multiple instances of Selenium:
with Pool(processes=args.number_of_browsers) as pool:
for i in range(args.number_of_browsers):
logging.info("Starting job on browser #" + str(i))
pool.apply_async(run, args=(args.refresh_rate, args.jitter, args.duration, args.url, str(i)))
For the purposes of the question, the run function could be as simple as:
def run():
logging.debug("ANYTHING")
I haven't been able to figure out how to get console output from the pool library.

here a basic working logging example in python
import logging
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s',
datefmt='%Y%m%d%H%M%S%p', level=logging.DEBUG)
NODE_NAME = 'Test'
logger = logging.getLogger(NODE_NAME)
logger.info('hello')
correct logging needs more config

Does python logging support multiprocessing?

I have been told that logging can not be used in Multiprocessing. You have to do the concurrency control in case multiprocessing messes the log.
But I did some test, it seems like there is no problem using logging in multiprocessing
import time
import logging
from multiprocessing import Process, current_process, pool
# setup log
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
datefmt='%a, %d %b %Y %H:%M:%S',
filename='/tmp/test.log',
filemode='w')
def func(the_time, logger):
proc = current_process()
while True:
if time.time() >= the_time:
logger.info('proc name %s id %s' % (proc.name, proc.pid))
return
if __name__ == '__main__':
the_time = time.time() + 5
for x in xrange(1, 10):
proc = Process(target=func, name=x, args=(the_time, logger))
proc.start()
As you can see from the code.
I deliberately let the subprocess write log at the same moment( 5s after start) to increase the chance of conflict. But there are no conflict at all.
So my question is can we use logging in multiprocessing?
Why so many posts say we can not ?

As Matino correctly explained: logging in a multiprocessing setup is not safe, as multiple processes (who do not know anything about the other ones existing) are writing into the same file, potentially intervening with each other.
Now what happens is that every process holds an open file handle and does an "append write" into that file. The question is under what circumstances the append write is "atomic" (that is, cannot be interrupted by e.g. another process writing to the same file and intermingling his output). This problem applies to every programming language, as in the end they'll do a syscall to the kernel. This answer answers under which circumstances a shared log file is ok.
It comes down to checking your pipe buffer size, on linux that is defined in /usr/include/linux/limits.h and is 4096 bytes. For other OSes you find here a good list.
That means: If your log line is less than 4'096 bytes (if on Linux), then the append is safe, if the disk is directly attached (i.e. no network in between). But for more details please check the first link in my answer. To test this you can do logger.info('proc name %s id %s %s' % (proc.name, proc.pid, str(proc.name)*5000)) with different lenghts. With 5000 for instance I got already mixed up log lines in /tmp/test.log.
In this question there are already quite a few solutions to this, so I won't add my own solution here.
Update: Flask and multiprocessing
Web frameworks like flask will be run in multiple workers if hosted by uwsgi or nginx. In that case, multiple processes may write into one log file. Will it have problems?
The error handling in flask is done via stdout/stderr which is then cought by the webserver (uwsgi, nginx, etc.) which needs to take care that logs are written in correct fashion (see e.g. this flask+nginx example), probably also adding process information so you can associate error lines to processes. From flasks doc:
By default as of Flask 0.11, errors are logged to your webserver’s log automatically. Warnings however are not.
So you'd still have this issue of intermingled log files if you use warn and the message exceeds the pipe buffer size.

It is not safe to write to a single file from multiple processes.
According to https://docs.python.org/3/howto/logging-cookbook.html#logging-to-a-single-file-from-multiple-processes
Although logging is thread-safe, and logging to a single file from
multiple threads in a single process is supported, logging to a single
file from multiple processes is not supported, because there is no
standard way to serialize access to a single file across multiple
processes in Python.
One possible solution would be to have each process write to its own file. You can achieve this by writing your own handler that adds process pid to the end of the file:
import logging.handlers
import os
class PIDFileHandler(logging.handlers.WatchedFileHandler):
def __init__(self, filename, mode='a', encoding=None, delay=0):
filename = self._append_pid_to_filename(filename)
super(PIDFileHandler, self).__init__(filename, mode, encoding, delay)
def _append_pid_to_filename(self, filename):
pid = os.getpid()
path, extension = os.path.splitext(filename)
return '{0}-{1}{2}'.format(path, pid, extension)
Then you just need to call addHandler:
logger = logging.getLogger('foo')
fh = PIDFileHandler('bar.log')
logger.addHandler(fh)

Use a queue for correct handling of concurrency simultaneously recovering from errors by feeding everything to the parent process via a pipe.
from logging.handlers import RotatingFileHandler
import multiprocessing, threading, logging, sys, traceback
class MultiProcessingLog(logging.Handler):
def __init__(self, name, mode, maxsize, rotate):
logging.Handler.__init__(self)
self._handler = RotatingFileHandler(name, mode, maxsize, rotate)
self.queue = multiprocessing.Queue(-1)
t = threading.Thread(target=self.receive)
t.daemon = True
t.start()
def setFormatter(self, fmt):
logging.Handler.setFormatter(self, fmt)
self._handler.setFormatter(fmt)
def receive(self):
while True:
try:
record = self.queue.get()
self._handler.emit(record)
except (KeyboardInterrupt, SystemExit):
raise
except EOFError:
break
except:
traceback.print_exc(file=sys.stderr)
def send(self, s):
self.queue.put_nowait(s)
def _format_record(self, record):
# ensure that exc_info and args
# have been stringified. Removes any chance of
# unpickleable things inside and possibly reduces
# message size sent over the pipe
if record.args:
record.msg = record.msg % record.args
record.args = None
if record.exc_info:
dummy = self.format(record)
record.exc_info = None
return record
def emit(self, record):
try:
s = self._format_record(record)
self.send(s)
except (KeyboardInterrupt, SystemExit):
raise
except:
self.handleError(record)
def close(self):
self._handler.close()
logging.Handler.close(self)
The handler does all the file writing from the parent process and uses just one thread to receive messages passed from child processes

QueueHandler is native in Python 3.2+, and safely handles multiprocessing logging.
Python docs have two complete examples: Logging to a single file from multiple processes
For those using Python < 3.2, just copy QueueHandler into your own code from: https://gist.github.com/vsajip/591589 or alternatively import logutils.
Each process (including the parent process) puts its logging on the Queue, and then a listener thread or process (one example is provided for each) picks those up and writes them all to a file - no risk of corruption or garbling.
Note: this question is basically a duplicate of How should I log while using multiprocessing in Python? so I've copied my answer from that question as I'm pretty sure it's currently the best solution.

Send log messages from all celery tasks to a single file

I'm wondering how to setup a more specific logging system. All my tasks use
logger = logging.getLogger(__name__)
as a module-wide logger.
I want celery to log to "celeryd.log" and my tasks to "tasks.log" but I got no idea how to get this working. Using CELERYD_LOG_FILE from django-celery I can route all celeryd related log messages to celeryd.log but there is no trace of the log messages created in my tasks.

Note: This answer is outdated as of Celery 3.0, where you now use get_task_logger() to get your per-task logger set up. Please see the Logging section of the What's new in Celery 3.0 document for details.
Celery has dedicated support for logging, per task. See the Task documentation on the subject:
You can use the workers logger to add diagnostic output to the worker log:
#celery.task()
def add(x, y):
logger = add.get_logger()
logger.info("Adding %s + %s" % (x, y))
return x + y
There are several logging levels available, and the workers loglevel setting decides
whether or not they will be written to the log file.
Of course, you can also simply use print as anything written to standard out/-err will be
written to the log file as well.
Under the hood this is all still the standard python logging module. You can set the CELERYD_HIJACK_ROOT_LOGGER option to False to allow your own logging setup to work, otherwise Celery will configure the handling for you.
However, for tasks, the .get_logger() call does allow you to set up a separate log file per individual task. Simply pass in a logfile argument and it'll route log messages to that separate file:
#celery.task()
def add(x, y):
logger = add.get_logger(logfile='tasks.log')
logger.info("Adding %s + %s" % (x, y))
return x + y
Last but not least, you can just configure your top-level package in the python logging module and give it a file handler of it's own. I'd set this up using the celery.signals.after_setup_task_logger signal; here I assume all your modules live in a package called foo.tasks (as in foo.tasks.email and foo.tasks.scaling):
from celery.signals import after_setup_task_logger
import logging
def foo_tasks_setup_logging(**kw):
logger = logging.getLogger('foo.tasks')
if not logger.handlers:
handler = logging.FileHandler('tasks.log')
formatter = logging.Formatter(logging.BASIC_FORMAT) # you may want to customize this.
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.propagate = False
after_setup_task_logger.connect(foo_tasks_setup_logging)
Now any logger whose name starts with foo.tasks will have all it's messages sent to tasks.log instead of to the root logger (which doesn't see any of these messages because .propagate is False).

Just a hint: Celery has its own logging handler:
from celery.utils.log import get_task_logger
logger = get_task_logger(__name__)
Also, Celery logs all output from the task. More details at Celery docs for Task Logging

join
--concurrency=1 --loglevel=INFO
with the command to run celery worker
eg: python xxxx.py celery worker --concurrency=1 --loglevel=INFO
Better to set loglevel inside each python files too

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.