Apache Airflow: Control over logging [Disable/Adjust logging level] - python

I am using Airflow 1.7.1.3 installed using pip
I would like to limit the logging to ERROR level for the workflow being executed by the scheduler. Could not find anything beyond setting log files location in the settings.py file.
Also the online resources led me to this google group discussion here but not much info here as well
Any idea how to control logging in Airflow?

I tried below work around and it seems to be working to set LOGGING_LEVEL outside of settings.py:
Update settings.py:
Remove or comment line:
LOGGING_LEVEL = logging.INFO
Add line:
LOGGING_LEVEL = os.path.expanduser(conf.get('core', 'LOGGING_LEVEL'))
Update airflow.cfg configuration file:
Add line under [core]:
logging_level = WARN
Restart webserver and scheduler services
Use environment vaiable AIRFLOW__CORE__LOGGING_LEVEL=WARN.
See the official docs for details.

The logging functionality and its configuration will be changed in version 1.9 with this commit

As #Dimo Boyadzhiev pointed the change, Adding the more info path for documentaion.
File - $AIRFLOW_HOME/airflow.cfg
# Logging level
logging_level = INFO
fab_logging_level = WARN

Only solution I am aware of is changing LOGGING_LEVEL in settings.py file. Default level is set to INFO.
AIRFLOW_HOME = os.path.expanduser(conf.get('core', 'AIRFLOW_HOME'))
SQL_ALCHEMY_CONN = conf.get('core', 'SQL_ALCHEMY_CONN')
LOGGING_LEVEL = logging.INFO
DAGS_FOLDER = os.path.expanduser(conf.get('core', 'DAGS_FOLDER'))

If you are using docker-compose.yaml
x-airflow-common:
&airflow-common
image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.1.2}
environment:
&airflow-common-env
#the other parameters
AIRFLOW__CORE__LOGGING_LEVEL: DEBUG #add

Related

How to run one part of code on local and another in production (heroku) in python? [duplicate]

I have a Django webapp, and I'd like to check if it's running on the Heroku stack (for conditional enabling of debugging, etc.) Is there any simple way to do this? An environment variable, perhaps?
I know I can probably also do it the other way around - that is, have it detect if it's running on a developer machine, but that just doesn't "sound right".
An ENV var seems to the most obvious way of doing this. Either look for an ENV var that you know exists, or set your own:
on_heroku = False
if 'YOUR_ENV_VAR' in os.environ:
on_heroku = True
more at: http://devcenter.heroku.com/articles/config-vars
Similar to what Neil suggested, I would do the following:
debug = True
if 'SOME_ENV_VAR' in os.environ:
debug = False
I've seen some people use if 'PORT' in os.environ: But the unfortunate thing is that the PORT variable is present when you run foreman start locally, so there is no way to distinguish between local testing with foreman and deployment on Heroku.
I'd also recommend using one of the env vars that:
Heroku has out of the box (rather than setting and checking for your own)
is unlikely to be found in your local environment
At the date of posting, Heroku has the following environ variables:
['PATH', 'PS1', 'COLUMNS', 'TERM', 'PORT', 'LINES', 'LANG', 'SHLVL', 'LIBRARY_PATH', 'PWD', 'LD_LIBRARY_PATH', 'PYTHONPATH', 'DYNO', 'PYTHONHASHSEED', 'PYTHONUNBUFFERED', 'PYTHONHOME', 'HOME', '_']
I generally go with if 'DYNO' in os.environ:, because it seems to be the most Heroku specific (who else would use the term dyno, right?).
And I also prefer to format it like an if-else statement because it's more explicit:
if 'DYNO' in os.environ:
debug = False
else:
debug = True
First set the environment variable ON_HEROKU on heroku:
$ heroku config:set ON_HEROKU=1
Then in settings.py
import os
# define if on heroku environment
ON_HEROKU = 'ON_HEROKU' in os.environ
Read more about it here: https://devcenter.heroku.com/articles/config-vars
My solution:
$ heroku config:set HEROKU=1
These environment variables are persistent – they will remain in place across deploys and app restarts – so unless you need to change values, you only need to set them once.
Then you can test its presence in your app.:
>>> 'HEROKU' in os.environ
True
The most reliable way would be to set an environment variable as above.
If that's not possible, there are a few signs you can look for in the filesystem, but they may not be / are not foolproof
Heroku instances all have the path /app - the files and scripts that are running will be under this too, so you can check for the presence of the directory and/or that the scripts are being run from under it.
There is an empty directory /etc/heroku
/etc/hosts may have some heroku related domains added
~ $ cat /etc/hosts
<snip>.dyno.rt.heroku.com
Any of these can and may change at any moment.
Your milage may vary
DATABASE_URL environment variable
in_heroku = False
if 'DATABASE_URL' in os.environ:
in_heroku = True
I think you need to enable the database for your app with:
heroku addons:create heroku-postgresql:hobby-dev
but it is free and likely what you are going to do anyways.
Heroku makes this environment variable available when running its apps, in particular for usage as:
import dj_database_url
if in_heroku:
DATABASES = {'default': dj_database_url.config()}
else:
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.sqlite3',
'NAME': os.path.join(BASE_DIR, 'db.sqlite3'),
}
}
Not foolproof as that variable might be defined locally, but convenient for simple cases.
heroku run env
might also show other possible variables like:
DYNO_RAM
WEB_CONCURRENCY
but I'm not sure if those are documented like DATABASE_URL.
Short version: check that the time zone is UTC/GMT:
if not 'ORIGINAL_TIMEZONE' in os.environ:
f = os.popen('date +%Z')
tz = f.read().upper()
os.environ['ORIGINAL_TIMEZONE']=tz
tz = os.environ['ORIGINAL_TIMEZONE']
if tz != '' and (not 'utc' in tz.lower()) and (not 'gmt' in tz.lower()):
print 'Definitely not running on Heroku (or in production in general)'
else:
print 'Assume that we are running on Heroku (or in production in general)'
This is more conservative than if tz=='UTC\n': if in doubt, assume that we are in production. Note that we are saving the timezone to an environment variable because settings.py may be executed more than once. In fact, the development server executes it twice, and the second time the system timezone is already 'UTC' (or whatever is in settings.TIMEZONE).
Long version:
making absolutely sure that we never run on Heroku with DEBUG=True, and that we never run the development server on Heroku even with DEBUG=False. From settings.py:
RUNNING_DEV_SERVER = (len(sys.argv) > 1) and (sys.argv[1] == 'runserver')
DEBUG = RUNNING_DEV_SERVER
TEMPLATE_DEBUG = DEBUG
# Detect the timezone
if not 'ORIGINAL_TIMEZONE' in os.environ:
f = os.popen('date +%Z')
tz = f.read().upper()
os.environ['ORIGINAL_TIMEZONE']=tz
print ('DEBUG: %d, RUNNING_DEV_SERVER: %d, system timezone: %s ' % (DEBUG, RUNNING_DEV_SERVER, tz))
if not (DEBUG or RUNNING_DEV_SERVER):
SECRET_KEY = os.environ['SECRET_KEY']
else:
print 'Running in DEBUG MODE! Hope this is not in production!'
SECRET_KEY = 'DEBUG_INSECURE_SECRET_KEY_ae$kh(7b%$+a fcw_bdnzl#)$t88x7h2-p%eg_ei5m=w&2p-)1+'
# But what if we are idiots and are still somehow running with DEBUG=True in production?!
# 1. Make sure SECRET_KEY is not set
assert not SECRET_KEY in os.environ
# 2. Make sure the timezone is not UTC or GMT (indicating production)
tz = os.environ['ORIGINAL_TIMEZONE']
assert tz != '' and (not 'UTC' in tz) and (not 'GMT' in tz)
# 3. Look for environment variables suggesting we are in PROD
for key in os.environ:
for red_flag in ['heroku', 'amazon', 'aws', 'prod', 'gondor']:
assert not red_flag in key.lower()
assert not red_flag in os.environ[key].lower()
If you really want to run the development server on Heroku, I suggest you add an environment variable specifying the date when you can do that. Then only proceed if this date is today. This way you'll have to change this variable before you begin development work, but if you forget to unset it, next day you will still be protected against accidentally running it in production. Of course, if you want to be super-conservative, you can also specify, say, a 1-hour window when exceptions apply.
Lastly, if you decided to adopt the approach suggested above, while you are at it, also install django-security, add djangosecurity to INSTALLED_APPS, and add to the end of your settings.py:
if not (DEBUG or RUNNING_DEV_SERVER):
### Security
SECURE_SSL_REDIRECT = True
SECURE_CONTENT_TYPE_NOSNIFF = True
SECURE_HSTS_SECONDS = 86400000
SECURE_HSTS_INCLUDE_SUBDOMAINS = True
SECURE_BROWSER_XSS_FILTER = True
SESSION_COOKIE_SECURE = True
SESSION_COOKIE_HTTPONLY = True
CSRF_COOKIE_HTTPONLY = True # May have problems with Ajax
CSRF_COOKIE_SECURE = True

Read Celery configuration from Python properties file

I have an application that needs to initialize Celery and other things (e.g. database). I would like to have a .ini file that would contain the applications configuration. This should be passed to the application at runtime.
development.init:
[celery]
broker=amqp://localhost/
backend=amqp://localhost/
task.result.expires=3600
[database]
# database config
# ...
celeryconfig.py:
from celery import Celery
import ConfigParser
config = ConfigParser.RawConfigParser()
config.read(...) # Pass this from the command line somehow
celery = Celery('myproject.celery',
broker=config.get('celery', 'broker'),
backend=config.get('celery', 'backend'),
include=['myproject.tasks'])
# Optional configuration, see the application user guide.
celery.conf.update(
CELERY_TASK_RESULT_EXPIRES=config.getint('celery', 'task.result.expires')
)
# Initialize database, etc.
if __name__ == '__main__':
celery.start()
To start Celery, I call:
celery worker --app=myproject.celeryconfig -l info
Is there anyway to pass in the config file without doing something ugly like setting a environment variable?
Alright, I took Jordan's advice and used a env variable. This is what I get in celeryconfig.py:
celery import Celery
import os
import sys
import ConfigParser
CELERY_CONFIG = 'CELERY_CONFIG'
if not CELERY_CONFIG in os.environ:
sys.stderr.write('Missing env variable "%s"\n\n' % CELERY_CONFIG)
sys.exit(2)
configfile = os.environ['CELERY_CONFIG']
if not os.path.isfile(configfile):
sys.stderr.write('Can\'t read file: "%s"\n\n' % configfile)
sys.exit(2)
config = ConfigParser.RawConfigParser()
config.read(configfile)
celery = Celery('myproject.celery',
broker=config.get('celery', 'broker'),
backend=config.get('celery', 'backend'),
include=['myproject.tasks'])
# Optional configuration, see the application user guide.
celery.conf.update(
CELERY_TASK_RESULT_EXPIRES=config.getint('celery', 'task.result.expires'),
)
if __name__ == '__main__':
celery.start()
To start Celery:
$ export CELERY_CONFIG=development.ini
$ celery worker --app=myproject.celeryconfig -l info
How is setting an environment variable ugly? You either set an environment variable with the current version of your application, or you can derive it based on hostname, or you can make your build/deployment process overwrite the file, and on development you let development.ini copy over to settings.ini in a general location, and on production you let production.ini copy over to settings.ini.
Any of these options are quite common. Using a configuration management tool such as Chef or Puppet to put the file in place is a good option.

Why can't it find my celery config file?

/home/myuser/mysite-env/lib/python2.6/site-packages/celery/loaders/default.py:53:
NotConfigured: No celeryconfig.py
module found! Please make sure it
exists and is available to Python.
NotConfigured)
I even defined it in my /etc/profile and also in my virtual environment's "activate". But it's not reading it.
Now in Celery 4.1 you can solve that problem by that code(the easiest way):
import celeryconfig
from celery import Celery
app = Celery()
app.config_from_object(celeryconfig)
For Example small celeryconfig.py :
BROKER_URL = 'pyamqp://'
CELERY_RESULT_BACKEND = 'redis://localhost'
CELERY_ROUTES = {'task_name': {'queue': 'queue_name_for_task'}}
Also very simple way:
from celery import Celery
app = Celery('tasks')
app.conf.update(
result_expires=60,
task_acks_late=True,
broker_url='pyamqp://',
result_backend='redis://localhost'
)
Or Using a configuration class/object:
from celery import Celery
app = Celery()
class Config:
enable_utc = True
timezone = 'Europe/London'
app.config_from_object(Config)
# or using the fully qualified name of the object:
# app.config_from_object('module:Config')
Or how was mentioned by setting CELERY_CONFIG_MODULE
import os
from celery import Celery
#: Set default configuration module name
os.environ.setdefault('CELERY_CONFIG_MODULE', 'celeryconfig')
app = Celery()
app.config_from_envvar('CELERY_CONFIG_MODULE')
Also see:
Create config: http://docs.celeryproject.org/en/latest/userguide/application.html#configuration
Configuration fields: http://docs.celeryproject.org/en/latest/userguide/configuration.html
I had a similar problem with my tasks module. A simple
# celery config is in a non-standard location
import os
os.environ['CELERY_CONFIG_MODULE'] = 'mypackage.celeryconfig'
in my package's __init__.py solved this problem.
Make sure you have celeryconfig.py in the same location you are running 'celeryd' or otherwise make sure its is available on the Python path.
you can work around this with the environment... or, use --config: it requires
a path relative to CELERY_CHDIR from /etc/defaults/celeryd
a python module name, not a filename.
The error message could probably use these two facts.

Python logging module logs on Mac, but not Linux

I am experiencing an issue where I am using the logging module in my app. I am working in Eclipse against the LDT Python (Py 2.7) interface (rather than Pydev) on my MacBook Pro. The logging module works through Eclipse; however, when I transfer my app over to a RHEL5 2.7, logging does not seem to be working at all. It is not throwing any exceptions, it is just not logging anything to console or file (it creates the file though).
Code:
# Initialize logging
log = logging.getLogger('pepPrep')
# Log to stderr
console = logging.StreamHandler()
console.setLevel(logging.INFO)
# Log to file
logname = 'pepPrep.' + datetime.datetime.now().strftime("%Y%m%d_%H:%M") + '.log'
filelog = logging.FileHandler(logname)
filelog.setLevel(logging.DEBUG)
# set a format
formatter = logging.Formatter('%(name)-12s: %(levelname)-8s %(message)s')
# tell the handler to use this format
console.setFormatter(formatter)
filelog.setFormatter(formatter)
# add the handler to the root logger
log.addHandler(console)
log.addHandler(filelog)
log.INFO('This is a test')
log.DEBUG('This is a test2')
Any pointers on how I can make this work?
The default threshold for logging is WARNING, so INFO and DEBUG messages are not output by default. To do so, add e.g.
logging.getLogger().setLevel(logging.DEBUG)
to get DEBUG and INFO messages.
You can confirm this is your problem by doing
log.warning('This is a test3')
before adding that setLevel, and confirming that the warning is actually output.

How do I configure the Python logging module in Django?

I'm trying to configure logging for a Django app using the Python logging module. I have placed the following bit of configuration code in my Django project's settings.py file:
import logging
import logging.handlers
import os
date_fmt = '%m/%d/%Y %H:%M:%S'
log_formatter = logging.Formatter(u'[%(asctime)s] %(levelname)-7s: %(message)s (%(filename)s:%(lineno)d)', datefmt=date_fmt)
log_dir = os.path.join(PROJECT_DIR, "var", "log", "my_app")
log_name = os.path.join(log_dir, "nyrb.log")
bytes = 1024 * 1024 # 1 MB
if not os.path.exists(log_dir):
os.makedirs(log_dir)
handler = logging.handlers.RotatingFileHandler(log_name, maxBytes=bytes, backupCount=7)
handler.setFormatter(log_formatter)
handler.setLevel(logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
logging.getLogger().addHandler(handler)
logging.getLogger(__name__).info("Initialized logging subsystem")
At startup, I get a couple Django-related messages, as well as the "Initialized logging subsystem", in the log files, but then all the log messages end up going to the web server logs (/var/log/apache2/error.log, since I'm using Apache), and use the standard log format (not the formatter I designated). Am I configuring logging incorrectly?
Kind of anti-climactic, but it turns out there was a third-party app installed in the project that had its own logging configuration that was overriding the one I set up (it modified the root logger, for some reason -- not very kosher for a Django app!). Removed that code and everything works as expected.
See this other answer. Note that settings.py is usually imported twice, so you should avoid creating multiple handlers. Better logging support is coming to Django in 1.3 (hopefully), but for now you should ensure that if your setup code is called more than once, there are no adverse effects.
I'm not sure why your logged messages are going to the Apache logs, unless you've (somewhere else in your code) added a StreamHandler to your root logger with sys.stdout or sys.stderr as the stream. You might want to print out logging.getLogger().handlers just to see it's what you'd expect to see.
I used this with success (although it does not rotate):
# in settings.py
import logging
logging.basicConfig(
level = logging.DEBUG,
format = '%(asctime)s %(levelname)s %(funcName)s %(lineno)d \
\033[35m%(message)s\033[0m',
datefmt = '[%d/%b/%Y %H:%M:%S]',
filename = '/tmp/my_django_app.log',
filemode = 'a'
)
I'd suggest to try an absolute path, too.
I guess logging stops when Apache forks the process. After that happened, because all file descriptors were closed during daemonization, logging system tries to reopen log file and as far as I understand uses relative file path:
log_dir = os.path.join(PROJECT_DIR, "var", "log", "my_app")
log_name = os.path.join(log_dir, "nyrb.log")
But there is no “current directory” when process has been daemonized. Try to use absolute log_dir path. Hope that helps.

Categories