I have a module which is expensive to import (it involves downloading a ~20MB index file), which is used by a celery worker. Unfortunately I can't figure out how to have the module imported only once, and only by the celery worker.
Version 1 tasks.py file:
import expensive_module
#shared_task
def f():
expensive_module.do_stuff()
When I organize the file this way the expensive module is imported both by the web server and the celery instance, which is what I'd expect since the tasks module is imported in both and they're difference processes.
Version 2 tasks.py file:
#shared_task:
def f():
import expensive_module
expensive_module.do_stuff()
In this version the web server never imports the module (which is good), but the module gets re-imported by the celery worker every time f.delay() is called. This is what really confuses me. In this scenario, why is the module re-imported every time this function is run by the celery worker? How can I re-organize this code to have only the celery worker import the expensive module, and have the module imported only once?
As a follow-on, less important question, in Version 1 of the tasks.py file, why does the web instance import the expensive module twice? Both times it's imported form urls.py when django runs self._urlconf_module = import_module(self.urlconf_name).
Make a duplicate tasks.py file for webserver which has empty tasks and no unneeded imports.
For celery use version 1 where you import only once instead of every time you call that task.
Been there and it works.
Related
I have a luigi pipeline where some luigi.Tasks have conflicting pip dependencies. This causes issues because those tasks are part of the same pipeline (i.e. one task requires the other). I would not want to create separated pipelines as I would not be able to inspect the full pipeline in the scheduler anymore. What are the best practices in this case?
Example: You have two python packages each defining a luigi.Task.
However packageA needs a different version of a library than packageB:
packageA/task1.py requires mypackage==1.0.0
packageB/task2.py requires mypackage==0.9.0
Let's say the pipeline is:
task1 -> task2 -> wrappertask
This is an issue as in task2 I have to import task1 in order to define the requires method:
# packageB/task2.py, needs mypackage==0.9.0
from task1 import Task1 # cannot do this as I would need mypackage==1.0.0
class Task2(luigi.Task):
id = luigi.Parameter()
def requires(self):
task1.Task1(id=id)
...
If you run luigi with the local scheduler, all scheduling and task execution occurs in a single python process. So, any python package imported into the global namespace stays present and new import statements for that package will be ignored. You also have to know that luigi instantiates all task classes once, but calls the requires() and output() methods of each task multiple times during scheduling and task execution.
So, you only want to import the troubling package into the local namespace of the task method where you use it. Be sure that the package in its two versions is available on the PYTHONPATH, e.g. as pack1 and pack2 so that you can use an import statement like:
...
...
import pack1 as pack
Hi I am working on a project where i need celery beat to run long term periodic tasks. But the problem is that after starting the celery beat it is taking the specified time to run for the first time.
I want to fire the task on load for the first time and then run periodically.
I have seen this question on stackoverflow and this issue on GitHub, but didn't found a reliable solution.
Any suggestions on this one?
Since this does not seem possible I suggest a different approach. Call the task explicitly when you need and let the scheduler continue scheduling the tasks as usual. You can call the task on startup by using one the following methods (you probably need to take care of the multiple calls of the ready method if the task is not idempotent). Alternatively call the task from the command line by using celery call after your django server startup command.
The best place to call it will most of the times be in the ready() function of the current apps AppConfig class:
from django.apps import AppConfig
from myapp.tasks import my_task
class RockNRollConfig(AppConfig):
# ...
def ready(self):
my_task.delay(1, 2, 3)
Notice the use of .delay() which puts the invocation on the celery que and doesn't slow down starting the server.
See: https://docs.djangoproject.com/en/3.2/ref/applications/#django.apps.AppConfig and https://docs.celeryproject.org/en/stable/userguide/calling.html.
I'm trying to find a way to constantly poll a server every x seconds from the ready() function of Django, basically something which looks like this:
from django.apps import AppConfig
class ApiConfig(AppConfig):
name = 'api'
def ready(self):
import threading
import time
from django.conf import settings
from api import utils
def refresh_ndt_servers_list():
while True:
utils.refresh_servers_list()
time.sleep(settings.WAIT_SECONDS_SERVER_POLL)
thread1 = threading.Thread(target=refresh_ndt_servers_list)
thread1.start()
I just want my utils.refresh_servers_list() to be executed when Django starts/is ready and re-execute that same method (which populates my DB) every settings.WAIT_SECONDS_SERVER_POLL seconds indefinitely. The problem with that is if I run python manage.py migrate the ready() function gets called and never finishes. I would like to avoid calling this function during migration.
Thanks!
AppConfig.ready() is to "... perform initialization tasks ..." and make your app ready to run / serve requests. Actual app working logic should be run after django app is initialized.
For launching task at regular intervals cron job can be used.
Or, setup a periodic celery task with celery beat.
Also, provided task seems to be performing updates in database (good for it to be atomic). It may be critical for it to have only single running instance of it. One instance of cronjob or one celery task take care of that.
However, the next job may still run if previous one has not yet finished or just be launched manually for some reason - adding some locking logic into task to check that only one is running (or lock database table for the run) may be desired.
I have a job added in apscheduler which loads some data in memory and I am deleting all the objects after the job is complete. Now if I run this job with python it works successfully and memory drop after process exits successfully.But in case of apscheduler the memory usage is not coming down.I am using BackgroundScheduler.Thanks in advance.
I was running quite a few tasks via apscheduler. I suspected this setup led to R14 errors on Heroku, with dyno memory overload, crashes and restarts occurring daily. So I spun up another dyno and scheduled a few jobs to run very frequently.
Watching the metrics tab in Heroku, it immediately became clear that apscheduler was the culprit.
Removing jobs after they're run was recommended to me. But this is of course a bad idea when running cron and interval jobs as they won't run again.
What finally solved it was tweaking the threadpoolexecutioner (lowering max number of workers), see this answer on Stackoverflow and this and this post on Github. I definitely suggest you read the docs on this.
Other diagnostics resources: 1, 2.
Example code:
import logging
from apscheduler.executors.pool import ThreadPoolExecutor, ProcessPoolExecutor
from apscheduler.schedulers.blocking import BlockingScheduler
from tests import overloadcheck
logging.basicConfig()
logging.getLogger('apscheduler').setLevel(logging.DEBUG)
sched = BlockingScheduler(
executors={
'threadpool': ThreadPoolExecutor(max_workers=9),
'processpool': ProcessPoolExecutor(max_workers=3)
}
)
#sched.scheduled_job('interval', minutes=10, executor='threadpool')
def message_overloadcheck():
overloadcheck()
sched.start()
Or, if you like I do, love to run heavy tasks—try the ProcessPoolExecutor as an alternative, or addition to the ThreadPool, but make sure to call it from specific jobs in such case.
Update: And, you need to import ProcessPoolExecutor as well if you wish to use it, added this to code.
Just in case anyone is using Flask-APScheduler and having memory leak issues, it took me a while to realize that it expects any configuration settings to be in your Flask Config, not when you instantiate the scheduler.
So if you (like me) did something like this:
from flask_apscheduler import APScheduler
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.executors.pool import ThreadPoolExecutor
bg_scheduler = BackgroundScheduler(executors={'threadpool': ThreadPoolExecutor(max_workers=1)})
scheduler = APScheduler(scheduler=bg_scheduler)
scheduler.init_app(app)
scheduler.start()
then for whatever reason, when jobs are run in the Flask request context, it will not recognize the executor, 'threadpool', or any other configuration settings you may have set.
However, if you set these same options in the Flask Config class as:
class Config(object):
#: Enable build completion checks?
SCHEDULER_API_ENABLED = True
#: Sets max workers to 1 which reduces memory footprint
SCHEDULER_EXECUTORS = {"default": {"type": "threadpool", "max_workers": 1}
# ... other Flask configuration options
and then do (back in the main script)
scheduler = APScheduler()
scheduler.init_app(app)
scheduler.start()
then the configuration settings actually do get set. I'm guessing when I called scheduler.init_app in the original script, Flask-APScheduler saw that I hadn't set any of those settings in my Flask Config and so overwrote them with default values, but not 100% sure.
Regardless, hopefully this helps anyone who has tried the top-rated answer but is also using Flask-APScheduler as a wrapper and might still be seeing memory issues.
I have a Celery 3.1.19 setup which uses a BROKER_URL including a virtual host.
# in settings.py
BROKER_URL = 'amqp://guest:guest#localhost:5672/yard'
Celery starts normally, loads the tasks, and the tasks I define within the #app.task decorator work fine. I assume that my rabbitmq and celery configuration at this end are correct.
Tasks, I define with #shared_tasks and load with app.autodiscover_tasks are still loading correctly upon start. However, if I call the task the message ends up in the (still existing) amqp://guest:guest#localhost:5672/ virtual host.
Question: What am I missing here? Where do shared tasks get their actual configuration from.
And here some more details:
# celery_app.py
from celery import Celery
celery_app = Celery('celery_app')
celery_app.config_from_object('settings')
celery_app.autodiscover_tasks(['connectors'])
#celery_app.task
def i_do_work():
print 'this works'
And in connectors/tasks.py (with an __init__.py in the same folder):
# in connectors/tasks.py
from celery import shared_task
#shared_task
def I_do_not_work():
print 'bummer'
And again the shared task gets also picked up by the Celery instance. It just lacks somehow the context to send messages to the right BROKER_URL.
Btw. why are shared_tasks so purely documented. Do they rely on some Django context? I am not using Django.
Or do I need additional parameters in my settings?
Thanks a lot.
The celery_app was not yet imported at application start. Within my project, I added following code to __init__.py at the same module level as my celery_app definition.
from __future__ import absolute_import
try:
from .celery_app import celery_app
except ImportError:
# just in case someone develops application without
# celery running
pass
I was confused by the fact that Celery seems to come with a perfectly working default app. In this case a more interface like structure with a NotImplementedError might have been more helpful. Nevertheless, Celery is awesome.