Django 3.0: Running backgound infinite loop in app ready() - python

I'm trying to find a way to constantly poll a server every x seconds from the ready() function of Django, basically something which looks like this:
from django.apps import AppConfig
class ApiConfig(AppConfig):
name = 'api'
def ready(self):
import threading
import time
from django.conf import settings
from api import utils
def refresh_ndt_servers_list():
while True:
utils.refresh_servers_list()
time.sleep(settings.WAIT_SECONDS_SERVER_POLL)
thread1 = threading.Thread(target=refresh_ndt_servers_list)
thread1.start()
I just want my utils.refresh_servers_list() to be executed when Django starts/is ready and re-execute that same method (which populates my DB) every settings.WAIT_SECONDS_SERVER_POLL seconds indefinitely. The problem with that is if I run python manage.py migrate the ready() function gets called and never finishes. I would like to avoid calling this function during migration.
Thanks!

AppConfig.ready() is to "... perform initialization tasks ..." and make your app ready to run / serve requests. Actual app working logic should be run after django app is initialized.
For launching task at regular intervals cron job can be used.
Or, setup a periodic celery task with celery beat.
Also, provided task seems to be performing updates in database (good for it to be atomic). It may be critical for it to have only single running instance of it. One instance of cronjob or one celery task take care of that.
However, the next job may still run if previous one has not yet finished or just be launched manually for some reason - adding some locking logic into task to check that only one is running (or lock database table for the run) may be desired.

Related

Atomic Code in gunicorn multiprocessing / only run code in worker 1?

I am new to gunicorn multiprocessing (by calling gunicorn --worker=X).
I am using it with Flask to provide the WSGI implementation for our productive frontend. To use multiprocessing, we pass the above mentioned parameter to unicorn.
Our Flask application also uses APScheduler (via Flask-APScheduler) to run a cron task every Y hours. This task searches for new database entries to process, and when it finds them, starts processing them one by one.
The process should only be run by one worker obviously. But because of gunicorn, X workers are now spawned, each running the task every X hours, creating race conditions.
Is there a way to make the code atomic so that I can set the "processed" variable in the DB entry to true? Or, maybe, tell gunicorn to only run that specific code on the parent process, or first spawned worker?
Thanks for every input! :-)
The --preload parameter for gunicorn gives an opportunity to run code just in the parent worker.
All the code that is run before app.run() (or whatever you called your Flask() object) is apparently run on the parent process.
Didn't find any documentation on this unfortunately, but this post lead me to it.
So, running the APScheduler code before it makes sure that it's only run (or registered, in this case) once.

Running Django celery on load

Hi I am working on a project where i need celery beat to run long term periodic tasks. But the problem is that after starting the celery beat it is taking the specified time to run for the first time.
I want to fire the task on load for the first time and then run periodically.
I have seen this question on stackoverflow and this issue on GitHub, but didn't found a reliable solution.
Any suggestions on this one?
Since this does not seem possible I suggest a different approach. Call the task explicitly when you need and let the scheduler continue scheduling the tasks as usual. You can call the task on startup by using one the following methods (you probably need to take care of the multiple calls of the ready method if the task is not idempotent). Alternatively call the task from the command line by using celery call after your django server startup command.
The best place to call it will most of the times be in the ready() function of the current apps AppConfig class:
from django.apps import AppConfig
from myapp.tasks import my_task
class RockNRollConfig(AppConfig):
# ...
def ready(self):
my_task.delay(1, 2, 3)
Notice the use of .delay() which puts the invocation on the celery que and doesn't slow down starting the server.
See: https://docs.djangoproject.com/en/3.2/ref/applications/#django.apps.AppConfig and https://docs.celeryproject.org/en/stable/userguide/calling.html.

Memory usage not getting lowered even after job is completed successfully

I have a job added in apscheduler which loads some data in memory and I am deleting all the objects after the job is complete. Now if I run this job with python it works successfully and memory drop after process exits successfully.But in case of apscheduler the memory usage is not coming down.I am using BackgroundScheduler.Thanks in advance.
I was running quite a few tasks via apscheduler. I suspected this setup led to R14 errors on Heroku, with dyno memory overload, crashes and restarts occurring daily. So I spun up another dyno and scheduled a few jobs to run very frequently.
Watching the metrics tab in Heroku, it immediately became clear that apscheduler was the culprit.
Removing jobs after they're run was recommended to me. But this is of course a bad idea when running cron and interval jobs as they won't run again.
What finally solved it was tweaking the threadpoolexecutioner (lowering max number of workers), see this answer on Stackoverflow and this and this post on Github. I definitely suggest you read the docs on this.
Other diagnostics resources: 1, 2.
Example code:
import logging
from apscheduler.executors.pool import ThreadPoolExecutor, ProcessPoolExecutor
from apscheduler.schedulers.blocking import BlockingScheduler
from tests import overloadcheck
logging.basicConfig()
logging.getLogger('apscheduler').setLevel(logging.DEBUG)
sched = BlockingScheduler(
executors={
'threadpool': ThreadPoolExecutor(max_workers=9),
'processpool': ProcessPoolExecutor(max_workers=3)
}
)
#sched.scheduled_job('interval', minutes=10, executor='threadpool')
def message_overloadcheck():
overloadcheck()
sched.start()
Or, if you like I do, love to run heavy tasks—try the ProcessPoolExecutor as an alternative, or addition to the ThreadPool, but make sure to call it from specific jobs in such case.
Update: And, you need to import ProcessPoolExecutor as well if you wish to use it, added this to code.
Just in case anyone is using Flask-APScheduler and having memory leak issues, it took me a while to realize that it expects any configuration settings to be in your Flask Config, not when you instantiate the scheduler.
So if you (like me) did something like this:
from flask_apscheduler import APScheduler
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.executors.pool import ThreadPoolExecutor
bg_scheduler = BackgroundScheduler(executors={'threadpool': ThreadPoolExecutor(max_workers=1)})
scheduler = APScheduler(scheduler=bg_scheduler)
scheduler.init_app(app)
scheduler.start()
then for whatever reason, when jobs are run in the Flask request context, it will not recognize the executor, 'threadpool', or any other configuration settings you may have set.
However, if you set these same options in the Flask Config class as:
class Config(object):
#: Enable build completion checks?
SCHEDULER_API_ENABLED = True
#: Sets max workers to 1 which reduces memory footprint
SCHEDULER_EXECUTORS = {"default": {"type": "threadpool", "max_workers": 1}
# ... other Flask configuration options
and then do (back in the main script)
scheduler = APScheduler()
scheduler.init_app(app)
scheduler.start()
then the configuration settings actually do get set. I'm guessing when I called scheduler.init_app in the original script, Flask-APScheduler saw that I hadn't set any of those settings in my Flask Config and so overwrote them with default values, but not 100% sure.
Regardless, hopefully this helps anyone who has tried the top-rated answer but is also using Flask-APScheduler as a wrapper and might still be seeing memory issues.

Celery: why do I need a broker for periodic tasks?

I have a standalong script that scrapes a page, initiates a connection to a database, and writes database to it. I need it to execute periodically after x hours. I can make it with using a bash script, with the pseudocode:
while true
do
python scraper.py
sleep 60*60*x
done
From what I read about message brokers, they are used for sending "signals" from one running program to another, like HTTP in principle. Like I have a piece of code that accepts an email id from user, it sends signal with email-id to another piece of code that will send the email.
I need celery to run a periodic task on heroku. I already have a mongodb on a separate server. WHy do I need to run another server for rabbitmq or redis just for this? Can I use celery without the broker?
Celery architecture is designed to scale and distribute tasks across several servers. For sites like yours it might be an overkill. Queue service is generally needed to maintain the task list and signal the status of finished tasks.
You might want to take a look in Huey instead. Huey is small-scale Celery "Clone" needing only Redis as an external dependency, not RabbitMQ. It's still using Redis queue mechanism to line the tasks in queue.
There also exists Advanced Python scheduler which does not need even Redis, but can hold the state of the queue in memory in-process.
Alternatively if you have very small amount of periodical tasks, no delayed tasks, I would just use Cron and pure Python scripts to run the tasks.
As the Celery documentation explains:
Celery communicates via messages, usually using a broker to mediate between clients and workers. To initiate a task, a client adds a message to the queue, which the broker then delivers to a worker.
You can use your existing MongoDB database as broker. see Using MongoDB.
For the application like this, its better use Django Background Tasks
,
Installation
Install from PyPI:
pip install django-background-tasks
Add to INSTALLED_APPS:
INSTALLED_APPS = (
# ...
'background_task',
# ...
)
Migrate your database:
python manage.py makemigrations background_task
python manage.py migrate
Creating and registering tasks
To register a task use the background decorator:
from background_task import background
from django.contrib.auth.models import User
#background(schedule=60)
def notify_user(user_id):
# lookup user by id and send them a message
user = User.objects.get(pk=user_id)
user.email_user('Here is a notification', 'You have been notified')
This will convert the notify_user into a background task function. When you call it from regular code it will actually create a Task object and stores it in the database. The database then contains serialised information about which function actually needs running later on. This does place limits on the parameters that can be passed when calling the function - they must all be serializable as JSON. Hence why in the example above a user_id is passed rather than a User object.
Calling notify_user as normal will schedule the original function to be run 60 seconds from now:
notify_user(user.id)
This is the default schedule time (as set in the decorator), but it can be overridden:
notify_user(user.id, schedule=90) # 90 seconds from now
notify_user(user.id, schedule=timedelta(minutes=20)) # 20 minutes from now
notify_user(user.id, schedule=timezone.now()) # at a specific time
Also you can run original function right now in synchronous mode:
notify_user.now(user.id) # launch a notify_user function and wait for it
notify_user = notify_user.now # revert task function back to normal function.
Useful for testing.
You can specify a verbose name and a creator when scheduling a task:
notify_user(user.id, verbose_name="Notify user", creator=user)

Trigger Django module on Database update

I want to develop an application that monitors the database for new records and allows me to execute a method in the context of my Django application when a new record is inserted.
I am planning to use an approach where a Celery task checks the database for changes since the last check and triggers the above method.
Is there a better way to achieve this?
I'm using SQLite as the backend and tried apsw's setupdatehook API, but it doesn't seem to run my module in Django context.
NOTE: The updates are made by a different application outside Django.
Create a celery task to do whatever it is you need to do with the object:
tasks.py
from celery.decorators import task
#task()
def foo(object):
object.do_some_calculation()
Then create a django signal that is fired every time an instance of your Model is saved , queuing up your task in Celery:
models.py
class MyModel(models.Model):
...
from django.db.models.signals import post_save
from django.dispatch import receiver
from mymodel import tasks
#receiver(post_save, sender=MyModel)
def queue_task(sender, instance, created, **kwargs):
tasks.foo.delay(object=instance)
What's important to note that is django's signals are synchronous, in other words the queue_task function runs within the request cycle, but all the queue_task function is doing is telling Celery to handle the actual guts of the work (do_some_calculation) in theb background
A better way would be to have that application that modifies the records call yours. Or at least make a celery queue entry so that you don't really have to query the database too often to see if something changed.
But if that is not an option, letting celery query the database to find if something changed is probably the next best option. (surely better than the other possible option of calling a web service from the database as a trigger, which you should really avoid.)

Categories