Celery Task instanciation cache

Celery Task instanciation cache - python

I'm trying to cache a large resource file among tasks using Celery 4.0.2.
Looking it in the documentation , I have reach with the task caching part.
http://docs.celeryproject.org/en/latest/userguide/tasks.html#instantiation
This can also be useful to cache resources, For example, a base Task class that caches a database connection:
from celery import Task
class DatabaseTask(Task):
_db = None
#property
def db(self):
if self._db is None:
self._db = Database.connect()
return self._db
In my case I have done some changes to cache my big file resource, and the object its shared among the tasks, but the memory used by big file resource are cached in the task forever.
from celery import Task
class BigResourceTask(Task):
_resource = None
#property
def resource(self):
if self._resource is None:
self._resource = load_big_resource()
return self._resource
How can I free that memory or make it expire after the execution of all the related tasks?

Since you are creating _resource on demand after checking if it exists, you can simply delete whenever you want it.
# complete all the tasks
del BigResourceTask._resource # free memory
# do something else
r = BigResourceTask.resource # create when needed

Related

trigger a celery job via django singnals

I would like to use Django signals to trigger a celery task like so:
def delete_content(sender, instance, **kwargs):
task_id = uuid()
task = delete_libera_contents.apply_async(kwargs={"instance": instance}, task_id=task_id)
task.wait(timeout=300, interval=2)
But I'm always running into kombu.exceptions.EncodeError: Object of type MusicTracks is not JSON serializable
Now I'm not sure how to tread MusicTracks instance as it's a model class instance. How can I properly pass such instances to my task?
At my tasks.py I have the following:
#app.task(name="Delete Libera Contents", queue='high_priority_tasks')
def delete_libera_contents(instance, **kwargs):
libera_backend = instance.file.libera_backend
...

Never send instance in celery task, you only should send variables for example instanse primary key and then inside of the celery task via this pk find this instance and then do your logic
your code should be like this:
views.py
def delete_content(sender, **kwargs):
task_id = uuid()
task = delete_libera_contents.apply_async(kwargs={"instance_pk": sender.pk}, task_id=task_id)
task.wait(timeout=300, interval=2)
task.py
#app.task(name="Delete Libera Contents", queue='high_priority_tasks')
def delete_libera_contents(instance_pk, **kwargs):
instance = Instance.ojbects.get(pk = instance_pk)
libera_backend = instance.file.libera_backend
...
you can find this rule in celery documentation (can't find link), one of
reasons imagine situation:
you send your instance to celery tasks (it is delayed for any reason for 5 min)
then your project makes logic with this instance, before your task finished
then celery's task time come and it uses this instance old version, and this instance become corrupted
(this is the reason as I think it is, not from the documentation)

First off, sorry for making the question a bit confusing, especially for the people that have already written an answer.
In my case, the delete_content signal can be trigger from three different models, so it actually looks like this:
#receiver(pre_delete, sender=MusicTracks)
#receiver(pre_delete, sender=Movies)
#receiver(pre_delete, sender=TvShowEpisodes)
def delete_content(sender, instance, **kwargs):
delete_libera_contents.delay(instance_pk=instance.pk)
So every time one of these models triggers a delete action, this signal will also trigger a celery task to actually delete the stuff in the background (all stored on S3).
As I cannot and should not pass instances around directly as pointed out by #oruchkin, I pass the instance.pk to the celery task which I then have to find in the celery task as I don't know in the celery task what model has triggered the delete action:
#app.task(name="Delete Libera Contents", queue='high_priority_tasks')
def delete_libera_contents(instance_pk, **kwargs):
if Movies.objects.filter(pk=instance_pk).exists():
instance = Movies.objects.get(pk=instance_pk)
elif MusicTracks.objects.filter(pk=instance_pk).exists():
instance = MusicTracks.objects.get(pk=instance_pk)
elif TvShowEpisodes.objects.filter(pk=instance_pk).exists():
instance = TvShowEpisodes.objects.get(pk=instance_pk)
else:
raise logger.exception("Task: 'Delete Libera Contents', reports: No instance found (code: JFN4LK) - Warning")
libera_backend = instance.file.libera_backend
You might ask why do you not simply pass the sender from the signal to the celery task. I also tried this and again, as already pointed out, I cannot pass instances and I fail with:
kombu.exceptions.EncodeError: Object of type ModelBase is not JSON serializable
So it really seems I have to hard obtain the instance using the if-elif-else clauses at the celery task.

SQLAlchemy session cleared in celery job and on_success function

I am building a tool that fetches data from a different database, transforms it, and stores it in my own database. I'm migrating from APScheduler to Celery, but I ran into the following problem:
I use a class I call JobRecords to store when a job ran, whether it was successful and which errors it encountered. I use this to know not too look too far back for updated entries, especially since some tables have multiple millions of rows.
Since the system is the same for all jobs, I created a subclass from the celery Task object. I make sure the job is executed within the Flask app context, and I fetch the latest time this Job finished successfully. I also make sure I register a value for now to avoid timing issues between querying the database and adding the job record.
class RecordedTask(Task):
"""
Task sublass that uses JobRecords to get the last run date
and add new JobRecords on completion
"""
now: datetime = None
ignore_result = True
_session: scoped_session = None
success: bool = True
info: dict = None
#property
def session(self) -> Session:
"""Making sure we have one global session instance"""
if self._session is None:
from app.extensions import db
self._session = db.session
return self._session
def __call__(self, *args, **kwargs):
from app.models import JobRecord
kwargs['last_run'] = (
self.session.query(func.max(JobRecord.run_at_))
.filter(JobRecord.job_id == self.name, JobRecord.success)
.first()
)[0] or datetime.min
self.now = kwargs['now'] = datetime.utcnow()
with app.app_context():
super(RecordedTask, self).__call__(*args, **kwargs)
def on_failure(self, exc, task_id, args: list, kwargs: dict, einfo):
self.session.rollback()
self.success = False
self.info = dict(
args=args,
kwargs=kwargs,
error=exc.args,
exc=format_exception(exc.__class__, exc, exc.__traceback__),
)
app.logger.error(f"Error executing job '{self.name}': {exc}")
def on_success(self, retval, task_id, args: list, kwargs: dict):
app.logger.info(f"Executed job '{self.name}' successfully, adding JobRecord")
for entry in self.to_trigger:
if len(entry) == 2:
job, kwargs = entry
else:
job, = entry
kwargs = {}
app.logger.info(f"Scheduling job '{job}'")
current_celery_app.signature(job, **kwargs).delay()
def after_return(self, *args, **kwargs):
from app.models import JobRecord
record = JobRecord(
job_id=self.name,
run_at_=self.now,
info=self.info,
success=self.success
)
self.session.add(record)
self.session.commit()
self.session.remove()
I added an example of a job to update a model called Location, but there are a lot of jobs just like this one.
#celery.task(bind=True, name="update_locations")
def update_locations(self, last_run: datetime = datetime.min, **_):
"""Get the locations from the external database and check for updates"""
locations: List[ExternalLocation] = ExternalLocation.query.filter(
ExternalLocation.updated_at_ >= last_run
).order_by(ExternalLocation.id).all()
app.logger.info(f"ExternalLocation: collected {len(locations)} updated locations")
for update_location in locations:
existing_location: Location = Location.query.filter(
Location.external_id == update_location.id
).first()
if existing_location is None:
self.session.add(Location.from_worker(update_location))
else:
existing_location.update_from_worker(update_location)
The problem is that when I run this job, the Location objects are not committed with the JobRecord, so only the latter is created. If I track it with the debugger, Location.query.count() returns the correct value inside the function, but as soon as it enters the on_success callback, it's back to 0, and self._session.new returns an empty dict.
I already tried adding the session as a property to make sure it's the same instance everywhere, but the problem still persists. Maybe it has something to do with it being a scoped_session because of Flask-SQLAlchemy?
Sorry about the large amount of code, I did try to strip as much away as possible. Any help is welcome!

I found out that the culprit was the combination of scoped_session and the Flask app context. Like any contextmanager, running the code with app.app_context() triggered the __exit__ function on leaving, which in turn caused the ScopedRegistry, where the scoped_session was stored, to be cleared. Then, a new session was created, the JobRecords were added to that, and that session was committed. Therefore, the locations would not be written to the database.
There are two possible solutions. If you don't use sessions in other files than in your task, you can add a session property to the task. This way, you avoid the scoped_session alltogether, and can clean up in your after_return function.
#property
def session(self):
if self._session is None:
from dashboard.extensions import db
self._session = db.create_session(options={})()
return self._session
However, I was accessing the session in my model definition files as well, through from extensions import db. Therefore, I was using two different sessions. I ended up using app.app_context().push() instead of the contextmanager, thus avoiding the __exit__ function
app.app_context().push()
super(RecordedTask, self).__call__(*args, **kwargs)

Non-lazy instance creation with Pyro4 and instance_mode='single'

My aim is to provide to a web framework access to a Pyro daemon that has time-consuming tasks at the first loading. So far, I have managed to keep in memory (outside of the web app) a single instance of a class that takes care of the time-consuming loading at its initialization. I can also query it with my web app. The code for the daemon is:
Pyro4.expose
#Pyro4.behavior(instance_mode='single')
class Store(object):
def __init__(self):
self._store = ... # the expensive loading
def query_store(self, query):
return ... # Useful query tool to expose to the web framework.
# Not time consuming, provided self._store is
# loaded.
with Pyro4.Daemon() as daemon:
uri = daemon.register(Thing)
with Pyro4.locateNS() as ns:
ns.register('thing', uri)
daemon.requestLoop()
The issue I am having is that although a single instance is created, it is only created at the first proxy query from the web app. This is normal behavior according to the doc, but not what I want, as the first query is still slow because of the initialization of Thing.
How can I make sure the instance is already created as soon as the daemon is started?
I was thinking of creating a proxy instance of Thing in the code of the daemon, but this is tricky because the event loop must be running.
EDIT
It turns out that daemon.register() can accept either a class or an object, which could be a solution. This is however not recommended in the doc (link above) and that feature apparently only exists for backwards compatibility.

Do whatever initialization you need outside of your Pyro code. Cache it somewhere. Use the instance_creator parameter of the #behavior decorator for maximum control over how and when an instance is created. You can even consider pre-creating server instances yourself and retrieving one from a pool if you so desire? Anyway, one possible way to do this is like so:
import Pyro4
def slow_initialization():
print("initializing stuff...")
import time
time.sleep(4)
print("stuff is initialized!")
return {"initialized stuff": 42}
cached_initialized_stuff = slow_initialization()
def instance_creator(cls):
print("(Pyro is asking for a server instance! Creating one!)")
return cls(cached_initialized_stuff)
#Pyro4.behavior(instance_mode="percall", instance_creator=instance_creator)
class Server:
def __init__(self, init_stuff):
self.init_stuff = init_stuff
#Pyro4.expose
def work(self):
print("server: init stuff is:", self.init_stuff)
return self.init_stuff
Pyro4.Daemon.serveSimple({
Server: "test.server"
})
But this complexity is not needed for your scenario, just initialize the thing (that takes a long time) and cache it somewhere. Instead of re-initializing it every time a new server object is created, just refer to the cached pre-initialized result. Something like this;
import Pyro4
def slow_initialization():
print("initializing stuff...")
import time
time.sleep(4)
print("stuff is initialized!")
return {"initialized stuff": 42}
cached_initialized_stuff = slow_initialization()
#Pyro4.behavior(instance_mode="percall")
class Server:
def __init__(self):
self.init_stuff = cached_initialized_stuff
#Pyro4.expose
def work(self):
print("server: init stuff is:", self.init_stuff)
return self.init_stuff
Pyro4.Daemon.serveSimple({
Server: "test.server"
})

Celery: Custom Base class/child class based task not showing up under app.tasks

I'm trying to create some celery tasks as classes, but am having some difficulty. The classes are:
class BaseCeleryTask(app.Task):
def is_complete(self):
""" default method for checking if celery task has completed. """
# simply return result (since by default tasks return boolean indicating completion)
try:
return self.result
except AttributeError:
logger.error('Result not defined. Make sure task has run!')
return False
class MacroReportTask(BaseCeleryTask):
def run(self, params):
""" Override the default run method with signal factory run"""
# hold on to the factory
process = MacroCountryReport(params)
self.result = process.run()
return self.result
but when I initialize the app, and check app.tasks (or run worker), app doesn't seem to have these above tasks in its registry. Other function based tasks (using app.task() decorator) seem to be registered fine.
I run the above task as:
process = SignalFactoryTask()
process.delay(params)
Celery worker errors with the following message:
Received unregistered task of type None.
I think the issue I'm having is: how do I add custom classes to the task registry as I do with regular function based tasks?

Ran into the exact same issue, took hours to find the solution cause I'm 90% sure it's a bug. In your class tasks, try the following
class BaseCeleryTask(app.Task):
def __init__(self):
self.name = "[modulename].BaseCeleryTask"
class MacroReportTask(app.Task):
def __init__(self):
self.name = "[modulename].MacroReportTask"
It seems registering it with the app still has a bug where the name isn't automatically configured. Let me know if that works.

Python threading: tracking threads even after application is restarted?

Is there a way to track if thread is running by using some kind of ID/reference of thread object so I would know if thread is really running at some specific time.
I have functionality that starts manufacturing processes on application in threaded mode. If there are no server restarts and nothing goes wrong, then everything completes normally in a background.
But if for example server is restarted,and at the time any of the threads were running, well they are killed, but production of some order is stuck in running state, because it is changed only after thread is complete.
I was thinking of some scheduler that would check those production orders and if it won't find related thread for running production order, then it assumes it is dead and has to be restarted.
But how can I make it track properly?
I have this code:
from threading import Thread
def action_produce_threaded(self):
thread = Thread(target=self.action_produce_thread)
thread.start()
return {'type': 'ir.actions.act_window_close'}
def action_produce_thread(self):
"""Threaded method to start job in background."""
# Create a new database cursor.
new_cr = sql_db.db_connect(self._cr.dbname).cursor()
context = self._context
job = None
with api.Environment.manage():
# Create a new environment on newly created cursor.
# Here we don't have a valid self.env, so we can safely
# assign env to self.
new_env = api.Environment(new_cr, self._uid, context)
Jobs = self.with_env(new_env).env['mrp.job']
try:
# Create a new job and commit it.
# This commit is required to know that process is started.
job = Jobs.create({
'production_id': context.get('active_id'),
'state': 'running',
'begin_date': fields.Datetime.now(),
})
new_cr.commit()
# Now call base method `do_produce` in the new cursor.
self.with_env(new_env).do_produce()
# When job will be done, update state and end_date.
job.write({
'state': 'done',
'end_date': fields.Datetime.now(),
})
except Exception as e:
# If we are here, then we have an exception. This exception will
# be written to job our job record and committed changes.
# If job doesn't exist, then rollback all changes.
if job:
job.write({
'state': 'exception',
'exception': e
})
new_cr.commit()
new_cr.rollback()
finally:
# Here commit all transactions and close cursor.
new_cr.commit()
new_cr.close()
So now at part where job is created, it can stuck when something goes wrong. It will stuck at 'running' state, because it won't be updated in database anymore.
Should I use some singleton class that would track threads through their lifetime, so some cronjob that is run periodically, could check it and decide which threads are really running and which were killed unexpectedly?
P.S. probably there is some good practice for doing it, if so, please advice.

I was able to solve this by writing singleton class. Now it tracks live threads and if server is restarted, all references to live threads will disappear and I won't find it anymore (only newly started will be found). So I will know for sure which threads are dead and can be safely restarted and which can not.
I don't know if this is the most optimal way to solve such problem, but here it is:
class ObjectTracker(object):
"""Singleton class to track current live objects."""
class __ObjectTracker:
objects = {}
#classmethod
def add_object(cls, resource, obj):
"""Add object and resource that goes it into class dict."""
cls.objects[resource] = obj
#classmethod
def get_object(cls, resource):
"""Get object using resource as identifier."""
return cls.objects.get(resource)
#classmethod
def pop_object(cls, resource):
"""Pop object if it exists."""
return cls.objects.pop(resource, None)
instance = None
def __new__(cls):
"""Instantiate only once."""
if not ObjectTracker.instance:
ObjectTracker.instance = ObjectTracker.__ObjectTracker()
return ObjectTracker.instance
def __getattr__(self, name):
"""Return from singleton instance."""
return getattr(self.instance, name)
def __setattr__(self, name):
"""Set to singleton instance."""
return setattr(self.instance, name)
P.S. this was used as starting example for my singleton: http://python-3-patterns-idioms-test.readthedocs.io/en/latest/Singleton.html#id4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Celery Task instanciation cache - python

Since you are creating _resource on demand after checking if it exists, you can simply delete whenever you want it. # complete all the tasks del BigResourceTask._resource # free memory # do something else r = BigResourceTask.resource # create when needed

Related

trigger a celery job via django singnals

SQLAlchemy session cleared in celery job and on_success function

Non-lazy instance creation with Pyro4 and instance_mode='single'

Celery: Custom Base class/child class based task not showing up under app.tasks

Python threading: tracking threads even after application is restarted?

Categories

Resources