Immediately access django-celery-results TaskResult after starting

Immediately access django-celery-results TaskResult after starting - python

I have several Celery tasks I'm executing within a Django view (more specifically within Django Rest Framework's perform_create method).
What I'm trying to achieve is to immediately (that is, as soon as the task has an id/is in the results backend) access the TaskResult object and do something with it, like this:
tasks = [do_something.s(a) for a in (1, 2, 3, 4,)]
results = group(*tasks).apply_async()
for result in results.children:
task = TaskResult.objects.get(task_id=result.task_id)
do_something_with_task_object(task)
Now, this fails with django_celery_results.models.DoesNotExist: TaskResult matching query does not exist.
I did not yet try it, but I could make this work with something like the following snippet. But that strikes me as plain wrong and ugly, also does it wait until the tasks are finished:
while not all([TaskResult.objects.filter(task_id=t.task_id).exists() for t in results.children]):
pass
Is there some way to make this work in a nice and clean fashion?

It turns out that a) the moment you ask a question on StackOverflow, you're able to answer it yourself and b) Django transaction management does everything you need.
If you wrap the call to task.apply_async in an atomic wrapper all is fine, e.g.
with transactions.atomic():
results = group(*tasks).apply_async()
TaskResult.objects.get(task_id=results.children[0].task_id)

I don't know if it worked for everyone, but with django-celery-results==2.2.0, the transaction as a context manager doesn't seem to work anymore.
On the other hand, in a post_save signal, it seems ok.
# models.py
#receiver(post_save, sender=TaskResult)
def after_task_result(sender, instance, created, **kwargs):
if created: transaction.on_commit(lambda x:do_something())
However, I lose the variables in the view that are not passed in the model creation with signal. In this case, it is still the ugly code that works best.
# views.py
while not TaskResult.objects.filter(task_id = task.id).exists(): pass
task = TaskResult.objects.get(task_id = task.id)
# do something more complex with local variables

Related

Hook every variable reference and execute code

I have a Python app split across different files. One of them, models.py, contains, among PyQt5 table models, several maps referred from several PyQt5 form files:
# first lines:
agents_id_map = \
{agent.name:agent.id for agent in db.session.query(db.Agent, db.Agent.id)}
# ....
# 2000 thousand lines
I want to keep this kind of maps centralized in a single point. I'm using SQLAlchemy also. Agent class is defined in a db.py file. I use these maps to fulfill the foreign key in another object, say, an invoice, like:
invoice = db.Invoice()
# Here is a reference
invoice.agent_id = models.agents_id_map[agent_combo.currentText()]
····
db.session.add(invoice)
db.session.commit()
The problem is that the model.py module gets cached and several parts of the application access old data, and, if another running instance A of the app creates a new agent, and a running instance B wants to create a new invoice, the B running instance won't see the new Agent created by A unless restarts the app. This also happens if a user in the same running instance creates an agent and then he wants to create an invoice. My solutions are:
Reload the module, to get the whole code executed again, but this could be very expensive.
Isolate the code building those maps in another file, say maps.py, which would be less expensive to reload and change all code that references it through refactoring.
Is there a solution that would allow me to touch only the code building those maps and the rest of the application remains ignorant of the change, and every time the map is referenced from another module or even the same, the code gets executed, effectively re-building maps with fresh data?

Is there a solution that would allow me to touch only the code building those maps and the rest of the application remains ignorant of the change, and every time the map is referenced from another module or even the same, the code gets executed, effectively re-building maps with fresh data?
Certainly: put you maps inside a function, or even better, a class.
If I understand this problem correctly, you have stateful data (maps) which need regenerating under some condition (every time they are accessed? Or just every time the db is updated?). I would do something like this:
class Mappings:
def __init__(self, db):
self._db = db
... # do any initial db stuff you need to here
def id_map(self, thing):
db_thing = getattr(self._db, thing.title)
return {x.name:x.id for x in self._db.session.query(db_thing, db_thing.id)}
def other_property_map(self, prop):
... # etc
mapping = Mapping(db)
mapping.id_map("agent")
This assumes that the mapping example you've given is your major use-case, but this model could easily be adapted for almost any other mapping you might want.
You would write a method of every kind of 'mapping' you need, and it would return the desired dictionary. Note that here I've assumed you handle setting up the db elsewhere and pass a fully initialised db access object to the class, which is probably what you want to do---this class is just about encapsulating mapper state, not re-inventing your orm.
Caching
I have not provided any caching. But if you have complete control over the db, it is easy enough to run a hook before you do any db commits looking to see if you've touched any particular model, and then state that those need rebuilding. Something like this:
class DbAccess(Mappings):
def __init__(self, db, models):
super().init(db)
self._cached_map = {model: {} for model in models}
def db_update(model: str, params: dict):
try:
self._cached_map[model] = {} # wipe cache
except KeyError:
pass
self._db.update_with_model(model, params) # dummy fn
def id_map(self, thing: str):
try:
return self._cached_map[thing]["id"]
except KeyError:
self._cached_map[thing]["id"] = super().id_map(thing)
return self._cached_map[thing]["id"]
I don't really think DbAccess should inherit from Mappings---put it all in one class, or have a DB class and a Mappings mixin and inherit from both. I just didn't want to write everything out again.
I've not written any real db access routines, (hence my dummy fn) as I don't know how you're doing it (but clearly using an ORM). But the basic idea is just to handle the caching yourself, by storing the mapping every time, but deleting all the stored mappings every time you do any commit transactions involving the model in question (thus rebuilding the cache as needed).
Aside
Note that if you really do have 2,000 lines of manually declared mappings of the form thing.name: thing.id you really should generate them at runtime anyhow. Declarative is all very well and good, but writing out 2,000 permutations of the same thing isn't declarative, it's just time-consuming---and doing the job a simple loop putting the data in ram could do for you at startup.

How do I get SQLAlchemy objects after commit?

The Problem
I am using a pattern to cache SQLAlchemy objects in Redis. Whenever an instance is modified and committed, I want to clear the relevant caches so that it will be reloaded when next fetched. This clear must happen after commit to avoid race conditions (another thread querying cache, missing, and reloading stale data into the cache).
I have fought with this for a long time, coming up with various solutions that work sometimes but nothing bulletproof. This seems like a straightforward enough problem that there should be a solution to it. How do I trigger some code every time a change is committed to a SQLAlchemy instance?
What I've Tried
Events
I've tried to stitch together some SQLAlchemy events to achieve my goal with varying levels of success. Listening to "after_insert" and "after_update" will tell me when an object is modified, and "after_commit" tells me that whatever was modified was saved, so I had a scheme where the first two events would register listeners for "after_commit", which would in turn pass the object to my cache clearing function. Like this:
def _register_after_commit(_: Mapper, __: Connection, target: MyClass) -> None:
""" Generic callback that adds this function for a target change without params """
targets.add(target) # Clear cache uses this set to know which instances to clear
event.listen(get_session(), "after_commit", clear_cache)
event.listen(MyClass, "after_insert", _register_after_commit)
event.listen(MyClass, "after_update", _register_after_commit)
This works most of the time, but I occasionally get DetachedInstanceError when accessing the attributes on the target that I need to know to clear them from cache (e.g. id). I've read that this happens because of automatic expiring during the commit which causes SQLAlchemy to want to refresh all attributes. I can't turn off auto-expiring nor can I expunge every object that passes through here, either of those could end up breaking other pieces of the code base.
A Custom Session
I made my own session class that looked something like this:
class SessionWithCallback(scoped_session):
""" A version of orm.Session which can call a method after commit completes """
def __init__(self, session_factory, scopefunc = None) -> None:
super().__init__(session_factory=session_factory, scopefunc=scopefunc)
self._callbacks = {}
def add_callback(self, func, *args, **kwargs) -> None:
"""
Adds a callback to be called after commit, ensuring only a single
instance of the callback for each set of args and kwargs is used
"""
key = f"{func}.{args}.{kwargs}"
self._callbacks[key] = (func, args, kwargs)
def run_callbacks(self) -> None:
"""
Executes all callbacks
"""
for (func, args, kwargs) in self._callbacks.values():
func(*args, **kwargs)
self._callbacks = {}
def commit(self) -> None:
""" Flush and commit the current transaction """
super().commit()
self.run_callbacks()
Then instead of _register_after_commit using the "after_commit" event, it would call the current session's add_callback function. This seemed to work when running tests with just SQLAlchemy, but it fell apart when integrating with a Flask app which uses these models and utilizes Flask-SQLAlchemy. I followed instructions to customize session (overriding create_session on the SQLAlchemy instance) but as soon as I commit anything I get an exception that scoped_session has no attribute add_callback. I stepped through and it is using my class internally somehow, but the session it gives me is not an instance of my class. Confusing.
I've Considered
Storing primary keys in my listeners, then requiring the callbacks to open a session and query for a new instance itself if it needs more info. Might work, but it feels like extra I/O that I shouldn't need. I could have several different callbacks for one instance, them all querying feels like a lot of work.
Having some global place to store callbacks instead of on the Session, so that I can avoid the add_callback function. I'd need to still make this session-specific and thread-safe though. Easy enough in Flask, but Flask isn't the only app that needs to share this code.
Just doing these cache clears manually... but that's bound to cause developer error.
Spawning some time-delayed job to clear cache from "after_insert/update". That gets really complicated really fast and sounds like a real headache. For instance, how do you decide how long to wait?

Solution needed to a scenario

I am trying to make use of a column's value as a radio button's choice using below code
Forms.py
#retreiving data from database and assigning it to diction list
diction = polls_datum.objects.values_list('poll_choices', flat=True)
#initializing list and dictionary
OPTIONS1 = {}
OPTIONS = []
#creating the dictionary with 0 to no of options given in list
for i in range(len(diction)):
OPTIONS1[i] = diction[i]
#creating tuples from the dictionary above
#OPTIONS = zip(OPTIONS1.keys(), OPTIONS1.values())
for i in OPTIONS1:
k = (i,OPTIONS1[i])
OPTIONS.append(k)
class polls_form(forms.ModelForm):
#retreiving data from database and assigning it to diction list
options = forms.ChoiceField(choices=OPTIONS, widget = forms.RadioSelect())
class Meta:
model = polls_model
fields = ['options']
Using a form I am saving the data or choices in a field (poll_choices), when trying to display it on the index page, it is not reflecting until a server restart.
Can someone help on this please

of course "it is not reflecting until a server restart" - that's obvious when you remember that django server processes are long-running processes (it's not like PHP where each script is executed afresh on each request), and that top-level code (code that's at the module's top-level, not in a function) is only executed once per process when the module is first imported. As a general rule: don't do ANY db query at a module's top-level or at the top-level of a class statement - at best you'll get stale data, at worse it will crash your server process (if you're doing query before everything has been properly setup by django, or if you're doing query based on a schema update before the migration has been applied).
The possible solutions are either to wait until the form's initialisation to setup your field's choices, or to pass a callable as the formfield's choices options, cf https://docs.djangoproject.com/en/2.1/ref/forms/fields/#django.forms.ChoiceField.choices
Also, the way you're building your choices list is uselessly complicated - you could do it as a one-liner:
OPTIONS = list(enumerate(polls_datum.objects.values_list('poll_choices', flat=True))
but it's also very brittle - you're relying on the current db content and ordering for the choice value when you should use the polls_datum's pk instead (which is garanteed to be stable).
And finally: since you're working with what seems to be a related model, you may want to use a ModelChoiceField instead.

For future reference:
What version of Django are you using?
Have you read up on the documentation of ModelForms? https://docs.djangoproject.com/en/2.1/topics/forms/modelforms/
I'm not sure what you're trying to do with diction to dictionary to tuple. I think you could skip a step there and your future self will thank you for that.
Try to follow some tutorials and understand why certain steps are being taken. I can see from your code that you're rather new to coding or Python and there's room for improvement. Not trying to talk you down, but I'm trying to push you into the direction of becoming a better developer ;-)
REAL ANSWER:
That being said, I think the solution is to write the loading of the data somewhere in your form model, rather than 'loose' in forms.py. See bruno's answer for more information on this.
If you want to reload the data on each request that loads the form, you should create a function that gets called every time the form is loaded (for example in the form's __init__ function).

Celery pickling not playing nice with Cassandra driver, can't figure out the root cause

I'm experiencing some behavior that I can't quite figure out. I'm using Cassandra to store message objects, and I'm using Celery for async pulls and pushes to the database. Everything is working fine, except for a single Celery task; the other tasks that use the same code/classes work. Here's a rough breakdown of the code logic:
db_manager = DBManager()
class User(object):
def __init__(self, user_id):
... normal init stuff ...
self.loader()
#run_async
def loader(self):
... loads from database if found, otherwise pulls from API ...
# THIS WORKS
#celery.task(name='user-to-db', filter=task_method)
def to_db(self):
# db_manager is a custom backend that handles relevant db reads, writes, etc.
db_manager.add('users', self.user_payload)
# THIS WORKS
#celery.task(name='load-friends', filter=task_method)
def load_friends(self):
# Checks secondary redis index for friends of user
friends = redis.srandmember('users:the-users-id:friends', self.id, 20)
if not friends:
profiles = load_friends_from_api(user_id=self.id)
else:
query = "SELECT * FROM keyspace.users WHERE id IN ({friends})".format(friends=friends)
# Init a User object for every friend
loaded_friends = [User(friend) for friend in profiles]
# Returns a class container with all the instances of User(friend), accessible through a class property
return FriendContainer(self.id, loaded_friends)
# THIS DOES NOT WORK
#celery.task(name='get-user-messages', filter=task_method)
def get_user_messages(self):
# THIS IS WHERE IT FAILS #
messages = db_manager.get("SELECT message FROM keyspace.message_timelines WHERE user_id = {user_id}".format(user_id=self.id))
# THAT LINE ABOVE #
# Init a message class object for every message payload in database
msgs = [Message(m, user=self) for m in messages]
# Returns a message container class holding all the message objects, accessible through a class property
return MessageContainer(msgs)
This last class method throws an error:
File "/usr/local/lib/python2.7/dist-packages/kombu/serialization.py", line 356, in pickle_dumps
return dumper(obj, protocol=pickle_protocol)
EncodeError: Can't pickle <class 'cassandra.io.eventletreactor.message'>: attribute lookup cassandra.io.eventletreactor.message failed
cassandra.io.eventletreactor.message points to a user-defined type in Cassandra that I use as a container for message objects per user. The line that throws this error is:
messages = db_manager.get("SELECT message FROM keyspace.message_timelines WHERE user_id = {user_id}".format(user_id=self.id))
This is the method from DBManager():
class DBManager(object):
... stuff ...
def get(self, query):
# I do some stuff to prepare the query, namely substituting `WHERE this = that` for `WHERE this = ?` to create a Cassandra prepared statement.
statement = cassandra.prepare(query_prepared)
# I want these messages as a dict, not the default namedtuple
cassandra.row_factory = dict_factory
# User id is parsed out of query
results = cassandra.execute(statement, (user_id,))
rows = results.current_rows
# rows is a list of dicts, no weird class references or anything in there
return rows
I've read that Celery tasks out of class methods is/was kind of experimental, but I can't figure out why all the other methods qua tasks that use the same instance of DBManager are working.
The problem seems to be localized to some issue with the user-defined type message that's not playing nice within the Cassandra driver; however, if I run the get method from DBManager within the Celery task itself, it works. That is, if I copy/paste the code that is throwing the error from DBManager.get into User.get_user_messages, it works fine. If I try to call DBManager.get from within User.get_user_messages, it breaks.
I just can't figure out where the problem is. I can do all the following just fine:
Run the get_user_messages method without Celery, and it works.
Run the get_user_messages method WITH Celery if I run the get method code right in the Celery task method itself.
I can run other methods registered as Celery tasks that point to other methods in DBManager that use the Cassandra driver, even ones that insert the same message user-defined type into the database.
I've tried pickling ALL THE THINGS all the way down myself, and in various combinations, and can't reproduce the error.
What I have not tried:
Change serializer to json or yaml. There are a few convenience items in the db payload that won't serialize with either of those two.
Use dill instead of pickle. It seems like this should work without having to switch serializers given that I can get various parts working separately.
I could just say screw it and run the query directly through the Cassandra driver instead of my DBManager class, but I feel like this should be solvable and I'm just missing something really, really obvious, so obvious that I'm not seeing it. Any suggestions on where to look would be greatly appreciated.
In case of relevance: Cassandra 3.3, CQL 3.4, DataStax python driver 3.1

Meh, I found the problem, and it WAS really obvious. I guess I didn't actually try pickling all the things, just most of the things, and I didn't catch this in my 4am debugging stupor.
At any rate, cassandra.row_factory = dict_factory, when called on a user defined type, doesn't actually return everything as a dict. It gives a dict of {'label': message(x='this', y='that')}, where message is a namedtuple. The Cassandra driver dynamically creates the namedtuple inside of a class instance, and so pickle couldn't find it.

Iterating over a large Django queryset while the data is changing elsewhere

Iterating over a queryset, like so:
class Book(models.Model):
# <snip some other stuff>
activity = models.PositiveIntegerField(default=0)
views = models.PositiveIntegerField(default=0)
def calculate_statistics():
self.activity = book.views * 4
book.save()
def cron_job_calculate_all_book_statistics():
for book in Book.objects.all():
book.calculate_statistics()
...works just fine. However, this is a cron task. book.views is being incremented while this is happening. If book.views is modified while this cronjob is running, it gets reverted.
Now, book.views is not being modified by the cronjob, but it is being cached during the .all() queryset call. When book.save(), I have a feeling it is using the old book.views value.
Is there a way to make sure that only the activity field is updated? Alternatively, let's say there are 100,000 books. This will take quite a while to run. But the book.views will be from when the queryset originally starts running. Is the solution to just use an .iterator()?
UPDATE: Here's effectively what I am doing. If you have ideas about how to make this work well inline, then I'm all for it.
def calculate_statistics(self):
self.activity = self.views + self.hearts.count() * 2
# Can't do self.comments.count with a comments GenericRelation, because Comment uses
# a TextField for object_pk, and that breaks the whole system. Lame.
self.activity += Comment.objects.for_model(self).count() * 4
self.save()

The following will do the job for you in Django 1.1, no loop necessary:
from django.db.models import F
Book.objects.all().update(activity=F('views')*4)
You can have a more complicated calculation too:
for book in Book.objects.all().iterator():
Book.objects.filter(pk=book.pk).update(activity=book.calculate_activity())
Both these options have the potential to leave the activity field out of sync with the rest, but I assume you're ok with that, given that you're calculating it in a cron job.

In addition to what others have said if you are iterating over a large queryset you should use iterator():
Book.objects.filter(stuff).order_by(stuff).iterator()
this will cause Django to not cache the items as it iterates (which could use a ton of memory for a large result set).

No matter how you solve this, beware of transaction-related issues. E.g. default transaction isolation level is set to REPEATABLE READ, at least for MySQL backend. This, plus the fact that both Django and db backend work in a specific autocommit mode with an ongoing transaction means, that even if you use (very nice) whrde suggestion, value of `views' could be no longer valid. I could be wrong here, but feel warned.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.