The Problem
I am using a pattern to cache SQLAlchemy objects in Redis. Whenever an instance is modified and committed, I want to clear the relevant caches so that it will be reloaded when next fetched. This clear must happen after commit to avoid race conditions (another thread querying cache, missing, and reloading stale data into the cache).
I have fought with this for a long time, coming up with various solutions that work sometimes but nothing bulletproof. This seems like a straightforward enough problem that there should be a solution to it. How do I trigger some code every time a change is committed to a SQLAlchemy instance?
What I've Tried
Events
I've tried to stitch together some SQLAlchemy events to achieve my goal with varying levels of success. Listening to "after_insert" and "after_update" will tell me when an object is modified, and "after_commit" tells me that whatever was modified was saved, so I had a scheme where the first two events would register listeners for "after_commit", which would in turn pass the object to my cache clearing function. Like this:
def _register_after_commit(_: Mapper, __: Connection, target: MyClass) -> None:
""" Generic callback that adds this function for a target change without params """
targets.add(target) # Clear cache uses this set to know which instances to clear
event.listen(get_session(), "after_commit", clear_cache)
event.listen(MyClass, "after_insert", _register_after_commit)
event.listen(MyClass, "after_update", _register_after_commit)
This works most of the time, but I occasionally get DetachedInstanceError when accessing the attributes on the target that I need to know to clear them from cache (e.g. id). I've read that this happens because of automatic expiring during the commit which causes SQLAlchemy to want to refresh all attributes. I can't turn off auto-expiring nor can I expunge every object that passes through here, either of those could end up breaking other pieces of the code base.
A Custom Session
I made my own session class that looked something like this:
class SessionWithCallback(scoped_session):
""" A version of orm.Session which can call a method after commit completes """
def __init__(self, session_factory, scopefunc = None) -> None:
super().__init__(session_factory=session_factory, scopefunc=scopefunc)
self._callbacks = {}
def add_callback(self, func, *args, **kwargs) -> None:
"""
Adds a callback to be called after commit, ensuring only a single
instance of the callback for each set of args and kwargs is used
"""
key = f"{func}.{args}.{kwargs}"
self._callbacks[key] = (func, args, kwargs)
def run_callbacks(self) -> None:
"""
Executes all callbacks
"""
for (func, args, kwargs) in self._callbacks.values():
func(*args, **kwargs)
self._callbacks = {}
def commit(self) -> None:
""" Flush and commit the current transaction """
super().commit()
self.run_callbacks()
Then instead of _register_after_commit using the "after_commit" event, it would call the current session's add_callback function. This seemed to work when running tests with just SQLAlchemy, but it fell apart when integrating with a Flask app which uses these models and utilizes Flask-SQLAlchemy. I followed instructions to customize session (overriding create_session on the SQLAlchemy instance) but as soon as I commit anything I get an exception that scoped_session has no attribute add_callback. I stepped through and it is using my class internally somehow, but the session it gives me is not an instance of my class. Confusing.
I've Considered
Storing primary keys in my listeners, then requiring the callbacks to open a session and query for a new instance itself if it needs more info. Might work, but it feels like extra I/O that I shouldn't need. I could have several different callbacks for one instance, them all querying feels like a lot of work.
Having some global place to store callbacks instead of on the Session, so that I can avoid the add_callback function. I'd need to still make this session-specific and thread-safe though. Easy enough in Flask, but Flask isn't the only app that needs to share this code.
Just doing these cache clears manually... but that's bound to cause developer error.
Spawning some time-delayed job to clear cache from "after_insert/update". That gets really complicated really fast and sounds like a real headache. For instance, how do you decide how long to wait?
Related
I have a Python app split across different files. One of them, models.py, contains, among PyQt5 table models, several maps referred from several PyQt5 form files:
# first lines:
agents_id_map = \
{agent.name:agent.id for agent in db.session.query(db.Agent, db.Agent.id)}
# ....
# 2000 thousand lines
I want to keep this kind of maps centralized in a single point. I'm using SQLAlchemy also. Agent class is defined in a db.py file. I use these maps to fulfill the foreign key in another object, say, an invoice, like:
invoice = db.Invoice()
# Here is a reference
invoice.agent_id = models.agents_id_map[agent_combo.currentText()]
ยทยทยทยท
db.session.add(invoice)
db.session.commit()
The problem is that the model.py module gets cached and several parts of the application access old data, and, if another running instance A of the app creates a new agent, and a running instance B wants to create a new invoice, the B running instance won't see the new Agent created by A unless restarts the app. This also happens if a user in the same running instance creates an agent and then he wants to create an invoice. My solutions are:
Reload the module, to get the whole code executed again, but this could be very expensive.
Isolate the code building those maps in another file, say maps.py, which would be less expensive to reload and change all code that references it through refactoring.
Is there a solution that would allow me to touch only the code building those maps and the rest of the application remains ignorant of the change, and every time the map is referenced from another module or even the same, the code gets executed, effectively re-building maps with fresh data?
Is there a solution that would allow me to touch only the code building those maps and the rest of the application remains ignorant of the change, and every time the map is referenced from another module or even the same, the code gets executed, effectively re-building maps with fresh data?
Certainly: put you maps inside a function, or even better, a class.
If I understand this problem correctly, you have stateful data (maps) which need regenerating under some condition (every time they are accessed? Or just every time the db is updated?). I would do something like this:
class Mappings:
def __init__(self, db):
self._db = db
... # do any initial db stuff you need to here
def id_map(self, thing):
db_thing = getattr(self._db, thing.title)
return {x.name:x.id for x in self._db.session.query(db_thing, db_thing.id)}
def other_property_map(self, prop):
... # etc
mapping = Mapping(db)
mapping.id_map("agent")
This assumes that the mapping example you've given is your major use-case, but this model could easily be adapted for almost any other mapping you might want.
You would write a method of every kind of 'mapping' you need, and it would return the desired dictionary. Note that here I've assumed you handle setting up the db elsewhere and pass a fully initialised db access object to the class, which is probably what you want to do---this class is just about encapsulating mapper state, not re-inventing your orm.
Caching
I have not provided any caching. But if you have complete control over the db, it is easy enough to run a hook before you do any db commits looking to see if you've touched any particular model, and then state that those need rebuilding. Something like this:
class DbAccess(Mappings):
def __init__(self, db, models):
super().init(db)
self._cached_map = {model: {} for model in models}
def db_update(model: str, params: dict):
try:
self._cached_map[model] = {} # wipe cache
except KeyError:
pass
self._db.update_with_model(model, params) # dummy fn
def id_map(self, thing: str):
try:
return self._cached_map[thing]["id"]
except KeyError:
self._cached_map[thing]["id"] = super().id_map(thing)
return self._cached_map[thing]["id"]
I don't really think DbAccess should inherit from Mappings---put it all in one class, or have a DB class and a Mappings mixin and inherit from both. I just didn't want to write everything out again.
I've not written any real db access routines, (hence my dummy fn) as I don't know how you're doing it (but clearly using an ORM). But the basic idea is just to handle the caching yourself, by storing the mapping every time, but deleting all the stored mappings every time you do any commit transactions involving the model in question (thus rebuilding the cache as needed).
Aside
Note that if you really do have 2,000 lines of manually declared mappings of the form thing.name: thing.id you really should generate them at runtime anyhow. Declarative is all very well and good, but writing out 2,000 permutations of the same thing isn't declarative, it's just time-consuming---and doing the job a simple loop putting the data in ram could do for you at startup.
I have a function which is registered as an event on a sqlalchemy model, as show in the code snippets below (not fully-functional as I don't show the db fixture), which should be enough to explain the problem.
root/myapp/models.py:
class MyModel:
id = Column(UUID, primary_key=True)
value = ''
#classmethod
def register_hook(cls, hook_fn):
event.listen(cls, "after_update", hook_fn, propagate=True)
root/myapp/app.py:
from models import MyModel
def hook_fn(mapper, connection, target):
print('fired hook!')
MyModel.register_hook(hook_fn)
root/test/conftest.py:
#pytest.fixture
def patched_hook_fn(mocker):
with mocker.patch("root.myapp.app.hook_fn") as patched:
yield patched
root/test/tests.py:
def test_hook_fires_on_change(db, patched_hook_fn):
model = MyModel(value="initial")
db.session.commit()
model.value = "changed"
db.session.commit() # hook fires here
assert patched_hook_fn.called # assert fails
What I'd like to know is:
Why doesn't the patched function get called?
Is there a simple way in a debug session to see where I should be patching in the with mocker.patch("myapp.app.hook_fn") as patched line?
It doesn't get called because you've already registered the unpatched version with the event system. SQLAlchemy does not read the value at root.myapp.app.hook_fn every time the event is fired, so even if you later set root.myapp.app.hook_fn = some_other_function (which is what patch is doing), it has no visible effect.
The way to fix this is to simply force your app to read the value every time the event is fired, by introducing a level of indirection:
MyModel.register_hook(lambda: hook_fn())
This takes advantage of the way Python resolves identifiers in a closure, where changing root.myapp.app.hook_fn actually changes the value of hook_fn in the closure.
As for your second question, there's no straightforward way to figure out what you need to patch because in order to patch it directly you need to figure out where it is stored in the internals of SQLAlchemy, and depending on that, even in your tests, is quite fragile.
I'm experiencing some behavior that I can't quite figure out. I'm using Cassandra to store message objects, and I'm using Celery for async pulls and pushes to the database. Everything is working fine, except for a single Celery task; the other tasks that use the same code/classes work. Here's a rough breakdown of the code logic:
db_manager = DBManager()
class User(object):
def __init__(self, user_id):
... normal init stuff ...
self.loader()
#run_async
def loader(self):
... loads from database if found, otherwise pulls from API ...
# THIS WORKS
#celery.task(name='user-to-db', filter=task_method)
def to_db(self):
# db_manager is a custom backend that handles relevant db reads, writes, etc.
db_manager.add('users', self.user_payload)
# THIS WORKS
#celery.task(name='load-friends', filter=task_method)
def load_friends(self):
# Checks secondary redis index for friends of user
friends = redis.srandmember('users:the-users-id:friends', self.id, 20)
if not friends:
profiles = load_friends_from_api(user_id=self.id)
else:
query = "SELECT * FROM keyspace.users WHERE id IN ({friends})".format(friends=friends)
# Init a User object for every friend
loaded_friends = [User(friend) for friend in profiles]
# Returns a class container with all the instances of User(friend), accessible through a class property
return FriendContainer(self.id, loaded_friends)
# THIS DOES NOT WORK
#celery.task(name='get-user-messages', filter=task_method)
def get_user_messages(self):
# THIS IS WHERE IT FAILS #
messages = db_manager.get("SELECT message FROM keyspace.message_timelines WHERE user_id = {user_id}".format(user_id=self.id))
# THAT LINE ABOVE #
# Init a message class object for every message payload in database
msgs = [Message(m, user=self) for m in messages]
# Returns a message container class holding all the message objects, accessible through a class property
return MessageContainer(msgs)
This last class method throws an error:
File "/usr/local/lib/python2.7/dist-packages/kombu/serialization.py", line 356, in pickle_dumps
return dumper(obj, protocol=pickle_protocol)
EncodeError: Can't pickle <class 'cassandra.io.eventletreactor.message'>: attribute lookup cassandra.io.eventletreactor.message failed
cassandra.io.eventletreactor.message points to a user-defined type in Cassandra that I use as a container for message objects per user. The line that throws this error is:
messages = db_manager.get("SELECT message FROM keyspace.message_timelines WHERE user_id = {user_id}".format(user_id=self.id))
This is the method from DBManager():
class DBManager(object):
... stuff ...
def get(self, query):
# I do some stuff to prepare the query, namely substituting `WHERE this = that` for `WHERE this = ?` to create a Cassandra prepared statement.
statement = cassandra.prepare(query_prepared)
# I want these messages as a dict, not the default namedtuple
cassandra.row_factory = dict_factory
# User id is parsed out of query
results = cassandra.execute(statement, (user_id,))
rows = results.current_rows
# rows is a list of dicts, no weird class references or anything in there
return rows
I've read that Celery tasks out of class methods is/was kind of experimental, but I can't figure out why all the other methods qua tasks that use the same instance of DBManager are working.
The problem seems to be localized to some issue with the user-defined type message that's not playing nice within the Cassandra driver; however, if I run the get method from DBManager within the Celery task itself, it works. That is, if I copy/paste the code that is throwing the error from DBManager.get into User.get_user_messages, it works fine. If I try to call DBManager.get from within User.get_user_messages, it breaks.
I just can't figure out where the problem is. I can do all the following just fine:
Run the get_user_messages method without Celery, and it works.
Run the get_user_messages method WITH Celery if I run the get method code right in the Celery task method itself.
I can run other methods registered as Celery tasks that point to other methods in DBManager that use the Cassandra driver, even ones that insert the same message user-defined type into the database.
I've tried pickling ALL THE THINGS all the way down myself, and in various combinations, and can't reproduce the error.
What I have not tried:
Change serializer to json or yaml. There are a few convenience items in the db payload that won't serialize with either of those two.
Use dill instead of pickle. It seems like this should work without having to switch serializers given that I can get various parts working separately.
I could just say screw it and run the query directly through the Cassandra driver instead of my DBManager class, but I feel like this should be solvable and I'm just missing something really, really obvious, so obvious that I'm not seeing it. Any suggestions on where to look would be greatly appreciated.
In case of relevance: Cassandra 3.3, CQL 3.4, DataStax python driver 3.1
Meh, I found the problem, and it WAS really obvious. I guess I didn't actually try pickling all the things, just most of the things, and I didn't catch this in my 4am debugging stupor.
At any rate, cassandra.row_factory = dict_factory, when called on a user defined type, doesn't actually return everything as a dict. It gives a dict of {'label': message(x='this', y='that')}, where message is a namedtuple. The Cassandra driver dynamically creates the namedtuple inside of a class instance, and so pickle couldn't find it.
Context: Pyramid, SQLAlchemy session with ZopeTransactionExtension, Pyramid transaction manager.
This doc (http://zodb.readthedocs.org/en/latest/transactions.html#more-features-and-things-to-keep-in-mind-about-transactions ) says:
Before-commit hooks
In some cases, it may be desirable to execute some code right before a transaction is committed. For example, if an operation needs to be performed on all objects changed during a transaction, it might be better to call it once at commit time instead of every time an object is changed, which could slow things down.
I need to do exactly that (get a list of changed objects, whether flushed or not), but the problem is hook functions that can be added by current_transaction.addBeforeCommitHook() appear to receive only args and kwargs passed by the programmers: not a list of changed objects, not a transaction, etc.
Q: How to get hook access objects changed in current transaction before they are flushed?
I worked out the solution, although not in "before commit" hook, but rather by inheriting from ZopeTransactionExtension:
class ZopeTransactionExtensionWithRequest(ZopeTransactionExtension):
def before_flush(self, session, flush_context, instances):
super(ZopeTransactionExtensionWithRequest, self).before_flush(session, flush_context, instances)
for sqa_inst in session.dirty:
pass
Perhaps you can access session.dirty with get_current_request
I just started using SQLAlchemy and get a DetachedInstanceError and can't find much information on this anywhere. I am using the instance outside a session, so it is natural that SQLAlchemy is unable to load any relations if they are not already loaded, however, the attribute I am accessing is not a relation, in fact this object has no relations at all. I found solutions such as eager loading, but I can't apply to this because this is not a relation. I even tried "touching" this attribute before closing the session, but it still doesn't prevent the exception. What could be causing this exception for a non-relational property even after it has been successfully accessed once before? Any help in debugging this issue is appreciated. I will meanwhile try to get a reproducible stand-alone scenario and update here.
Update: This is the actual exception message with a few stacks:
File "/home/hari/bin/lib/python2.6/site-packages/SQLAlchemy-0.6.1-py2.6.egg/sqlalchemy/orm/attributes.py", line 159, in __get__
return self.impl.get(instance_state(instance), instance_dict(instance))
File "/home/hari/bin/lib/python2.6/site-packages/SQLAlchemy-0.6.1-py2.6.egg/sqlalchemy/orm/attributes.py", line 377, in get
value = callable_(passive=passive)
File "/home/hari/bin/lib/python2.6/site-packages/SQLAlchemy-0.6.1-py2.6.egg/sqlalchemy/orm/state.py", line 280, in __call__
self.manager.deferred_scalar_loader(self, toload)
File "/home/hari/bin/lib/python2.6/site-packages/SQLAlchemy-0.6.1-py2.6.egg/sqlalchemy/orm/mapper.py", line 2323, in _load_scalar_attributes
(state_str(state)))
DetachedInstanceError: Instance <ReportingJob at 0xa41cd8c> is not bound to a Session; attribute refresh operation cannot proceed
The partial model looks like this:
metadata = MetaData()
ModelBase = declarative_base(metadata=metadata)
class ReportingJob(ModelBase):
__tablename__ = 'reporting_job'
job_id = Column(BigInteger, Sequence('job_id_sequence'), primary_key=True)
client_id = Column(BigInteger, nullable=True)
And the field client_id is what is causing this exception with a usage like the below:
Query:
jobs = session \
.query(ReportingJob) \
.filter(ReportingJob.job_id == job_id) \
.all()
if jobs:
# FIXME(Hari): Workaround for the attribute getting lazy-loaded.
jobs[0].client_id
return jobs[0]
This is what triggers the exception later out of the session scope:
msg = msg + ", client_id: %s" % job.client_id
I found the root cause while trying to narrow down the code that caused the exception. I placed the same attribute access code at different places after session close and found that it definitely doesn't cause any issue immediately after the close of query session. It turns out the problem starts appearing after closing a fresh session that is opened to update the object. Once I understood that the state of the object is unusable after a session close, I was able to find this thread that discussed this same issue. Two solutions that come out of the thread are:
Keep a session open (which is obvious)
Specify expire_on_commit=False to sessionmaker().
The 3rd option is to manually set expire_on_commit to False on the session once it is created, something like: session.expire_on_commit = False. I verified that this solves my issue.
We were getting similar errors, even with expire_on_commit set to False. In the end it was actually caused by having two sessionmakers that were both getting used to make sessions in different requests. I don't really understand what was going on, but if you see this exception with expire_on_commit=False, make sure you don't have two sessionmakers initialized.
I had a similar problem with the DetachedInstanceError: Instance <> is not bound to a Session;
The situation was quite simple, I pass the session and the record to be updated to my function and it would merge the record and commit it to the database. In the first sample I would get the error, as I was lazy and thought that I could just return the merged object so my operating record would be updated (ie its is_modified value would be false). It did return the updated record and is_modified was now false but subsequent uses threw the error. I think this was compounded because of related child records but not entirely sure of that.
def EditStaff(self, session, record):
try:
r = session.merge(record)
session.commit()
return r
except:
return False
After much googling and reading about sessions etc, I realized that since I had captured the instance r before the commit and returned it, when that same record was sent back to this function for another edit/commit it had lost its session.
So to fix this I just query the database for the record just updated and return it to keep it in session and mark its is_modified value back to false.
def EditStaff(self, session, record):
try:
session.merge(record)
session.commit()
r = self.GetStaff(session, record)
return r
except:
return False
Setting the expire_on_commit=False also avoided the error as mentioned above, but I don't think it actually addresses the error, and could lead to many other issues IMO.
To throw my cause & solution into the ring, I use flask and flask-sqlalchemy to manage all my session stuff. This is fine when I'm doing things via the site, but when doing things via command line and scripts, you have to ensure that anything that's doing flask-y things has to do it with the flask context.
So, in my situation, I needed to get things from a database (using flask-sqlalchemy), then render them to templates (using flask's render_template), then email them (using flask-mail).
In code, what I'd done was something like,
def render_obj(db_obj):
with app.app_context():
return render_template('template_for_my_db_obj.html', db_obj=db_obj
def get_renders():
my_db_objs = MyDbObj.query.all()
renders = []
for day, _db_objs in itertools.groupby(my_db_objs, MyDbObj.get_date):
renders.extend(list(map(render_obj, _db_obj)))
return renders
def email_report():
renders = get_renders()
report = '\n'.join(renders)
with app.app_context():
mail.send(Message('Subject', ['me#me.com'], html=report))
(this is basically pseudocode, I was doing other things in the grouping section)
And when I was running, I'd get through the first _db_obj, but then I'd get the error on any run after.
The culprit? with app.app_context().
Basically it does a few things when you come out of that context, including kinda freshening up the db connections. One of the things that comes from that is getting rid of the last session that was around, which was the session that all the my_db_objs were associated with.
There's a few different options for solutions, but I went with a variant of,
def render_obj(db_obj):
return render_template('template_for_my_db_obj.html', db_obj=db_obj
def get_renders():
my_db_objs = MyDbObj.query.all()
renders = []
for day, _db_objs in itertools.groupby(my_db_objs, MyDbObj.get_date):
renders.extend(list(map(render_obj, _db_obj)))
return renders
def email_report():
with app.app_context():
renders = get_renders()
report = '\n'.join(renders)
mail.send(Message('Subject', ['me#me.com'], html=report))
Only 1 with app.app_context() which wraps them all. The main thing you need to do (if you've a setup like mine) is ensure any dB object you're using to be "inside" any app_context you're using. If you do what I did in the first iteration, all your dB objects will lose their session, ending in DetachedInstanceError like me.
My solution was a simple oversight;
I created an object, added and ,committed it to the db and after that I tried to access on of the original object attributes without refreshing session session.refresh(object)
user = UserFactory()
session.add(user)
session.commit()
# missing session.refresh(user) and causing the problem
print(user.name)
As for me (newbie), I made a mistake on the indent and close the session inside my loop, in which I loop each row, do some operation and commit each time.
So for those newbie like me, check your code before setting things like expire_on_commit=False, it may lead your to another trap.
My solution to this error was also a simple oversight, which I don't think any of the other answers cover.
My function is fetching object x, modifying it, then returning the original x, because I would like the older version.
Before committing and returning x, I was calling expunge_all, but it was "too late", as the object was already marked dirty.
The solution was simply to expunge the object as early as possible.
# pseudo code
x = session.fetch_x()
# adding the following line fixed it
session.expunge(x)
y = session.update(x)
return x
I have a similar problem in my current project and this fix works for me. Please check in your DB relationship for options lazy=True and change it to lazy='dynamic'.