Can SQLAlchemy events be used to update a denormalized data cache?

Can SQLAlchemy events be used to update a denormalized data cache? - python

For performance reasons, I've got a denormalized database where some tables contain data which has been aggregated from many rows in other tables. I'd like to maintain this denormalized data cache by using SQLAlchemy events. As an example, suppose I was writing forum software and wanted each Thread to have a column tracking the combined word count of all comments in the thread in order to efficiently display that information:
class Thread(Base):
id = Column(UUID, primary_key=True, default=uuid.uuid4)
title = Column(UnicodeText(), nullable=False)
word_count = Column(Integer, nullable=False, default=0)
class Comment(Base):
id = Column(UUID, primary_key=True, default=uuid.uuid4)
thread_id = Column(UUID, ForeignKey('thread.id', ondelete='CASCADE'), nullable=False)
thread = relationship('Thread', backref='comments')
message = Column(UnicodeText(), nullable=False)
#property
def word_count(self):
return len(self.message.split())
So every time a comment is inserted (for the sake of simplicity let's say that comments are never edited or deleted), we want to update the word_count attribute on the associated Thread object. So I'd want to do something like
def after_insert(mapper, connection, target):
thread = target.thread
thread.word_count = sum(c.word_count for c in thread.comments)
print("updated cached word count to", thread.word_count)
event.listen(Comment, "after_insert", after_insert)
So when I insert a Comment, I can see the event firing and see that it has correctly calculated the word count, but that change is not saved to the Thread row in the database. I don't see any caveats about updated other tables in the after_insert documentation, though I do see some caveats in some of the others, such as after_delete.
So is there a supported way to do this with SQLAlchemy events? I'm already using SQLAlchemy events for lots of other things, so I'd like to do everything that way instead of having to write database triggers.

the after_insert() event is one way to do this, and you might notice it is passed a SQLAlchemy Connection object, instead of a Session as is the case with other flush related events. The mapper-level flush events are intended to be used normally to invoke SQL directly on the given Connection:
#event.listens_for(Comment, "after_insert")
def after_insert(mapper, connection, target):
thread_table = Thread.__table__
thread = target.thread
connection.execute(
thread_table.update().
where(thread_table.c.id==thread.id).
values(word_count=sum(c.word_count for c in thread.comments))
)
print "updated cached word count to", thread.word_count
what is notable here is that invoking an UPDATE statement directly is also a lot more performant than running that attribute change through the whole unit of work process again.
However, an event like after_insert() isn't really needed here, as we know the value of "word_count" before the flush even happens. We actually know it as Comment and Thread objects are associated with each other, and we could just as well keep Thread.word_count completely fresh in memory at all times using attribute events:
def _word_count(msg):
return len(msg.split())
#event.listens_for(Comment.message, "set")
def set(target, value, oldvalue, initiator):
if target.thread is not None:
target.thread.word_count += (_word_count(value) - _word_count(oldvalue))
#event.listens_for(Comment.thread, "set")
def set(target, value, oldvalue, initiator):
# the new Thread, if any
if value is not None:
value.word_count += _word_count(target.message)
# the old Thread, if any
if oldvalue is not None:
oldvalue.word_count -= _word_count(target.message)
the great advantage of this method is that there's also no need to iterate through thread.comments, which for an unloaded collection means another SELECT is emitted.
still another method is to do it in before_flush(). Below is a quick and dirty version, which can be refined to more carefully analyze what has changed in order to determine if the word_count needs to be updated or not:
#event.listens_for(Session, "before_flush")
def before_flush(session, flush_context, instances):
for obj in session.new | session.dirty:
if isinstance(obj, Thread):
obj.word_count = sum(c.word_count for c in obj.comments)
elif isinstance(obj, Comment):
obj.thread.word_count = sum(c.word_count for c in obj.comments)
I'd go with the attribute event method as it is the most performant and up-to-date.

You can do this with SQLAlchemy-Utils aggregated columns: http://sqlalchemy-utils.readthedocs.org/en/latest/aggregates.html

Related

Filtering a relationship attribute in SQLAlchemy

I have some code with a Widget object that must undergo some processing periodically. Widgets have a relationship to a Process object that tracks individual processing attempts and holds data about those attempts, such as state information, start and end times, and the result. The relationship looks something like this:
class Widget(Base):
_tablename_ = 'widget'
id = Column(Integer, primary_key=True)
name = Column(String)
attempts = relationship('Process')
class Process(Base):
_tablename_ = 'process'
id = Column(Integer, primary_key=True)
widget_id = Column(Integer, ForeignKey='widget.id'))
start = Column(DateTime)
end = Column(DateTime)
success = Column(Boolean)
I want to have a method on Widget to check whether it's time to process that widget yet, or not. It needs to look at all the attempts, find the most recent successful one, and see if it is older than the threshold.
One option is to iterate Widget.attempts using a list comprehension. Assuming now and delay are reasonable datetime and timedelta objects, then something like this works when defined as a method on Widget:
def ready(self):
recent_success = [attempt for attempt in self.attempts if attempt.success is True and attempt.end >= now - delay]
if recent_success:
return False
return True
That seems like good idiomatic Python, but it's not making good use of the power of the SQL database backing the data, and it's probably less efficient than running a similar SQL query especially once there are a large number of Process objects in the attempts list. I'm having a hard time figuring out the best way to implement this as a query, though.
It's easy enough to run the query inside Widget something like this:
def ready(self):
recent_success = session.query(Process).filter(
and_(
Process.widget_id == self.id,
Process.success == True,
Process.end >= now - delay
)
).order_by(Process.end.desc()).first()
if recent_success:
return False
return True
But I run into problems in unit tests with getting session set properly inside the module that defines Widget. It seems to me that's a poor style choice, and probably not how SQLAlchemy objects are meant to be structured.
I could make the ready() function something external to the Widget class, which would fix the problems with setting session in unit tests, but that seems like poor OO structure.
I think the ideal would be if I could somehow filter Widget.attempts with SQL-like code that's more efficient than a list comprehension, but I haven't found anything that suggests that's possible.
What is actually the best approach for something like this?

You are thinking in the right direction. Any solution within the Widget instance implies you need to process all instances. Seeking the external process would have better performance and easier testability.
You can get all the Widget instances which need to be scheduled for next processing using this query:
q = (
session
.query(Widget)
.filter(Widget.attempts.any(and_(
Process.success == True,
Process.end >= now - delay,
)))
)
widgets_to_process = q.all()
If you really want to have a property on the model, i would not create a separate query, but just use the relationship:
def ready(self, at_time):
successes = [
attempt
for attempt in sorted(self.attempts, key=lambda v: v.end)
if attempt.success and attempt.end >= at_time # at_time = now - delay
]
return bool(successes)

is it possible to unload the relationship data of sqlalchemy objects once loaded?

I am working in a library that needs to read several objects that have a one-to-many relationship with another class. The following code represents a highly simplified version of the code of my library:
class Plan(Base):
__tablename__ = 'plan'
id = Column(Integer, primary_key=True)
name = Column(String)
points = relationship('Point')
def __init__(self,name):
self.name = name
class Point(Base):
__tablename__ = 'point'
id = Column(Integer, primary_key=True)
coordinates = Column(String)
plan_id = Column(Integer,ForeignKey('plan.id'))
def __init__(self, coordinates):
self.coordinates = coordinates
I need to read all the data of the relationship (all the points for the example from above), conduct an operation with all the points, and continue with the next object (an object from the Plan class in the example). Since I cannot load all the data in memory because of the high number of points that each plan has, what I would like to do, is after completing the operation with the points objects related to a plan (which I do not need them after the operation), unload them and continue the process with the next plan. I have tried the following approaches:
plan.points = []
or
for point in plan.points:
plan.points.remove(point)
But each of these approaches stacks an update or delete query (depending on how the cascade attribute is set in the relationship) that implies a considerably overhead to the process, incrementing by 2 or 3 the total execution time and stacking a potentially harmful SQL operation if the session is committed.
Is there a way for unloading objects without the generation of these UPDATE/DELETE queries and the increment of the execution time?

By default, your objects are only stored in the Session using weak references, meaning that you can simply delete your own references to the objects to clear them from memory.
That said, you are free to call Session.expunge_all() to remove any thing still held by the session. Don't call this if you have changes that need to be committed, issue a Session.flush() first.
You can expunge individual Point objects with the Session.expunge() method too.
See Using the Session for more details.

sqlalchemy: ObjectdereferencedError

I have a flask app where I made a bunch of classes all with relationships to each other:
User
Course
Lecture
Note
Queue
Asset
So I'm trying to make a new lecture and note, and I have a method defined for each thing.
create Note
def createPad(user,course,lecture):
lecture.queues.first().users.append(user)
# make new etherpad for user to wait in
newNote = Note(dt) # init also creates a new pad at /p/groupID$noteID
db.session.add(newNote)
#db.session.commit()
# add note to user, course, and lecture
user.notes.append(newNote)
course.notes.append(newNote)
lecture.notes.append(newNote)
db.session.commit()
return newNote
createLecture
def createLecture(user, course):
# create new lecture
now = datetime.now()
dt = now.strftime("%Y-%m-%d-%H-%M")
newLecture = Lecture(dt)
db.session.add(newLecture)
# add lecture to course, add new queue to lecture, add user to queue, add new user to lecture
course.lectures.append(newLecture)
newQueue = MatchQueue('neutral')
db.session.add(newQueue)
newLecture.users.append(user)
# hook up the new queue to the user, lecture
newQueue.users.append(user)
newQueue.lecture = newLecture
# put new lecture in correct course
db.session.commit()
newLecture.groupID = pad.createGroupIfNotExistsFor(newLecture.course.name+dt)['groupID']
db.session.commit()
return newLecture
which is all called from some controller logic
newlec = createLecture(user, courseobj)
# make new pad
newNote = createPad(user,courseobj,newlec)
# make lecture live
newLecture.live = True
db.session.commit()
redirect(somewhere)
This ends up throwing this error:
ObjectDereferencedError: Can't emit change event for attribute 'Queue.users' - parent object of type has been garbage collected.
At lecture.queues.first().users.append(user) in createPad.
I have no clue what this means. I think I'm lacking some fundamental knowledge of sqlalchemy here (I am a sqlalchemy noob). What's going on?

lecture.queues.first().users.append(user)
it means:
the first() method hits the database and produces an object, I'm not following your mappings but I have a guess its a Queue object.
then, you access the "users" collection on Queue.
At this point, Python itself garbage collects Queue - it is not referenced anywhere, once "users" has been returned. This is how reference counting garbage collection works.
Then you attempt to append a "user" to "users". SQLAlchemy has to track the changes to all mapped attributes, if you were to say Queue.name = "some name", SQLAlchemy needs to register that with the parent Queue object so it knows to flush it. If you say Queue.users.append(someuser), same idea, it needs to register the change event with the parent.
SQLAlchemy can't do this, because the Queue is gone. Hence the message is raised. SQLAlchemy has a weakref to the parent here so it knows exactly what has happened (and we can't prevent it because people get very upset when we create unnecessary reference cycles in their object models).
The solution is very easy and also easier to read which is to assign the query result to a variable:
queue = lecture.queues.first()
queue.users.append(user)

I did a bunch of refactoring and put in the objects corresponding to some objects during instantiation, which made things a lot neater. Somehow the problem went away.
One thing I did differently was in my many-to-many relationships, I changed the backref to a db.backref():
courses = db.relationship('Course', secondary=courseTable, backref=db.backref('users', lazy='dynamic'))
lectures = db.relationship('Lecture', secondary=lectureTable, backref=db.backref('users', lazy='dynamic'))
notes = db.relationship('Note', secondary=noteTable, backref=db.backref('users', lazy='dynamic'))
queues = db.relationship('Queue', secondary=queueTable, backref=db.backref('users', lazy='dynamic'))

SQLAlchemy: Conditional autoincrement

I want to create a flat forum, where threads are no separate table, with a composite primary key for posts.
So posts have two fields forming a natural key: thread_id and post_number, where the further is the ID of the thread they are part of, and the latter is their position in the thread. if you aren’t convinced, check below the line.
My problem is that i don’t know how to tell SQLAlchemy
when committing the addition of new Post instances with thread_id tid, look up how many posts with thread_id tid exist, and autoincrement from that number on.
Why do i think that schema is a good idea? because it’s natural and performant:
class Post(Base):
number = Column(Integer, primary_key=True, autoincrement=False, nullable=False)
thread_id = Column(Integer, primary_key=True, autoincrement=False, nullable=False)
title = Column(Text) #nullable for not-first posts
text = Column(Text, nullable=False)
...
PAGESIZE = 10
#test
tid = 5
page = 4
Entire Thread (query):
thread5 = session.query(Post).filter_by(thread_id=5)
Thread title:
title = thread5.filter_by(number=0).one().title
Thread page
page4 = thread5.filter(
Post.number >= (page * PAGESIZE),
Post.number < ((page+1) * PAGESIZE)).all()
#or
page4 = thread5.offset(page * PAGESIZE).limit(PAGESIZE).all()
Number of pages:
ceil(thread5.count() / PAGESIZE)

You can probably do this with an SQL expression as a default value (see the default argument). Give it a callable like this:
from sqlalchemy.sql import func
def maxnumber_for_threadid(context):
return post_table.select([func.max(post_table.c.number)]).where(post_table.c.thread_id==context.current_parameters['thread_id'])
I'm not absolutely sure you can return an sql expression from a default callable--you may have to actually execute this query and return a scalar value inside the callback. (The cursor should be available from the context parameter.)
However, I strongly recommend you do what #kindall says and just use another auto-incrementing sequence for the number column. What you want to do is actually very tricky to get right even without SQLAlchemy. For example, if you are using an MVCC database you need to introduce special row-level locking so that the number of rows with a matching thread_id does not change while you are running the transaction. How this is done is database-dependent. For example with MySQL InnoDB, you need to do something like this:
BEGIN TRANSACTION;
SELECT MAX(number)+1 FROM posts WHERE thread_id=? FOR UPDATE;
INSERT INTO posts (thread_id, number) VALUES (?, ?); -- number is from previous query
COMMIT;
If you didn't use FOR UPDATE, then conceivably another connection trying to insert a new post into the same thread at the same time could have gotten the same value for number.
So rather than being performant, post inserts are actually quite slow (relatively speaking) because of the extra query and locking required.
All this is resolved by using a separate sequence and not worrying about post number incrementing only within a thread_id.

You should just use a global post number that increments for posts in any thread. Then you don't need to figure out the right number to use for a given thread. A given thread, then, might have posts numbered 7, 20, 42, 51, and so on. This does not matter because you can easily get the number of posts in the thread from the size of the recordset you get back from the query, and you can easily number the posts in the HTML output separately from the actual post numbers.

How do I ensure data integrity for objects in google app engine without using key names?

I'm having a bit of trouble in Google App Engine ensuring that my data is correct when using an ancestor relationship without key names.
Let me explain a little more: I've got a parent entity category, and I want to create a child entity item. I'd like to create a function that takes a category name and item name, and creates both entities if they don't exist. Initially I created one transaction and created both in the transaction if needed using a key name, and this worked fine. However, I realized I didn't want to use the name as the key as it may need to change, and I tried within my transaction to do this:
def add_item_txn(category_name, item_name):
category_query = db.GqlQuery("SELECT * FROM Category WHERE name=:category_name", category_name=category_name)
category = category_query.get()
if not category:
category = Category(name=category_name, count=0)
item_query = db.GqlQuery("SELECT * FROM Item WHERE name=:name AND ANCESTOR IS :category", name=item_name, category=category)
item_results = item_query.fetch(1)
if len(item_results) == 0:
item = Item(parent=category, name=name)
db.run_in_transaction(add_item_txn, "foo", "bar")
What I found when I tried to run this is that App Engine rejects this as it won't let you run a query in a transaction: Only ancestor queries are allowed inside transactions.
Looking at the example Google gives about how to address this:
def decrement(key, amount=1):
counter = db.get(key)
counter.count -= amount
if counter.count < 0: # don't let the counter go negative
raise db.Rollback()
db.put(counter)
q = db.GqlQuery("SELECT * FROM Counter WHERE name = :1", "foo")
counter = q.get()
db.run_in_transaction(decrement, counter.key(), amount=5)
I attempted to move my fetch of the category to before the transaction:
def add_item_txn(category_key, item_name):
category = category_key.get()
item_query = db.GqlQuery("SELECT * FROM Item WHERE name=:name AND ANCESTOR IS :category", name=item_name, category=category)
item_results = item_query.fetch(1)
if len(item_results) == 0:
item = Item(parent=category, name=name)
category_query = db.GqlQuery("SELECT * FROM Category WHERE name=:category_name", category_name="foo")
category = category_query.get()
if not category:
category = Category(name=category_name, count=0)
db.run_in_transaction(add_item_txn, category.key(), "bar")
This seemingly worked, but I found when I ran this with a number of requests that I had duplicate categories created, which makes sense, as the category is queried outside the transaction and multiple requests could create multiple categories.
Does anyone have any idea how I can create these categories properly? I tried to put the category creation into a transaction, but received the error about ancestor queries only again.
Thanks!
Simon

Here is an approach to solving your problem. It is not an ideal approach in many ways, and I sincerely hope that someone other AppEnginer will come up with a neater solution than I have. If not, give this a try.
My approach utilizes the following strategy: it creates entities that act as aliases for the Category entities. The name of the Category can change, but the alias entity will retain its key, and we can use elements of the alias's key to create a keyname for your Category entities, so we will be able to look up a Category by its name, but its storage is decoupled from its name.
The aliases are all stored in a single entity group, and that allows us to use a transaction-friendly ancestor query, so we can lookup or create a CategoryAlias without risking that multiple copies will be created.
When I want to lookup or create a Category and item combo, I can use the category's keyname to programatically generate a key inside the transaction, and we are allowed to get an entity via its key inside a transaction.
class CategoryAliasRoot(db.Model):
count = db.IntegerProperty()
# Not actually used in current code; just here to avoid having an empty
# model definition.
__singleton_keyname = "categoryaliasroot"
#classmethod
def get_instance(cls):
# get_or_insert is inherently transactional; no chance of
# getting two of these objects.
return cls.get_or_insert(cls.__singleton_keyname, count=0)
class CategoryAlias(db.Model):
alias = db.StringProperty()
#classmethod
def get_or_create(cls, category_alias):
alias_root = CategoryAliasRoot.get_instance()
def txn():
existing_alias = cls.all().ancestor(alias_root).filter('alias = ', category_alias).get()
if existing_alias is None:
existing_alias = CategoryAlias(parent=alias_root, alias=category_alias)
existing_alias.put()
return existing_alias
return db.run_in_transaction(txn)
def keyname_for_category(self):
return "category_" + self.key().id
def rename(self, new_name):
self.alias = new_name
self.put()
class Category(db.Model):
pass
class Item(db.Model):
name = db.StringProperty()
def get_or_create_item(category_name, item_name):
def txn(category_keyname):
category_key = Key.from_path('Category', category_keyname)
existing_category = db.get(category_key)
if existing_category is None:
existing_category = Category(key_name=category_keyname)
existing_category.put()
existing_item = Item.all().ancestor(existing_category).filter('name = ', item_name).get()
if existing_item is None:
existing_item = Item(parent=existing_category, name=item_name)
existing_item.put()
return existing_item
cat_alias = CategoryAlias.get_or_create(category_name)
return db.run_in_transaction(txn, cat_alias.keyname_for_category())
Caveat emptor: I have not tested this code. Obviously, you will need to change it to match your actual models, but I think that the principles that it uses are sound.
UPDATE:
Simon, in your comment, you mostly have the right idea; although, there is an important subtlety that you shouldn't miss. You'll notice that the Category entities are not children of the dummy root. They do not share a parent, and they are themselves the root entities in their own entity groups. If the Category entities did all have the same parent, that would make one giant entity group, and you'd have a performance nightmare because each entity group can only have one transaction running on it at a time.
Rather, the CategoryAlias entities are the children of the bogus root entity. That allows me to query inside a transaction, but the entity group doesn't get too big because the Items that belong to each Category aren't attached to the CategoryAlias.
Also, the data in the CategoryAlias entity can change without changing the entitie's key, and I am using the Alias's key as a data point for generating a keyname that can be used in creating the actual Category entities themselves. So, I can change the name that is stored in the CategoryAlias without losing my ability to match that entity with the same Category.

A couple of things to note (I think they're probably just typos) -
The first line of your transactional method calls get() on a key - this is not a documented function. You don't need to have the actual category object in the function anyway - the key is sufficient in both of the places where you are using the category entity.
You don't appear to be calling put() on either of the category or the item (but since you say you are getting data in the datastore, I assume you have left this out for brevity?)
As far as a solution goes - you could attempt to add a value in memcache with a reasonable expiry -
if memcache.add("category.%s" % category_name, True, 60): create_category(...)
This at least stops you creating multiples. It is still a bit tricky to know what do if the query does not return the category, but you cannot grab the lock from memcache. This means the category is in the process of being created.
If the originating request comes from the task queue, then just throw an exception so the task gets re-run.
Otherwise you could wait a bit and query again, although this is a little dodgy.
If the request comes from the user, then you could tell them there has been a conflict and to try again.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can SQLAlchemy events be used to update a denormalized data cache? - python

You can do this with SQLAlchemy-Utils aggregated columns: http://sqlalchemy-utils.readthedocs.org/en/latest/aggregates.html

Related

Filtering a relationship attribute in SQLAlchemy

is it possible to unload the relationship data of sqlalchemy objects once loaded?

sqlalchemy: ObjectdereferencedError

SQLAlchemy: Conditional autoincrement

How do I ensure data integrity for objects in google app engine without using key names?

Categories

Resources