Filtering a relationship attribute in SQLAlchemy

Filtering a relationship attribute in SQLAlchemy - python

I have some code with a Widget object that must undergo some processing periodically. Widgets have a relationship to a Process object that tracks individual processing attempts and holds data about those attempts, such as state information, start and end times, and the result. The relationship looks something like this:
class Widget(Base):
_tablename_ = 'widget'
id = Column(Integer, primary_key=True)
name = Column(String)
attempts = relationship('Process')
class Process(Base):
_tablename_ = 'process'
id = Column(Integer, primary_key=True)
widget_id = Column(Integer, ForeignKey='widget.id'))
start = Column(DateTime)
end = Column(DateTime)
success = Column(Boolean)
I want to have a method on Widget to check whether it's time to process that widget yet, or not. It needs to look at all the attempts, find the most recent successful one, and see if it is older than the threshold.
One option is to iterate Widget.attempts using a list comprehension. Assuming now and delay are reasonable datetime and timedelta objects, then something like this works when defined as a method on Widget:
def ready(self):
recent_success = [attempt for attempt in self.attempts if attempt.success is True and attempt.end >= now - delay]
if recent_success:
return False
return True
That seems like good idiomatic Python, but it's not making good use of the power of the SQL database backing the data, and it's probably less efficient than running a similar SQL query especially once there are a large number of Process objects in the attempts list. I'm having a hard time figuring out the best way to implement this as a query, though.
It's easy enough to run the query inside Widget something like this:
def ready(self):
recent_success = session.query(Process).filter(
and_(
Process.widget_id == self.id,
Process.success == True,
Process.end >= now - delay
)
).order_by(Process.end.desc()).first()
if recent_success:
return False
return True
But I run into problems in unit tests with getting session set properly inside the module that defines Widget. It seems to me that's a poor style choice, and probably not how SQLAlchemy objects are meant to be structured.
I could make the ready() function something external to the Widget class, which would fix the problems with setting session in unit tests, but that seems like poor OO structure.
I think the ideal would be if I could somehow filter Widget.attempts with SQL-like code that's more efficient than a list comprehension, but I haven't found anything that suggests that's possible.
What is actually the best approach for something like this?

You are thinking in the right direction. Any solution within the Widget instance implies you need to process all instances. Seeking the external process would have better performance and easier testability.
You can get all the Widget instances which need to be scheduled for next processing using this query:
q = (
session
.query(Widget)
.filter(Widget.attempts.any(and_(
Process.success == True,
Process.end >= now - delay,
)))
)
widgets_to_process = q.all()
If you really want to have a property on the model, i would not create a separate query, but just use the relationship:
def ready(self, at_time):
successes = [
attempt
for attempt in sorted(self.attempts, key=lambda v: v.end)
if attempt.success and attempt.end >= at_time # at_time = now - delay
]
return bool(successes)

Related

How to return field as set() using peewee

I have currently worked abit with ORM using Peewee and I have been trying to understand how I am able to get the field url from the table. The condition is that column visible needs to be true as well. Meaning that if visible is True and the store_id is 4 then return all the url as set.
I have currently done something like this
from peewee import (
Model,
TextField,
BooleanField
)
from playhouse.pool import PooledPostgresqlDatabase
# -------------------------------------------------------------------------
# Connection to Postgresql
# -------------------------------------------------------------------------
postgres_pool = PooledPostgresqlDatabase(
'xxxxxxx',
host='xxxxxxxx',
user='xxxxxxxx',
password='xxxxxx',
max_connections=20,
stale_timeout=30,
)
# ------------------------------------------------------------------------------- #
class Products(Model):
store_id = TextField(column_name='store_id')
url = TextField(column_name='url')
visible = BooleanField(column_name='visible')
class Meta:
database = postgres_pool
db_table = "develop"
#classmethod
def get_urls(cls):
try:
return set([i.url for i in cls.select().where((cls.store_id == 4) & (cls.visible))])
except Products.IntegrityError:
return None
However using the method takes around 0.13s which feels abit too long for me than what it supposed to do which I believe is due to the for loop and needing to put it as a set() and I wonder if there is a possibility that peewee can do something like cls.select(cls.url).where((cls.store_id == 4) & (cls.visible) and return as set()?

How many products do you have? How big is this set? Why not use distinct() so that the database de-duplicates them for you? What indexes do you have? All of these questions are much more pertinent than "how do I make this python loop faster".
I'd suggest that you need an index on store_id, visible or store_id where visible.
create index "product_urls" on "products" ("store_id") where "visible"
You could even use a covering index but this may take up a lot of disk space:
create index "product_urls" on "products" ("store_id", "url") where visible
Once you've got the actual query sped up with an index, you can also use distinct() to make the db de-dupe the URLs before sending them to Python. Additionally, since you only need the URL, just select that column and use the tuples() method to avoid creating a class:
#classmethod
def get_urls(cls):
query = cls.select(cls.url).where((cls.store_id == 4) & cls.visible)
return set(url for url, in query.distinct().tuples())
Lastly please read the docs: http://docs.peewee-orm.com/en/latest/peewee/querying.html#iterating-over-large-result-sets

How to make sure that app engine ndb root entity is saved before proceeding with rest of code

I am currently trying to create a CustomUser entity in my app engine project upon a user signing in for the first time. I would like CustomUser entities to be unique, and I would like to prevent the same entity from being created more than once. This would be fairly easy to do if I can supply it with an ancestor upon entity creation, as this will make the transaction strongly consistent.
Unfortunately, this is not the case, due to the fact that a CustomUser entity is a root entity, and it will thus be eventually consistent, not strongly consistent. Because of this, there are instances when the entity is created twice, which I would like to prevent as this will cause problems later on.
So the question is, is there a way I can prevent the entity from being created more than once? Or at least make the commit of the ancestor entity strongly consistent to prevent duplication? Here's my code, and interim (hacky) solution.
import time
import logging
from google.appengine.ext import ndb
# sample Model
class CustomUser(ndb.Model):
user_id = ndb.StringProperty(required=True)
some_data = ndb.StringProperty(required=True)
some_more_data = ndb.StringProperty(required=True)
externally_based_user_id = "id_taken_from_somewhere_else"
# check if this id already exists in the Model.
# If it does not exist yet, create it
user_entity = CustomUser.query(
CustomUser.user_id == externally_based_user_id,
ancestor=None).get()
if not user_entity:
# prepare the entity
user_entity = CustomUser(
user_id=externally_based_user_id,
some_data="some information",
some_more_data="even more information",
parent=None
)
# write the entity to ndb
user_key = user_entity.put()
# inform of success
logging.info("user " + str(user_key) + " created")
# eventual consistency workaround - loop and keep checking if the
# entity has already been created
#
# I understand that a while loop may not be the wisest solution.
# I can also use a for loop with n range to avoid going around the loop infinitely.
# Both however seem to be band aid solutions
while not entity_check:
entity_check = CustomUser.query(
CustomUser.user_id == externally_based_user_id,
ancestor=None).get()
# time.sleep to prevent the instance from consuming too much processing power and
# memory, although I'm not certain if this has any real effect apart from
# reducing the number of loops
if not entity_check:
time.sleep(0.5)
EDIT: Solution I ended up using based on both of Daniel Roseman's suggestions. This can be further simplified by using get_or_insert as suggested by voscausa. I've stuck to using the usual way of doing things to make things clearer.
import logging
from google.appengine.ext import ndb
# ancestor Model
# we can skip the creation of an empty class like this, and just use a string when
# retrieving a key
class PhantomAncestor(ndb.Model):
pass
# sample Model
class CustomUser(ndb.Model):
# user_id now considered redundance since we will be
# user_id = ndb.StringProperty(required=True)
some_data = ndb.StringProperty(required=True)
some_more_data = ndb.StringProperty(required=True)
externally_based_user_id = "id_taken_from_somewhere_else"
# construct the entity key using information we know.
# entity_key = ndb.Key(*arbitrary ancestor kind*, *arbitrary ancestor id*, *Model*, *user_id*)
# we can also use the string "PhantomAncestor" instead of passing in an empty class like so:
# entity_key = ndb.Key("SomeRandomString", externally_based_user_id, CustomUser, externally_based_user_id)
# check this page on how to construct a key: https://cloud.google.com/appengine/docs/python/ndb/keyclass#Constructors
entity_key = ndb.Key(PhantomAncestor, externally_based_user_id, CustomUser, externally_based_user_id)
# check if this id already exists in the Model.
user_entity = entity_key.get()
# If it does not exist yet, create it
if not user_entity:
# prepare the entity with the desired key value
user_entity = CustomUser(
# user_id=externally_based_user_id,
some_data="some information",
some_more_data="even more information",
parent=None,
# specify the custom key value here
id=externally_based_user_id
)
# write the entity to ndb
user_key = user_entity.put()
# inform of success
logging.info("user " + str(user_key) + " created")
# we should also be able to use CustomUser.get_and_insert to simplify the code further

A couple of things here.
First, note that the ancestor doesn't have to actually exist. If you want a strongly consistent query, you can use any arbitrary key as an ancestor.
A second option would be to use user_id as your key. Then you can do a key get, rather than a query, which again is strongly consistent.

SQLAlchemy-serialize values into column of table

I'm working on a project in SQLAlchemy. I've got Command class which has custom serialization/deserialization method called toBinArray() and fromBinArray(bytes). I use it for TCP communication (I don't want to use pickle because my functions create smaller outputs).
Command has several subclasses, let's call them CommandGet, CommandSet, etc. They have additional methods and attributes and serialization methods redefinitions to keep track of their own attributes. I'm keeping all of them in one table using polymorhic_identity mechanism.
The problem is that there are lot of subclasses and every has different attributes. I have previously written mapping for every of them, but this way table has huge amount of columns.
I would like to write mechanism that will serialize (using self.toBinArray()) every instance to attribute self._bin_array (stored in Binary column) before every write to DB and load (using self.fromBinArray(value)) attributes after every load of instance from DB.
I have already found answer to part of my question: I can call self.fromBinArray(self._bin_array) in function with #orm.reconstructor decorator. It is inherited by every Command subclass and executes proper inherited version of fromBinArray(). My question is how to automatize serialization on writing to DB (I know I can manually set self._bin_array but that's very troublesome)?
P.S. Part of my code, my main class:
class Command(Base):
__tablename__ = "commands"
dbid = Column(Integer, Sequence("commands_seq"), primary_key = True)
cmd_id = Column(SmallInteger)
instance_dbid = Column(Integer, ForeignKey("instances.dbid"))
type = Column(String(20))
_bin_array = Column(Binary)
__mapper_args__ = {
"polymorphic_on" : type,
"polymorphic_identity" : "Command",
}
#orm.reconstructor
def init_on_load(self):
self.fromBinArray(self._bin_array)
def fromBinArray(self, b):
(...)
def toBinArray(self):
(...)
EDIT: I've found solution (below in answer), but are there any other solutions? Maybe some shortcut to insert event listening function inside class body?

It looks that solution was simpler than I expected-you need to use event listener for before_insert (and/or before_update event). I've found information (source) that
reconstructor() is a shortcut into a larger system of “instance level”
events, which can be subscribed to using the event API - see
InstanceEvents for the full API description of these events.
And that gave me the clue:
#event.listens_for(Command, 'before_insert', propagate = True)
def serialize_before_insert(mapper, connection, target):
print("serialize_before_insert")
target._bin_array = target.toBinArray()
You can also use event.listen() function to ,,bind'' event listener to instance, but I personally prefer decorator way. It's very important to add propagate = True) in the declaration so subclasses can inherit listener!

Is it possible to have a collection on an object that does not have a foreign key relationship to each other?

I'm using an declarative SQLAlchemy class to perform computations. Part of the computations require me to perform the computations for all configurations provided by a different table which doesn't have any foreign key relationships between the two tables.
This analogy is nothing like my real application, but hopefully will help to comprehend what I want to happen.
I have a set of cars and a list of paint colors.
The car object has a factory which provides a car in all possible colors
from sqlalchemy import *
from sqlachemy.orm import *
def PaintACar(car, color):
pass
Base = declarative_base()
class Colors(Base):
__table__ = u'colors'
id = Column('id', Integer)
color= Column('color', Unicode)
class Car(Base):
__table__ = u'car'
id = Column('id', Integer)
model = Column('model', Unicode)
# is this somehow possible?
all_color_objects = collection(...)
# I know this is possible, but would like to know if there's another way
#property
def all_colors(self):
s = Session.object_session(self)
return s.query(A).all()
def CarColorFactory(self):
for color in self.all_color_objects:
yield PaintACar(self, color)
My question: Is it possible to produce all_color_objects somehow? Without having to resort to finding the session and manually issuing a query as in the all_colors property?

It's been a while, so I'm providing the best answer I saw (as a comment by zzzeek). Basically, I was looking for one-off syntactic sugar. My original 'ugly' implementation works just fine.
what better way would there be here besides getting a Session and producing the query you
want? Are you looking for being able to add to the collection and that automatically
flushes things? (just add the objects to the Session?) Do you not like using
object_session(self) >(you can build some mixin class or something that hides that for
you?) It's not really clear >what the problem is. The objects here have no relationship to
the parent class so there's no particular intelligence SQLAlchemy would be able to add.
– zzzeek Jun 17 at 5:03

Can SQLAlchemy events be used to update a denormalized data cache?

For performance reasons, I've got a denormalized database where some tables contain data which has been aggregated from many rows in other tables. I'd like to maintain this denormalized data cache by using SQLAlchemy events. As an example, suppose I was writing forum software and wanted each Thread to have a column tracking the combined word count of all comments in the thread in order to efficiently display that information:
class Thread(Base):
id = Column(UUID, primary_key=True, default=uuid.uuid4)
title = Column(UnicodeText(), nullable=False)
word_count = Column(Integer, nullable=False, default=0)
class Comment(Base):
id = Column(UUID, primary_key=True, default=uuid.uuid4)
thread_id = Column(UUID, ForeignKey('thread.id', ondelete='CASCADE'), nullable=False)
thread = relationship('Thread', backref='comments')
message = Column(UnicodeText(), nullable=False)
#property
def word_count(self):
return len(self.message.split())
So every time a comment is inserted (for the sake of simplicity let's say that comments are never edited or deleted), we want to update the word_count attribute on the associated Thread object. So I'd want to do something like
def after_insert(mapper, connection, target):
thread = target.thread
thread.word_count = sum(c.word_count for c in thread.comments)
print("updated cached word count to", thread.word_count)
event.listen(Comment, "after_insert", after_insert)
So when I insert a Comment, I can see the event firing and see that it has correctly calculated the word count, but that change is not saved to the Thread row in the database. I don't see any caveats about updated other tables in the after_insert documentation, though I do see some caveats in some of the others, such as after_delete.
So is there a supported way to do this with SQLAlchemy events? I'm already using SQLAlchemy events for lots of other things, so I'd like to do everything that way instead of having to write database triggers.

the after_insert() event is one way to do this, and you might notice it is passed a SQLAlchemy Connection object, instead of a Session as is the case with other flush related events. The mapper-level flush events are intended to be used normally to invoke SQL directly on the given Connection:
#event.listens_for(Comment, "after_insert")
def after_insert(mapper, connection, target):
thread_table = Thread.__table__
thread = target.thread
connection.execute(
thread_table.update().
where(thread_table.c.id==thread.id).
values(word_count=sum(c.word_count for c in thread.comments))
)
print "updated cached word count to", thread.word_count
what is notable here is that invoking an UPDATE statement directly is also a lot more performant than running that attribute change through the whole unit of work process again.
However, an event like after_insert() isn't really needed here, as we know the value of "word_count" before the flush even happens. We actually know it as Comment and Thread objects are associated with each other, and we could just as well keep Thread.word_count completely fresh in memory at all times using attribute events:
def _word_count(msg):
return len(msg.split())
#event.listens_for(Comment.message, "set")
def set(target, value, oldvalue, initiator):
if target.thread is not None:
target.thread.word_count += (_word_count(value) - _word_count(oldvalue))
#event.listens_for(Comment.thread, "set")
def set(target, value, oldvalue, initiator):
# the new Thread, if any
if value is not None:
value.word_count += _word_count(target.message)
# the old Thread, if any
if oldvalue is not None:
oldvalue.word_count -= _word_count(target.message)
the great advantage of this method is that there's also no need to iterate through thread.comments, which for an unloaded collection means another SELECT is emitted.
still another method is to do it in before_flush(). Below is a quick and dirty version, which can be refined to more carefully analyze what has changed in order to determine if the word_count needs to be updated or not:
#event.listens_for(Session, "before_flush")
def before_flush(session, flush_context, instances):
for obj in session.new | session.dirty:
if isinstance(obj, Thread):
obj.word_count = sum(c.word_count for c in obj.comments)
elif isinstance(obj, Comment):
obj.thread.word_count = sum(c.word_count for c in obj.comments)
I'd go with the attribute event method as it is the most performant and up-to-date.

You can do this with SQLAlchemy-Utils aggregated columns: http://sqlalchemy-utils.readthedocs.org/en/latest/aggregates.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filtering a relationship attribute in SQLAlchemy - python

Related

How to return field as set() using peewee

How to make sure that app engine ndb root entity is saved before proceeding with rest of code

SQLAlchemy-serialize values into column of table

Is it possible to have a collection on an object that does not have a foreign key relationship to each other?

Can SQLAlchemy events be used to update a denormalized data cache?

Categories

Resources