SQLAlchemy: Conditional autoincrement - python

I want to create a flat forum, where threads are no separate table, with a composite primary key for posts.
So posts have two fields forming a natural key: thread_id and post_number, where the further is the ID of the thread they are part of, and the latter is their position in the thread. if you aren’t convinced, check below the line.
My problem is that i don’t know how to tell SQLAlchemy
when committing the addition of new Post instances with thread_id tid, look up how many posts with thread_id tid exist, and autoincrement from that number on.
Why do i think that schema is a good idea? because it’s natural and performant:
class Post(Base):
number = Column(Integer, primary_key=True, autoincrement=False, nullable=False)
thread_id = Column(Integer, primary_key=True, autoincrement=False, nullable=False)
title = Column(Text) #nullable for not-first posts
text = Column(Text, nullable=False)
...
PAGESIZE = 10
#test
tid = 5
page = 4
Entire Thread (query):
thread5 = session.query(Post).filter_by(thread_id=5)
Thread title:
title = thread5.filter_by(number=0).one().title
Thread page
page4 = thread5.filter(
Post.number >= (page * PAGESIZE),
Post.number < ((page+1) * PAGESIZE)).all()
#or
page4 = thread5.offset(page * PAGESIZE).limit(PAGESIZE).all()
Number of pages:
ceil(thread5.count() / PAGESIZE)

You can probably do this with an SQL expression as a default value (see the default argument). Give it a callable like this:
from sqlalchemy.sql import func
def maxnumber_for_threadid(context):
return post_table.select([func.max(post_table.c.number)]).where(post_table.c.thread_id==context.current_parameters['thread_id'])
I'm not absolutely sure you can return an sql expression from a default callable--you may have to actually execute this query and return a scalar value inside the callback. (The cursor should be available from the context parameter.)
However, I strongly recommend you do what #kindall says and just use another auto-incrementing sequence for the number column. What you want to do is actually very tricky to get right even without SQLAlchemy. For example, if you are using an MVCC database you need to introduce special row-level locking so that the number of rows with a matching thread_id does not change while you are running the transaction. How this is done is database-dependent. For example with MySQL InnoDB, you need to do something like this:
BEGIN TRANSACTION;
SELECT MAX(number)+1 FROM posts WHERE thread_id=? FOR UPDATE;
INSERT INTO posts (thread_id, number) VALUES (?, ?); -- number is from previous query
COMMIT;
If you didn't use FOR UPDATE, then conceivably another connection trying to insert a new post into the same thread at the same time could have gotten the same value for number.
So rather than being performant, post inserts are actually quite slow (relatively speaking) because of the extra query and locking required.
All this is resolved by using a separate sequence and not worrying about post number incrementing only within a thread_id.

You should just use a global post number that increments for posts in any thread. Then you don't need to figure out the right number to use for a given thread. A given thread, then, might have posts numbered 7, 20, 42, 51, and so on. This does not matter because you can easily get the number of posts in the thread from the size of the recordset you get back from the query, and you can easily number the posts in the HTML output separately from the actual post numbers.

Related

Filtering a relationship attribute in SQLAlchemy

I have some code with a Widget object that must undergo some processing periodically. Widgets have a relationship to a Process object that tracks individual processing attempts and holds data about those attempts, such as state information, start and end times, and the result. The relationship looks something like this:
class Widget(Base):
_tablename_ = 'widget'
id = Column(Integer, primary_key=True)
name = Column(String)
attempts = relationship('Process')
class Process(Base):
_tablename_ = 'process'
id = Column(Integer, primary_key=True)
widget_id = Column(Integer, ForeignKey='widget.id'))
start = Column(DateTime)
end = Column(DateTime)
success = Column(Boolean)
I want to have a method on Widget to check whether it's time to process that widget yet, or not. It needs to look at all the attempts, find the most recent successful one, and see if it is older than the threshold.
One option is to iterate Widget.attempts using a list comprehension. Assuming now and delay are reasonable datetime and timedelta objects, then something like this works when defined as a method on Widget:
def ready(self):
recent_success = [attempt for attempt in self.attempts if attempt.success is True and attempt.end >= now - delay]
if recent_success:
return False
return True
That seems like good idiomatic Python, but it's not making good use of the power of the SQL database backing the data, and it's probably less efficient than running a similar SQL query especially once there are a large number of Process objects in the attempts list. I'm having a hard time figuring out the best way to implement this as a query, though.
It's easy enough to run the query inside Widget something like this:
def ready(self):
recent_success = session.query(Process).filter(
and_(
Process.widget_id == self.id,
Process.success == True,
Process.end >= now - delay
)
).order_by(Process.end.desc()).first()
if recent_success:
return False
return True
But I run into problems in unit tests with getting session set properly inside the module that defines Widget. It seems to me that's a poor style choice, and probably not how SQLAlchemy objects are meant to be structured.
I could make the ready() function something external to the Widget class, which would fix the problems with setting session in unit tests, but that seems like poor OO structure.
I think the ideal would be if I could somehow filter Widget.attempts with SQL-like code that's more efficient than a list comprehension, but I haven't found anything that suggests that's possible.
What is actually the best approach for something like this?
You are thinking in the right direction. Any solution within the Widget instance implies you need to process all instances. Seeking the external process would have better performance and easier testability.
You can get all the Widget instances which need to be scheduled for next processing using this query:
q = (
session
.query(Widget)
.filter(Widget.attempts.any(and_(
Process.success == True,
Process.end >= now - delay,
)))
)
widgets_to_process = q.all()
If you really want to have a property on the model, i would not create a separate query, but just use the relationship:
def ready(self, at_time):
successes = [
attempt
for attempt in sorted(self.attempts, key=lambda v: v.end)
if attempt.success and attempt.end >= at_time # at_time = now - delay
]
return bool(successes)

Why am I getting a duplicate foreign key error?

I'm trying to use Python and SQLAlcehmy to insert stuff into the database but it's giving me a duplicate foreign key error. I didn't have any problems when I was executing the SQL queries to create the tables earlier.
You're getting the duplicate because you've written the code as a one to one relationship, when it is at least a one to many relationship.
Sql doesn't let you have more than one of any variable. It creates keys for each variable, and when you try to insert the same variable, but haven't set up any type of relationship between the table it gets really upset at you, and throws up the error you're getting.
The below code is a one-to-many relationship for your tables using flask to connect to the database.. if you aren't using flask yourself.. figure out the translation, or use it.
class ChildcareUnit(db.Model):
Childcare_id=db.Column('ChildcareUnit_id',db.Integer,primary_key=True)
fullname = db.Column(String(250), nullable = False)
shortname = db.Column(String(250), nullable = False)
_Menu = db.relationship('Menu')
def __init__(self,fullname,shortname):
self.fullname = fullname
self.shortname = shortname
def __repr__(self):
return '<ChildcareUnit %r>' % self.id
class Menu(db.Model):
menu_id = db.Column('menu_id', db.Integer, primary_key=True)
menu_date = db.Column('Date', Date, nullable=True)
idChildcareUnit=db.Column(db.Integer,db.Forgeinkey('ChilecareUnit.ChilecareUnit_id'))
ChilecareUnits = db.relationship('ChildcareUnit')
def __init__(self,menu_date):
self.menu_date = menu_date
def __repr__(self):
return '<Menu %r>' % self.id
A couple differences here to note. the Columns are now db.Column() not Column(). This is the Flask code at work. it makes a connection between your database and the column in that table, saying "hey, these two things are connected".
Also, look at the db.Relationship() variables I've added to both of the tables. This is what tells your ORM that the two tables have a 1-2-many relationship. They need to be in both of the tables, and the relationship column in one table needs to list the other for it to work, as you can see.
Lastly, look at __repr__. This is what you're ORM uses to generate the foreign Keys for your database. It is also really important to include. Your code will either be super super slow without it, or just not work all together.
there are two different options you have to generate foreign keys in sqlalchemy. __repr__ and __str__
__repr__ is designed to generate keys that are easier for the machine to read, which will help with performance, but might make reading and understanding them a little more difficult.
__str__ is designed to be human friendly. It'll make your foreign keys easier to understand, but it will also make your code run just a little bit slower.
You can always use __str__ while you're developing, and then switch __repr__ when you're ready to have your final database.

peewee, mysql and auto incrementing id

I have model in peewee ORM with unique=True field. Im saving data to my MySQL db like this :
try:
model.save()
except IntegrityError: # do not save if it's already in db
pass
But when peewee trying to save data that already in db, MySQL increments id and ids order is broken. How to avoid this behavior ?
Here's my model im trying to save :
class FeedItem(Model):
vendor = ForeignKeyField(Vendor, to_field='name')
url = CharField(unique=True)
title = CharField()
pub = DateTimeField()
rating = IntegerField(default=0)
img = CharField(default='null')
def construct(self, vendor, url, title):
self.vendor = vendor
self.url = url
self.title = title
self.pub = datetime.now()
self.save()
class Meta:
database = db
There's im saving it:
for article in feedparser.parse(vendor.feed)['items']:
try:
entry = FeedItem()
entry.construct(vendor.name, article.link, article.title)
except IntegrityError:
pass
MySQL increments id and ids order is broken. How to avoid this behavior?
You don't.
The database-generated identifier is outside your control. It's generated by the database. There's no guarantee that all identifiers have to be sequential and without gaps, just that they're unique. There are any number of things which would result in a number not being present in that sequence, such as:
A record was deleted.
A record was attempted to be inserted, which generated an ID, but the insert in some way failed after that ID was generated.
A record was inserted as part of a transaction which wasn't committed.
A set of IDs was generated to memory as part of an internal optimization in the database engine and the engine went down before the IDs were used.
A record was inserted with an explicit ID, causing the auto-increment feature to re-adjust to the new value.
There may be more I'm not considering. But the point is that you simply don't control that value, the database engine does.
If you want to control that value then don't use autoincrement. Though be aware that this would come with a whole host of other problems that you'd need to solve which autoincrement solves for you. Or you'd have to switch to a GUID instead of an integer, which itself could result in other considerations you'd need to account for.
I'm not positive if this will work but you can try something like:
try:
with database.atomic():
model.save()
except IntegrityError:
pass # Model already exists.
By wrapping in atomic() the code will execute in a transaction (or savepoint if you are already in a transaction). This may lead to the ID sequence remaining intact.
I agree with David's answer, though, which is that really this is a database detail and should not be part of your application logic. If you need monotonically incrementing IDs you should implement that yourself.

sqlalchemy integrity error

I have a table where the primary key (id) isn't the key by which i distinguish between records. For this, I have a unique constraint for 3 columns.
In order to be able to merge records, I've added a classmethod that retrieves the relevant record if exists, otherwise, it returns with a new record.
class Foo(Base):
__table_args__ = (sa.UniqueConstraint('bar', 'baz', 'qux'),)
id = sa.Column(Identifier, sa.Sequence('%s_id_seq' % __tablename__), nullable=False, primary_key=True)
bar = sa.Column(sa.BigInteger)
baz = sa.Column(sa.BigInteger)
qux = sa.Column(sa.BigInteger)
a1 = sa.Column(sa.BigInteger)
a2 = sa.Column(sa.BigInteger)
#classmethod
def get(cls, bar=None, baz=None, qux=None, **kwargs):
item = session.query(cls).\
filter(cls.bar== bar).\
filter(cls.baz == baz).\
filter(cls.qux == qux).\
first()
if item:
for k, v in kwargs.iteritems():
if getattr(item, k) != v:
setattr(item, k, v)
else:
item = cls(bar=bar, baz=baz, qux=qux, **kwargs)
return item
This works well most of the time, but every once in a while, I get an Integrity error when trying to merge an item:
foo = Foo.get(**item)
session.merge(foo)
As I understand, this happens since merge tries to insert a record where a record having the unique fields already exists.
Is there something wrong with the get function? What am I missing here?
(BTW: I realize this might look awkward, but I need a unique sequential ID and to avoid problems with DB's not supporting sequences on non-primary-key columns, I made it this way)
Edit 1:
Changed orm.db to session so the example would be clearer
Edit 2:
I have this system running on several platforms and it seems that this only happens in mysql on top of Ubuntu (other platform is Oracle on top of RedHat). Also, in some weird way, it happens much more to specific end-users.
Regarding mysql, I tried both mysql and mysql+mysqldb as connection strings but both produce this error.
Regarding the end users, it makes no sense, and I don't know what to make of it ...
Regarding mysql,
Indeed your method is prone to integrity errors. What happens is that when you invoke Foo.get(1,2,3) 2 times, you do not flush the session in between. The second time you invoke it, the ORM query fails again - because there is no actual row in the DB yet - and a new object is created, with different identity. Then on commit/flush these 2 clash causing the integrity error. This can be avoided by flushing the DB after each object creation.
Now, things work differently in SQLAlchemy if the primary key is known in merge/ORM get - if the matching primary key is found already loaded in the session, SQLAlchemy realizes that these 2 must be the same object. However no such checks are done on unique indexes. It might be possible to do that also. However it would make race conditions only rarer, as there might be 2 sessions creating the same (bar, baz, qux) triplet at the same time.
TL;DR:
else:
item = cls(bar=bar, baz=baz, qux=qux, **kwargs)
session.add(item)
session.flush()
I seem to have found the culprit!
Yesterday I had a situation where clients got the IntegrityError due to wrong timezones, so it got me to think at that direction.
One of the fields I was using to identify the models is a Date column (I had no idea it is related, hence I didn't even mention it, sorry ...), however, since I call the actions from a Flex client using AMF, and since actionscript doesn't have a Date object without time, I'm transferring a Date object with zeroed time. I guess that in some situations, I got a different time in those dates which raised an IntegrityError.
I would have expected SA to strip the time from date values in case of Date columns as I think the DB would have, so Model.date_column == datetime_value should cast the datetime to datetime.date before making the comparison.
As for my solution, I simply make sure that the value is cast to datetime.date() before I query the DB ...
So far, yesterday and today was quiet with no complaints. I'll keep and eye and report should anything changes ...
Thank you all for your help.
Cheers,
Ofir

Can SQLAlchemy events be used to update a denormalized data cache?

For performance reasons, I've got a denormalized database where some tables contain data which has been aggregated from many rows in other tables. I'd like to maintain this denormalized data cache by using SQLAlchemy events. As an example, suppose I was writing forum software and wanted each Thread to have a column tracking the combined word count of all comments in the thread in order to efficiently display that information:
class Thread(Base):
id = Column(UUID, primary_key=True, default=uuid.uuid4)
title = Column(UnicodeText(), nullable=False)
word_count = Column(Integer, nullable=False, default=0)
class Comment(Base):
id = Column(UUID, primary_key=True, default=uuid.uuid4)
thread_id = Column(UUID, ForeignKey('thread.id', ondelete='CASCADE'), nullable=False)
thread = relationship('Thread', backref='comments')
message = Column(UnicodeText(), nullable=False)
#property
def word_count(self):
return len(self.message.split())
So every time a comment is inserted (for the sake of simplicity let's say that comments are never edited or deleted), we want to update the word_count attribute on the associated Thread object. So I'd want to do something like
def after_insert(mapper, connection, target):
thread = target.thread
thread.word_count = sum(c.word_count for c in thread.comments)
print("updated cached word count to", thread.word_count)
event.listen(Comment, "after_insert", after_insert)
So when I insert a Comment, I can see the event firing and see that it has correctly calculated the word count, but that change is not saved to the Thread row in the database. I don't see any caveats about updated other tables in the after_insert documentation, though I do see some caveats in some of the others, such as after_delete.
So is there a supported way to do this with SQLAlchemy events? I'm already using SQLAlchemy events for lots of other things, so I'd like to do everything that way instead of having to write database triggers.
the after_insert() event is one way to do this, and you might notice it is passed a SQLAlchemy Connection object, instead of a Session as is the case with other flush related events. The mapper-level flush events are intended to be used normally to invoke SQL directly on the given Connection:
#event.listens_for(Comment, "after_insert")
def after_insert(mapper, connection, target):
thread_table = Thread.__table__
thread = target.thread
connection.execute(
thread_table.update().
where(thread_table.c.id==thread.id).
values(word_count=sum(c.word_count for c in thread.comments))
)
print "updated cached word count to", thread.word_count
what is notable here is that invoking an UPDATE statement directly is also a lot more performant than running that attribute change through the whole unit of work process again.
However, an event like after_insert() isn't really needed here, as we know the value of "word_count" before the flush even happens. We actually know it as Comment and Thread objects are associated with each other, and we could just as well keep Thread.word_count completely fresh in memory at all times using attribute events:
def _word_count(msg):
return len(msg.split())
#event.listens_for(Comment.message, "set")
def set(target, value, oldvalue, initiator):
if target.thread is not None:
target.thread.word_count += (_word_count(value) - _word_count(oldvalue))
#event.listens_for(Comment.thread, "set")
def set(target, value, oldvalue, initiator):
# the new Thread, if any
if value is not None:
value.word_count += _word_count(target.message)
# the old Thread, if any
if oldvalue is not None:
oldvalue.word_count -= _word_count(target.message)
the great advantage of this method is that there's also no need to iterate through thread.comments, which for an unloaded collection means another SELECT is emitted.
still another method is to do it in before_flush(). Below is a quick and dirty version, which can be refined to more carefully analyze what has changed in order to determine if the word_count needs to be updated or not:
#event.listens_for(Session, "before_flush")
def before_flush(session, flush_context, instances):
for obj in session.new | session.dirty:
if isinstance(obj, Thread):
obj.word_count = sum(c.word_count for c in obj.comments)
elif isinstance(obj, Comment):
obj.thread.word_count = sum(c.word_count for c in obj.comments)
I'd go with the attribute event method as it is the most performant and up-to-date.
You can do this with SQLAlchemy-Utils aggregated columns: http://sqlalchemy-utils.readthedocs.org/en/latest/aggregates.html

Categories