sqlalchemy integrity error - python

I have a table where the primary key (id) isn't the key by which i distinguish between records. For this, I have a unique constraint for 3 columns.
In order to be able to merge records, I've added a classmethod that retrieves the relevant record if exists, otherwise, it returns with a new record.
class Foo(Base):
__table_args__ = (sa.UniqueConstraint('bar', 'baz', 'qux'),)
id = sa.Column(Identifier, sa.Sequence('%s_id_seq' % __tablename__), nullable=False, primary_key=True)
bar = sa.Column(sa.BigInteger)
baz = sa.Column(sa.BigInteger)
qux = sa.Column(sa.BigInteger)
a1 = sa.Column(sa.BigInteger)
a2 = sa.Column(sa.BigInteger)
#classmethod
def get(cls, bar=None, baz=None, qux=None, **kwargs):
item = session.query(cls).\
filter(cls.bar== bar).\
filter(cls.baz == baz).\
filter(cls.qux == qux).\
first()
if item:
for k, v in kwargs.iteritems():
if getattr(item, k) != v:
setattr(item, k, v)
else:
item = cls(bar=bar, baz=baz, qux=qux, **kwargs)
return item
This works well most of the time, but every once in a while, I get an Integrity error when trying to merge an item:
foo = Foo.get(**item)
session.merge(foo)
As I understand, this happens since merge tries to insert a record where a record having the unique fields already exists.
Is there something wrong with the get function? What am I missing here?
(BTW: I realize this might look awkward, but I need a unique sequential ID and to avoid problems with DB's not supporting sequences on non-primary-key columns, I made it this way)
Edit 1:
Changed orm.db to session so the example would be clearer
Edit 2:
I have this system running on several platforms and it seems that this only happens in mysql on top of Ubuntu (other platform is Oracle on top of RedHat). Also, in some weird way, it happens much more to specific end-users.
Regarding mysql, I tried both mysql and mysql+mysqldb as connection strings but both produce this error.
Regarding the end users, it makes no sense, and I don't know what to make of it ...
Regarding mysql,

Indeed your method is prone to integrity errors. What happens is that when you invoke Foo.get(1,2,3) 2 times, you do not flush the session in between. The second time you invoke it, the ORM query fails again - because there is no actual row in the DB yet - and a new object is created, with different identity. Then on commit/flush these 2 clash causing the integrity error. This can be avoided by flushing the DB after each object creation.
Now, things work differently in SQLAlchemy if the primary key is known in merge/ORM get - if the matching primary key is found already loaded in the session, SQLAlchemy realizes that these 2 must be the same object. However no such checks are done on unique indexes. It might be possible to do that also. However it would make race conditions only rarer, as there might be 2 sessions creating the same (bar, baz, qux) triplet at the same time.
TL;DR:
else:
item = cls(bar=bar, baz=baz, qux=qux, **kwargs)
session.add(item)
session.flush()

I seem to have found the culprit!
Yesterday I had a situation where clients got the IntegrityError due to wrong timezones, so it got me to think at that direction.
One of the fields I was using to identify the models is a Date column (I had no idea it is related, hence I didn't even mention it, sorry ...), however, since I call the actions from a Flex client using AMF, and since actionscript doesn't have a Date object without time, I'm transferring a Date object with zeroed time. I guess that in some situations, I got a different time in those dates which raised an IntegrityError.
I would have expected SA to strip the time from date values in case of Date columns as I think the DB would have, so Model.date_column == datetime_value should cast the datetime to datetime.date before making the comparison.
As for my solution, I simply make sure that the value is cast to datetime.date() before I query the DB ...
So far, yesterday and today was quiet with no complaints. I'll keep and eye and report should anything changes ...
Thank you all for your help.
Cheers,
Ofir

Related

Why am I getting a duplicate foreign key error?

I'm trying to use Python and SQLAlcehmy to insert stuff into the database but it's giving me a duplicate foreign key error. I didn't have any problems when I was executing the SQL queries to create the tables earlier.
You're getting the duplicate because you've written the code as a one to one relationship, when it is at least a one to many relationship.
Sql doesn't let you have more than one of any variable. It creates keys for each variable, and when you try to insert the same variable, but haven't set up any type of relationship between the table it gets really upset at you, and throws up the error you're getting.
The below code is a one-to-many relationship for your tables using flask to connect to the database.. if you aren't using flask yourself.. figure out the translation, or use it.
class ChildcareUnit(db.Model):
Childcare_id=db.Column('ChildcareUnit_id',db.Integer,primary_key=True)
fullname = db.Column(String(250), nullable = False)
shortname = db.Column(String(250), nullable = False)
_Menu = db.relationship('Menu')
def __init__(self,fullname,shortname):
self.fullname = fullname
self.shortname = shortname
def __repr__(self):
return '<ChildcareUnit %r>' % self.id
class Menu(db.Model):
menu_id = db.Column('menu_id', db.Integer, primary_key=True)
menu_date = db.Column('Date', Date, nullable=True)
idChildcareUnit=db.Column(db.Integer,db.Forgeinkey('ChilecareUnit.ChilecareUnit_id'))
ChilecareUnits = db.relationship('ChildcareUnit')
def __init__(self,menu_date):
self.menu_date = menu_date
def __repr__(self):
return '<Menu %r>' % self.id
A couple differences here to note. the Columns are now db.Column() not Column(). This is the Flask code at work. it makes a connection between your database and the column in that table, saying "hey, these two things are connected".
Also, look at the db.Relationship() variables I've added to both of the tables. This is what tells your ORM that the two tables have a 1-2-many relationship. They need to be in both of the tables, and the relationship column in one table needs to list the other for it to work, as you can see.
Lastly, look at __repr__. This is what you're ORM uses to generate the foreign Keys for your database. It is also really important to include. Your code will either be super super slow without it, or just not work all together.
there are two different options you have to generate foreign keys in sqlalchemy. __repr__ and __str__
__repr__ is designed to generate keys that are easier for the machine to read, which will help with performance, but might make reading and understanding them a little more difficult.
__str__ is designed to be human friendly. It'll make your foreign keys easier to understand, but it will also make your code run just a little bit slower.
You can always use __str__ while you're developing, and then switch __repr__ when you're ready to have your final database.

With SQLAlchemy, how can I convert a row to a "real" Python object?

I've been using SQLAlchemy with Alembic to simplify the database access I use, and any data structure changes I make to the tables. This has been working out really well up until I started to notice more and more issues with SQLAlchemy "expiring" fields from my point of view nearly at random.
A case in point would be this snippet,
class HRDecimal(Model):
dec_id = Column(String(50), index=True)
#staticmethod
def qfilter(*filters):
"""
:rtype : list[HRDecimal]
"""
return list(HRDecimal.query.filter(*filters))
class Meta(Model):
dec_id = Column(String(50), index=True)
#staticmethod
def qfilter(*filters):
"""
:rtype : list[Meta]
"""
return list(Meta.query.filter(*filters))
Code:
ids = ['1', '2', '3'] # obviously fake list of ids
decs = HRDecimal.qfilter(
HRDecimal.dec_id.in_(ids))
metas = Meta.qfilter(
Meta.dec_id.in_(ids))
combined = []
for ident in ids:
combined.append((
ident,
[dec for dec in decs if dec.dec_id == ident],
[hm for hm in metas if hm.dec_id == ident]
))
For the above, there wasn't a problem, but when I'm processing a list of ids that may contain a few thousand ids, this process started taking a huge amount of time, and if done from a web request in flask, the thread would often be killed.
When I started poking around with why this was happening, the key area was
[dec for dec in decs if dec.dec_id == ident],
[hm for hm in metas if hm.dec_id == ident]
At some point during the combining of these (what I thought were) Python objects, at some point calling dec.dec_id and hm.dec_id, in the SQLAlchemy code, at best, we go into,
def __get__(self, instance, owner):
if instance is None:
return self
dict_ = instance_dict(instance)
if self._supports_population and self.key in dict_:
return dict_[self.key]
else:
return self.impl.get(instance_state(instance), dict_)
Of InstrumentedAttribute in sqlalchemy/orm/attributes.py which seems to be very slow, but even worse than this, I've observed times when fields expired, and then we enter,
def get(self, state, dict_, passive=PASSIVE_OFF):
"""Retrieve a value from the given object.
If a callable is assembled on this object's attribute, and
passive is False, the callable will be executed and the
resulting value will be set as the new value for this attribute.
"""
if self.key in dict_:
return dict_[self.key]
else:
# if history present, don't load
key = self.key
if key not in state.committed_state or \
state.committed_state[key] is NEVER_SET:
if not passive & CALLABLES_OK:
return PASSIVE_NO_RESULT
if key in state.expired_attributes:
value = state._load_expired(state, passive)
Of AttributeImpl in the same file. Horrible issue here is that state._load_expired re-runs the SQL Query completely. So in a situation like this, with a big list of idents, we end up running thousands of "small" SQL queries to the database, where I think we should have only been running two "large" ones at the top.
Now, I've gotten around the expired issue by how I initialise the database for flask with session-options, changing
app = Flask(__name__)
CsrfProtect(app)
db = SQLAlchemy(app)
to
app = Flask(__name__)
CsrfProtect(app)
db = SQLAlchemy(
app,
session_options=dict(autoflush=False, autocommit=False, expire_on_commit=False))
This has definitely improved the above situation for when a rows fields just seemed to expire seemingly (from my observations) at random, but the "normal" slowness of accessing items to SQLAlchemy is still an issue for what we're currently running.
Is there any way with SQLAlchemy, to get a "real" Python object returned from a query, instead of a proxied one like it is now, so it isn't being affected by this?
Your randomness is probably related to either explicitly committing or rolling back at an inconvenient time, or due to auto-commit of some kind. In its default configuration SQLAlchemy session expires all ORM-managed state when a transaction ends. This is usually a good thing, since when a transaction ends you've no idea what the current state of the DB is. This can be disabled, as you've done with expire_on_commit=False.
The ORM is also ill suited for extremely large bulk operations in general, as explained here. It is very well suited for handling complex object graphs and persisting those to a relational database with much less effort on your part, as it organizes the required inserts etc. for you. An important part of that is tracking changes to instance attributes. The SQLAlchemy Core is better suited for bulk.
It looks like you're performing 2 queries that produce a potentially large amount of results and then do a manual "group by" on the data, but in a rather unperforming way, because for each id you have you scan the entire list of results, or O(nm), where n is the number of ids and m the results. Instead you should group the results to lists of objects by id first and then perform the "join". On some other database systems you could handle the grouping in SQL directly, but alas MySQL has no notion of arrays, other than JSON.
A possibly more performant version of your grouping could be for example:
from itertools import groupby
from operator import attrgetter
ids = ['1', '2', '3'] # obviously fake list of ids
# Order the results by `dec_id` for Python itertools.groupby. Cannot
# use your `qfilter()` method as it produces lists, not queries.
decs = HRDecimal.query.\
filter(HRDecimal.dec_id.in_(ids)).\
order_by(HRDecimal.dec_id).\
all()
metas = Meta.query.\
filter(Meta.dec_id.in_(ids)).\
order_by(Meta.dec_id).\
all()
key = attrgetter('dec_id')
decs_lookup = {dec_id: list(g) for dec_id, g in groupby(decs, key)}
metas_lookup = {dec_id: list(g) for dec_id, g in groupby(metas, key)}
combined = [(ident,
decs_lookup.get(ident, []),
metas_lookup.get(ident, []))
for ident in ids]
Note that since in this version we iterate over the queries only once, all() is not strictly necessary, but it should not hurt much either. The grouping could also be done without sorting in SQL with defaultdict(list):
from collections import defaultdict
decs = HRDecimal.query.filter(HRDecimal.dec_id.in_(ids)).all()
metas = Meta.query.filter(Meta.dec_id.in_(ids)).all()
decs_lookup = defaultdict(list)
metas_lookup = defaultdict(list)
for d in decs:
decs_lookup[d.dec_id].append(d)
for m in metas:
metas_lookup[m.dec_id].append(m)
combined = [(ident, decs_lookup[ident], metas_lookup[ident])
for ident in ids]
And finally to answer your question, you can fetch "real" Python objects by querying for the Core table instead of the ORM entity:
decs = HRDecimal.query.\
filter(HRDecimal.dec_id.in_(ids)).\
with_entities(HRDecimal.__table__).\
all()
which will result in a list of namedtuple like objects that can easily be converted to dict with _asdict().

Marking an object as clean in SQLAlchemy ORM

Is there any way to explicitly mark an object as clean in the SQLAlchemy ORM?
This is related partly to a previous question on bulk update strategies.
I want to, within a before_flush event listener mark a bunch of object as actually not needing to be flushed. This is due to them being manually synced with the database by other means.
I have tried the strategy below, but it results in the object being removed from the session, which then can cause problems later when a lazy load happens.
#event.listens_for(SignallingSession, 'before_flush')
def before_flush(session, flush_context, instances):
ledgers = []
if session.dirty:
for elem in session.dirty:
if ( session.is_modified(elem, include_collections=False) ):
if isinstance(elem, Wallet):
session.expunge(elem) # causes problems later
ledgers.append(Ledger(id=elem.id, amount=elem.balance))
if ledgers:
session.bulk_save_objects(ledgers)
session.execute('UPDATE wallet w JOIN ledger l on w.id = l.id SET w.balance = l.amount')
session.execute('TRUNCATE ledger')
I want to do something like:
session.dirty.remove(MyObject)
But that doesn't work as session.dirty is a computed property, not a regular attribute. I've been digging around the instrumentation code, but can't see how I might fool the dirty list to not contain something. I see there is also a history on the object state that will need taking care of as well.
Any ideas? The underlying database is MySQL if that makes any difference.
-Matt
When you modify the database outside of the ORM, you can let the ORM know the current database state by using set_committed_value().
Example:
wallet = session.query(Wallet).filter_by(id=123)
wallet.balance = 0
session.execute("UPDATE wallet SET balance = 0 WHERE id = 123;")
set_committed_value(wallet, "balance", 0)
session.commit() # won't issue additional SQL to update wallet
If you really wanted to mark the instance as not dirty, you can muck with the internals of SQLAlchemy:
state = inspect(p)
session.identity_map._modified.discard(state)
state.modified = False
print(p in session.dirty) # False
Let me summarize this insanity.
from sqlalchemy.orm import attributes
attributes.instance_state(your_object).committed_state.clear()
Easy. (no)

PonyORM (Python) "Value was updated outside of current transaction" but it wasn't

I'm using Pony ORM version 0.7 with a Sqlite3 database on disk, and running into this issue: I am performing a select, then an update, then a select, then another update, and getting an error message of
pony.orm.core.UnrepeatableReadError: Value of Task.order_id for
Task[23654] was updated outside of current transaction (was: 1, now: 2)
I've reduced the problem to the minimum set of commands that causes the problem (i.e. removing anything causes the problem not to occur):
#db_session
def test_method():
tasks = list(map(Task.to_dict, Task.select()))
db.execute("UPDATE Task SET order_id=order_id*2")
task_to_move = select(task for task in Task if task.order_id == 2).first()
task_to_move.order_id = 1
test_method()
For completeness's sake, here is the definition of Task:
class Task(db.Entity):
text = Required(unicode)
heading = Required(int)
create_timestamp = Required(datetime)
done_timestamp = Optional(datetime)
order_id = Required(int)
Also, if I remove the constraint that task.order_id == 2 from my select, the problem no longer occurs, so I assume the problem has something to do with querying based on a field that has been changed since the transaction has started, but I don't know why the error message is telling me that it was changed by a different transaction (unless maybe db.execute is executing in a separate transaction because it is raw SQL?)
I've already looked at this similar question, but the problem was different (Pony ORM reports record "was updated outside of current transaction" while there is not other transaction) and at this documentation (https://docs.ponyorm.com/transactions.html) but neither solved my problem.
Any ideas what might be going on here?
Pony uses optimistic concurrency control by default. For each attribute Pony remembers its current value (potentially modified by application code) as well as original value which was read from the database. During UPDATE Pony checks that the value of column in the database is still the same. If the value is changed, Pony assumes that some concurrent transaction did it, and throw exception in order to avoid the "lost update" situation.
If you execute some raw SQL query, Pony does not know what exactly was modified in the database. So when Pony encounters that the counter value was changed, it mistakenly thinks that the value was changed by another transaction.
In order to avoid the problem you can mark order_id attribute as volatile. Then Pony will assume, that the value of attribute can change at any time (by trigger or raw SQL update), and will exclude that attribute from optimistic checks:
class Task(db.Entity):
text = Required(unicode)
heading = Required(int)
create_timestamp = Required(datetime)
done_timestamp = Optional(datetime)
order_id = Required(int, volatile=True)
Note that Pony will cache the value of volatile attribute and will not re-read the value from the database until the object was saved, so in some situation you can get obsolete value in Python.
Update:
Starting from release 0.7.4 you can also specify optimistic=False option to db_session to turn off optimistic checks for specific transaction that uses raw SQL queries:
with db_session(optimistic=False):
...
or
#db_session(optimistic=False)
def some_function():
...
Also it is possible now to specify optimistic=False option for attribute instead of specifying volatile=True. Then Pony will not make optimistic checks for that attribute, but will still consider treat it as non-volatile

How can you keep the Django ORM from making mistakes when you pass the wrong kind of object?

We found this while testing, one machine was setup with MyISAM as the default engine and one was set with InnoDB as the default engine. We have code similar to the following
class StudyManager(models.Manager):
def scored(self, school=None, student=None):
qset = self.objects.all()
if school:
qset = qset.filter(school=school)
if student:
qset = qset.filter(student=student)
return qset.order_by('something')
The problem code looked like this:
print Study.objects.scored(student).count()
which meant that the "student" was being treated as a school. This got thru testing in with MyISAM because student.id == school.id because MyISAM can't do a rollback and gets completely re-created each test (resetting the autoincrement id field). InnoDB caught these errors because rollback evidently does not reset the autoincrement fields.
Problem is, during testing, there could be many other errors that are going uncaught due to duck typing since all models have an id field. I'm worried about the id's on objects lining up (in production or in testing) and that causing problems/failing to find the bugs.
I could add asserts like so:
class StudyManager(models.Manager):
def scored(self, school=None, student=None):
qset = self.objects.all()
if school:
assert(isinstance(school, School))
qset = qset.filter(school=school)
if student:
assert(isinstance(student, Student))
qset = qset.filter(student=student)
return qset.order_by('something')
But this looks nasty, and is a lot of work (to go back and retrofit). It's also slower in debug mode.
I've thought about the idea that the id field for the models could be coerced into model_id (student_id for Student, school_id for School) so that schools would not have a student_id, this would only involve specifying the primary key field, but django has a shortcut for that in .pk so I'm guessing that might not help in all cases.
Is there a more elegant solution to catching this kind of bug? Being an old C++ hand, I kind of miss type safety.
This is an aspect of Python and has nothing to do with Django per se.
By defining default values for function parameters you do not eliminate the concept of positional arguments — you simply make it possible to not specify all parameters when invoking the function. #mVChr is correct in saying that you need to get in the habit of using the parameter name(s) when you call the routine, particularly when there is inherent ambiguity in just what it is being called with.
You might also consider having two separate routines whose names quiet clearly identify their expected parameter types.

Categories