SQLAlchemy and explicit locking - python

I have multiple processes that can potentially insert duplicate rows into the database. These inserts do not happen very frequently (a few times every hour) so it is not performance critical.
I've tried an exist check before doing the insert, like so:
#Assume we're inserting a camera object, that's a valid SQLAlchemy ORM object that inherits from declarative_base...
try:
stmt = exists().where(Camera.id == camera_id)
exists_result = session.query(Camera).with_lockmode("update").filter(stmt).first()
if exists_result is None:
session.add(Camera(...)) #Lots of parameters, just assume it works
session.commit()
except IntegrityError as e:
session.rollback()
The problem I'm running into is that the exist() check doesn't lock the table, and so there is a chance that multiple processes could attempt to insert the same object at the same time. In such a scenario, one process succeeds with the insert and the others fail with an IntegrityError exception. While this works, it doesn't feel "clean" to me.
I would really like some way of locking the Camera table before doing the exists() check.

Pehaps this might be of interest to you:
https://groups.google.com/forum/?fromgroups=#!topic/sqlalchemy/8WLhbsp2nls
You can lock the tables by executing the SQL directly. I'm not sure what that looks like in Elixir, but in plain SA it'd be something like:
conn = engine.connect()
conn.execute("LOCK TABLES Pointer WRITE")
#do stuff with conn
conn.execute("UNLOCK TABLES")

Related

Querying objects added to a non committed session in SQLAlchemy

So I placed this question without too much context, and got downvoted, let's try again...
For one, I don't follow the logic behind SQLAlchemy's session.add. I understand that it queues the object for insertion, and I understand that session.query looks in the connected database rather than in the session, but is it at all possible, within SQLAlchemy, to query the session without first doing session.flush? My expectation from something which reads session.query is that it queries the session...
I am now manually looking in session.new after a None comes out of session.query().first().
There's two reasons why I don't want to do session.flush before my session.query,
one based on efficiency fears (why should I write to the database, and query the database if I am still within a session which the user may want to rollback?);
two is because I've adopted a fairly large program, and it manages to define its own Session whose instances causes flush to also commit.
So really the core of this question is who's helping me find an error in a GPL program on github!
This is a code snippet with a surprising behaviour in bauble/ghini:
# setting up things in ghini
# <replace-later>
import bauble
import bauble.db as db
db.open('sqlite:///:memory:', verify=False)
from bauble.prefs import prefs
import bauble.pluginmgr as pluginmgr
prefs.init()
prefs.testing = True
pluginmgr.load()
db.create(True)
Session = bauble.db.Session
from bauble.plugins.garden import Location
# </replace-later>
# now just plain straightforward usage
session = Session()
session.query(Location).delete()
session.commit()
u0 = session.query(Location).filter_by(code=u'mario').first()
print u0
u1 = Location(code=u'mario')
session.add(u1)
session.flush()
u2 = session.query(Location).filter_by(code=u'mario').one()
print u1, u2, u1==u2
session.rollback()
u3 = session.query(Location).filter_by(code=u'mario').first()
print u3
the output here is:
None
mario mario True
mario
here you have what I think is just standard simple code to set up a database:
from sqlalchemy import Column, Unicode
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Location(Base):
__tablename__ = 'location'
code = Column(Unicode(64), index=True, primary_key=True)
def __init__(self, code=None):
self.code = code
def __repr__(self):
return self.code
from sqlalchemy import create_engine
engine = create_engine('sqlite:///joindemo.db')
Base.metadata.create_all(engine)
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine, autoflush=False)
with this, the output of the same above code snippet is less surprising:
None
mario mario True
None
The reason why flushes in bauble end up emitting a COMMIT is the line 133 in db.py where they handle their history table:
table.insert(dict(table_name=mapper.local_table.name,
table_id=instance.id, values=str(row),
operation=operation, user=user,
timestamp=datetime.datetime.today())).execute()
Instead of issuing the additional SQL in the event handler using the passed in transactional connection, as they should, they execute the statement itself as is, which means it ends up using the engine as the bind (found through the table's metadata). Executing using the engine has autocommit behaviour. Since bauble always uses a SingletonThreadPool, there's just one connection per thread, and so that statement ends up committing the flushed changes as well. I wonder if this bug is the reason why bauble disables autoflush...
The fix is to change the event handling to use the transactional connection:
class HistoryExtension(orm.MapperExtension):
"""
HistoryExtension is a
:class:`~sqlalchemy.orm.interfaces.MapperExtension` that is added
to all clases that inherit from bauble.db.Base so that all
inserts, updates, and deletes made to the mapped objects are
recorded in the `history` table.
"""
def _add(self, operation, mapper, connection, instance):
"""
Add a new entry to the history table.
"""
... # a ton of code here
table = History.__table__
stmt = table.insert(dict(table_name=mapper.local_table.name,
table_id=instance.id, values=str(row),
operation=operation, user=user,
timestamp=datetime.datetime.today()))
connection.execute(stmt)
def after_update(self, mapper, connection, instance):
self._add('update', mapper, connection, instance)
def after_insert(self, mapper, connection, instance):
self._add('insert', mapper, connection, instance)
def after_delete(self, mapper, connection, instance):
self._add('delete', mapper, connection, instance)
It's worth a note that MapperExtension has been deprecated since version 0.7.
Regarding your views about the session I quote "Session Basics", which you really should read through:
In the most general sense, the Session establishes all conversations with the database and represents a “holding zone” for all the objects which you’ve loaded or associated with it during its lifespan. It provides the entrypoint to acquire a Query object, which sends queries to the database using the Session object’s current database connection, ...
and "Is the Session a cache?":
Yeee…no. It’s somewhat used as a cache, in that it implements the identity map pattern, and stores objects keyed to their primary key. However, it doesn’t do any kind of query caching. This means, if you say session.query(Foo).filter_by(name='bar'), even if Foo(name='bar') is right there, in the identity map, the session has no idea about that. It has to issue SQL to the database, get the rows back, and then when it sees the primary key in the row, then it can look in the local identity map and see that the object is already there. It’s only when you say query.get({some primary key}) that the Session doesn’t have to issue a query.
So:
My expectation from something which reads session.query is that it queries the session...
Your expectations are wrong. The Session handles talking to the DB – among other things.
There's two reasons why I don't want to do session.flush before my session.query,
one based on efficiency fears (why should I write to the database, and query the database if I am still within a session which the user may want to rollback?);
Because your DB may do validation, have triggers, and generate values for some columns – primary keys, timestamps, and the like. The data you thought you're inserting may end up something else in the DB and the Session has absolutely no way to know about that.
Also, why should SQLAlchemy implement a sort of an in-memory DB in itself, with its own query engine, and all the problems that come with synchronizing 2 databases? How would SQLAlchemy support all the different operations and functions of different DBs you query against? Your simple equality predicate example just scratches the surface.
When you rollback, you roll back the DB's transaction (along with the session's unflushed changes).
two is because I've adopted a fairly large program, and it manages to define its own Session whose instances causes flush to also commit.
Caused by the event handling bug.

Catch python DatabaseErrors generically

I have a database schema that might be implemented in a variety of different database engines (let's say an MS Access database that I'll connect to with pyodbc or a SQLite database that I'll connect to via the built-in sqlite3 module as an simple example).
I'd like to create a factory function/method that returns a database connection of the appropriate type based on some parameter, similar to the following:
def createConnection(connType, params):
if connType == 'sqlite':
return sqlite3.connect(params['filename'])
elif connType == 'msaccess':
return pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ={};'.format(params['filename']))
else:
# do something else
Now I've got some query code that should work with any connection type (since the schema is identical no matter the underlying DB engine) but may throw an exception that I'll need to catch:
db = createDatabase(params['dbType'], params)
cursor = db.cursor()
try:
cursor.execute('SELECT A, B, C FROM TABLE')
for row in cursor:
print('{},{},{}'.format(row.A, row.B, row.C))
except DatabaseError as err:
# Do something...
The problem I'm having is that the DatabaseError classes from each DB API 2.0 implementation don't share a common base class (other than the way-too-generic Exception), so I don't know how to catch these exceptions generically. Obviously I could do something like the following:
try:
# as before
except sqlite3.DatabaseError as err:
# do something
except pyodbc.DatabaseError as err:
# do something again
...where I included an explicit catch block for each possible database engine. But this seems distinctly non-pythonic to me.
How can I generically catch DatabaseErrors from different underlying DB API 2.0 database implementations?
There is a number of approaches :
Use a catch-all exception and then work out what exception it is. If it is not in your list, raise the exception again (or your own). See: Python When I catch an exception, how do I get the type, file, and line number?
Perhaps you want to take the problem in a different way: your factory code should also provide the exception to test for.
A simpler approach in my view (and the one I use in practice), is to have a class for all database connections, and to subclass it for each specific database type/syntax. Inheritance allows you to take care of all specificities. For some reason, I never had to worry about this issue.

conditional add statement in SQLAlchemy

Suppose I want to upload several SQL records, to a table that may not be populated yet. If there is a record with a primary key("ID") that already exists, either in the table or in the records to be committed to a table, I want to replace the existing record with the new record.
I'm using mssql, SQL server 2008.
My first guess would be
try:
session.add(record)
session.commit
except:
session.query().\
filter(Class.ID == record.ID).\
update(some expression)
session.commit()
what should the expression be? and is there a cleaner(and safer!) way of doing this?
In general unless using statements that guarantee atomicity, you'll always have to account for race conditions that might arise from multiple actors trying to either insert or update (don't forget delete). Even the MERGE statement, though a single statement, can have race conditions if not used correctly.
Traditionally this kind of "upsert" is performed using stored procedures or other SQL or implementation specific features available, like the MERGE statement.
An SQLAlchemy solution has to either attempt the insert and perform an update if an integrity error is raised, or perform the udpate and attempt an insert if no rows were affected. It should be prepared to retry in case both operations fail (a row might get deleted or inserted in between):
from sqlalchemy.exc import IntegrityError
while True: # Infinite loop, use a retry counter if necessary
try:
# begin a save point, prevents the whole transaction failing
# in case of an integrity error
with session.begin_nested():
session.add(record)
# Flush instead of commit, we need the transaction intact
session.flush()
# If the flush is successful, break out of the loop as the insert
# was performed
break
except IntegrityError:
# Attempt the update. If the session has to reflect the changes
# performed by the update, change the `synchronize_session` argument.
if session.query(Class).\
filter_by(ID=record.ID).\
update({...},
syncronize_session=False):
# 1 or more rows were affected (hopefully 1)
break
# Nothing was updated, perhaps a DELETE in between
# Both operations have failed, retry
session.commit()
Regarding
If there is a record with a primary key("ID") that already exists, either in the table or in the records to be committed to a table, I want to replace the existing record with the new record.
If you can be sure that no concurrent updates to the table in question will happen, you can use Session.merge for this kind of task:
# Records have primary key set, on which merge can either load existing
# state and merge, or create a new record in session if none was found.
for record in records:
merged_record = session.merge(record)
# Note that merged_record is not record
session.commit()
The SQLAlchemy merge will first check if an instance with given primary key exists in the identity map. If it doesn't and load is passed as True it'll check the database for the primary key. If a given instance has no primary key or an instance cannot be found, a new instance will be created.
The merge will then copy the state of the given instance onto the located/created instance. The new instance is returned.
No. There is a much better pattern for doing this. Do a query first to see if the record already exists, and then proceed accordingly.
Using your syntax, it would be something like the following:
result = session.query().filter(Class.ID == record.ID).first()
# If record does not exist in Db, then add record
if result is None:
try:
session.add(record)
session.commit()
except:
db.rollback()
log.error('Rolling back transaction in query-none block')
# If record does exist, then update value of record in Db
else:
try:
session.query().\
filter(Class.ID == record.ID).\
update(some expression)
session.commit()
except:
db.rollback()
log.error('Rolling back transaction')
It's usually a good idea to wrap your database operations in a try/except block , so you're on the right track with the try-portion of what you wrote. Depending on what you're doing, the except block should typically show you an error message or do a db rollback.

Can I use the same cursor while looping through it?

I am iterating through a SELECT result, like this:
import MySQLdb
conn = MySQLdb.connect(host = 127.0.0.1, user = ...) # and so on
cur = conn.cursor()
cur.execute("SELECT * FROM some_table")
for row in cur:
# some stuff I'm doing
# sometimes I need to perform another SELECT here
The question is, can I use cur again inside the for loop, or do I have to create another cursor (or even more - another connection)?
I guess I am missing some basic knowledge about databases or Python here... I am actually quite new with both. Also, my attempts to google the answer have failed.
I would even guess myself that I have to create another cursor, but I think I have actually used it for some time like this before I realized that it might be wrong and it seemed to work. But I am a bit confused now and can't guarantee it. So I just want to make sure.
You have to create a new cursor. Otherwise, cur is now holding the results of your new "inner" select instead of your "outer" one.
This may work anyway, depending on your database library and your luck, but you shouldn't count on it. I'll try to explain below.
You don't need a new connection, however.
So:
cur.execute("SELECT * FROM some_table")
for row in cur:
# some stuff I'm doing
inner_cur = conn.cursor()
inner_cur.execute("SELECT * FROM other_table WHERE column = row[1]")
for inner_row in inner_cur:
# stuff
So, why does it sometimes work?
Well, let's look at what a for row in cur: loop really does under the covers:
temp_iter = iter(cur)
while True:
try:
row = next(temp_iter)
except StopIteration:
break
# your code runs here
Now, that iter(cur) calls the __iter__ method on the cursor object. What does that do? That's up to cur, an object of the cursor type provided by your database library.
The obvious implementation is to return some kind of iterator that has a reference to either the cursor object, or to the same row collection that the cursor object is using under the covers. This is what happens when you call iter on a list, for example.
But there's nothing requiring the database library to implement its __iter__ that way. It could create a copy of the row set for the iterator to use. Or, more plausibly, it could make the iterator refer to the current row set… but then change the cursor itself to refer to a different one when you next call execute. If it does that, then the old iterator keeps reading the old row set, and you can get a new iterator that iterates the new row set.
You shouldn't rely on that happening just because a database library is allowed to do that. But you also shouldn't rely on it not happening, of course.
In more concrete terms, imagine this type:
class Cursor(object):
# other stuff
def __iter__(self):
return iter(self.rowset)
def execute(self, sql, *args):
self.rowset = self.db.do_the_real_work(sql, *args)

When should I be calling flush() on SQLAlchemy?

I'm new to SQLAlchemy and have inherited a somewhat messy codebase without access to the original author.
The code is litered with calls to DBSession.flush(), seemingly any time the author wanted to make sure data was being saved. At first I was just following patterns I saw in this code, but as I'm reading docs, it seems this is unnecessary - that autoflushing should be in place. Additionally, I've gotten into a few cases with AJAX calls that generate the error "InvalidRequestError: Session is already flushing".
Under what scenarios would I legitimately want to keep a call to flush()?
This is a Pyramid app, and SQLAlchemy is being setup with:
DBSession = scoped_session(sessionmaker(extension=ZopeTransactionExtension(), expire_on_commit=False))
Base = declarative_base()
The ZopeTransactionExtension on the DBSession in conjunction with the pyramid_tm being active on your project will handle all commits for you. The situations where you need to flush are:
You want to create a new object and get back the primary key.
DBSession.add(obj)
DBSession.flush()
log.info('look, my new object got primary key %d', obj.id)
You want to try to execute some SQL in a savepoint and rollback if it fails without invalidating the entire transaction.
sp = transaction.savepoint()
try:
foo = Foo()
foo.id = 5
DBSession.add(foo)
DBSession.flush()
except IntegrityError:
log.error('something already has id 5!!')
sp.rollback()
In all other cases involving the ORM, the transaction will be aborted for you upon exception, or committed upon success automatically by pyramid_tm. If you execute raw SQL, you will need to execute transaction.commit() yourself or mark the session as dirty via zope.sqlalchemy.mark_changed(DBSession) otherwise there is no way for the ZTE to know the session has changed.
Also you should leave expire_on_commit at the default of True unless you have a really good reason.

Categories