Sqlalchemy mass update [duplicate]

Sqlalchemy mass update [duplicate] - python

I'm starting a new application and looking at using an ORM -- in particular, SQLAlchemy.
Say I've got a column 'foo' in my database and I want to increment it. In straight sqlite, this is easy:
db = sqlite3.connect('mydata.sqlitedb')
cur = db.cursor()
cur.execute('update table stuff set foo = foo + 1')
I figured out the SQLAlchemy SQL-builder equivalent:
engine = sqlalchemy.create_engine('sqlite:///mydata.sqlitedb')
md = sqlalchemy.MetaData(engine)
table = sqlalchemy.Table('stuff', md, autoload=True)
upd = table.update(values={table.c.foo:table.c.foo+1})
engine.execute(upd)
This is slightly slower, but there's not much in it.
Here's my best guess for a SQLAlchemy ORM approach:
# snip definition of Stuff class made using declarative_base
# snip creation of session object
for c in session.query(Stuff):
c.foo = c.foo + 1
session.flush()
session.commit()
This does the right thing, but it takes just under fifty times as long as the other two approaches. I presume that's because it has to bring all the data into memory before it can work with it.
Is there any way to generate the efficient SQL using SQLAlchemy's ORM? Or using any other python ORM? Or should I just go back to writing the SQL by hand?

SQLAlchemy's ORM is meant to be used together with the SQL layer, not hide it. But you do have to keep one or two things in mind when using the ORM and plain SQL in the same transaction. Basically, from one side, ORM data modifications will only hit the database when you flush the changes from your session. From the other side, SQL data manipulation statements don't affect the objects that are in your session.
So if you say
for c in session.query(Stuff).all():
c.foo = c.foo+1
session.commit()
it will do what it says, go fetch all the objects from the database, modify all the objects and then when it's time to flush the changes to the database, update the rows one by one.
Instead you should do this:
session.execute(update(stuff_table, values={stuff_table.c.foo: stuff_table.c.foo + 1}))
session.commit()
This will execute as one query as you would expect, and because at least the default session configuration expires all data in the session on commit you don't have any stale data issues.
In the almost-released 0.5 series you could also use this method for updating:
session.query(Stuff).update({Stuff.foo: Stuff.foo + 1})
session.commit()
That will basically run the same SQL statement as the previous snippet, but also select the changed rows and expire any stale data in the session. If you know you aren't using any session data after the update you could also add synchronize_session=False to the update statement and get rid of that select.

session.query(Clients).filter(Clients.id == client_id_list).update({'status': status})
session.commit()
Try this =)

There are several ways to UPDATE using sqlalchemy
1) for c in session.query(Stuff).all():
c.foo += 1
session.commit()
2) session.query(Stuff).update({"foo": Stuff.foo + 1})
session.commit()
3) conn = engine.connect()
table = Stuff.__table__
stmt = table.update().values({'foo': Stuff.foo + 'a'})
conn.execute(stmt)
conn.commit()

Here's an example of how to solve the same problem without having to map the fields manually:
from sqlalchemy import Column, ForeignKey, Integer, String, Date, DateTime, text, create_engine
from sqlalchemy.exc import IntegrityError
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy.orm.attributes import InstrumentedAttribute
engine = create_engine('postgres://postgres#localhost:5432/database')
session = sessionmaker()
session.configure(bind=engine)
Base = declarative_base()
class Media(Base):
__tablename__ = 'media'
id = Column(Integer, primary_key=True)
title = Column(String, nullable=False)
slug = Column(String, nullable=False)
type = Column(String, nullable=False)
def update(self):
s = session()
mapped_values = {}
for item in Media.__dict__.iteritems():
field_name = item[0]
field_type = item[1]
is_column = isinstance(field_type, InstrumentedAttribute)
if is_column:
mapped_values[field_name] = getattr(self, field_name)
s.query(Media).filter(Media.id == self.id).update(mapped_values)
s.commit()
So to update a Media instance, you can do something like this:
media = Media(id=123, title="Titular Line", slug="titular-line", type="movie")
media.update()

If it is because of the overhead in terms of creating objects, then it probably can't be sped up at all with SA.
If it is because it is loading up related objects, then you might be able to do something with lazy loading. Are there lots of objects being created due to references? (IE, getting a Company object also gets all of the related People objects).

Withough testing, I'd try:
for c in session.query(Stuff).all():
c.foo = c.foo+1
session.commit()
(IIRC, commit() works without flush()).
I've found that at times doing a large query and then iterating in python can be up to 2 orders of magnitude faster than lots of queries. I assume that iterating over the query object is less efficient than iterating over a list generated by the all() method of the query object.
[Please note comment below - this did not speed things up at all].

Related

Use "on_conflict_do_update()" with sqlalchemy ORM

I am currently using SQLAlchemy ORM to deal with my db operations. Now I have a SQL command which requires ON CONFLICT (id) DO UPDATE. The method on_conflict_do_update() seems to be the correct one to use. But the post here says the code have to switch to SQLAlchemy core and the high-level ORM functionalities are missing. I am confused by this statement since I think the code like the demo below can achieve what I want while keep the functionalities of SQLAlchemy ORM.
class Foo(Base):
...
bar = Column(Integer)
foo = Foo(bar=1)
insert_stmt = insert(Foo).values(bar=foo.bar)
do_update_stmt = insert_stmt.on_conflict_do_update(
set_=dict(
bar=insert_stmt.excluded.bar,
)
)
session.execute(do_update_stmt)
I haven't tested it on my project since it will require a huge amount of modification. Can I ask if this is the correct way to deal with ON CONFLICT (id) DO UPDATE with SQLALchemy ORM?

As noted in the documentation, the constraint= argument is
The name of a unique or exclusion constraint on the table, or the constraint object itself if it has a .name attribute.
so we need to pass the name of the PK constraint to .on_conflict_do_update().
We can get the PK constraint name via the inspection interface:
insp = inspect(engine)
pk_constraint_name = insp.get_pk_constraint(Foo.__tablename__)["name"]
print(pk_constraint_name) # tbl_foo_pkey
new_bar = 123
insert_stmt = insert(Foo).values(id=12345, bar=new_bar)
do_update_stmt = insert_stmt.on_conflict_do_update(
constraint=pk_constraint_name, set_=dict(bar=new_bar)
)
with Session(engine) as session, session.begin():
session.execute(do_update_stmt)

Sqlalchemy: secondary relationship update

I have two tables, say A and B. Both have a primary key id. They have a many-to-many relationship, SEC.
SEC = Table('sec', Base.metadata,
Column('a_id', Integer, ForeignKey('A.id'), primary_key=True, nullable=False),
Column('b_id', Integer, ForeignKey('B.id'), primary_key=True, nullable=False)
)
class A():
...
id = Column(Integer, primary_key=True)
...
rels = relationship(B, secondary=SEC)
class B():
...
id = Column(Integer, primary_key=True)
...
Let's consider this piece of code.
a = A()
b1 = B()
b2 = B()
a.rels = [b1, b2]
...
#some place later
b3 = B()
a.rels = [b1, b3] # errors sometimes
Sometimes, I get an error at the last line saying
duplicate key value violates unique constraint a_b_pkey
In my understanding, I think it tries to add (a.id, b.id) into 'sec' table again resulting in a unique constraint error. Is that what it is? If so, how can I avoid this? If not, why do I have this error?

The problem is you want to make sure the instances you create are unique. We can create an alternate constructor that checks a cache of existing uncommited instances or queries the database for existing commited instance before returning a new instance.
Here is a demonstration of such a method:
from sqlalchemy import Column, Integer, String, ForeignKey, Table
from sqlalchemy.engine import create_engine
from sqlalchemy.ext.declarative.api import declarative_base
from sqlalchemy.orm import sessionmaker, relationship
engine = create_engine('sqlite:///:memory:', echo=True)
Session = sessionmaker(engine)
Base = declarative_base(engine)
session = Session()
class Role(Base):
__tablename__ = 'role'
id = Column(Integer, primary_key=True)
name = Column(String, nullable=False, unique=True)
#classmethod
def get_unique(cls, name):
# get the session cache, creating it if necessary
cache = session._unique_cache = getattr(session, '_unique_cache', {})
# create a key for memoizing
key = (cls, name)
# check the cache first
o = cache.get(key)
if o is None:
# check the database if it's not in the cache
o = session.query(cls).filter_by(name=name).first()
if o is None:
# create a new one if it's not in the database
o = cls(name=name)
session.add(o)
# update the cache
cache[key] = o
return o
Base.metadata.create_all()
# demonstrate cache check
r1 = Role.get_unique('admin') # this is new
r2 = Role.get_unique('admin') # from cache
session.commit() # doesn't fail
# demonstrate database check
r1 = Role.get_unique('mod') # this is new
session.commit()
session._unique_cache.clear() # empty cache
r2 = Role.get_unique('mod') # from database
session.commit() # nop
# show final state
print session.query(Role).all() # two unique instances from four create calls
The create_unique method was inspired by the example from the SQLAlchemy wiki. This version is much less convoluted, favoring simplicity over flexibility. I have used it in production systems with no problems.
There are obviously improvements that can be added; this is just a simple example. The get_unique method could be inherited from a UniqueMixin, to be used for any number of models. More flexible memoizing of arguments could be implemented. This also puts aside the problem of multiple threads inserting conflicting data mentioned by Ants Aasma; handling that is more complex but should be an obvious extension. I leave that to you.

The error you mention is indeed from inserting a conflicting value to the sec table. To be sure that it is from the operation you think it is, not some previous change, turn on SQL logging and check what values is it trying to insert before erroring out.
When overwriting a many-to-many collection value, SQLAlchemy compares the new contents of the collection with the state in the database and correspondingly issues delete and insert statements. Unless you are poking around in SQLAlchemy internals, there should be two ways to encounter this error.
First is concurrent modification: Process 1 fetches the value a.rels and notices that it is empty, meanwhile Process 2 also fetches a.rels, sets it to [b1, b2] and commits flushing the (a,b1),(a,b2) tuples, Process 1 sets a.rels to [b1, b3] noticing that the previous contents was empty and when it tries to flush the sec tuple (a,b1) it gets a duplicate key error. The correct action in such cases is usually to retry the transaction from the top. You can use serializable transaction isolation to instead get a serialization error in this case that is distinct from a business logic error causing a duplicate key error.
The second case happens when you have managed to convince SQLAlchemy that you don't need to know the database state by setting the loading strategy of the rels attribute to noload. This can be done when defining the relationship by adding the lazy='noload' parameter, or when querying, calling .options(noload(A.rels)) on the query. SQLAlchemy will assume that sec table has no matching rows for objects loaded with this strategy in effect.

How can I re-use a sqlalchemy ORM model on multiple databases and schemas?

I have a SQLAlchemy ORM model that currently looks a bit like this:
Base = declarative_base()
class Database(Base):
__tablename__ = "databases"
__table_args__ = (
saschema.PrimaryKeyConstraint('db', 'role'),
{
'schema' : 'defines',
},
)
db = Column(String, nullable=False)
role = Column(String, nullable=False)
server = Column(String)
Here's the thing, in practice this model exists in multiple databases, and in those databases it'll exist in mutiple schemas. For any one operation I'll only use one (database, schema) tuple.
Right now, I can set the database engine using this:
Session = scoped_session(sessionmaker())
Session.configure(bind=my_db_engine)
# ... do my operations on the model here.
But I'm not sure how I can change the __table_args__ at execution time so that the schema will be the right one.

One option is to use the create_engine to bind the models to a schema/database rather than do so in the actual database.
#first connect to the database that holds the DB and schema
engine1 = create_engine('mysql://user:pass#db.com/schema1')
Session = session(sessionmaker())
session = Session(bind=engine1)
#fetch the first database
database = session.query(Database).first()
engine2 = create_engine('mysql://user:pass#%s/%s' % (database.DB, database.schema))
session2 = Session(bind=engine2)
I don't know that this is ideal, but it is one way to do it. If you cache the list of databases before hand then in most cases you are only having to create one session.

I'm also looking for an elegant solution for this problem. If there are standard tables/models, but different table names, databases, and schemas at runtime, how is this handled? Others have suggested writing some sort of function that takes a tablename and schema argument, and constructs the model for you. I've found that using __abstract__ helps. I've suggested a solution here that may be useful. It involves adding a Base with a specific schema/metadata into the inheritance.

How to create a non-persistent Elixir/SQLAlchemy object?

Because of legacy data which is not available in the database but some external files, I want to create a SQLAlchemy object which contains data read from the external files, but isn't written to the database if I execute session.flush()
My code looks like this:
try:
return session.query(Phone).populate_existing().filter(Phone.mac == ident).one()
except:
return self.createMockPhoneFromLicenseFile(ident)
def createMockPhoneFromLicenseFile(self, ident):
# Some code to read necessary data from file deleted....
phone = Phone()
phone.mac = foo
phone.data = bar
phone.state = "Read from legacy file"
phone.purchaseOrderPosition = self.getLegacyOrder(ident)
# SQLAlchemy magic doesn't seem to work here, probably because we don't insert the created
# phone object into the database. So we set the id fields manually.
phone.order_id = phone.purchaseOrderPosition.order_id
phone.order_position_id = phone.purchaseOrderPosition.order_position_id
return phone
Everything works fine except that on a session.flush() executed later in the application SQLAlchemy tries to write the created Phone object to the database (which fortunately doesn't succeed, because phone.state is longer than the data type allows), which breaks the function which issues the flush.
Is there any way to prevent SQLAlchemy from trying to write such an object?
Update
While I didn't find anything on
using_mapper_options(save_on_init=False)
in the Elixir documentation (maybe you can provide a link?), it seemed to me worth a try (I would have preferred a way to prevent a single instance from being written instead of the whole entity).
At first it seemed that the statement has no effect and I suspected my SQLAlchemy/Elixir versions to be too old, but then I found out that the connection to the PurchaseOrderPosition entity (which I didn't modify) made with
phone.purchaseOrderPosition = self.getLegacyOrder(ident)
causes the phone object to be written again. If I remove the statement, everything seems to be fine.

You need to do
import elixir
elixir.options_defaults['mapper_options'] = { 'save_on_init': False }
to prevent Entity instances which you instantiate being auto-added to the session. Ideally, this should be done as early as possible in your code. You can also do this on a per-entity basis, via using_mapper_options(save_on_init=False) - see the Elixir documentation for more details.
Update:
See this post on the Elixir mailing list indicating that this is the solution.
Also, as Ants Aasma points out, you can use cascade options on the Elixir relationship to set up cascade options in SQLAlchemy. See this page for more details.

Well, sqlalchemy doesn't, by default.
Consider the following self-contained example code.
from sqlalchemy import Column, Integer, Unicode, create_engine
from sqlalchemy.orm import create_session
from sqlalchemy.ext.declarative import declarative_base
e = create_engine('sqlite://')
Base = declarative_base(bind=e)
class User(Base):
__tablename__ = 'users'
id = Column(Integer, primary_key=True)
name = Column(Unicode(50))
# create the empty table and a session
Base.metadata.create_all()
s = create_session(bind=e, autoflush=False, autocommit=False)
# assert the table is empty
assert s.query(User).all() == []
# create a new User instance but don't save it to database:
u = User()
u.name = 'siebert'
# I could run s.add(u) here but I won't
s.flush()
s.commit()
# assert the table is still empty
assert s.query(User).all() == []
So I'm not sure what's implicity adding your instances to the session. Normally you have to manually call s.add(u) to make it go to the session. I'm not familiar with elixir so perhaps this is some elixir trickery... Maybe you could remove it from the session, by using session.expunge().

Old post but I came across a similar issue, in my case in sqlalchemy it was caused by cascading on backrefs:
http://docs.sqlalchemy.org/en/rel_0_7/orm/session.html#backref-cascade
Turn it off on your backrefs so that you have to explicitly add things to the session

Efficiently updating database using SQLAlchemy ORM

I'm starting a new application and looking at using an ORM -- in particular, SQLAlchemy.
Say I've got a column 'foo' in my database and I want to increment it. In straight sqlite, this is easy:
db = sqlite3.connect('mydata.sqlitedb')
cur = db.cursor()
cur.execute('update table stuff set foo = foo + 1')
I figured out the SQLAlchemy SQL-builder equivalent:
engine = sqlalchemy.create_engine('sqlite:///mydata.sqlitedb')
md = sqlalchemy.MetaData(engine)
table = sqlalchemy.Table('stuff', md, autoload=True)
upd = table.update(values={table.c.foo:table.c.foo+1})
engine.execute(upd)
This is slightly slower, but there's not much in it.
Here's my best guess for a SQLAlchemy ORM approach:
# snip definition of Stuff class made using declarative_base
# snip creation of session object
for c in session.query(Stuff):
c.foo = c.foo + 1
session.flush()
session.commit()
This does the right thing, but it takes just under fifty times as long as the other two approaches. I presume that's because it has to bring all the data into memory before it can work with it.
Is there any way to generate the efficient SQL using SQLAlchemy's ORM? Or using any other python ORM? Or should I just go back to writing the SQL by hand?

SQLAlchemy's ORM is meant to be used together with the SQL layer, not hide it. But you do have to keep one or two things in mind when using the ORM and plain SQL in the same transaction. Basically, from one side, ORM data modifications will only hit the database when you flush the changes from your session. From the other side, SQL data manipulation statements don't affect the objects that are in your session.
So if you say
for c in session.query(Stuff).all():
c.foo = c.foo+1
session.commit()
it will do what it says, go fetch all the objects from the database, modify all the objects and then when it's time to flush the changes to the database, update the rows one by one.
Instead you should do this:
session.execute(update(stuff_table, values={stuff_table.c.foo: stuff_table.c.foo + 1}))
session.commit()
This will execute as one query as you would expect, and because at least the default session configuration expires all data in the session on commit you don't have any stale data issues.
In the almost-released 0.5 series you could also use this method for updating:
session.query(Stuff).update({Stuff.foo: Stuff.foo + 1})
session.commit()
That will basically run the same SQL statement as the previous snippet, but also select the changed rows and expire any stale data in the session. If you know you aren't using any session data after the update you could also add synchronize_session=False to the update statement and get rid of that select.

session.query(Clients).filter(Clients.id == client_id_list).update({'status': status})
session.commit()
Try this =)

There are several ways to UPDATE using sqlalchemy
1) for c in session.query(Stuff).all():
c.foo += 1
session.commit()
2) session.query(Stuff).update({"foo": Stuff.foo + 1})
session.commit()
3) conn = engine.connect()
table = Stuff.__table__
stmt = table.update().values({'foo': Stuff.foo + 'a'})
conn.execute(stmt)
conn.commit()

Here's an example of how to solve the same problem without having to map the fields manually:
from sqlalchemy import Column, ForeignKey, Integer, String, Date, DateTime, text, create_engine
from sqlalchemy.exc import IntegrityError
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy.orm.attributes import InstrumentedAttribute
engine = create_engine('postgres://postgres#localhost:5432/database')
session = sessionmaker()
session.configure(bind=engine)
Base = declarative_base()
class Media(Base):
__tablename__ = 'media'
id = Column(Integer, primary_key=True)
title = Column(String, nullable=False)
slug = Column(String, nullable=False)
type = Column(String, nullable=False)
def update(self):
s = session()
mapped_values = {}
for item in Media.__dict__.iteritems():
field_name = item[0]
field_type = item[1]
is_column = isinstance(field_type, InstrumentedAttribute)
if is_column:
mapped_values[field_name] = getattr(self, field_name)
s.query(Media).filter(Media.id == self.id).update(mapped_values)
s.commit()
So to update a Media instance, you can do something like this:
media = Media(id=123, title="Titular Line", slug="titular-line", type="movie")
media.update()

If it is because of the overhead in terms of creating objects, then it probably can't be sped up at all with SA.
If it is because it is loading up related objects, then you might be able to do something with lazy loading. Are there lots of objects being created due to references? (IE, getting a Company object also gets all of the related People objects).

Withough testing, I'd try:
for c in session.query(Stuff).all():
c.foo = c.foo+1
session.commit()
(IIRC, commit() works without flush()).
I've found that at times doing a large query and then iterating in python can be up to 2 orders of magnitude faster than lots of queries. I assume that iterating over the query object is less efficient than iterating over a list generated by the all() method of the query object.
[Please note comment below - this did not speed things up at all].

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.