Dynamic Datasets and SQLAlchemy

Dynamic Datasets and SQLAlchemy - python

I am refactoring some old SQLite3 SQL statements in Python into SQLAlchemy. In our framework, we have the following SQL statements that takes in a dict with certain known keys and potentially any number of unexpected keys and values (depending what information was provided).
import sqlite3
import sys
def dict_factory(cursor, row):
d = {}
for idx, col in enumerate(cursor.description):
d[col[0]] = row[idx]
return d
def Create_DB(db):
# Delete the database
from os import remove
remove(db)
# Recreate it and format it as needed
with sqlite3.connect(db) as conn:
conn.row_factory = dict_factory
conn.text_factory = str
cursor = conn.cursor()
cursor.execute("CREATE TABLE [Listings] ([ID] INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL UNIQUE, [timestamp] REAL NOT NULL DEFAULT(( datetime ( 'now' , 'localtime' ) )), [make] VARCHAR, [model] VARCHAR, [year] INTEGER);")
def Add_Record(db, data):
with sqlite3.connect(db) as conn:
conn.row_factory = dict_factory
conn.text_factory = str
cursor = conn.cursor()
#get column names already in table
cursor.execute("SELECT * FROM 'Listings'")
col_names = list(map(lambda x: x[0], cursor.description))
#check if column doesn't exist in table, then add it
for i in data.keys():
if i not in col_names:
cursor.execute("ALTER TABLE 'Listings' ADD COLUMN '{col}' {type}".format(col=i, type='INT' if type(data[i]) is int else 'VARCHAR'))
#Insert record into table
cursor.execute("INSERT INTO Listings({cols}) VALUES({vals});".format(cols = str(data.keys()).strip('[]'),
vals=str([data[i] for i in data]).strip('[]')
))
#Database filename
db = 'test.db'
Create_DB(db)
data = {'make': 'Chevy',
'model' : 'Corvette',
'year' : 1964,
'price' : 50000,
'color' : 'blue',
'doors' : 2}
Add_Record(db, data)
data = {'make': 'Chevy',
'model' : 'Camaro',
'year' : 1967,
'price' : 62500,
'condition' : 'excellent'}
Add_Record(db, data)
This level of dynamicism is necessary because there's no way we can know what additional information will be provided, but, regardless, it's important that we store all information provided to us. This has never been a problem because in our framework, as we've never expected an unwieldy number of columns in our tables.
While the above code works, it's obvious that it's not a clean implementation and thus why I'm trying to refactor it into SQLAlchemy's cleaner, more robust ORM paradigm. I started going through SQLAlchemy's official tutorials and various examples and have arrived at the following code:
from sqlalchemy import Column, String, Integer
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
Base = declarative_base()
class Listing(Base):
__tablename__ = 'Listings'
id = Column(Integer, primary_key=True)
make = Column(String)
model = Column(String)
year = Column(Integer)
engine = create_engine('sqlite:///')
session = sessionmaker()
session.configure(bind=engine)
Base.metadata.create_all(engine)
data = {'make':'Chevy',
'model' : 'Corvette',
'year' : 1964}
record = Listing(**data)
s = session()
s.add(record)
s.commit()
s.close()
and it works beautifully with that data dict. Now, when I add a new keyword, such as
data = {'make':'Chevy',
'model' : 'Corvette',
'year' : 1964,
'price' : 50000}
I get a TypeError: 'price' is an invalid keyword argument for Listing error. To try and solve the issue, I modified the class to be dynamic, too:
class Listing(Base):
__tablename__ = 'Listings'
id = Column(Integer, primary_key=True)
make = Column(String)
model = Column(String)
year = Column(Integer)
def __checker__(self, data):
for i in data.keys():
if i not in [a for a in dir(self) if not a.startswith('__')]:
if type(i) is int:
setattr(self, i, Column(Integer))
else:
setattr(self, i, Column(String))
else:
self[i] = data[i]
But I quickly realized this would not work at all for several reasons, e.g. the class was already initialized, the data dict cannot be fed into the class without reinitializing it, it's a hack more than anything, et al.). The more I think about it, the less obvious the solution using SQLAlchemy seems to me. So, my main question is, how do I implement this level of dynamicism using SQLAlchemy?
I've researched a bit to see if anyone has a similar issue. The closest I've found was Dynamic Class Creation in SQLAlchemy but it only talks about the constant attributes ("tablename" et al.). I believe the unanswered https://stackoverflow.com/questions/29105206/sqlalchemy-dynamic-attribute-change may be asking the same question. While Python is not my forte, I consider myself a highly skilled programmer (C++ and JavaScript are my strongest languages) in the context scientific/engineering applications, so I may not hitting the correct Python-specific keywords in my searches.
I welcome any and all help.

class Listing(Base):
__tablename__ = 'Listings'
id = Column(Integer, primary_key=True)
make = Column(String)
model = Column(String)
year = Column(Integer)
def __init__(self,**kwargs):
for k,v in kwargs.items():
if hasattr(self,k):
setattr(self,k,v)
else:
engine.execute("ALTER TABLE %s AD COLUMN %s"%(self.__tablename__,k)
setattr(self.__class__,Column(k, String))
setattr(self,k,v)
might work ... maybe ... I am not entirely sure I did not test it
a better solution would be to use a relational table
class Attribs(Base):
listing_id = Column(Integer,ForeignKey("Listing"))
name = Column(String)
val = Column(String)
class Listing(Base):
id = Column(Integer,primary_key = True)
attributes = relationship("Attribs",backref="listing")
def __init__(self,**kwargs):
for k,v in kwargs.items():
Attribs(listing_id=self.id,name=k,value=v)
def __str__(self):
return "\n".join(["A LISTING",] + ["%s:%s"%(a.name,a.val) for a in self.attribs])
another solution would be to store json
class Listing(Base):
__tablename__ = 'Listings'
id = Column(Integer, primary_key=True)
data = Column(String)
def __init__(self,**kwargs):
self.data = json.dumps(kwargs)
self.data_dict = kwargs
the best solution would be to use a no-sql key,value store (maybe even just a simple json file? or perhaps shelve? or even pickle I guess)

Related

SQLAlchemy - pass a dynamic tablename to query function?

I have a simple polling script that polls entries based on new ID's in a MSSQL table. I'm using SQLAlchemy's ORM to create a table class and then query that table. I want to be able to add more tables "dynamically" without coding it directly into the method.
My polling function:
def poll_db():
query = db.query(
Transactions.ID).order_by(Transactions.ID.desc()).limit(1)
# Continually poll for new images to classify
max_id_query = query
last_max_id = max_id_query.scalar()
while True:
max_id = max_id_query.scalar()
if max_id > last_max_id:
print(
f"New row(s) found. "
f"Processing ids {last_max_id + 1} through {max_id}"
)
# Insert ML model
id_query = db.query(Transactions).filter(
Transactions.ID > last_max_id)
df_from_query = pd.read_sql_query(
id_query.statement, db.bind, index_col='ID')
print(f"New query was made")
last_max_id = max_id
time.sleep(5)
My table model:
import sqlalchemy as db
from sqlalchemy import Boolean, Column, ForeignKey, Integer, String, Text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import defer, relationship, query
from database import SessionLocal, engine
insp = db.inspect(engine)
db_list = insp.get_schema_names()
Base = declarative_base(cls=BaseModel)
class Transactions(Base):
__tablename__ = 'simulation_data'
sender_account = db.Column('sender_account', db.BigInteger)
recipient_account = db.Column('recipient_account', db.String)
sender_name = db.Column('sender_name', db.String)
recipient_name = db.Column('recipient_name', db.String)
date = db.Column('date', db.DateTime)
text = db.Column('text', db.String)
amount = db.Column('amount', db.Float)
currency = db.Column('currency', db.String)
transaction_type = db.Column('transaction_type', db.String)
fraud = db.Column('fraud', db.BigInteger)
swift_bic = db.Column('swift_bic', db.String)
recipient_country = db.Column('recipient_country', db.String)
internal_external = db.Column('internal_external', db.String)
ID = Column('ID', db.BigInteger, primary_key=True)
QUESTION
How can I pass the table class name "dynamically" in the likes of poll_db(tablename), where tablename='Transactions', and instead of writing similar queries for multiple tables, such as:
query = db.query(Transactions.ID).order_by(Transactions.ID.desc()).limit(1)
query2 = db.query(Transactions2.ID).order_by(Transactions2.ID.desc()).limit(1)
query3 = db.query(Transactions3.ID).order_by(Transactions3.ID.desc()).limit(1)
The tables will have identical structure, but different data.

I can't give you a full example right now (will edit later) but here's one hacky way to do it (the documentation will probably be a better place to check):
def dynamic_table(tablename):
for class_name, cls in Base._decl_class_registry.items():
if cls.__tablename__ == tablename:
return cls
Transactions2 = dynamic_table("simulation_data")
assert Transactions2 is Transactions
The returned class is the model you want. Keep in mind that Base can only access the tables that have been subclassed already so if you have them in other modules you need to import them first so they are registered as Base's subclasses.
For selecting columns, something like this should work:
def dynamic_table_with_columns(tablename, *columns):
cls = dynamic_table(tablename)
subset = []
for col_name in columns:
column = getattr(cls, col_name)
if column:
subset.append(column)
# in case no columns were given
if not subset:
return db.query(cls)
return db.query(*subset)

Moving data from sqlalchemy to a pandas DataFrame

I am trying to load an SQLAlchemy in a pandas DataFrame.
When I do:
df = pd.DataFrame(LPRRank.query.all())
I get
>>> df
0 <M. Misty || 1 || 18>
1 <P. Patch || 2 || 18>
...
...
But, what I want is each column in the database to be a column in the dataframe:
0 M. Misty 1 18
1 P. Patch 2 18
...
...
and when I try:
dff = pd.read_sql_query(LPRRank.query.all(), db.session())
I get an Attribute Error:
AttributeError: 'SignallingSession' object has no attribute 'cursor'
and
dff = pd.read_sql_query(LPRRank.query.all(), db.session)
also gives an error:
AttributeError: 'scoped_session' object has no attribute 'cursor'
What I'm using to generate the list of objects is:
app = Flask(__name__)
db = SQLAlchemy(app)
class LPRRank(db.Model):
id = db.Column(db.Integer, primary_key=True)
candid = db.Column(db.String(40), index=True, unique=False)
rank = db.Column(db.Integer, index=True, unique=False)
user_id = db.Column(db.Integer, db.ForeignKey('lprvote.id'))
def __repr__(self):
return '<{} || {} || {}>'.format(self.candid,
self.rank, self.user_id)
This question:
How to convert SQL Query result to PANDAS Data Structure?
is error free, but gives each row as an object, which is not what I want. I can access the individual columns in the returned object, but its seems like there is a better way to do it.
The documentation at pandas.pydata.org is great if you already understand what is going on and just need to review syntax. The documentation from April 20, 2016 (the 1319 page pdf) identifies a pandas connection as still experimental on p.872.
Now, SQLALCHEMY/PANDAS - SQLAlchemy reading column as CLOB for Pandas to_sql is about specifying the SQL type. Mine is SQLAlchemy which is the default.
And, sqlalchemy pandas to_sql OperationalError, Writing to MySQL database with pandas using SQLAlchemy, to_sql, and SQLAlchemy/pandas to_sql for SQLServer -- CREATE TABLE in master db are about writing to the SQL database which produces an operational error, a database error, and a 'create table' error neither of which are my problems.
This one, SQLAlchemy Pandas read_sql from jsonb wants a jsonb attribute to columns: not my cup 'o tea.
This previous question SQLAlchemy ORM conversion to pandas DataFrame addresses my issue but the solution: using query.session.bind is not my solution. I'm opening /closing sessions with db.session.add(), and db.session.commit(), but when I use db.session.bind as specified in the second answer here, then I get an Attribute Error:
AttributeError: 'list' object has no attribute '_execute_on_connection'

Simply add an __init__ method in your model and call the Class object before dataframe build. Specifically below creates an iterable of tuples binded into columns with pandas.DataFrame().
class LPRRank(db.Model):
id = db.Column(db.Integer, primary_key=True)
candid = db.Column(db.String(40), index=True, unique=False)
rank = db.Column(db.Integer, index=True, unique=False)
user_id = db.Column(db.Integer, db.ForeignKey('lprvote.id'))
def __init__(self, candid=None, rank=None, user_id=None):
self.data = (candid, rank, user_id)
def __repr__(self):
return (self.candid, self.rank, self.user_id)
data = db.session.query(LPRRank).all()
df = pd.DataFrame([(d.candid, d.rank, d.user_id) for d in data],
columns=['candid', 'rank', 'user_id'])
Alternatively, use the SQLAlchemy ORM based on your defined Model class, LPRRank, to run read_sql:
df = pd.read_sql(sql = db.session.query(LPRRank)\
.with_entities(LPRRank.candid,
LPRRank.rank,
LPRRank.user_id).statement,
con = db.session.bind)

The Parfait answer is good but could have to problems:
efficiency each object creation imply duplication of data into a DataFrame, so a list of dataframe could take time to be created
That do not mirror a dataframe with a collection of row
Thus below example provides a parent class which is assimilated to a DataFrame representation and a child class assimilated to row of a given dataframe.
Code below provides two way to get a dataframe, the DataFrame object is created only at demand to not waste cpu and memory.
If dataframe is need at creation time you have only to add constructor (def __init__(self, rows:List[MyDataFrameRow] = None)...) and create a new attribute and assing the result of self.data_frame.
from pandas import DataFrame, read_sql
from sqlalchemy import Column, Integer, String, Float, ForeignKey
from sqlalchemy.orm import relationship, Session
Base = declarative_base()
class MyDataFrame(Base):
__tablename__ = 'my_data_frame'
id = Column(Integer, primary_key=True)
rows = relationship('MyDataFrameRow', cascade='all,delete')
#property
def data_frame(self) -> DataFrame:
columns = GenomeCoverageRow.data_frame_columns()
return DataFrame([[getattr(row, column) for column in columns] for row in self.rows],
columns=columns)
#staticmethod
def to_data_frame(identifier: int, session: Session) -> DataFrame:
query = session.query(MyDataFrameRow).join(MyDataFrame).filter(MyDataFrame.id == identifier)
return read_sql(query.statement, session.get_bind())
class MyDataFrameRow(Base):
__tablename__ = 'my_data_row'
id = Column(Integer, primary_key=True)
name= Column(String)
age= Column(Integer)
number_of_children = Column(Integer)
height= Column(Integer)
parent_id = Column(Integer, ForeignKey('my_data_frame.id'))
#staticmethod
def data_frame_columns() -> Tuple[Any]:
return tuple(column.name for column in GenomeCoverageRow.__table__.columns if len(column.foreign_keys) == 0
and column.primary_key is False)
...
session = Session(...)
df1 = MyDataFrame.to_data_frame(1,session)
my_table_obj = session.query(MyDataFrame).filter(MyDataFrame.id == 1).one()
df2 = my_table_obj.data_frame

I'm using flask-sqlalchemy with reflection to build my models but this worked for me:
import pandas as pd
from app.models import Runs
from app import db
def get_db_data_df():
df_runs = pd.read_sql(Runs.__table__.name, con=db.get_engine(), index_col=None)
return df_runs

Validating SQLAlchemy Fields

I have a dictionary that gets created from a programatic process that looks like
{'field1: 3, 'field2: 'TEST'}
I feed this dictionary into the model as its fields (for example: Model(**dict))
I want to run a series of unit tests that determine whether the fields are of valid data type.
How do I validate that these data types are valid for my database without having to do an insertion and rollback as this would introduce flakiness into my tests as I would interacting with an actual database correct? (MySQL).

I do not have much experience with sqlalchemy but if you use data-types in Columns of your models, won't that work?
This link might help you : http://docs.sqlalchemy.org/en/rel_0_9/core/type_basics.html

Here's a rudimentary way to do what you asked
class Sample_Table(Base):
__tablename__ = 'Sample_Table'
__table_args__ = {'sqlite_autoincrement': True}
id = Column(Integer, primary_key=True, nullable=False)
col1 = Column(Integer)
col2 = Column(Integer)
def __init__(self, **kwargs):
for k,v in kwargs.items():
col_type = str(self.__table__.c[k].type)
try:
if str(type(v).__name__) in col_type.lower():
setattr(self, k, v)
else:
raise Exception("BAD COLUMN TYPE FOR COL " + k)
except ValueError as e:
print e.message
If you try to use the above to insert a record with a column type that is different than what you specified, it will throw an error, i.e. it will not perform an insertion and rollback.
To prove that this works, try the following full-working code:
from sqlalchemy import Column, Integer
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
Base = declarative_base()
class Sample_Table(Base):
__tablename__ = 'Sample_Table'
__table_args__ = {'sqlite_autoincrement': True}
id = Column(Integer, primary_key=True, nullable=False)
col1 = Column(Integer)
col2 = Column(Integer)
def __init__(self, **kwargs):
for k,v in kwargs.items():
col_type = str(self.__table__.c[k].type)
try:
if str(type(v).__name__) in col_type.lower():
setattr(self, k, v)
else:
raise Exception("BAD COLUMN TYPE FOR COL " + k)
except ValueError as e:
print e.message
engine = create_engine('sqlite:///')
session = sessionmaker()
session.configure(bind=engine)
s = session()
Base.metadata.create_all(engine)
data = {"col1" : 1, "col2" : 2}
record = Sample_Table(**data)
s.add(record) #works
s.commit()
data = {"col1" : 1, "col2" : "2"}
record = Sample_Table(**data)
s.add(record) #doesn't work!
s.commit()
s.close()
(Even though I used SQLite, it will work for a MySQL database alike.)

SQLAlchemy temporary table with Declarative Base

I need a temporary table in my programme. I have seen that this can be achieved with the "mapper" syntax in this way:
t = Table(
't', metadata,
Column('id', Integer, primary_key=True),
# ...
prefixes=['TEMPORARY'],
)
Seen here
But, my whole code is using the declarative base, it is what I understand, and I would like to stick to it. There is the possibility of using a hybrid approach but if possible I'd avoid it.
This is a simplified version of how my declarative class looks like:
import SQLAlchemy as alc
class Tempo(Base):
"""
Class for temporary table used to process data coming from xlsx
#param Base Declarative Base
"""
# TODO: make it completely temporary
__tablename__ = 'tempo'
drw = alc.Column(alc.String)
date = alc.Column(alc.Date)
check_number = alc.Column(alc.Integer)
Thanks in advance!
EDITED WITH THE NEW PROBLEMS:
Now the class looks like this:
import SQLAlchemy as alc
class Tempo(Base):
"""
Class for temporary table used to process data coming from xlsx
#param Base Declarative Base
"""
# TODO: make it completely temporary
__tablename__ = 'tempo'
__table_args__ = {'prefixes': ['TEMPORARY']}
drw = alc.Column(alc.String)
date = alc.Column(alc.Date)
check_number = alc.Column(alc.Integer)
And when I try to insert data in this table, I get the following error message:
sqlalchemy.exc.OperationalError: (OperationalError) no such table:
tempo u'INSERT INTO tempo (...) VALUES (?, ?, ?, ?, ?, ?, ?, ?)' (....)
It seems the table doesn't exist just by declaring it. I have seen something like create_all() that might be the solution for this (it's funny to see how new ideas come while explaining thoroughly)
Then again, thank you very much!

Is it possible to use __table_args__? See https://docs.sqlalchemy.org/en/14/orm/declarative_tables.html#orm-declarative-table-configuration
class Tempo(Base):
"""
Class for temporary table used to process data coming from xlsx
#param Base Declarative Base
"""
# TODO: make it completely temporary
__tablename__ = 'tempo'
__table_args__ = {'prefixes': ['TEMPORARY']}
drw = alc.Column(alc.String)
date = alc.Column(alc.Date)
check_number = alc.Column(alc.Integer)

Old question, but if anyone out there wants to create a temp table from an existing declarative table model on the fly rather than having it always be a part of your model/code, you can try the following approach. Copying __table_args__ is a little tricky since it can have multiple formats and any Index objects need to be recreated so they aren't associated with the old table.
import time
from sqlalchemy.schema import CreateTable
def copy_table_args(model, **kwargs):
"""Try to copy existing __table_args__, override params with kwargs"""
table_args = model.__table_args__
if isinstance(table_args, tuple):
new_args = []
for arg in table_args:
if isinstance(arg, dict):
table_args_dict = arg.copy()
table_args_dict.update(**kwargs)
new_args.append(arg)
elif isinstance(arg, sa.Index):
index = sa.Index(
arg.name,
*[col for col in arg.columns.keys()],
unique=arg.unique,
**arg.kwargs,
)
new_args.append(index)
else:
# TODO: need to handle Constraints
raise Exception(f"Unhandled table arg: {arg}")
table_args = tuple(new_args)
elif isinstance(table_args, dict):
table_args = {
k: (v.copy() if hasattr(v, "copy") else v) for k, v in table_args.items()
}
table_args.update(**kwargs)
else:
raise Exception(f"Unexpected __table_args__ type: {table_args}")
return table_args
def copy_table_from_model(conn, model, **kwargs):
model_name = model.__name__ + "Tmp"
table_name = model.__table__.name + "_" + str(time.time()).replace(".", "_")
table_args = copy_table_args(model, extend_existing=True)
args = {c.name: c.copy() for c in model.__table__.c}
args["__tablename__"] = table_name
args["__table_args__"] = table_args
copy_model = type(model_name, model.__bases__, args)
print(str(CreateTable(copy_model.__table__)))
copy_model.__table__.create(conn)
return copy_model
def temp_table_from_model(conn, model, **kwargs):
return copy_table_from_model(conn, model, prefixes=["TEMPORARY"])
Note: I haven't added logic to handle copying Constraints, and this is lightly tested against MySQL. Also note that if you do this with non-temporary tables and auto-named indexes (i.e. Column(..., index=True)) then this may not play nice with alembic.

How to do an upsert with SqlAlchemy?

I have a record that I want to exist in the database if it is not there, and if it is there already (primary key exists) I want the fields to be updated to the current state. This is often called an upsert.
The following incomplete code snippet demonstrates what will work, but it seems excessively clunky (especially if there were a lot more columns). What is the better/best way?
Base = declarative_base()
class Template(Base):
__tablename__ = 'templates'
id = Column(Integer, primary_key = True)
name = Column(String(80), unique = True, index = True)
template = Column(String(80), unique = True)
description = Column(String(200))
def __init__(self, Name, Template, Desc):
self.name = Name
self.template = Template
self.description = Desc
def UpsertDefaultTemplate():
sess = Session()
desired_default = Template("default", "AABBCC", "This is the default template")
try:
q = sess.query(Template).filter_by(name = desiredDefault.name)
existing_default = q.one()
except sqlalchemy.orm.exc.NoResultFound:
#default does not exist yet, so add it...
sess.add(desired_default)
else:
#default already exists. Make sure the values are what we want...
assert isinstance(existing_default, Template)
existing_default.name = desired_default.name
existing_default.template = desired_default.template
existing_default.description = desired_default.description
sess.flush()
Is there a better or less verbose way of doing this? Something like this would be great:
sess.upsert_this(desired_default, unique_key = "name")
although the unique_key kwarg is obviously unnecessary (the ORM should be able to easily figure this out) I added it just because SQLAlchemy tends to only work with the primary key. eg: I've been looking at whether Session.merge would be applicable, but this works only on primary key, which in this case is an autoincrementing id which is not terribly useful for this purpose.
A sample use case for this is simply when starting up a server application that may have upgraded its default expected data. ie: no concurrency concerns for this upsert.

SQLAlchemy supports ON CONFLICT with two methods on_conflict_do_update() and on_conflict_do_nothing().
Copying from the documentation:
from sqlalchemy.dialects.postgresql import insert
stmt = insert(my_table).values(user_email='a#b.com', data='inserted data')
stmt = stmt.on_conflict_do_update(
index_elements=[my_table.c.user_email],
index_where=my_table.c.user_email.like('%#gmail.com'),
set_=dict(data=stmt.excluded.data)
)
conn.execute(stmt)

SQLAlchemy does have a "save-or-update" behavior, which in recent versions has been built into session.add, but previously was the separate session.saveorupdate call. This is not an "upsert" but it may be good enough for your needs.
It is good that you are asking about a class with multiple unique keys; I believe this is precisely the reason there is no single correct way to do this. The primary key is also a unique key. If there were no unique constraints, only the primary key, it would be a simple enough problem: if nothing with the given ID exists, or if ID is None, create a new record; else update all other fields in the existing record with that primary key.
However, when there are additional unique constraints, there are logical issues with that simple approach. If you want to "upsert" an object, and the primary key of your object matches an existing record, but another unique column matches a different record, then what do you do? Similarly, if the primary key matches no existing record, but another unique column does match an existing record, then what? There may be a correct answer for your particular situation, but in general I would argue there is no single correct answer.
That would be the reason there is no built in "upsert" operation. The application must define what this means in each particular case.

Nowadays, SQLAlchemy provides two helpful functions on_conflict_do_nothing and on_conflict_do_update. Those functions are useful but require you to swich from the ORM interface to the lower-level one - SQLAlchemy Core.
Although those two functions make upserting using SQLAlchemy's syntax not that difficult, these functions are far from providing a complete out-of-the-box solution to upserting.
My common use case is to upsert a big chunk of rows in a single SQL query/session execution. I usually encounter two problems with upserting:
For example, higher level ORM functionalities we've gotten used to are missing. You cannot use ORM objects but instead have to provide ForeignKeys at the time of insertion.
I'm using this following function I wrote to handle both of those issues:
def upsert(session, model, rows):
table = model.__table__
stmt = postgresql.insert(table)
primary_keys = [key.name for key in inspect(table).primary_key]
update_dict = {c.name: c for c in stmt.excluded if not c.primary_key}
if not update_dict:
raise ValueError("insert_or_update resulted in an empty update_dict")
stmt = stmt.on_conflict_do_update(index_elements=primary_keys,
set_=update_dict)
seen = set()
foreign_keys = {col.name: list(col.foreign_keys)[0].column for col in table.columns if col.foreign_keys}
unique_constraints = [c for c in table.constraints if isinstance(c, UniqueConstraint)]
def handle_foreignkeys_constraints(row):
for c_name, c_value in foreign_keys.items():
foreign_obj = row.pop(c_value.table.name, None)
row[c_name] = getattr(foreign_obj, c_value.name) if foreign_obj else None
for const in unique_constraints:
unique = tuple([const,] + [row[col.name] for col in const.columns])
if unique in seen:
return None
seen.add(unique)
return row
rows = list(filter(None, (handle_foreignkeys_constraints(row) for row in rows)))
session.execute(stmt, rows)

I use a "look before you leap" approach:
# first get the object from the database if it exists
# we're guaranteed to only get one or zero results
# because we're filtering by primary key
switch_command = session.query(Switch_Command).\
filter(Switch_Command.switch_id == switch.id).\
filter(Switch_Command.command_id == command.id).first()
# If we didn't get anything, make one
if not switch_command:
switch_command = Switch_Command(switch_id=switch.id, command_id=command.id)
# update the stuff we care about
switch_command.output = 'Hooray!'
switch_command.lastseen = datetime.datetime.utcnow()
session.add(switch_command)
# This will generate either an INSERT or UPDATE
# depending on whether we have a new object or not
session.commit()
The advantage is that this is db-neutral and I think it's clear to read. The disadvantage is that there's a potential race condition in a scenario like the following:
we query the db for a switch_command and don't find one
we create a switch_command
another process or thread creates a switch_command with the same primary key as ours
we try to commit our switch_command

There are multiple answers and here comes yet another answer (YAA). Other answers are not that readable due to the metaprogramming involved. Here is an example that
Uses SQLAlchemy ORM
Shows how to create a row if there are zero rows using on_conflict_do_nothing
Shows how to update the existing row (if any) without creating a new row using on_conflict_do_update
Uses the table primary key as the constraint
A longer example in the original question what this code is related to.
import sqlalchemy as sa
import sqlalchemy.orm as orm
from sqlalchemy import text
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy.orm import Session
class PairState(Base):
__tablename__ = "pair_state"
# This table has 1-to-1 relationship with Pair
pair_id = sa.Column(sa.ForeignKey("pair.id"), nullable=False, primary_key=True, unique=True)
pair = orm.relationship(Pair,
backref=orm.backref("pair_state",
lazy="dynamic",
cascade="all, delete-orphan",
single_parent=True, ), )
# First raw event in data stream
first_event_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))
# Last raw event in data stream
last_event_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))
# The last hypertable entry added
last_interval_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))
#staticmethod
def create_first_event_if_not_exist(dbsession: Session, pair_id: int, ts: datetime.datetime):
"""Sets the first event value if not exist yet."""
dbsession.execute(
insert(PairState).
values(pair_id=pair_id, first_event_at=ts).
on_conflict_do_nothing()
)
#staticmethod
def update_last_event(dbsession: Session, pair_id: int, ts: datetime.datetime):
"""Replaces the the column last_event_at for a named pair."""
# Based on the original example of https://stackoverflow.com/a/49917004/315168
dbsession.execute(
insert(PairState).
values(pair_id=pair_id, last_event_at=ts).
on_conflict_do_update(constraint=PairState.__table__.primary_key, set_={"last_event_at": ts})
)
#staticmethod
def update_last_interval(dbsession: Session, pair_id: int, ts: datetime.datetime):
"""Replaces the the column last_interval_at for a named pair."""
dbsession.execute(
insert(PairState).
values(pair_id=pair_id, last_interval_at=ts).
on_conflict_do_update(constraint=PairState.__table__.primary_key, set_={"last_interval_at": ts})
)

The below works fine for me with redshift database and will also work for combined primary key constraint.
SOURCE : this
Just few modifications required for creating SQLAlchemy engine in the function
def start_engine()
from sqlalchemy import Column, Integer, Date ,Metadata
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.dialects.postgresql import insert
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.dialects import postgresql
Base = declarative_base()
def start_engine():
engine = create_engine(os.getenv('SQLALCHEMY_URI',
'postgresql://localhost:5432/upsert'))
connect = engine.connect()
meta = MetaData(bind=engine)
meta.reflect(bind=engine)
return engine
class DigitalSpend(Base):
__tablename__ = 'digital_spend'
report_date = Column(Date, nullable=False)
day = Column(Date, nullable=False, primary_key=True)
impressions = Column(Integer)
conversions = Column(Integer)
def __repr__(self):
return str([getattr(self, c.name, None) for c in self.__table__.c])
def compile_query(query):
compiler = query.compile if not hasattr(query, 'statement') else
query.statement.compile
return compiler(dialect=postgresql.dialect())
def upsert(session, model, rows, as_of_date_col='report_date', no_update_cols=[]):
table = model.__table__
stmt = insert(table).values(rows)
update_cols = [c.name for c in table.c
if c not in list(table.primary_key.columns)
and c.name not in no_update_cols]
on_conflict_stmt = stmt.on_conflict_do_update(
index_elements=table.primary_key.columns,
set_={k: getattr(stmt.excluded, k) for k in update_cols},
index_where=(getattr(model, as_of_date_col) < getattr(stmt.excluded, as_of_date_col))
)
print(compile_query(on_conflict_stmt))
session.execute(on_conflict_stmt)
session = start_engine()
upsert(session, DigitalSpend, initial_rows, no_update_cols=['conversions'])

This allows access to the underlying models based on string names
def get_class_by_tablename(tablename):
"""Return class reference mapped to table.
https://stackoverflow.com/questions/11668355/sqlalchemy-get-model-from-table-name-this-may-imply-appending-some-function-to
:param tablename: String with name of table.
:return: Class reference or None.
"""
for c in Base._decl_class_registry.values():
if hasattr(c, '__tablename__') and c.__tablename__ == tablename:
return c
sqla_tbl = get_class_by_tablename(table_name)
def handle_upsert(record_dict, table):
"""
handles updates when there are primary key conflicts
"""
try:
self.active_session().add(table(**record_dict))
except:
# Here we'll assume the error is caused by an integrity error
# We do this because the error classes are passed from the
# underlying package (pyodbc / sqllite) SQLAlchemy doesn't mask
# them with it's own code - this should be updated to have
# explicit error handling for each new db engine
# <update>add explicit error handling for each db engine</update>
active_session.rollback()
# Query for conflic class, use update method to change values based on dict
c_tbl_primary_keys = [i.name for i in table.__table__.primary_key] # List of primary key col names
c_tbl_cols = dict(sqla_tbl.__table__.columns) # String:Col Object crosswalk
c_query_dict = {k:record_dict[k] for k in c_tbl_primary_keys if k in record_dict} # sub-dict from data of primary key:values
c_oo_query_dict = {c_tbl_cols[k]:v for (k,v) in c_query_dict.items()} # col-object:query value for primary key cols
c_target_record = session.query(sqla_tbl).filter(*[k==v for (k,v) in oo_query_dict.items()]).first()
# apply new data values to the existing record
for k, v in record_dict.items()
setattr(c_target_record, k, v)

This works for me with sqlite3 and postgres. Albeit it might fail with combined primary key constraints and will most likely fail with additional unique constraints.
try:
t = self._meta.tables[data['table']]
except KeyError:
self._log.error('table "%s" unknown', data['table'])
return
try:
q = insert(t, values=data['values'])
self._log.debug(q)
self._db.execute(q)
except IntegrityError:
self._log.warning('integrity error')
where_clause = [c.__eq__(data['values'][c.name]) for c in t.c if c.primary_key]
update_dict = {c.name: data['values'][c.name] for c in t.c if not c.primary_key}
q = update(t, values=update_dict).where(*where_clause)
self._log.debug(q)
self._db.execute(q)
except Exception as e:
self._log.error('%s: %s', t.name, e)

As we had problems with generated default-ids and references which lead to ForeignKeyViolation-Errors like
update or delete on table "..." violates foreign key constraint
Key (id)=(...) is still referenced from table "...".
we had to exclude the id for the update dict, as otherwise the it will be always generated as new default value.
In addition the method is returning the created/updated entity.
from sqlalchemy.dialects.postgresql import insert # Important to use the postgresql insert
def upsert(session, data, key_columns, model):
stmt = insert(model).values(data)
# Important to exclude the ID for update!
exclude_for_update = [model.id.name, *key_columns]
update_dict = {c.name: c for c in stmt.excluded if c.name not in exclude_for_update}
stmt = stmt.on_conflict_do_update(
index_elements=key_columns,
set_=update_dict
).returning(model)
orm_stmt = (
select(model)
.from_statement(stmt)
.execution_options(populate_existing=True)
)
return session.execute(orm_stmt).scalar()
Example:
class UpsertUser(Base):
__tablename__ = 'upsert_user'
id = Column(Id, primary_key=True, default=uuid.uuid4)
name: str = Column(sa.String, nullable=False)
user_sid: str = Column(sa.String, nullable=False, unique=True)
house_admin = relationship('UpsertHouse', back_populates='admin', uselist=False)
class UpsertHouse(Base):
__tablename__ = 'upsert_house'
id = Column(Id, primary_key=True, default=uuid.uuid4)
admin_id: Id = Column(Id, ForeignKey('upsert_user.id'), nullable=False)
admin: UpsertUser = relationship('UpsertUser', back_populates='house_admin', uselist=False)
# Usage
upserted_user = upsert(session, updated_user, [UpsertUser.user_sid.name], UpsertUser)
Note: Only tested on postgresql but could work also for other DBs which support ON DUPLICATE KEY UPDATE e.g. MySQL

In case of sqlite, the sqlite_on_conflict='REPLACE' option can be used when defining a UniqueConstraint, and sqlite_on_conflict_unique for unique constraint on a single column. Then session.add will work in a way just like upsert. See the official documentation.

I use this code for upsert
Before using this code, you should add primary keys to table in database.
from sqlalchemy import create_engine
from sqlalchemy import MetaData, Table
from sqlalchemy.inspection import inspect
from sqlalchemy.engine.reflection import Inspector
from sqlalchemy.dialects.postgresql import insert
def upsert(df, engine, table_name, schema=None, chunk_size = 1000):
metadata = MetaData(schema=schema)
metadata.bind = engine
table = Table(table_name, metadata, schema=schema, autoload=True)
# olny use common columns between df and table.
table_columns = {column.name for column in table.columns}
df_columns = set(df.columns)
intersection_columns = table_columns.intersection(df_columns)
df1 = df[intersection_columns]
records = df1.to_dict('records')
# get list of fields making up primary key
primary_keys = [key.name for key in inspect(table).primary_key]
with engine.connect() as conn:
chunks = [records[i:i + chunk_size] for i in range(0, len(records), chunk_size)]
for chunk in chunks:
stmt = insert(table).values(chunk)
update_dict = {c.name: c for c in stmt.excluded if not c.primary_key}
s = stmt.on_conflict_do_update(
index_elements= primary_keys,
set_=update_dict)
conn.execute(s)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dynamic Datasets and SQLAlchemy - python

Related

SQLAlchemy - pass a dynamic tablename to query function?

Moving data from sqlalchemy to a pandas DataFrame

Validating SQLAlchemy Fields

SQLAlchemy temporary table with Declarative Base

How to do an upsert with SqlAlchemy?

Categories

Resources