SQLAlchemy upsert Function for MySQL - python

I have used the following documentation as a guide and tried to implement an upset mechanism for my Games table. I want to be able to dynamically update all columns of the selected table at a time (without having to specify each column individually). I have tried different approaches, but none have provided a proper SQL query which can be executed. What did I misunderstand respectively what are the errors in the code?
https://docs.sqlalchemy.org/en/12/dialects/mysql.html?highlight=on_duplicate_key_update#insert-on-duplicate-key-update-upsert
https://github.com/sqlalchemy/sqlalchemy/issues/4483
class Game(CustomBase, Base):
__tablename__ = 'games'
game_id = Column('id', Integer, primary_key=True)
date_time = Column(DateTime, nullable=True)
hall_id = Column(Integer, ForeignKey(SportPlace.id), nullable=False)
team_id_home = Column(Integer, ForeignKey(Team.team_id))
team_id_away = Column(Integer, ForeignKey(Team.team_id))
score_home = Column(Integer, nullable=True)
score_away = Column(Integer, nullable=True)
...
def put_games(games): # games is a/must be a list of type Game
insert_stmt = insert(Game).values(games)
#insert_stmt = insert(Game).values(id=Game.game_id, data=games)
on_upset_stmt = insert_stmt.on_duplicate_key_update(**games)
print(on_upset_stmt)
...
I regularly load original data from an external API (incl. ID) and want to update my database with it, i.e. update the existing entries (with the same ID) with the new data and add missing ones without completely reloading the database.
Updates
The actual code results in
TypeError: on_duplicate_key_update() argument after ** must be a
mapping, not list
With the commented line #insert_statement... instead of first insert_stmt is the error message
sqlalchemy.exc.CompileError: Unconsumed column names: data

Related

selecting columns with tables created using double quotation fail

I connected a postgresql database to Apache Superset and am playing around with their SQL editor. I'm running into a problem where I cannot do a left join between two tables with an associated id.
SELECT id, profile_name FROM "ProductionRun"
LEFT JOIN "StatsAssociation" ON "ProductionRun".id = "StatsAssociation".production_run_id;
Is my above syntax correct? The tables must be referenced with double quotation because they are created case sensitive. This returns only the id and profile_name columns of ProductionRun table without joining with StatsAssociation table.
I created the tables using sqlalchemy and here are the table schema:
ProductionRun
class ProductionRun(Base):
__tablename__ = 'ProductionRun'
id = Column(Integer, primary_key=True, autoincrement=True)
profile_name = Column(String, nullable=False)
StatsAssociation
class StatsAssociation(Base):
__tablename__ = 'StatsAssociation'
production_run_id = Column(Integer, ForeignKey('ProductionRun.id'), primary_key=True)
stats_package_id = Column(Integer, ForeignKey('StatsPackage.id'), unique=True, nullable=False)
stats_package = relationship('StatsPackage', back_populates='stats_association', cascade='all,delete')
production_run = relationship('ProductionRun', back_populates='stats_association')
When I view the tables, they both exist and StatsAssociation has production_run_id column which shares the same ids as ProductionRun.
This was originally posted as a comment.
You're not specifying any column from the "StatsAssociation" table, so it is expected that nothing would show up. To get columns in the output of the SELECT query, you need to list them -- the only exception that I can currently think of being if you use "TableName".* or * in SELECT.
For example, and just to start you off:
SELECT id, profile_name, production_run_id
FROM ...
where ... is the rest of your query.

Update 50K+ records in a table *at once* using SqlAlchemy

Background
I'm working with a table (in a postgres db), let's call it Person. It is related to a table, JobTitle through the association table PersonJobTitleAssociation. (Each person can have many job titles.)
engine = create_engine(DB_URI)
Base = declarative_base(engine)
class Person(Base):
__tablename__ = 'person'
id = Column(Integer, unique=True, primary_key=True)
name = Column(String, unique=False)
# relationship with all job_titles
all_job_titles = relationship('JobTitle',
secondary='person_job_title_association',
order_by='desc(person_job_title_association.c.date_created)')
# Update this
magic_value = Column(String, unique=False)
class PersonJobTitleAssociation(Base):
__tablename__ = 'person_job_title_association'
person_id = Column(Integer, ForeignKey('person.id'), primary_key=True)
job_title_id = Column(Integer, ForeignKey('job_title.id'), primary_key=True)
date_created = Column(DateTime, nullable=False, default=datetime.datetime.utcnow)
class JobTitle(Base):
__tablename__ = 'job_title'
id = Column(Integer, unique=True, primary_key=True)
name = Column(String, unique=True)
# Once everything is declared, bind to the session
session = sessionmaker(bind=engine)()
Problem
I'd like to access each Person and their most recent JobTitle and perform some_magic_function() to this person's name and job title. (Mask for "some operation which must be done in python").
import random
import string
def some_magic_function(name, job_title):
"""This operation must be done in python"""
# Update the job_title if blank
if not job_title:
job_title = 'unspecified'
# Get a random character and check if it's in our person's name
char = random.choice(string.ascii_letters)
if char in name:
return job_title.upper()
else:
return job_title.lower()
I'm updating values like so:
(Let's pretend this query is optimized and doesn't need to be improved)
query = session.query(Person)\
.options(joinedload(Person.all_job_titles))\
.order_by(Person.id)
# operate on all people
for person in query:
# Get and set the magic value
magic_value = some_magic_function(person.name, person.all_job_titles[0])
if person.magic_value != magic_value:
person.magic_value = magic_value
# Finally, once complete, commit the session
session.commit()
Issue
Querying and updating values is pretty fast on the python side. But things get real slow when calling session.commit(). Did some research, it appears sqlalchemy is locking the entire person table each time it updates a value. Further, each update is executed as its own command. (That's 50K independent SQL commands for 50K records.)
Desired outcome
I'd love a pythonic solution which would update all 50K records in "one swoop."
I've considered utilizing a read_only session, then passing update values into an array of tuples and sending updates through an with_updates session. This seems like a more SQL friendly approach, but is a bit heavy handed and unstraightforward.
Much appreciated!
You might be able to reduce the number of round trips to the database by simply enabling batch fast execution helper, but as a more explicit approach produce a temporary/derived table of changes one way or the other:
CREATE TEMPORARY TABLE and COPY
(VALUES ...) AS ..., possibly combined with explicit use of execute_values()
unnest() an array of rows
from JSON using json(b)_to_recordset()
allowing you to bulk send the changes, and do UPDATE ... FROM:
import csv
from io import StringIO
# Pretending that the query is optimized and deterministic
query = session.query(Person)\
.options(joinedload(Person.all_job_titles))\
.order_by(Person.id)
# Prepare data for COPY
changes_csv = StringIO()
changes_writer = csv.writer(changes_csv)
for p in query:
mv = some_magic_function(p.name, p.all_job_titles[0])
if p.magic_value != mv:
changes_writer.writerow((p.id, mv))
changes_csv.seek(0)
session.execute("""
CREATE TEMPORARY TABLE new_magic (
person_id INTEGER,
value TEXT
) ON COMMIT DROP
""")
# Access the underlying psycopg2 connection directly to obtain a cursor
with session.connection().connection.cursor() as cur:
stmt = "COPY new_magic FROM STDIN WITH CSV"
cur.copy_expert(stmt, changes_csv)
# Make sure that the planner has proper statistics to work with
session.execute("ANALYZE new_magic ( person_id )")
session.execute("""
UPDATE person
SET magic_value = new_magic.value
FROM new_magic
WHERE person.id = new_magic.person_id
""")
session.commit()
Not exactly "Pythonic" in the sense that it does not let the ORM figure out what to do, but on the other hand explicit is better than implicit.

Converting a pandas dataframe to a class and saving using orm

My code is working with a mixture of pandas dataframes and orm tables. Because I wanted to speed up the retrieval of data using an index (as opposed to reading an entire file into a dataframe and re-writing it each time), I created a class statement to facilitate orm queries. But I'm struggling to put it all together.
Here is my class statement:
engine_local = create_engine(Config.SQLALCHEMY_DATABASE_URI_LOCAL)
Base_local = declarative_base()
Base_local.metadata.create_all(engine_local)
Session_local = sessionmaker(bind=engine_local)
Session_local.configure(bind=engine_local)
session_local = Session_local()
class Clients(Base_local):
__tablename__ = 'clients'
id = sa.Column(sa.Integer, primary_key=True)
client_id = sa.Column(sa.Integer, primary_key=True)
client_year = sa.Column(sa.Integer, primary_key=True)
client_cnt = sa.Column(sa.Integer, nullable=False)
date_posted = sa.Column(sa.DateTime, nullable=False, default=datetime.utcnow)
client_company = sa.Column(sa.Integer, nullable=False)
client_terr = sa.Column(sa.Integer, nullable=False)
client_credit = sa.Column(sa.Integer, nullable=False)
client_ann_prem = sa.Float(sa.Float)
def __repr__(self):
return f"Clients('{self.client_id}', '{self.client_year}', '{self.client_ann_prem}')"
meta = sa.MetaData()
meta.bind = engine_local
meta.drop_all(engine_local)
meta.create_all(engine_local)
And here is my panda definition statement:
clients_df = pd.DataFrame(client_id, columns=feature_list)
clients_df['client_year'] = client_year
clients_df['client_cnt'] = client_cnt
clients_df['client_company'] = client_company
clients_df['client_terr'] = client_terr
clients_df['client_credit'] = client_credit
clients_df['client_ann_prem'] = client_ann_prem
I have an initialize step where I need to save this entire dataframe for the first time (so it will constitute the entire database and can write over any pre-existing data). Later, however, I will want to import only a portion of the table based on client_year, and then append the updated dataframe to the existing table.
Questions I am struggling with:
Is it useful to define a class at all? (I'm choosing this path since I believe orm is easier than raw SQL)
Will the pd.to_sql statement automatically match the dataframes to the class definitions?
If I want to create new versions of the table (e.g. for a threaded process), can i create inherited classes based upon Clients without having to go through an initialize step? (e.g. a Clients01 and Clients02 table).
Thanks!

Instance <xxx> has been deleted after commit(), but when and why?

Here's a really simple piece of code. After adding the "poll" instance to the DB and committing, I cannot later read it. SQLAlchemy fails with the error:
Instance '<PollStat at 0x7f9372ea72b0>' has been deleted, or its row is otherwise not present.
Weirdly, this does not happen if I replace the ts_start/ts_run primary key by an integer autoincrement one. Is it possible that DateTime columns are not suitable as primary key?
db = Session()
poll = models.PollStat(
ts_start=datetime.datetime.now(),
ts_run=ts_run,
polled_tools=0)
db.add(poll)
db.commit() # I want to commit here in case something fails later
print(poll.polled_tools) # this fails
PollStat in module models.py:
class PollStat(Base):
__tablename__ = 'poll_stat'
ts_run = Column(Integer, primary_key=True)
ts_start = Column(DateTime, primary_key=True)
elapsed_ms = Column(Integer, default=None)
polled_tools = Column(Integer, default=0)
But if I do this:
class PollStat(Base):
__tablename__ = 'poll_stat'
id = Column(Integer, primary_key=True)
ts_run = Column(Integer)
ts_start = Column(DateTime)
elapsed_ms = Column(Integer, default=None)
polled_tools = Column(Integer, default=0)
it works. Why?
For anyone that still has this problem, this error happened to me because I submitted a JSON object with an id of 0; I use the same form for adding and editing said object, so when editing this would normally have an ID number, but when creating the item, the id property needs to be deleted before inserting the item. Some databases don't accept an ID of 0. In the end the row is created but the ID of 0 no longer matches the current ID, hence why the error pops up.

How to count child table items with or without join to parent table using SQLAlchemy?

I used SQLAlchemy to create a SQLite database which stores bibliographic data of some document, and I want to query the author number of each document.
I know how to do this in raw SQL, but how can I achieve the same result using SQLAlchemy? It is possible without using join?
Here is the classes that I have defined:
class WosDocument(Base):
__tablename__ = 'wos_document'
document_id = Column(Integer, primary_key=True)
unique_id = Column(String, unique=True)
......
authors = relationship('WosAuthor', back_populates='document')
class WosAuthor(Base):
__tablename__ = 'wos_author'
author_id = Column(Integer, primary_key=True, autoincrement=True)
document_unique_id = Column(String, ForeignKey('wos_document.unique_id'))
document = relationship('WosDocument', back_populates='authors')
last_name = Column(String)
first_name = Column(String)
And my goal is to get the same result as this SQL query does:
SELECT a.unique_id, COUNT(*)
FROM wos_document AS a
LEFT JOIN wos_author AS b
ON a.unique_id = b.document_unique_id
GROUP BY a.unique_id
I tried the codes below:
session.query(WosDocument.unique_id, len(WosDocument.authors)).all()
session.query(WosDocument.unique_id, func.count(WosDocument.authors)).all()
The first line raised an error, the second line doesn't give me the desired result, it return only one row and I don't recognize what it is:
[('000275510800023', 40685268)]
Since WosDocument Object has a one-to-many relationship authors, I supposed that I can query the author number of each document without using join explicitly, but I can't find out how to do this with SQLAlchemy.
Can you help me? Thanks!
If you have written the right relation in your model. Then the query would be like:
db.session.query(ParentTable.pk,func.count('*').label("count")).join(Childtable).group_by(ParentTable).all()
The detail of the document of the join() is
https://docs.sqlalchemy.org/en/latest/orm/query.html#sqlalchemy.orm.query.Query.join
If you don't join() explictly you would need to deal with something like parent.relations as a field.

Categories