Best way to perform bulk insert SQLAlchemy

Best way to perform bulk insert SQLAlchemy - python

I have a tabled called products
which has following columns
id, product_id, data, activity_id
What I am essentially trying to do is copy bulk of existing products and update it's activity_id and create new entry in the products table.
Example:
I already have 70 existing entries in products with activity_id 2
Now I want to create another 70 entries with same data except for updated activity_id
I could have thousands of existing entries that I'd like to make a copy of and update the copied entries activity_id to be a new id.
products = self.session.query(model.Products).filter(filter1, filter2).all()
This returns all the existing products for a filter.
Then I iterate through products, then simply clone existing products and just update activity_id field.
for product in products:
product.activity_id = new_id
self.uow.skus.bulk_save_objects(simulation_skus)
self.uow.flush()
self.uow.commit()
What is the best/ fastest way to do these bulk entries so it kills time, as of now it's OK performance, is there a better solution?

You don't need to load these objects locally, all you really want to do is have the database create these rows.
You essentially want to run a query that creates the rows from the existing rows:
INSERT INTO product (product_id, data, activity_id)
SELECT product_id, data, 2 -- the new activity_id value
FROM product
WHERE activity_id = old_id
The above query would run entirely on the database server; this is far preferable over loading your query into Python objects, then sending all the Python data back to the server to populate INSERT statements for each new row.
Queries like that are something you could do with SQLAlchemy core, the half of the API that deals with generating SQL statements. However, you can use a query built from a declarative ORM model as a starting point. You'd need to
Access the Table instance for the model, as that then lets you create an INSERT statement via the Table.insert() method.
You could also get the same object from models.Product query, more on that later.
Access the statement that would normally fetch the data for your Python instances for your filtered models.Product query; you can do so via the Query.statement property.
Update the statement to replace the included activity_id column with your new value, and remove the primary key (I'm assuming that you have an auto-incrementing primary key column).
Apply that updated statement to the Insert object for the table via Insert.from_select().
Execute the generated INSERT INTO ... FROM ... query.
Step 1 can be achieved by using the SQLAlchemy introspection API; the inspect() function, applied to a model class, gives you a Mapper instance, which in turn has a Mapper.local_table attribute.
Steps 2 and 3 require a little juggling with the Select.with_only_columns() method to produce a new SELECT statement where we swapped out the column. You can't easily remove a column from a select statement but we can, however, use a loop over the existing columns in the query to 'copy' them across to the new SELECT, and at the same time make our replacement.
Step 4 is then straightforward, Insert.from_select() needs to have the columns that are inserted and the SELECT query. We have both as the SELECT object we have gives us its columns too.
Here is the code for generating your INSERT; the **replace keyword arguments are the columns you want to replace when inserting:
from sqlalchemy import inspect, literal
from sqlalchemy.sql import ClauseElement
def insert_from_query(model, query, **replace):
# The SQLAlchemy core definition of the table
table = inspect(model).local_table
# and the underlying core select statement to source new rows from
select = query.statement
# validate asssumptions: make sure the query produces rows from the above table
assert table in select.froms, f"{query!r} must produce rows from {model!r}"
assert all(c.name in select.columns for c in table.columns), f"{query!r} must include all {model!r} columns"
# updated select, replacing the indicated columns
as_clause = lambda v: literal(v) if not isinstance(v, ClauseElement) else v
replacements = {name: as_clause(value).label(name) for name, value in replace.items()}
from_select = select.with_only_columns([
replacements.get(c.name, c)
for c in table.columns
if not c.primary_key
])
return table.insert().from_select(from_select.columns, from_select)
I included a few assertions about the model and query relationship, and the code accepts arbitrary column clauses as replacements, not just literal values. You could use func.max(models.Product.activity_id) + 1 as a replacement value (wrapped as a subselect), for example.
The above function executes steps 1-4, producing the desired INSERT SQL statement when printed (I created a products model and query that I thought might be representative):
>>> print(insert_from_query(models.Product, products, activity_id=2))
INSERT INTO products (product_id, data, activity_id) SELECT products.product_id, products.data, :param_1 AS activity_id
FROM products
WHERE products.activity_id != :activity_id_1
All you have to do is execute it:
insert_stmt = insert_from_query(models.Product, products, activity_id=2)
self.session.execute(insert_stmt)

Related

Using arbitrary sqlalchemy select results (e.g., from a CTE) to create ORM instances

When I create an instance of an object (in my example below, a Company) I want to automagically create default, related objects. One way is to use a per-row, after-insert trigger, but I'm trying to avoid that route and use CTEs which are easier to read and maintain. I have this SQL working (underlying db is PostgreSQL and the only thing you need to know about table company is its primary key is: id SERIAL PRIMARY KEY and it has one other required column, name VARCHAR NOT NULL):
with new_company as (
-- insert my company row, returning the whole row
insert into company (name)
values ('Acme, Inc.')
returning *
),
other_related as (
-- herein I join to `new_company` and create default related rows
-- in other tables. Here we use, effectively, a no-op - what it
-- actually does is not germane to the issue.
select id from new_company
)
-- Having created the related rows, we return the row we inserted into
-- table `company`.
select * from new_company;
The above works like a charm and with the recently added Select.add_cte() (in sqlalchemy 1.4.21) I can write the above with the following python:
import sqlalchemy as sa
from myapp.models import Company
new_company = (
sa.insert(Company)
.values(name='Acme, Inc.')
.returning(Company)
.cte(name='new_company')
)
other_related = (
sa.select(sa.text('new_company.id'))
.select_from(new_company)
.cte('other_related')
)
fetch_company = (
sa.select(sa.text('* from new_company'))
.add_cte(other_related)
)
print(fetch_company)
And the output is:
WITH new_company AS
(INSERT INTO company (name) VALUES (:param_1) RETURNING company.id, company.name),
other_related AS
(SELECT new_company.id FROM new_company)
SELECT * from new_company
Perfect! But when I execute the above query I get back a Row:
>>> result = session.execute(fetch_company).fetchone()
>>> print(result)
(26, 'Acme, Inc.')
I can create an instance with:
>>> result = session.execute(fetch_company).fetchone()
>>> company = Company(**result)
But this instance, if added to the session, is in the wrong state, pending, and if I flush and/or commit, I get a duplicate key error because the company is already in the database.
If I try using Company in the select list, I get a bad query because sqlalchemy automagically sets the from-clause and I cannot figure out how to clear or explicitly set the from-clause to use my CTE.
I'm looking for one of several possible solutions:
annotate an arbitrary query in some way to say, "build an instance of MyModel, but use this table/alias", e.g., query = sa.select(Company).select_from(new_company.alias('company'), reset=True).
tell a session that an instance is persistent regardless of what the session thinks about the instance, e.g., company = Company(**result); session.add(company, force_state='persistent')
Obviously I could do another round-trip to the db with a call to session.merge() (as discussed in early comments of this question) so the instance ends up in the correct state, but that seems terribly inefficient especially if/when used to return lists of instances.

Add table columns to select without adding it to select_from

I have a prepared function in the database, which I want to call using Gino. This function has a return type equal to one of the tables, that is created using declarative. What I try to do is:
select(MyModel).select_from(func.my_function);
The problem is, that SQLAlchemy automatically detects the table in my select and adds it implicitly to select_from. The resulting SQL contains both my function and the table name in the FROM clause and the result is a cartesian of the function result and the whole table (not what I want really).
My question is – can I somehow specify that I want to select all the columns for a model without having the corresponding class in the FROM?

You have to specify the columns (as an array) if you don't want SA to automatically add MyModel to the FROM clause.
You have to either do this:
select([your_model_table.c.column1, your_model_table.c.column2]).select_from(func.my_function);
Your if you want all columns:
select(your_model_table.columns).select_from(func.my_function);

Too many server roundtrips w/ psycopg2

I am making a script, that should create a schema for each customer. I’m fetching all metadata from a database that defines how each customer’s schema should look like, and then create it. Everything is well defined, the types, names of tables, etc. A customer has many tables (fx, address, customers, contact, item, etc), and each table has the same metadata.
My procedure now:
get everything I need from the metadataDatabase.
In a for loop, create a table, and then Alter Table and add each metadata (This is done for each table).
Right now my script runs in about a minute for each customer, which I think is too slow. It has something to do with me having a loop, and in that loop, I’m altering each table.
I think that instead of me altering (which might be not so clever approach), I should do something like the following:
Note that this is just a stupid but valid example:
for table in tables:
con.execute("CREATE TABLE IF NOT EXISTS tester.%s (%s, %s);", (table, "last_seen date", "valid_from timestamp"))
But it gives me this error (it seems like it reads the table name as a string in a string..):
psycopg2.errors.SyntaxError: syntax error at or near "'billing'"
LINE 1: CREATE TABLE IF NOT EXISTS tester.'billing' ('last_seen da...

Consider creating tables with a serial type (i.e., autonumber) ID field and then use alter table for all other fields by using a combination of sql.Identifier for identifiers (schema names, table names, column names, function names, etc.) and regular format for data types which are not literals in SQL statement.
from psycopg2 import sql
# CREATE TABLE
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (ID serial)"""
cur.execute(sql.SQL(query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table")))
# ALTER TABLE
items = [("last_seen", "date"), ("valid_from", "timestamp")]
query = """ALTER TABLE {shm}.{tbl} ADD COLUMN {col} {typ}"""
for item in items:
# KEEP IDENTIFIER PLACEHOLDERS
final_query = query.format(shm="{shm}", tbl="{tbl}", col="{col}", typ=i[1])
cur.execute(sql.SQL(final_query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"),
col = sql.Identifier(item[0]))
Alternatively, use str.join with list comprehension for one CREATE TABLE:
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (
"id" serial,
{vals}
)"""
items = [("last_seen", "date"), ("valid_from", "timestamp")]
val = ",\n ".join(["{{}} {typ}".format(typ=i[1]) for i in items])
# KEEP IDENTIFIER PLACEHOLDERS
pre_query = query.format(shm="{shm}", tbl="{tbl}", vals=val)
final_query = sql.SQL(pre_query).format(*[sql.Identifier(i[0]) for i in items],
shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"))
cur.execute(final_query)
SQL (sent to database)
CREATE TABLE IF NOT EXISTS "tester"."table" (
"id" serial,
"last_seen" date,
"valid_from" timestamp
)

However, this becomes heavy as there are too many server roundtrips.
How many tables with how many columns are you creating that this is slow? Could you ssh to a machine closer to your server and run the python there?
I don't get that error. Rather, I get an SQL syntax error. A values list is for conveying data. But ALTER TABLE is not about data, it is about metadata. You can't use a values list there. You need the names of the columns and types in double quotes (or no quotes) rather than single quotes. And you can't have a comma between name and type. And you can't have parentheses around each pair. And each pair needs to be introduced with "ADD", you can't have it just once. You are using the wrong tool for the job. execute_batch is almost the right tool, except it will use single quotes rather than double quotes around the identifiers. Perhaps you could add a flag to it tell it to use quote_ident.
Not only is execute_values the wrong tool for the job, but I think python in general might be as well. Why not just load from a .sql file?

Optimizing an Update statement with many records in SQLAlchemy

I am trying to update many records at a time using SQLAlchemy, but am finding it to be very slow. Is there an optimal way to perform this?
For some reference, I am performing an update on 40,000 records and it took about 1 hour.
Below is the code I am using. The table_name refers to the table which is loaded, the column is the single column which is to be updated, and the pairs refer to the primary key and new value for the column.
def update_records(table_name, column, pairs):
table = Table(table_name, db.MetaData, autoload=True,
autoload_with=db.engine)
conn = db.engine.connect()
values = []
for id, value in pairs:
values.append({'row_id': id, 'match_value': str(value)})
stmt = table.update().where(table.c.id == bindparam('row_id')).values({column: bindparam('match_value')})
conn.execute(stmt, values)

Passing a list of arguments to execute() essentially issues 40k individual UPDATE statements, which is going to have a lot of overhead. The solution for this is to increase the number of rows per query. For MySQL, this means inserting into a temp table and then doing an update:
# assuming temp table already created
conn.execute(temp_table.insert().values(values))
conn.execute(table.update().values({column: temp_table.c.match_value})
.where(table.c.id == temp_table.c.row_id))
Or, alternatively, you can use INSERT ... ON DUPLICATE KEY UPDATE to avoid creating the temp table, but SQLAlchemy does not support that natively, so you'll need to use a custom compiled construct for that (e.g. this gist).

According to document fast-execution-helpers, batch update statements can be issued as one statement. In my experiments, this trick reduce update or deletion time from 30 mins to 1 mins.
engine = create_engine(
"postgresql+psycopg2://scott:tiger#host/dbname",
executemany_mode='values_plus_batch',
executemany_values_page_size=5000, executemany_batch_page_size=5000)

Find parent with certain combination of child rows - SQLite with Python

There are several parts to this question. I am working with sqlite3 in Python 2.7, but I am less concerned with the exact syntax, and more with the methods I need to use. I think the best way to ask this question is to describe my current database design, and what I am trying to accomplish. I am new to databases in general, so I apologize if I don't always use correct nomenclature.
I am modeling refrigeration systems (using Modelica--not really important to know), and I am using the database to manage input data, results data, and models used for that data.
My top parent table is Model, which contains the columns:
id, name, version, date_created
My child table under Model is called Design. It is used to create a unique id for each combination of design input parameters and the model used. the columns it contains are:
id, model_id, date_created
I then have two child tables under Design, one called Input, and the other called Result. We can just look at Input for now, since one example should be enough. The columns for input are:
id, value, design_id, parameter_id, component_id
parameter_id and component_id are foreign keys to their own tables.The Parameter table has the following columns:
id, name, units
Some example rows for Parameter under name are: length, width, speed, temperature, pressure (there are many dozens more). The Component table has the following columns:
id, name
Some example rows for Component under name are: compressor, heat_exchanger, valve.
Ultimately, in my program I want to search the database for a specific design. I want to be able to search a specific design to be able to grab specific results for that design, or to know whether or not a model simulation with that design has already been run previously, to avoid re-running the same data point.
I also want to be able to grab all the parameters for a given design, and insert it into a class I have created in Python, which is then used to provide inputs to my models. In case it helps for solving the problem, the classes I have created are based on the components. So, for example, I have a compressor class, with attributes like compressor.speed, compressor.stroke, compressor.piston_size. Each of these attributes should have their own row in the Parameter table.
So, how would I query this database efficiently to find if there is a design that matches a long list (let's assume 100+) of parameters with specific values? Just as a side note, my friend helped me design this database. He knows databases, but not my application super well. It is possible that I designed it poorly for what I want to accomplish.
Here is a simple picture trying to map a certain combination of parameters with certain values to a design_id, where I have taken out component_id for simplicity:
Picture of simplified tables

Simply join the necessary tables. Your schema properly reflects normalization (separating tables into logical groupings) and can scale for one-to-many relationships. Specifically, to answer your question --So, how would I query this database efficiently to find if there is a design that matches a long list (let's assume 100+) of parameters with specific values?-- consider below approaches:
Inner Join with Where Clause
For handful of parameters, use an inner join with a WHERE...IN() clause. Below returns design fields joined by input and parameters tables, filtered for specific parameter names where you can have Python pass as parameterized values even iteratively in a loop:
SELECT d.id, d.model_id, d.date_created
FROM design d
INNER JOIN input i ON d.id = i.design_id
INNER JOIN parameters p ON p.id = i.parameter_id
WHERE p.name IN ('param1', 'param2', 'param3', 'param4', 'param5', ...)
Inner Join with Temp Table
Should values be over 100+ in a long list, consider a temp table that filters parameters table to specific parameter values:
# CREATE EMPTY TABLE (SAME STRUCTURE AS parameters)
sql = "CREATE TABLE tempparams AS SELECT id, name, units FROM parameters WHERE 0;"
cur.execute(sql)
db.commit()
# ITERATIVELY APPEND TO TEMP
for i in paramslist: # LIST OF 100+ ITEMS
sql = "INSERT INTO tempparams (id, name, units) \
SELECT p.id, p.name, p.units \
FROM parameters p \
WHERE p.name = ?;"
cur.execute(sql, i) # CURSOR OBJECT COMMAND PASSING PARAM
db.commit() # DB OBJECT COMMIT ACTION
Then, join main design and input tables with new temp table holding specific parameters:
SELECT d.id, d.model_id, d.date_created
FROM design d
INNER JOIN input i ON d.id = i.design_id
INNER JOIN tempparams t ON t.id = i.parameter_id
Same process can work with components table as well.
*Moved picture to question section

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.