Optimizing an Update statement with many records in SQLAlchemy - python

I am trying to update many records at a time using SQLAlchemy, but am finding it to be very slow. Is there an optimal way to perform this?
For some reference, I am performing an update on 40,000 records and it took about 1 hour.
Below is the code I am using. The table_name refers to the table which is loaded, the column is the single column which is to be updated, and the pairs refer to the primary key and new value for the column.
def update_records(table_name, column, pairs):
table = Table(table_name, db.MetaData, autoload=True,
autoload_with=db.engine)
conn = db.engine.connect()
values = []
for id, value in pairs:
values.append({'row_id': id, 'match_value': str(value)})
stmt = table.update().where(table.c.id == bindparam('row_id')).values({column: bindparam('match_value')})
conn.execute(stmt, values)

Passing a list of arguments to execute() essentially issues 40k individual UPDATE statements, which is going to have a lot of overhead. The solution for this is to increase the number of rows per query. For MySQL, this means inserting into a temp table and then doing an update:
# assuming temp table already created
conn.execute(temp_table.insert().values(values))
conn.execute(table.update().values({column: temp_table.c.match_value})
.where(table.c.id == temp_table.c.row_id))
Or, alternatively, you can use INSERT ... ON DUPLICATE KEY UPDATE to avoid creating the temp table, but SQLAlchemy does not support that natively, so you'll need to use a custom compiled construct for that (e.g. this gist).

According to document fast-execution-helpers, batch update statements can be issued as one statement. In my experiments, this trick reduce update or deletion time from 30 mins to 1 mins.
engine = create_engine(
"postgresql+psycopg2://scott:tiger#host/dbname",
executemany_mode='values_plus_batch',
executemany_values_page_size=5000, executemany_batch_page_size=5000)

Related

purging a huge data mysql table using python

I have a 1000M data table where i need to have a automated script just keeping last 7 days and delete the before days. I want to do it using python and chunks concept. Want to delete chunk wise.
do we have any library with this chunk concept related to mysql on python?
If no, can anyone suggest me a best method of how to use chunk or apply this with mysql
I'm unaware of a Python package that has an API for "chunking" deletes from a MySQL table. SqlAlchemy provides a fluent interface that can do this but it's not much different than the SQL. I suggest using PyMySql.
import datetime
import pymysql.cursors
connection = pymysql.connect(
host='host',
user='user',
password='password',
database='database'
)
seven_days_before_now = datetime.datetime.now() - datetime.timedelta(days=7)
chunksize = 1000
with connection.cursor() as cursor:
sql = 'DELETE FROM `mytable` WHERE `timestamp` < %s ORDER BY `id` LIMIT %s;'
num_deleted = None
while num_deleted != 0:
num_deleted = cursor.execute(sql, (seven_days_before_now, chunksize))
connection.commit()
The LIMIT just limits the number of deleted rows to the chunksize. The ORDER BY ensures that the DELETE is deterministic and it sorts by the primary key because the primary key is guaranteed to be indexed; so even though it sorts for each chunk, at least it's sorting on an indexed column. Remove the ORDER BY if deterministic behavior is unnecessary, it will result in faster execution time.
You'll need to replace the connection details, table name, column name and chunksize. Also, this solution assumes that the table has a column named id which is the primary key and an auto-incrementing integer. You'll need to make some changes if your schema differs.
As Bernd Buffen commented: the correct way to get the behavior you desire is to partition the table. Please consider a migration to do so.
And, of course: stop using Python 2, it's been unsupported for almost two years as of the first version of this answer.

Best way to perform bulk insert SQLAlchemy

I have a tabled called products
which has following columns
id, product_id, data, activity_id
What I am essentially trying to do is copy bulk of existing products and update it's activity_id and create new entry in the products table.
Example:
I already have 70 existing entries in products with activity_id 2
Now I want to create another 70 entries with same data except for updated activity_id
I could have thousands of existing entries that I'd like to make a copy of and update the copied entries activity_id to be a new id.
products = self.session.query(model.Products).filter(filter1, filter2).all()
This returns all the existing products for a filter.
Then I iterate through products, then simply clone existing products and just update activity_id field.
for product in products:
product.activity_id = new_id
self.uow.skus.bulk_save_objects(simulation_skus)
self.uow.flush()
self.uow.commit()
What is the best/ fastest way to do these bulk entries so it kills time, as of now it's OK performance, is there a better solution?
You don't need to load these objects locally, all you really want to do is have the database create these rows.
You essentially want to run a query that creates the rows from the existing rows:
INSERT INTO product (product_id, data, activity_id)
SELECT product_id, data, 2 -- the new activity_id value
FROM product
WHERE activity_id = old_id
The above query would run entirely on the database server; this is far preferable over loading your query into Python objects, then sending all the Python data back to the server to populate INSERT statements for each new row.
Queries like that are something you could do with SQLAlchemy core, the half of the API that deals with generating SQL statements. However, you can use a query built from a declarative ORM model as a starting point. You'd need to
Access the Table instance for the model, as that then lets you create an INSERT statement via the Table.insert() method.
You could also get the same object from models.Product query, more on that later.
Access the statement that would normally fetch the data for your Python instances for your filtered models.Product query; you can do so via the Query.statement property.
Update the statement to replace the included activity_id column with your new value, and remove the primary key (I'm assuming that you have an auto-incrementing primary key column).
Apply that updated statement to the Insert object for the table via Insert.from_select().
Execute the generated INSERT INTO ... FROM ... query.
Step 1 can be achieved by using the SQLAlchemy introspection API; the inspect() function, applied to a model class, gives you a Mapper instance, which in turn has a Mapper.local_table attribute.
Steps 2 and 3 require a little juggling with the Select.with_only_columns() method to produce a new SELECT statement where we swapped out the column. You can't easily remove a column from a select statement but we can, however, use a loop over the existing columns in the query to 'copy' them across to the new SELECT, and at the same time make our replacement.
Step 4 is then straightforward, Insert.from_select() needs to have the columns that are inserted and the SELECT query. We have both as the SELECT object we have gives us its columns too.
Here is the code for generating your INSERT; the **replace keyword arguments are the columns you want to replace when inserting:
from sqlalchemy import inspect, literal
from sqlalchemy.sql import ClauseElement
def insert_from_query(model, query, **replace):
# The SQLAlchemy core definition of the table
table = inspect(model).local_table
# and the underlying core select statement to source new rows from
select = query.statement
# validate asssumptions: make sure the query produces rows from the above table
assert table in select.froms, f"{query!r} must produce rows from {model!r}"
assert all(c.name in select.columns for c in table.columns), f"{query!r} must include all {model!r} columns"
# updated select, replacing the indicated columns
as_clause = lambda v: literal(v) if not isinstance(v, ClauseElement) else v
replacements = {name: as_clause(value).label(name) for name, value in replace.items()}
from_select = select.with_only_columns([
replacements.get(c.name, c)
for c in table.columns
if not c.primary_key
])
return table.insert().from_select(from_select.columns, from_select)
I included a few assertions about the model and query relationship, and the code accepts arbitrary column clauses as replacements, not just literal values. You could use func.max(models.Product.activity_id) + 1 as a replacement value (wrapped as a subselect), for example.
The above function executes steps 1-4, producing the desired INSERT SQL statement when printed (I created a products model and query that I thought might be representative):
>>> print(insert_from_query(models.Product, products, activity_id=2))
INSERT INTO products (product_id, data, activity_id) SELECT products.product_id, products.data, :param_1 AS activity_id
FROM products
WHERE products.activity_id != :activity_id_1
All you have to do is execute it:
insert_stmt = insert_from_query(models.Product, products, activity_id=2)
self.session.execute(insert_stmt)

SQLite get id and insert when using executemany

I am optimising my code, and reducing the amount of queries. These used to be in a loop but I am trying to restructure my code to be done like this. How do I get the second query working so that it uses the id entered in the first query from each row. Assume that the datasets are in the right order too.
self.c.executemany("INSERT INTO nodes (node_value, node_group) values (?, (SELECT node_group FROM nodes WHERE node_id = ?)+1)", new_values)
#my problem is here
new_id = self.c.lastrowid
connection_values.append((node_id, new_id))
#insert entry
self.c.executemany("INSERT INTO connections (parent, child, strength) VALUES (?,?,1)", connection_values)
These queries used to be a for loop but were taking too long so I am trying to avoid using a for loop and doing the query individually. I believe their might be a way with combining it into one query but I am unsure how this would be done.
You will need to either insert rows one at a time or read back the rowids that were picked by SQLite's ID assignment logic; as documented in Autoincrement in SQLite, there is no guarantee that the IDs generated will be consecutive and trying to guess them in client code is a bad idea.
You can do this implicitly if your program is single-threaded as follows:
Set the AUTOINCREMENT keyword in your table definition. This will guarantee that any generated row IDs will be higher than any that appear in the table currently.
Immediately before the first statement, determine the highest ROWID in use in the table.
oldmax ← Execute("SELECT max(ROWID) from nodes").
Perform the first insert as before.
Read back the row IDs that were actually assigned with a select statement:
NewNodes ← Execute("SELECT ROWID FROM nodes WHERE ROWID > ? ORDER BY ROWID ASC", oldmax) .
Construct the connection_values array by combining the parent ID from new_values and the child ID from NewNodes.
Perform the second insert as before.
This may or may not be faster than your original code; AUTOINCREMENT can slow down performance, and without actually doing the experiment there's no way to tell.
If your program is writing to nodes from multiple threads, you'll need to guard this algorithm with a mutex as it will not work at all with multiple concurrent writers.

Bulk upsert (insert-update) a csv in postgres [duplicate]

A very frequently asked question here is how to do an upsert, which is what MySQL calls INSERT ... ON DUPLICATE UPDATE and the standard supports as part of the MERGE operation.
Given that PostgreSQL doesn't support it directly (before pg 9.5), how do you do this? Consider the following:
CREATE TABLE testtable (
id integer PRIMARY KEY,
somedata text NOT NULL
);
INSERT INTO testtable (id, somedata) VALUES
(1, 'fred'),
(2, 'bob');
Now imagine that you want to "upsert" the tuples (2, 'Joe'), (3, 'Alan'), so the new table contents would be:
(1, 'fred'),
(2, 'Joe'), -- Changed value of existing tuple
(3, 'Alan') -- Added new tuple
That's what people are talking about when discussing an upsert. Crucially, any approach must be safe in the presence of multiple transactions working on the same table - either by using explicit locking, or otherwise defending against the resulting race conditions.
This topic is discussed extensively at Insert, on duplicate update in PostgreSQL?, but that's about alternatives to the MySQL syntax, and it's grown a fair bit of unrelated detail over time. I'm working on definitive answers.
These techniques are also useful for "insert if not exists, otherwise do nothing", i.e. "insert ... on duplicate key ignore".
9.5 and newer:
PostgreSQL 9.5 and newer support INSERT ... ON CONFLICT (key) DO UPDATE (and ON CONFLICT (key) DO NOTHING), i.e. upsert.
Comparison with ON DUPLICATE KEY UPDATE.
Quick explanation.
For usage see the manual - specifically the conflict_action clause in the syntax diagram, and the explanatory text.
Unlike the solutions for 9.4 and older that are given below, this feature works with multiple conflicting rows and it doesn't require exclusive locking or a retry loop.
The commit adding the feature is here and the discussion around its development is here.
If you're on 9.5 and don't need to be backward-compatible you can stop reading now.
9.4 and older:
PostgreSQL doesn't have any built-in UPSERT (or MERGE) facility, and doing it efficiently in the face of concurrent use is very difficult.
This article discusses the problem in useful detail.
In general you must choose between two options:
Individual insert/update operations in a retry loop; or
Locking the table and doing batch merge
Individual row retry loop
Using individual row upserts in a retry loop is the reasonable option if you want many connections concurrently trying to perform inserts.
The PostgreSQL documentation contains a useful procedure that'll let you do this in a loop inside the database. It guards against lost updates and insert races, unlike most naive solutions. It will only work in READ COMMITTED mode and is only safe if it's the only thing you do in the transaction, though. The function won't work correctly if triggers or secondary unique keys cause unique violations.
This strategy is very inefficient. Whenever practical you should queue up work and do a bulk upsert as described below instead.
Many attempted solutions to this problem fail to consider rollbacks, so they result in incomplete updates. Two transactions race with each other; one of them successfully INSERTs; the other gets a duplicate key error and does an UPDATE instead. The UPDATE blocks waiting for the INSERT to rollback or commit. When it rolls back, the UPDATE condition re-check matches zero rows, so even though the UPDATE commits it hasn't actually done the upsert you expected. You have to check the result row counts and re-try where necessary.
Some attempted solutions also fail to consider SELECT races. If you try the obvious and simple:
-- THIS IS WRONG. DO NOT COPY IT. It's an EXAMPLE.
BEGIN;
UPDATE testtable
SET somedata = 'blah'
WHERE id = 2;
-- Remember, this is WRONG. Do NOT COPY IT.
INSERT INTO testtable (id, somedata)
SELECT 2, 'blah'
WHERE NOT EXISTS (SELECT 1 FROM testtable WHERE testtable.id = 2);
COMMIT;
then when two run at once there are several failure modes. One is the already discussed issue with an update re-check. Another is where both UPDATE at the same time, matching zero rows and continuing. Then they both do the EXISTS test, which happens before the INSERT. Both get zero rows, so both do the INSERT. One fails with a duplicate key error.
This is why you need a re-try loop. You might think that you can prevent duplicate key errors or lost updates with clever SQL, but you can't. You need to check row counts or handle duplicate key errors (depending on the chosen approach) and re-try.
Please don't roll your own solution for this. Like with message queuing, it's probably wrong.
Bulk upsert with lock
Sometimes you want to do a bulk upsert, where you have a new data set that you want to merge into an older existing data set. This is vastly more efficient than individual row upserts and should be preferred whenever practical.
In this case, you typically follow the following process:
CREATE a TEMPORARY table
COPY or bulk-insert the new data into the temp table
LOCK the target table IN EXCLUSIVE MODE. This permits other transactions to SELECT, but not make any changes to the table.
Do an UPDATE ... FROM of existing records using the values in the temp table;
Do an INSERT of rows that don't already exist in the target table;
COMMIT, releasing the lock.
For example, for the example given in the question, using multi-valued INSERT to populate the temp table:
BEGIN;
CREATE TEMPORARY TABLE newvals(id integer, somedata text);
INSERT INTO newvals(id, somedata) VALUES (2, 'Joe'), (3, 'Alan');
LOCK TABLE testtable IN EXCLUSIVE MODE;
UPDATE testtable
SET somedata = newvals.somedata
FROM newvals
WHERE newvals.id = testtable.id;
INSERT INTO testtable
SELECT newvals.id, newvals.somedata
FROM newvals
LEFT OUTER JOIN testtable ON (testtable.id = newvals.id)
WHERE testtable.id IS NULL;
COMMIT;
Related reading
UPSERT wiki page
UPSERTisms in Postgres
Insert, on duplicate update in PostgreSQL?
http://petereisentraut.blogspot.com/2010/05/merge-syntax.html
Upsert with a transaction
Is SELECT or INSERT in a function prone to race conditions?
SQL MERGE on the PostgreSQL wiki
Most idiomatic way to implement UPSERT in Postgresql nowadays
What about MERGE?
SQL-standard MERGE actually has poorly defined concurrency semantics and is not suitable for upserting without locking a table first.
It's a really useful OLAP statement for data merging, but it's not actually a useful solution for concurrency-safe upsert. There's lots of advice to people using other DBMSes to use MERGE for upserts, but it's actually wrong.
Other DBs:
INSERT ... ON DUPLICATE KEY UPDATE in MySQL
MERGE from MS SQL Server (but see above about MERGE problems)
MERGE from Oracle (but see above about MERGE problems)
Here are some examples for insert ... on conflict ... (pg 9.5+) :
Insert, on conflict - do nothing.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict do nothing;`
Insert, on conflict - do update, specify conflict target via column.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict(id)
do update set name = 'new_name', size = 3;
Insert, on conflict - do update, specify conflict target via constraint name.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict on constraint dummy_pkey
do update set name = 'new_name', size = 4;
I am trying to contribute with another solution for the single insertion problem with the pre-9.5 versions of PostgreSQL. The idea is simply to try to perform first the insertion, and in case the record is already present, to update it:
do $$
begin
insert into testtable(id, somedata) values(2,'Joe');
exception when unique_violation then
update testtable set somedata = 'Joe' where id = 2;
end $$;
Note that this solution can be applied only if there are no deletions of rows of the table.
I do not know about the efficiency of this solution, but it seems to me reasonable enough.
SQLAlchemy upsert for Postgres >=9.5
Since the large post above covers many different SQL approaches for Postgres versions (not only non-9.5 as in the question), I would like to add how to do it in SQLAlchemy if you are using Postgres 9.5. Instead of implementing your own upsert, you can also use SQLAlchemy's functions (which were added in SQLAlchemy 1.1). Personally, I would recommend using these, if possible. Not only because of convenience, but also because it lets PostgreSQL handle any race conditions that might occur.
Cross-posting from another answer I gave yesterday (https://stackoverflow.com/a/44395983/2156909)
SQLAlchemy supports ON CONFLICT now with two methods on_conflict_do_update() and on_conflict_do_nothing():
Copying from the documentation:
from sqlalchemy.dialects.postgresql import insert
stmt = insert(my_table).values(user_email='a#b.com', data='inserted data')
stmt = stmt.on_conflict_do_update(
index_elements=[my_table.c.user_email],
index_where=my_table.c.user_email.like('%#gmail.com'),
set_=dict(data=stmt.excluded.data)
)
conn.execute(stmt)
http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html?highlight=conflict#insert-on-conflict-upsert
MERGE in PostgreSQL v. 15
Since PostgreSQL v. 15, is possible to use MERGE command. It actually has been presented as the first of the main improvements of this new version.
It uses a WHEN MATCHED / WHEN NOT MATCHED conditional in order to choose the behaviour when there is an existing row with same criteria.
It is even better than standard UPSERT, as the new feature gives full control to INSERT, UPDATE or DELETE rows in bulk.
MERGE INTO customer_account ca
USING recent_transactions t
ON t.customer_id = ca.customer_id
WHEN MATCHED THEN
UPDATE SET balance = balance + transaction_value
WHEN NOT MATCHED THEN
INSERT (customer_id, balance)
VALUES (t.customer_id, t.transaction_value)
WITH UPD AS (UPDATE TEST_TABLE SET SOME_DATA = 'Joe' WHERE ID = 2
RETURNING ID),
INS AS (SELECT '2', 'Joe' WHERE NOT EXISTS (SELECT * FROM UPD))
INSERT INTO TEST_TABLE(ID, SOME_DATA) SELECT * FROM INS
Tested on Postgresql 9.3
Since this question was closed, I'm posting here for how you do it using SQLAlchemy. Via recursion, it retries a bulk insert or update to combat race conditions and validation errors.
First the imports
import itertools as it
from functools import partial
from operator import itemgetter
from sqlalchemy.exc import IntegrityError
from app import session
from models import Posts
Now a couple helper functions
def chunk(content, chunksize=None):
"""Groups data into chunks each with (at most) `chunksize` items.
https://stackoverflow.com/a/22919323/408556
"""
if chunksize:
i = iter(content)
generator = (list(it.islice(i, chunksize)) for _ in it.count())
else:
generator = iter([content])
return it.takewhile(bool, generator)
def gen_resources(records):
"""Yields a dictionary if the record's id already exists, a row object
otherwise.
"""
ids = {item[0] for item in session.query(Posts.id)}
for record in records:
is_row = hasattr(record, 'to_dict')
if is_row and record.id in ids:
# It's a row but the id already exists, so we need to convert it
# to a dict that updates the existing record. Since it is duplicate,
# also yield True
yield record.to_dict(), True
elif is_row:
# It's a row and the id doesn't exist, so no conversion needed.
# Since it's not a duplicate, also yield False
yield record, False
elif record['id'] in ids:
# It's a dict and the id already exists, so no conversion needed.
# Since it is duplicate, also yield True
yield record, True
else:
# It's a dict and the id doesn't exist, so we need to convert it.
# Since it's not a duplicate, also yield False
yield Posts(**record), False
And finally the upsert function
def upsert(data, chunksize=None):
for records in chunk(data, chunksize):
resources = gen_resources(records)
sorted_resources = sorted(resources, key=itemgetter(1))
for dupe, group in it.groupby(sorted_resources, itemgetter(1)):
items = [g[0] for g in group]
if dupe:
_upsert = partial(session.bulk_update_mappings, Posts)
else:
_upsert = session.add_all
try:
_upsert(items)
session.commit()
except IntegrityError:
# A record was added or deleted after we checked, so retry
#
# modify accordingly by adding additional exceptions, e.g.,
# except (IntegrityError, ValidationError, ValueError)
db.session.rollback()
upsert(items)
except Exception as e:
# Some other error occurred so reduce chunksize to isolate the
# offending row(s)
db.session.rollback()
num_items = len(items)
if num_items > 1:
upsert(items, num_items // 2)
else:
print('Error adding record {}'.format(items[0]))
Here's how you use it
>>> data = [
... {'id': 1, 'text': 'updated post1'},
... {'id': 5, 'text': 'updated post5'},
... {'id': 1000, 'text': 'new post1000'}]
...
>>> upsert(data)
The advantage this has over bulk_save_objects is that it can handle relationships, error checking, etc on insert (unlike bulk operations).

How to insert several thousand columns into sqlite3?

Similar to my last question, but I ran into problem lets say I have a simple dictionary like below but its Big, when I try inserting a big dictionary using the methods below I get operational error for the c.execute(schema) for too many columns so what should be my alternate method to populate an sql databases columns? Using the alter table command and add each one individually?
import sqlite3
con = sqlite3.connect('simple.db')
c = con.cursor()
dic = {
'x1':{'y1':1.0,'y2':0.0},
'x2':{'y1':0.0,'y2':2.0,'joe bla':1.5},
'x3':{'y2':2.0,'y3 45 etc':1.5}
}
# 1. Find the unique column names.
columns = set()
for _, cols in dic.items():
for key, _ in cols.items():
columns.add(key)
# 2. Create the schema.
col_defs = [
# Start with the column for our key name
'"row_name" VARCHAR(2) NOT NULL PRIMARY KEY'
]
for column in columns:
col_defs.append('"%s" REAL NULL' % column)
schema = "CREATE TABLE simple (%s);" % ",".join(col_defs)
c.execute(schema)
# 3. Loop through each row
for row_name, cols in dic.items():
# Compile the data we have for this row.
col_names = cols.keys()
col_values = [str(val) for val in cols.values()]
# Insert it.
sql = 'INSERT INTO simple ("row_name", "%s") VALUES ("%s", "%s");' % (
'","'.join(col_names),
row_name,
'","'.join(col_values)
)
If I understand you right, you're not trying to insert thousands of rows, but thousands of columns. SQLite has a limit on the number of columns per table (by default 2000), though this can be adjusted if you recompile SQLite. Never having done this, I do not know if you then need to tweak the Python interface, but I'd suspect not.
You probably want to rethink your design. Any non-data warehouse / OLAP application is highly unlikely to need or be terribly efficient with thousands of columns (rows, yes) and SQLite is not a good solution for a data warehouse / OLAP type situation. You may get a bit further with something like an entity-attribute-value setup (not a normal recommendation for genuine relational databases, but a valid application data model and much more likely to accommodate your needs without pushing the limits of SQLite too far).
If you really are adding a massive number of rows and are running into problems, maybe your single transaction is getting too large.
Do a COMMIT (commit()) after a given number of lines (or even after each insert as a test) if that is acceptable.
Thousands of rows should be easily doable with sqlite. Getting to millions and above, at some point there might be need for more. Depends on a lot of things, of course.

Categories