Efficient insert of multiple rows with SQLAlchemy/SQLite3 when duplicate entries exist

Efficient insert of multiple rows with SQLAlchemy/SQLite3 when duplicate entries exist - python

I'm inserting multiple rows into an SQLite3 table using SQLAlchemy, and frequently the entries are already in the table. It is very slow to insert the rows one at a time, and catch the exception and continue if the row already exists. Is there an efficient way to do this? If the row already exists, I'd like to do nothing.

You can use an SQL statement
INSERT OR IGNORE INTO ... etc. ...
to simply ignore the insert if it is a duplicate. Learn about the IGNORE conflict clause here
Perhaps you can use OR IGNORE as a prefix in your SQLAlchemy Insert -- the documentation for how to place OR IGNORE between INSERT and INTO in your SQL statement is here

If you are happy to run 'native' sqlite SQL you can just do:
REPLACE INTO my_table(id, col2, ..) VALUES (1, 'value', ...);
REPLACE INTO my_table(...);
...
COMMIT
However, this won't be portable across all DBMS's and is therefore the reason that its not found in the general sqlalchemy dialect.
Another thing you could do is use the SQLAlchemy ORM, define a 'domain model' -- a python class which maps to your database table. Then you can create many instances of your domain class and call session.save_or_update(domain_object) on each of the items you wish to insert (or ignore) and finally call session.commit() when you want to insert (or ignore) the items to your database table.
This question looks like a duplicate of SQLAlchemy - INSERT OR REPLACE equivalent

Related

How to get the data object of a newly inserted data row and flask-mysqldb?

I have work in Perl where I am able to get the newly created data object ID by passing the result back to a variable. For example:
my $data_obj = $schema->resultset('PersonTable')->create(\%psw_rec_hash);
Where the $data_obj contains the primary key's column value.
I want to be able to do the same thing using Python 3.7, Flask and flask-mysqldb,
but without having to do another query. I want to be able to use the specific
record's primary key column value for another method.
Python and flask-mysqldb inserts data like so:
query = "INSERT INTO PersonTable (fname, mname, lname) VALUES('Phil','','Vil')
cursor = db.connection.cursor()
cursor.execute(query)
db.connection.commit()
cursor.close()
The PersonTable has a primary key column called, id. So, the newly inserted data row would look
like:
23, 'Phil', 'Vil'
Because there are 22 rows of data before the last inserted data, I don't want to perform a search
for the data, because there could be more than one entry with the same data. However, all I want
the most recent data row.
Can I do something similar to Perl with python 3.7 and flask-mysqldb?

You may want to consider the Flask-SQLAlchemy package to help you with this.
Although the syntax is going to be slightly different from Perl, what you can do is, when you create the model object, you can set it to a variable. Then, when you either flush or commit on the Database session, you can pull up your primary key attribute on that model object you had created (whether it's "id" or something else), and use it as needed.
SQLAlchemy supports MySQL, as well as several other relational databases. In addition, it is able to help prevent SQL injection attacks so long as you use model objects and add/delete them to your database session, as opposed to straight SQL commands.

how to write sql to update some field given only one record in the target table

I got a table named test in MySQL database.
There are some fields in the test table, say, name.
However, there is only 0 or 1 record in the table.
When new record , say name = fox, comes, I'd like to update the targeted field of the table test.
I use python to handle MySQL and my question is how to write the sql.
PS. I try not to use where expression, but failed.
Suppose I've got the connection to the db, like the following:
conn = MySQLdb.connect(host=myhost, ...)

What you need here is a query which does the Merge kind of operation on your data. Algorithmically:
When record exists
do Update
Else
do Insert
You can go through this article to get a fair idea on doing things in this situation:
http://www.xaprb.com/blog/2006/06/17/3-ways-to-write-upsert-and-merge-queries-in-mysql/
What I personally recommend is the INSERT.. ON DUPLICATE KEY UPDATE
In your scenario, something like
INSERT INTO test (name)
VALUES ('fox')
ON DUPLICATE KEY UPDATE
name = 'fox';
Using this kind of a query you can handle the situation in one single shot.

Bulk upsert (insert-update) a csv in postgres [duplicate]

A very frequently asked question here is how to do an upsert, which is what MySQL calls INSERT ... ON DUPLICATE UPDATE and the standard supports as part of the MERGE operation.
Given that PostgreSQL doesn't support it directly (before pg 9.5), how do you do this? Consider the following:
CREATE TABLE testtable (
id integer PRIMARY KEY,
somedata text NOT NULL
);
INSERT INTO testtable (id, somedata) VALUES
(1, 'fred'),
(2, 'bob');
Now imagine that you want to "upsert" the tuples (2, 'Joe'), (3, 'Alan'), so the new table contents would be:
(1, 'fred'),
(2, 'Joe'), -- Changed value of existing tuple
(3, 'Alan') -- Added new tuple
That's what people are talking about when discussing an upsert. Crucially, any approach must be safe in the presence of multiple transactions working on the same table - either by using explicit locking, or otherwise defending against the resulting race conditions.
This topic is discussed extensively at Insert, on duplicate update in PostgreSQL?, but that's about alternatives to the MySQL syntax, and it's grown a fair bit of unrelated detail over time. I'm working on definitive answers.
These techniques are also useful for "insert if not exists, otherwise do nothing", i.e. "insert ... on duplicate key ignore".

9.5 and newer:
PostgreSQL 9.5 and newer support INSERT ... ON CONFLICT (key) DO UPDATE (and ON CONFLICT (key) DO NOTHING), i.e. upsert.
Comparison with ON DUPLICATE KEY UPDATE.
Quick explanation.
For usage see the manual - specifically the conflict_action clause in the syntax diagram, and the explanatory text.
Unlike the solutions for 9.4 and older that are given below, this feature works with multiple conflicting rows and it doesn't require exclusive locking or a retry loop.
The commit adding the feature is here and the discussion around its development is here.
If you're on 9.5 and don't need to be backward-compatible you can stop reading now.
9.4 and older:
PostgreSQL doesn't have any built-in UPSERT (or MERGE) facility, and doing it efficiently in the face of concurrent use is very difficult.
This article discusses the problem in useful detail.
In general you must choose between two options:
Individual insert/update operations in a retry loop; or
Locking the table and doing batch merge
Individual row retry loop
Using individual row upserts in a retry loop is the reasonable option if you want many connections concurrently trying to perform inserts.
The PostgreSQL documentation contains a useful procedure that'll let you do this in a loop inside the database. It guards against lost updates and insert races, unlike most naive solutions. It will only work in READ COMMITTED mode and is only safe if it's the only thing you do in the transaction, though. The function won't work correctly if triggers or secondary unique keys cause unique violations.
This strategy is very inefficient. Whenever practical you should queue up work and do a bulk upsert as described below instead.
Many attempted solutions to this problem fail to consider rollbacks, so they result in incomplete updates. Two transactions race with each other; one of them successfully INSERTs; the other gets a duplicate key error and does an UPDATE instead. The UPDATE blocks waiting for the INSERT to rollback or commit. When it rolls back, the UPDATE condition re-check matches zero rows, so even though the UPDATE commits it hasn't actually done the upsert you expected. You have to check the result row counts and re-try where necessary.
Some attempted solutions also fail to consider SELECT races. If you try the obvious and simple:
-- THIS IS WRONG. DO NOT COPY IT. It's an EXAMPLE.
BEGIN;
UPDATE testtable
SET somedata = 'blah'
WHERE id = 2;
-- Remember, this is WRONG. Do NOT COPY IT.
INSERT INTO testtable (id, somedata)
SELECT 2, 'blah'
WHERE NOT EXISTS (SELECT 1 FROM testtable WHERE testtable.id = 2);
COMMIT;
then when two run at once there are several failure modes. One is the already discussed issue with an update re-check. Another is where both UPDATE at the same time, matching zero rows and continuing. Then they both do the EXISTS test, which happens before the INSERT. Both get zero rows, so both do the INSERT. One fails with a duplicate key error.
This is why you need a re-try loop. You might think that you can prevent duplicate key errors or lost updates with clever SQL, but you can't. You need to check row counts or handle duplicate key errors (depending on the chosen approach) and re-try.
Please don't roll your own solution for this. Like with message queuing, it's probably wrong.
Bulk upsert with lock
Sometimes you want to do a bulk upsert, where you have a new data set that you want to merge into an older existing data set. This is vastly more efficient than individual row upserts and should be preferred whenever practical.
In this case, you typically follow the following process:
CREATE a TEMPORARY table
COPY or bulk-insert the new data into the temp table
LOCK the target table IN EXCLUSIVE MODE. This permits other transactions to SELECT, but not make any changes to the table.
Do an UPDATE ... FROM of existing records using the values in the temp table;
Do an INSERT of rows that don't already exist in the target table;
COMMIT, releasing the lock.
For example, for the example given in the question, using multi-valued INSERT to populate the temp table:
BEGIN;
CREATE TEMPORARY TABLE newvals(id integer, somedata text);
INSERT INTO newvals(id, somedata) VALUES (2, 'Joe'), (3, 'Alan');
LOCK TABLE testtable IN EXCLUSIVE MODE;
UPDATE testtable
SET somedata = newvals.somedata
FROM newvals
WHERE newvals.id = testtable.id;
INSERT INTO testtable
SELECT newvals.id, newvals.somedata
FROM newvals
LEFT OUTER JOIN testtable ON (testtable.id = newvals.id)
WHERE testtable.id IS NULL;
COMMIT;
Related reading
UPSERT wiki page
UPSERTisms in Postgres
Insert, on duplicate update in PostgreSQL?
http://petereisentraut.blogspot.com/2010/05/merge-syntax.html
Upsert with a transaction
Is SELECT or INSERT in a function prone to race conditions?
SQL MERGE on the PostgreSQL wiki
Most idiomatic way to implement UPSERT in Postgresql nowadays
What about MERGE?
SQL-standard MERGE actually has poorly defined concurrency semantics and is not suitable for upserting without locking a table first.
It's a really useful OLAP statement for data merging, but it's not actually a useful solution for concurrency-safe upsert. There's lots of advice to people using other DBMSes to use MERGE for upserts, but it's actually wrong.
Other DBs:
INSERT ... ON DUPLICATE KEY UPDATE in MySQL
MERGE from MS SQL Server (but see above about MERGE problems)
MERGE from Oracle (but see above about MERGE problems)

Here are some examples for insert ... on conflict ... (pg 9.5+) :
Insert, on conflict - do nothing.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict do nothing;`
Insert, on conflict - do update, specify conflict target via column.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict(id)
do update set name = 'new_name', size = 3;
Insert, on conflict - do update, specify conflict target via constraint name.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict on constraint dummy_pkey
do update set name = 'new_name', size = 4;

I am trying to contribute with another solution for the single insertion problem with the pre-9.5 versions of PostgreSQL. The idea is simply to try to perform first the insertion, and in case the record is already present, to update it:
do $$
begin
insert into testtable(id, somedata) values(2,'Joe');
exception when unique_violation then
update testtable set somedata = 'Joe' where id = 2;
end $$;
Note that this solution can be applied only if there are no deletions of rows of the table.
I do not know about the efficiency of this solution, but it seems to me reasonable enough.

SQLAlchemy upsert for Postgres >=9.5
Since the large post above covers many different SQL approaches for Postgres versions (not only non-9.5 as in the question), I would like to add how to do it in SQLAlchemy if you are using Postgres 9.5. Instead of implementing your own upsert, you can also use SQLAlchemy's functions (which were added in SQLAlchemy 1.1). Personally, I would recommend using these, if possible. Not only because of convenience, but also because it lets PostgreSQL handle any race conditions that might occur.
Cross-posting from another answer I gave yesterday (https://stackoverflow.com/a/44395983/2156909)
SQLAlchemy supports ON CONFLICT now with two methods on_conflict_do_update() and on_conflict_do_nothing():
Copying from the documentation:
from sqlalchemy.dialects.postgresql import insert
stmt = insert(my_table).values(user_email='a#b.com', data='inserted data')
stmt = stmt.on_conflict_do_update(
index_elements=[my_table.c.user_email],
index_where=my_table.c.user_email.like('%#gmail.com'),
set_=dict(data=stmt.excluded.data)
)
conn.execute(stmt)
http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html?highlight=conflict#insert-on-conflict-upsert

MERGE in PostgreSQL v. 15
Since PostgreSQL v. 15, is possible to use MERGE command. It actually has been presented as the first of the main improvements of this new version.
It uses a WHEN MATCHED / WHEN NOT MATCHED conditional in order to choose the behaviour when there is an existing row with same criteria.
It is even better than standard UPSERT, as the new feature gives full control to INSERT, UPDATE or DELETE rows in bulk.
MERGE INTO customer_account ca
USING recent_transactions t
ON t.customer_id = ca.customer_id
WHEN MATCHED THEN
UPDATE SET balance = balance + transaction_value
WHEN NOT MATCHED THEN
INSERT (customer_id, balance)
VALUES (t.customer_id, t.transaction_value)

WITH UPD AS (UPDATE TEST_TABLE SET SOME_DATA = 'Joe' WHERE ID = 2
RETURNING ID),
INS AS (SELECT '2', 'Joe' WHERE NOT EXISTS (SELECT * FROM UPD))
INSERT INTO TEST_TABLE(ID, SOME_DATA) SELECT * FROM INS
Tested on Postgresql 9.3

Since this question was closed, I'm posting here for how you do it using SQLAlchemy. Via recursion, it retries a bulk insert or update to combat race conditions and validation errors.
First the imports
import itertools as it
from functools import partial
from operator import itemgetter
from sqlalchemy.exc import IntegrityError
from app import session
from models import Posts
Now a couple helper functions
def chunk(content, chunksize=None):
"""Groups data into chunks each with (at most) `chunksize` items.
https://stackoverflow.com/a/22919323/408556
"""
if chunksize:
i = iter(content)
generator = (list(it.islice(i, chunksize)) for _ in it.count())
else:
generator = iter([content])
return it.takewhile(bool, generator)
def gen_resources(records):
"""Yields a dictionary if the record's id already exists, a row object
otherwise.
"""
ids = {item[0] for item in session.query(Posts.id)}
for record in records:
is_row = hasattr(record, 'to_dict')
if is_row and record.id in ids:
# It's a row but the id already exists, so we need to convert it
# to a dict that updates the existing record. Since it is duplicate,
# also yield True
yield record.to_dict(), True
elif is_row:
# It's a row and the id doesn't exist, so no conversion needed.
# Since it's not a duplicate, also yield False
yield record, False
elif record['id'] in ids:
# It's a dict and the id already exists, so no conversion needed.
# Since it is duplicate, also yield True
yield record, True
else:
# It's a dict and the id doesn't exist, so we need to convert it.
# Since it's not a duplicate, also yield False
yield Posts(**record), False
And finally the upsert function
def upsert(data, chunksize=None):
for records in chunk(data, chunksize):
resources = gen_resources(records)
sorted_resources = sorted(resources, key=itemgetter(1))
for dupe, group in it.groupby(sorted_resources, itemgetter(1)):
items = [g[0] for g in group]
if dupe:
_upsert = partial(session.bulk_update_mappings, Posts)
else:
_upsert = session.add_all
try:
_upsert(items)
session.commit()
except IntegrityError:
# A record was added or deleted after we checked, so retry
#
# modify accordingly by adding additional exceptions, e.g.,
# except (IntegrityError, ValidationError, ValueError)
db.session.rollback()
upsert(items)
except Exception as e:
# Some other error occurred so reduce chunksize to isolate the
# offending row(s)
db.session.rollback()
num_items = len(items)
if num_items > 1:
upsert(items, num_items // 2)
else:
print('Error adding record {}'.format(items[0]))
Here's how you use it
>>> data = [
... {'id': 1, 'text': 'updated post1'},
... {'id': 5, 'text': 'updated post5'},
... {'id': 1000, 'text': 'new post1000'}]
...
>>> upsert(data)
The advantage this has over bulk_save_objects is that it can handle relationships, error checking, etc on insert (unlike bulk operations).

How to use executemany() to insert multiple rows but ignore the erroneous ones?

I am using Python MySQLdb to insert data into a mysql database.
InsertList contains many rows. All are valid except for a few which violate database integrity rules.
If I run the code below, the command returns an error.
Cursor1.executemany(query, InsertList)
How can I force executemany() to insert the rows which are valid but ignore the few which are erroneous? The erroneous ones are caused by duplicate values in the new row. Do I have to use execute() one by one to insert the rows instead?
Thank you for your help.

Use the SQL command
INSERT IGNORE INTO
instead of plain
INSERT INTO
Here is a reference to the docs. Note that there is also INSERT REPLACE (to replace duplicates with the new values) and INSERT ... ON DUPLICATE KEY UPDATE for more control.

Python - Bulk Select then Insert from one DB to another

I'm looking for some help on how to do this in Python using sqlite3
Basically I have a process which downloads a DB (temp) and then needs to insert it's records into a 2nd identical DB (the main db).. and at the same time ignore/bypass any possible duplicate key errors
I was thinking of two scenarios but am unsure how to best do this in Python
Option 1:
create 2 connections and cursor objects, 1 to each DB
select from DB 1 eg:
dbcur.executemany('SELECT * from table1')
rows = dbcur.fetchall()
insert them into DB 2:
dbcur.execute('INSERT INTO table1 VALUES (:column1, :column2)', rows)
dbcon.commit()
This of course does not work as I'm not sure how to do it properly :)
Option 2 (which I would prefer, but not sure how to do):
SELECT and INSERT in 1 statement
Also, I have 4 tables within the DB's each with varying columns, can I skip naming the columns on the INSERT statement?
As far as the duplicate keys go, I have read I can use 'ON DUPLICATE KEY' to handle
eg.
INSERT INTO table1 VALUES (:column1, :column2) ON DUPLICATE KEY UPDATE set column1=column1

You can ATTACH two databases to the same connection with code like this:
import sqlite3
connection = sqlite3.connect('/path/to/temp.sqlite')
cursor=connection.cursor()
cursor.execute('ATTACH "/path/to/main.sqlite" AS master')
There is no ON DUPLICATE KEY syntax in sqlite as there is in MySQL. This SO question contains alternatives.
So to do the bulk insert in one sql statement, you could use something like
cursor.execute('INSERT OR REPLACE INTO master.table1 SELECT * FROM table1')
See this page for information about REPLACE and other ON CONFLICT options.

The code for option 1 looks correct.
If you need filtering to bypass duplicate keys, do the insert into a temporary table and then use SQL commands to eliminate duplicates and merge them into the target table.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient insert of multiple rows with SQLAlchemy/SQLite3 when duplicate entries exist - python

Related

How to get the data object of a newly inserted data row and flask-mysqldb?

how to write sql to update some field given only one record in the target table

Bulk upsert (insert-update) a csv in postgres [duplicate]

How to use executemany() to insert multiple rows but ignore the erroneous ones?

Python - Bulk Select then Insert from one DB to another

Categories

Resources