I'm running a python script with sqlalchemy to export and then import data from production to a postgres db on a daily basis. The script runs successfully once and then the second time and beyond the script fails. As you will see in the script below, the error returned suggets the dependencies in the tables (foreign keys) are the cause of the import failure, however, I do not understand why this issue is not circumvented by the sorted_tables object. I've opted to remove any of the intialiation code like repoistory imports, db connection objects to simplify the post and reduce clutter.
def create_db(src,dst,src_schema,dst_schema,drop_dst_schema=False):
if drop_dst_schema:
post_db.engine.execute('DROP SCHEMA IF EXISTS {0} CASCADE'.format(dst_schema))
print "Schema {0} Dropped".format(dst_schema)
post_db.engine.execute('CREATE SCHEMA IF NOT EXISTS {0}'.format(dst_schema))
post_db.engine.execute('GRANT USAGE ON SCHEMA {0} TO {0}_ro'.format(dst_schema))
post_db.engine.execute('GRANT USAGE ON SCHEMA {0} TO {0}_rw'.format(dst_schema))
print "Schema {0} Created".format(dst_schema)
def create_table(tbl, dst_schema):
dest_table=tbl
dest_table.schema=dst_schema
for col in dest_table.columns:
if hasattr(col.type, 'collation'):
col.type.collation = None
if col.name == 'id':
dest_table.append_constraint(PrimaryKeyConstraint(col))
col.type=convert(col.type)
timestamp_col=Column ('timestamp',DateTime(timezone=False), server_default=func.now())
#print tbl.c
dest_table.append_column(timestamp_col)
dest_table.create(post_db.engine,checkfirst=True)
post_db.engine.execute('GRANT INSERT ON {1} to {0}_ro'.format(dst_schema, dest_table))
post_db.engine.execute('GRANT ALL PRIVILEGES ON {1} to {0}_rw'.format(dst_schema, dest_table))
print "Table {0} created".format(dest_table)
create_db(mysql_db.engine,post_db.engine,src_schema,dst_schema,drop_dst_schema=False)
mysql_meta=MetaData(bind=mysql_db.engine)
mysql_meta.reflect(schema=src_schema)
post_meta=MetaData(bind=post_db.engine)
post_meta.reflect(schema=dst_schema)
script_begin=time.time()
rejected_list=[]
for table in mysql_meta.sorted_tables:
df=mysql_db.sql_retrieve('select * from {0}'.format(table.name))
df=df.where((pd.notnull(df)), None)
print "Table {0} : {1}".format(table.name,len(df))
dest_table=table
dest_table.schema = dst_schema
dest_table.drop(post_db.engine, checkfirst=True)
create_table(dest_table, dst_schema)
print "Table {0} emptied".format(dest_table.name)
try:
start=time.time()
if len(df)>10000:
for g,df_new in df.groupby(np.arange(len(df))//10000):
dict_items=df_new.to_dict(orient='records')
post_db.engine.connect().execute(dest_table.insert().values(dict_items))
else:
dict_items=df.to_dict(orient='records')
post_db.engine.connect().execute(dest_table.insert().values(dict_items))
loadtime=time.time()-start
print "Data loaded with datasize {0}".format(str(len(df)))
print "Table {0} loaded to BI database with loadtime {1}".format(dest_table.name,loadtime)
except:
print "Table {0} could not be loaded".format(dest_table.name)
rejected_list.append(dest_table.name)
If I drop the entire dst_schema before importing the data, the import succeeds.
This is the erorr I see:
sqlalchemy.exc.InternalError: (psycopg2.InternalError) cannot drop table A because other objects depend on it
DETAIL: constraint fk_rails_111193 on table B depends on table A
HINT: Use DROP ... CASCADE to drop the dependent objects too.
[SQL: '\nDROP TABLE A']
Can someone steer me into a possible solution?
Are there better alternatives other than dropping the dst_schema before importing the data to the destination db (drop_dst_schema=true)?
def create_db(src,dst,src_schema,dst_schema,drop_dst_schema=True)
Has anyone have an idea why sorted_tables does not drop the dependencies in the schema? Am I misunderstanding this object?
You have several options:
Drop the whole schema every time
If you have a complex schema, with any kind of closed loop reference chain, your best option is to always drop the whole schema.
You could have some self-referencing tables (such as a persons table, with a self-relation of type person parent-of person). You could also have a schema where table A references table B which references table A. For instance, you have one table persons and one companies, and two relations (probably, with intermediat tables): company employs persons, and persons trade shares of companies.
In cases like this, that are realistic, no matter what you do with sorted_tables, this will never work.
If you're actually replicating data from another DB, and can afford the time, dropping and recreating the whole schema is the easiest to implement solution. Your code will be much simpler: less cases to consider.
DROP CASCADE
You can also drop the tables using DROP CASCADE. If one table is referenced by another, this will drop both (or as many as necessary). You have to make sure the order in which you DROP and CREATE gives you the end result you expect. I'd check very carefully that this scenario works in all cases.
Drop all FK constraints, then recreate them at the end
Also, there is also one last possibility: drop all FK constraints for all tables before manipulating them, and recreate them at the end. This way, you'll be able to drop any table at any moment.
I have table with unique contraint on one colum like:
CREATE TABLE entity (
id INT NOT NULL AUTO_INCREMENT,
zip_code INT NOT NULL,
entity_url VARCHAR(255) NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY ix_uniq_zip_code_entity_url (zip_code, entity_url)
);
and corresponding SQLAlchemy model. I adding a lot of records and do not want to commit session after each record. My assumption that better to call session.add(new_record) multiple times and one time session.commit().
But while adding new records I could get IntegrityError because constraint is violated. It is normal situation and I want just skip such records insertion. But looks like I can only revert entire transaction.
Also I do not want to add another complex checks "get all records from database where zip_code in [...] and entity_url in [...] then drop matched data from records_to_insert".
Is there way to instuct SQLAlchemy drop records which violates constraint?
My assumption that better to call session.add(new_record) multiple times and one time session.commit().
You might want to revisit this assumption. Batch processing of a lot of records usually lends itself to multiple commits -- what if you have 10k records and your code raises an exception on the 9,999th? You'll be forced to start over. The core question here is whether or not it makes sense for one of the records to exist in the database without the rest. If it does, then there's no problem committing on each entry (performance issues aside). In that case, you can simply catch the IntegrityError and call session.rollback() to continue down the list of records.
In any case, a similar question was asked on the SQLA mailing list and answered by the library's creator, Mike Bayer. He recommended removing the duplicates from the list of new records yourself, as this is easy to do with a dictionary or a set. This could be as simple as a dict comprehension:
new_entities = { (entity['zip_code'], entity['url']): entity for entity in new_entities}
(This would pick the last-seen duplicate as the one to add to the DB.)
Also note that he uses the SQLAlchemy core library to perform the inserts, rather than the ORM's session.add() method:
sess.execute(Entry.__table__.insert(), params=inserts)
This is a much faster option if you're dealing with a lot of records (like in his example, with 100,000 records).
If you decide to insert your records row by row, you can check if it already exists before you do your insert. This may or may not be more elegant and efficient:
def record_exists(session, some_id):
return session.query(exists().where(YourEntity.id == some_id)).scalar()
for item in items:
if not record_exists(session, item.some_id):
session.add(record)
session.flush()
else:
print "Already exists, skipping..."
session.commit()
I need create a lot of unique codes and insert it in database.
Of course I can write something like this:
codes = set()
while len(codes) < codes_size:
c = generate_code()
if len(Codes.objects.filter(code=c)) == 0:
codes.add(Codes(c))
Codes.objects.bulk_create(codes)
But when database already contain a lot of codes it works very slow.
If insert code after each generation - it's very slow too.
Now best idea - not verify code until bulk_create. And if bulk_create raise exception then regenerate all codes again. Exceptions very rare, but when database will be more bigger then and exceptions will be more often.
bulk_create not say which code raise exception.
My understanding is that bulk_create() performs it's operation within a transaction which is not committed if an error occurs. This means that either all inserts succeeded, or none succeeded.
For example, if a code is generated that is a duplicate of one that is already in the table, or a duplicate of another within the batch, an IntegrityError exception would be raised and none of the codes would have been inserted into the database.
In terms of exceptions, you'll likely get a subclass of django.db.utils.DatabaseError, e.g. django.db.utils.IntegrityError. Because you don't know which database error will be raised you should catch DatabaseError.
I don't think that there is any way to know from the exception which of the codes caused the problem.
One way to handle this is to generate all of the codes up front and then, within a transaction, test whether they already exist in the table in one go using filter() and insert them if there are no duplicates:
from django.db import transaction
codes = set()
while len(codes) < codes_size:
codes.add(generate_code())
with transaction.atomic():
# check for any duplicate codes in table...
dups = Codes.objects.filter(code__in=codes)
if len(dups):
print 'Duplicate code(s) generated: {}'.format(dup.code for dup in dups)
# remove dups from codes, and generate additional codes to make up the shortfall.
# Note that this might also result in duplicates....
else:
Codes.objects.bulk_create(Codes(code) for code in codes)
You still need to handle database exceptions that are not due to duplicate values. For simplicity I've left that out.
A very frequently asked question here is how to do an upsert, which is what MySQL calls INSERT ... ON DUPLICATE UPDATE and the standard supports as part of the MERGE operation.
Given that PostgreSQL doesn't support it directly (before pg 9.5), how do you do this? Consider the following:
CREATE TABLE testtable (
id integer PRIMARY KEY,
somedata text NOT NULL
);
INSERT INTO testtable (id, somedata) VALUES
(1, 'fred'),
(2, 'bob');
Now imagine that you want to "upsert" the tuples (2, 'Joe'), (3, 'Alan'), so the new table contents would be:
(1, 'fred'),
(2, 'Joe'), -- Changed value of existing tuple
(3, 'Alan') -- Added new tuple
That's what people are talking about when discussing an upsert. Crucially, any approach must be safe in the presence of multiple transactions working on the same table - either by using explicit locking, or otherwise defending against the resulting race conditions.
This topic is discussed extensively at Insert, on duplicate update in PostgreSQL?, but that's about alternatives to the MySQL syntax, and it's grown a fair bit of unrelated detail over time. I'm working on definitive answers.
These techniques are also useful for "insert if not exists, otherwise do nothing", i.e. "insert ... on duplicate key ignore".
9.5 and newer:
PostgreSQL 9.5 and newer support INSERT ... ON CONFLICT (key) DO UPDATE (and ON CONFLICT (key) DO NOTHING), i.e. upsert.
Comparison with ON DUPLICATE KEY UPDATE.
Quick explanation.
For usage see the manual - specifically the conflict_action clause in the syntax diagram, and the explanatory text.
Unlike the solutions for 9.4 and older that are given below, this feature works with multiple conflicting rows and it doesn't require exclusive locking or a retry loop.
The commit adding the feature is here and the discussion around its development is here.
If you're on 9.5 and don't need to be backward-compatible you can stop reading now.
9.4 and older:
PostgreSQL doesn't have any built-in UPSERT (or MERGE) facility, and doing it efficiently in the face of concurrent use is very difficult.
This article discusses the problem in useful detail.
In general you must choose between two options:
Individual insert/update operations in a retry loop; or
Locking the table and doing batch merge
Individual row retry loop
Using individual row upserts in a retry loop is the reasonable option if you want many connections concurrently trying to perform inserts.
The PostgreSQL documentation contains a useful procedure that'll let you do this in a loop inside the database. It guards against lost updates and insert races, unlike most naive solutions. It will only work in READ COMMITTED mode and is only safe if it's the only thing you do in the transaction, though. The function won't work correctly if triggers or secondary unique keys cause unique violations.
This strategy is very inefficient. Whenever practical you should queue up work and do a bulk upsert as described below instead.
Many attempted solutions to this problem fail to consider rollbacks, so they result in incomplete updates. Two transactions race with each other; one of them successfully INSERTs; the other gets a duplicate key error and does an UPDATE instead. The UPDATE blocks waiting for the INSERT to rollback or commit. When it rolls back, the UPDATE condition re-check matches zero rows, so even though the UPDATE commits it hasn't actually done the upsert you expected. You have to check the result row counts and re-try where necessary.
Some attempted solutions also fail to consider SELECT races. If you try the obvious and simple:
-- THIS IS WRONG. DO NOT COPY IT. It's an EXAMPLE.
BEGIN;
UPDATE testtable
SET somedata = 'blah'
WHERE id = 2;
-- Remember, this is WRONG. Do NOT COPY IT.
INSERT INTO testtable (id, somedata)
SELECT 2, 'blah'
WHERE NOT EXISTS (SELECT 1 FROM testtable WHERE testtable.id = 2);
COMMIT;
then when two run at once there are several failure modes. One is the already discussed issue with an update re-check. Another is where both UPDATE at the same time, matching zero rows and continuing. Then they both do the EXISTS test, which happens before the INSERT. Both get zero rows, so both do the INSERT. One fails with a duplicate key error.
This is why you need a re-try loop. You might think that you can prevent duplicate key errors or lost updates with clever SQL, but you can't. You need to check row counts or handle duplicate key errors (depending on the chosen approach) and re-try.
Please don't roll your own solution for this. Like with message queuing, it's probably wrong.
Bulk upsert with lock
Sometimes you want to do a bulk upsert, where you have a new data set that you want to merge into an older existing data set. This is vastly more efficient than individual row upserts and should be preferred whenever practical.
In this case, you typically follow the following process:
CREATE a TEMPORARY table
COPY or bulk-insert the new data into the temp table
LOCK the target table IN EXCLUSIVE MODE. This permits other transactions to SELECT, but not make any changes to the table.
Do an UPDATE ... FROM of existing records using the values in the temp table;
Do an INSERT of rows that don't already exist in the target table;
COMMIT, releasing the lock.
For example, for the example given in the question, using multi-valued INSERT to populate the temp table:
BEGIN;
CREATE TEMPORARY TABLE newvals(id integer, somedata text);
INSERT INTO newvals(id, somedata) VALUES (2, 'Joe'), (3, 'Alan');
LOCK TABLE testtable IN EXCLUSIVE MODE;
UPDATE testtable
SET somedata = newvals.somedata
FROM newvals
WHERE newvals.id = testtable.id;
INSERT INTO testtable
SELECT newvals.id, newvals.somedata
FROM newvals
LEFT OUTER JOIN testtable ON (testtable.id = newvals.id)
WHERE testtable.id IS NULL;
COMMIT;
Related reading
UPSERT wiki page
UPSERTisms in Postgres
Insert, on duplicate update in PostgreSQL?
http://petereisentraut.blogspot.com/2010/05/merge-syntax.html
Upsert with a transaction
Is SELECT or INSERT in a function prone to race conditions?
SQL MERGE on the PostgreSQL wiki
Most idiomatic way to implement UPSERT in Postgresql nowadays
What about MERGE?
SQL-standard MERGE actually has poorly defined concurrency semantics and is not suitable for upserting without locking a table first.
It's a really useful OLAP statement for data merging, but it's not actually a useful solution for concurrency-safe upsert. There's lots of advice to people using other DBMSes to use MERGE for upserts, but it's actually wrong.
Other DBs:
INSERT ... ON DUPLICATE KEY UPDATE in MySQL
MERGE from MS SQL Server (but see above about MERGE problems)
MERGE from Oracle (but see above about MERGE problems)
Here are some examples for insert ... on conflict ... (pg 9.5+) :
Insert, on conflict - do nothing.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict do nothing;`
Insert, on conflict - do update, specify conflict target via column.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict(id)
do update set name = 'new_name', size = 3;
Insert, on conflict - do update, specify conflict target via constraint name.
insert into dummy(id, name, size) values(1, 'new_name', 3)
on conflict on constraint dummy_pkey
do update set name = 'new_name', size = 4;
I am trying to contribute with another solution for the single insertion problem with the pre-9.5 versions of PostgreSQL. The idea is simply to try to perform first the insertion, and in case the record is already present, to update it:
do $$
begin
insert into testtable(id, somedata) values(2,'Joe');
exception when unique_violation then
update testtable set somedata = 'Joe' where id = 2;
end $$;
Note that this solution can be applied only if there are no deletions of rows of the table.
I do not know about the efficiency of this solution, but it seems to me reasonable enough.
SQLAlchemy upsert for Postgres >=9.5
Since the large post above covers many different SQL approaches for Postgres versions (not only non-9.5 as in the question), I would like to add how to do it in SQLAlchemy if you are using Postgres 9.5. Instead of implementing your own upsert, you can also use SQLAlchemy's functions (which were added in SQLAlchemy 1.1). Personally, I would recommend using these, if possible. Not only because of convenience, but also because it lets PostgreSQL handle any race conditions that might occur.
Cross-posting from another answer I gave yesterday (https://stackoverflow.com/a/44395983/2156909)
SQLAlchemy supports ON CONFLICT now with two methods on_conflict_do_update() and on_conflict_do_nothing():
Copying from the documentation:
from sqlalchemy.dialects.postgresql import insert
stmt = insert(my_table).values(user_email='a#b.com', data='inserted data')
stmt = stmt.on_conflict_do_update(
index_elements=[my_table.c.user_email],
index_where=my_table.c.user_email.like('%#gmail.com'),
set_=dict(data=stmt.excluded.data)
)
conn.execute(stmt)
http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html?highlight=conflict#insert-on-conflict-upsert
MERGE in PostgreSQL v. 15
Since PostgreSQL v. 15, is possible to use MERGE command. It actually has been presented as the first of the main improvements of this new version.
It uses a WHEN MATCHED / WHEN NOT MATCHED conditional in order to choose the behaviour when there is an existing row with same criteria.
It is even better than standard UPSERT, as the new feature gives full control to INSERT, UPDATE or DELETE rows in bulk.
MERGE INTO customer_account ca
USING recent_transactions t
ON t.customer_id = ca.customer_id
WHEN MATCHED THEN
UPDATE SET balance = balance + transaction_value
WHEN NOT MATCHED THEN
INSERT (customer_id, balance)
VALUES (t.customer_id, t.transaction_value)
WITH UPD AS (UPDATE TEST_TABLE SET SOME_DATA = 'Joe' WHERE ID = 2
RETURNING ID),
INS AS (SELECT '2', 'Joe' WHERE NOT EXISTS (SELECT * FROM UPD))
INSERT INTO TEST_TABLE(ID, SOME_DATA) SELECT * FROM INS
Tested on Postgresql 9.3
Since this question was closed, I'm posting here for how you do it using SQLAlchemy. Via recursion, it retries a bulk insert or update to combat race conditions and validation errors.
First the imports
import itertools as it
from functools import partial
from operator import itemgetter
from sqlalchemy.exc import IntegrityError
from app import session
from models import Posts
Now a couple helper functions
def chunk(content, chunksize=None):
"""Groups data into chunks each with (at most) `chunksize` items.
https://stackoverflow.com/a/22919323/408556
"""
if chunksize:
i = iter(content)
generator = (list(it.islice(i, chunksize)) for _ in it.count())
else:
generator = iter([content])
return it.takewhile(bool, generator)
def gen_resources(records):
"""Yields a dictionary if the record's id already exists, a row object
otherwise.
"""
ids = {item[0] for item in session.query(Posts.id)}
for record in records:
is_row = hasattr(record, 'to_dict')
if is_row and record.id in ids:
# It's a row but the id already exists, so we need to convert it
# to a dict that updates the existing record. Since it is duplicate,
# also yield True
yield record.to_dict(), True
elif is_row:
# It's a row and the id doesn't exist, so no conversion needed.
# Since it's not a duplicate, also yield False
yield record, False
elif record['id'] in ids:
# It's a dict and the id already exists, so no conversion needed.
# Since it is duplicate, also yield True
yield record, True
else:
# It's a dict and the id doesn't exist, so we need to convert it.
# Since it's not a duplicate, also yield False
yield Posts(**record), False
And finally the upsert function
def upsert(data, chunksize=None):
for records in chunk(data, chunksize):
resources = gen_resources(records)
sorted_resources = sorted(resources, key=itemgetter(1))
for dupe, group in it.groupby(sorted_resources, itemgetter(1)):
items = [g[0] for g in group]
if dupe:
_upsert = partial(session.bulk_update_mappings, Posts)
else:
_upsert = session.add_all
try:
_upsert(items)
session.commit()
except IntegrityError:
# A record was added or deleted after we checked, so retry
#
# modify accordingly by adding additional exceptions, e.g.,
# except (IntegrityError, ValidationError, ValueError)
db.session.rollback()
upsert(items)
except Exception as e:
# Some other error occurred so reduce chunksize to isolate the
# offending row(s)
db.session.rollback()
num_items = len(items)
if num_items > 1:
upsert(items, num_items // 2)
else:
print('Error adding record {}'.format(items[0]))
Here's how you use it
>>> data = [
... {'id': 1, 'text': 'updated post1'},
... {'id': 5, 'text': 'updated post5'},
... {'id': 1000, 'text': 'new post1000'}]
...
>>> upsert(data)
The advantage this has over bulk_save_objects is that it can handle relationships, error checking, etc on insert (unlike bulk operations).
Assume that I have a table user_count defined as follows:
id primary key, auto increment
user_id unique
count default 0
What I want to do is increment count by one when an existing record of a user exists or else insert a new record.
Currently, I do it this way (in Python):
try:
cursor.execute("INSERT INTO user_count (user_id) VALUES (%s)", user.id)
except IntegrityError:
cursor.execute("UPDATE user_count SET count = count+1 WHERE user_id = %s", user.id)
And it can also be implement this way:
cursor.execute("INSERT INTO user_count (user_id) VALUES (user_id) ON DUPLICATE KEY UPDATE count = count + 1", user.id)
What's the difference between these two ways, and which one is better?
The second one is a single SQL command which makes use of the feature that the database offers for solving exactly the problem you have here.
It'd use that as it should be faster and more reliable.
The first one is a fallback if that feature is not available (older database version?).
The first one uses an exception to direct the flow of the program, which isn't what you should do unless you have no other solutions to it (e.g. getting exclusive access to a file). Also, it takes the work from the database which should know better to handle the case.
The second code handles all the work in the database which in turn can optimize the query plan to a very efficient manner.
I would use the second solution as the database usually knows better than yourself how to handle a case.