I'm migrating a Django site from MySQL to PostgreSQL. The quantity of data isn't huge, so I've taken a very simple approach: I've just used the built-in Django serialize and deserialize routines to create JSON records, and then load them in the new instance, loop over the objects, and save each one to the new database.
This works very nicely, with one hiccup: after loading all the records, I run into an IntegrityError when I try to add new data after loading the old records. The Postgres equivalent of a MySQL autoincrement ID field is a serial field, but the internal counter for serial fields isn't incremented when id values are specified explicitly. As a result, Postgres tries to start numbering records at 1 -- already used -- causing a constraint violation. (This is a known issue in Django, marked wontfix.)
There are quite a few questions and answers related to this, but none of the answers seem to address the issue directly in the context of Django. This answer gives an example of the query you'd need to run to update the counter, but I try to avoid making explicit queries when possible. I could simply delete the ID field before saving and let Postgres do the numbering itself, but there are ForeignKey references that will be broken in that case. And everything else works beautifully!
It would be nice if Django provided a routine for doing this that intelligently handles any edge cases. (This wouldn't fix the bug, but it would allow developers to work around it in a consistent and correct way.) Do we really have to just use a raw query to fix this? It seems so barbaric.
If there's really no such routine, I will simply do something like the below, which directly runs the query suggested in the answer linked above. But in that case, I'd be interested to hear about any potential issues with this approach, or any other information about what I might be doing wrong. For example, should I just modify the records to use UUIDs instead, as this suggests?
Here's the raw approach (edited to reflect a simplified version of what I actually wound up doing). It's pretty close to Pere Picornell's answer, but his looks more robust to me.
table = model._meta.db_table
cur = connection.cursor()
cur.execute(
"SELECT setval('{}_id_seq', (SELECT max(id) FROM {}))".format(table, table)
)
About the debate: my case is a one-time migration, and my decision was to run this function right after I finish each table's migration, although you could call it anytime you suspect integrity could be broken.
def synchronize_last_sequence(model):
# Postgresql aut-increments (called sequences) don't update the 'last_id' value if you manually specify an ID.
# This sets the last incremented number to the last id
sequence_name = model._meta.db_table+"_"+model._meta.pk.name+"_seq"
with connections['default'].cursor() as cursor:
cursor.execute(
"SELECT setval('" + sequence_name + "', (SELECT max(" + model._meta.pk.name + ") FROM " +
model._meta.db_table + "))"
)
print("Last auto-incremental number for sequence "+sequence_name+" synchronized.")
Which I did using the SQL query you proposed in your question.
It's been very useful to find your post. Thank you!
It should work with custom PKs but not with multi-field PKs.
One option is to use natural keys during serialization and deserialization. That way when you insert it into PostgreSQL, it will auto-increment the primary key field and keep everything inline.
The downside to this approach is that you need to have a set of unique fields for each model that don't include the id.
Related
I'm wondering what the best strategy is for using insert-only permissions to a postgres db with Peewee. I'd like this in order to be certain that a specific user can't read any data back out of the database.
I granted INSERT permissions to my table, 'test', in postgres. But I've run into the problem that when I try to save new rows with something like:
thing = Test(value=1)
thing.save()
The sql actually contains a RETURNING clause that needs more permissions (namely, SELECT) than just insert:
INSERT INTO "test" ("value") VALUES (1) RETURNING "test"."id"
Seems like the same sql is generated when I try to use query = test.insert(value=1)' query.execute() as well.
From looking around, it seems like you need either grant SELECT privileges, or use a more exotic feature like "row level security" in Postgres. Is there any way to go about this with peewee out of the box? Or another suggestion of how to add new rows with truly write-only permissions?
You can omit the returning clause by explicitly writing your INSERT query and supplying a blank RETURNING. Peewee uses RETURNING whenever possible so that the auto-generated PK can be recovered in a single operation, but it is possible to disable it:
# Empty call to returning will disable the RETURNING clause:
iq = Test.insert(value=1).returning()
iq.execute()
You can also override this for all INSERT operations by setting the returning_clause attribute on the DB to False:
db = PostgresqlDatabase(...)
db.returning_clause = False
This is not an officially supported approach, though, and may have unintended side-effects or weird behavior - caveat emptor.
What's the potential pitfall of always using 'implicit_returning': False in SQLAlchemy?
I've encountered problems a number of times when working on MSSQL tables that have triggers defined, and since the DB is in replication, all of the tables have triggers.
I'm not sure now what the problem exactly is. It has something to do with auto-increment fields - maybe because I'm prefetching the auto-incremented value so I can insert it in another table.
If I don't set 'implicit_returning': False for the table, when I try to insert values, I get this error:
The target table of the DML statement cannot have any enabled triggers
if the statement contains an OUTPUT clause without INTO clause.
So what if I put __table_args__ = {'implicit_returning': False} into all mapped classes just to be safe?
Particularly frustrating for me is that local DB I use for development & testing is not in replication and doesn't need that option, but the production DB is replicated so when I deploy changes they sometimes don't work. :)
As you probably already know, the cause of your predicament is described in SQLAlchemy Docs as the following:
SQLAlchemy by default uses OUTPUT INSERTED to get at newly generated
primary key values via IDENTITY columns or other server side
defaults. MS-SQL does not allow the usage of OUTPUT INSERTED on
tables that have triggers. To disable the usage of OUTPUT INSERTED
on a per-table basis, specify implicit_returning=False for each
Table which has triggers.
If you set your SQLAlchemy engine to echo the SQL, you will see that by default, it does this:
INSERT INTO [table] (id, ...) OUTPUT inserted.[id] VALUES (...)
But if you disable implicit_returning, it does this instead:
INSERT INTO [table] (id, ...) VALUES (...); select scope_identity()
So the question, "Is there any harm in disabling implicit_returning for all tables just in case?" is really, "Is there any disadvantage to using SCOPE_IDENTITY() instead of OUTPUT INSERTED?"
I'm no expert, but I get the impression that although OUTPUT INSERTED is the preferred method these days, SCOPE_IDENTITY() is usually fine too. In the past, SQL Server 2008 (and maybe earlier versions too?) had a bug where SCOPE_IDENTITY sometimes didn't return the correct value, but I hear that has now been fixed (see this question for more detail). (On the other hand, other techniques like ##IDENTITY and IDENT_CURRENT() are still dangerous since they can return the wrong value in corner cases. See this answer and the others on that same page for more detail.)
The big advantage that OUTPUT INSERTED still has is that it can work for cases where you are inserting multiple rows via a single INSERT statement. Is that something you are doing with SQLAlchemy? Probably not, right? So it doesn't matter.
Note that if you are going to have to disable implicit_returning for many tables, you could avoid a bit of boilerplate by making a mixin for it (and whichever other columns and properties you want all of the tables to inherit):
class AutoincTriggerMixin():
__table_args__ = {
'implicit_returning': False
}
id = Column(Integer, primary_key=True, autoincrement=True)
class SomeModel(AutoincTriggerMixin, Base):
some_column = Column(String(1000))
...
See this page in the SQLALchemy documentation for more detail. As an added bonus, it makes it more obvious which tables involve triggers.
I have a django app that has a model (Person) defined and I also have some DB (in there is a table Appointment) that do not have any models defined (not meant to be connected to the django app).
I need to move some data from Appointment table over to the Person such that all information the People table needs to mirror the Appointment table. It is this way because there are multiple independent DBs like Appointment that needs to be copied to the Person table (so I do not want to make any architectural changes to how this is setup).
Here is what I do now:
res = sourcedb.fetchall() # from Appointment Table
for myrecord in res:
try:
existingrecord = Person.objects.filter(vendorid = myrecord[12], office = myoffice)[0]
except:
existingrecord = Person(vendorid = myrecord[12], office = myoffice)
existingrecord.firstname = myrecord[0]
existingrecord.midname = myrecord[1]
existingrecord.lastname = myrecord[2]
existingrecord.address1 = myrecord[3]
existingrecord.address2 = myrecord[4]
existingrecord.save()
The problem is that this is way too slow (takes about 8 minutes for 20K records). What can I do to speed this up?
I have considered the following approach:
1. bulk_create: Cannot use this because I have to update sometimes.
2. delete all and then bulk_create There is dependency on the Person model to other things, so I cannot delete records in Person model.
3. INSERT ... ON DUPLICATE KEY UPDATE: cannot do this because the Person table's PK is different from the Appointment table PK (primary key). The Appointment PK is copied into Person table. If there was a way to check on two duplicate keys, this approach would work I think.
A few ideas:
EDIT: See Trewq's comment to this and create Indexes on your tables first of all…
Wrap it all in a transaction using with transaction.atomic():, as by default Django will create a new transaction per save() call which can become very expensive. With 20K records, one giant transaction might also be a problem, so you might have to write some code to split your transactions into multiple batches. Try it out and measure!
If RAM is not an issue (should not be one with 20k records), fetch all data first from the appointment table and then fetch all existing Person objects using a single SELECT query instead of one per record
Use bulk_create even if some of them are updates. This will still issue UPDATE queries for your updates, but will reduce all your INSERT queries to just one/a few, which still is an improvement. You can distinguish inserts and updates by the fact that inserts wont have a primary key set before calling save() and save the inserts into a Python list for a later bulk_create instead of saving them directly
As a last resort: Write raw SQL to make use of MySQLs INSERT … ON DUPLICATE KEY UPDATE syntax. You don't need the same primary key for this, a UNIQUE key would suffice. Keys can span multiple columns, see Django's Meta.unique_together model option.
I apologize if my question turns out to be silly, but I'm rather new to Django, and I could not find an answer anywhere.
I have the following model:
class BlackListEntry(models.Model):
user_banned = models.ForeignKey(auth.models.User,related_name="user_banned")
user_banning = models.ForeignKey(auth.models.User,related_name="user_banning")
Now, when i try to create an object like this:
BlackListEntry.objects.create(user_banned=int(user_id),user_banning=int(banning_id))
I get a following error:
Cannot assign "1": "BlackListEntry.user_banned" must be a "User" instance.
Of course, if i replace it with something like this:
user_banned = User.objects.get(pk=user_id)
user_banning = User.objects.get(pk=banning_id)
BlackListEntry.objects.create(user_banned=user_banned,user_banning=user_banning)
everything works fine. The question is:
Does my solution hit the database to retrieve both users, and if yes, is it possible to avoid it, just passing ids?
The answer to your question is: YES.
Django will hit the database (at least) 3 times, 2 to retrieve the two User objects and a third one to commit your desired information. This will cause an absolutelly unnecessary overhead.
Just try:
BlackListEntry.objects.create(user_banned_id=int(user_id),user_banning_id=int(banning_id))
These is the default name pattern for the FK fields generated by Django ORM. This way you can set the information directly and avoid the queries.
If you wanted to query for the already saved BlackListEntry objects, you can navigate the attributes with a double underscore, like this:
BlackListEntry.objects.filter(user_banned__id=int(user_id),user_banning__id=int(banning_id))
This is how you access properties in Django querysets. with a double underscore. Then you can compare to the value of the attribute.
Though very similar, they work completely different. The first one sets an atribute directly while the second one is parsed by django, that splits it at the '__', and query the database the right way, being the second part the name of an attribute.
You can always compare user_banned and user_banning with the actual User objects, instead of their ids. But there is no use for this if you don't already have those objects with you.
Hope it helps.
I do believe that when you fetch the users, it is going to hit the db...
To avoid it, you would have to write the raw sql to do the update using method described here:
https://docs.djangoproject.com/en/dev/topics/db/sql/
If you decide to go that route keep in mind you are responsible for protecting yourself from sql injection attacks.
Another alternative would be to cache the user_banned and user_banning objects.
But in all likelihood, simply grabbing the users and creating the BlackListEntry won't cause you any noticeable performance problems. Caching or executing raw sql will only provide a small benefit. You're probably going to run into other issues before this becomes a problem.
How can I get the id of the created record in SQLAlchemy?
I'm doing:
engine.execute("insert into users values (1,'john')")
When you execute a plain text statement, you're at the mercy of the DBAPI you're using as to whether or not the new PK value is available and via what means. With SQlite and MySQL DBAPIs you'll have it as result.lastrowid, which just gives you the value of .lastrowid for the cursor. With PG, Oracle, etc., there's no ".lastrowid" - as someone else said you can use "RETURNING" for those in which case results are available via result.fetchone() (although using RETURNING with oracle, again not taking advantage of SQLAlchemy expression constructs, requires several awkward steps), or if RETURNING isn't available you can use direct sequence access (NEXTVAL in pg), or a "post fetch" operation (CURRVAL in PG, ##identity or scope_identity() in MSSQL).
Sounds complicated right ? That's why you're better off using table.insert(). SQLAlchemy's primary system of providing newly generated PKs is designed to work with these constructs. One you're there, the result.last_inserted_ids() method gives you the newly generated (possibly composite) PK in all cases, regardless of backend. The above methods of .lastrowid, sequence execution, RETURNING etc. are all dealt with for you (0.6 uses RETURNING when available).
There's an extra clause you can add: RETURNING
ie
INSERT INTO users (name, address) VALUES ('richo', 'beaconsfield') RETURNING id
Then just retrieve a row like your insert was a SELECT statement.