How can I improve django mysql copy performance?

How can I improve django mysql copy performance? - python

I have a django app that has a model (Person) defined and I also have some DB (in there is a table Appointment) that do not have any models defined (not meant to be connected to the django app).
I need to move some data from Appointment table over to the Person such that all information the People table needs to mirror the Appointment table. It is this way because there are multiple independent DBs like Appointment that needs to be copied to the Person table (so I do not want to make any architectural changes to how this is setup).
Here is what I do now:
res = sourcedb.fetchall() # from Appointment Table
for myrecord in res:
try:
existingrecord = Person.objects.filter(vendorid = myrecord[12], office = myoffice)[0]
except:
existingrecord = Person(vendorid = myrecord[12], office = myoffice)
existingrecord.firstname = myrecord[0]
existingrecord.midname = myrecord[1]
existingrecord.lastname = myrecord[2]
existingrecord.address1 = myrecord[3]
existingrecord.address2 = myrecord[4]
existingrecord.save()
The problem is that this is way too slow (takes about 8 minutes for 20K records). What can I do to speed this up?
I have considered the following approach:
1. bulk_create: Cannot use this because I have to update sometimes.
2. delete all and then bulk_create There is dependency on the Person model to other things, so I cannot delete records in Person model.
3. INSERT ... ON DUPLICATE KEY UPDATE: cannot do this because the Person table's PK is different from the Appointment table PK (primary key). The Appointment PK is copied into Person table. If there was a way to check on two duplicate keys, this approach would work I think.

A few ideas:
EDIT: See Trewq's comment to this and create Indexes on your tables first of all…
Wrap it all in a transaction using with transaction.atomic():, as by default Django will create a new transaction per save() call which can become very expensive. With 20K records, one giant transaction might also be a problem, so you might have to write some code to split your transactions into multiple batches. Try it out and measure!
If RAM is not an issue (should not be one with 20k records), fetch all data first from the appointment table and then fetch all existing Person objects using a single SELECT query instead of one per record
Use bulk_create even if some of them are updates. This will still issue UPDATE queries for your updates, but will reduce all your INSERT queries to just one/a few, which still is an improvement. You can distinguish inserts and updates by the fact that inserts wont have a primary key set before calling save() and save the inserts into a Python list for a later bulk_create instead of saving them directly
As a last resort: Write raw SQL to make use of MySQLs INSERT … ON DUPLICATE KEY UPDATE syntax. You don't need the same primary key for this, a UNIQUE key would suffice. Keys can span multiple columns, see Django's Meta.unique_together model option.

Related

Does Django provide any built-in way to update PostgreSQL autoincrement counters?

I'm migrating a Django site from MySQL to PostgreSQL. The quantity of data isn't huge, so I've taken a very simple approach: I've just used the built-in Django serialize and deserialize routines to create JSON records, and then load them in the new instance, loop over the objects, and save each one to the new database.
This works very nicely, with one hiccup: after loading all the records, I run into an IntegrityError when I try to add new data after loading the old records. The Postgres equivalent of a MySQL autoincrement ID field is a serial field, but the internal counter for serial fields isn't incremented when id values are specified explicitly. As a result, Postgres tries to start numbering records at 1 -- already used -- causing a constraint violation. (This is a known issue in Django, marked wontfix.)
There are quite a few questions and answers related to this, but none of the answers seem to address the issue directly in the context of Django. This answer gives an example of the query you'd need to run to update the counter, but I try to avoid making explicit queries when possible. I could simply delete the ID field before saving and let Postgres do the numbering itself, but there are ForeignKey references that will be broken in that case. And everything else works beautifully!
It would be nice if Django provided a routine for doing this that intelligently handles any edge cases. (This wouldn't fix the bug, but it would allow developers to work around it in a consistent and correct way.) Do we really have to just use a raw query to fix this? It seems so barbaric.
If there's really no such routine, I will simply do something like the below, which directly runs the query suggested in the answer linked above. But in that case, I'd be interested to hear about any potential issues with this approach, or any other information about what I might be doing wrong. For example, should I just modify the records to use UUIDs instead, as this suggests?
Here's the raw approach (edited to reflect a simplified version of what I actually wound up doing). It's pretty close to Pere Picornell's answer, but his looks more robust to me.
table = model._meta.db_table
cur = connection.cursor()
cur.execute(
"SELECT setval('{}_id_seq', (SELECT max(id) FROM {}))".format(table, table)
)

About the debate: my case is a one-time migration, and my decision was to run this function right after I finish each table's migration, although you could call it anytime you suspect integrity could be broken.
def synchronize_last_sequence(model):
# Postgresql aut-increments (called sequences) don't update the 'last_id' value if you manually specify an ID.
# This sets the last incremented number to the last id
sequence_name = model._meta.db_table+"_"+model._meta.pk.name+"_seq"
with connections['default'].cursor() as cursor:
cursor.execute(
"SELECT setval('" + sequence_name + "', (SELECT max(" + model._meta.pk.name + ") FROM " +
model._meta.db_table + "))"
)
print("Last auto-incremental number for sequence "+sequence_name+" synchronized.")
Which I did using the SQL query you proposed in your question.
It's been very useful to find your post. Thank you!
It should work with custom PKs but not with multi-field PKs.

One option is to use natural keys during serialization and deserialization. That way when you insert it into PostgreSQL, it will auto-increment the primary key field and keep everything inline.
The downside to this approach is that you need to have a set of unique fields for each model that don't include the id.

How do large scale databases handle locks?

I have a question about databases and updating rows.
I am currently running a flask application, and the endpoint runs a command like this (Please accept this pseudocode / shorthand syntax)
select * from Accounts where used = "False"
username = (first/random row of returned set)
update Accounts set used = "True" where name = username
return username
However, what if 100 people run a call to this endpoint at the Same Time?
How can I avoid colisions? (Meaning 2 people dont get the same username from the table, as the update statement hasnt run yet before the 2nd person queries).
The obvious solution is a lock, something like -- this way if both people hit the endpoint at the exact same time, the 2nd person will have to wait for the lock to release
Global lock
----
with lock:
select * from Accounts where used = "False"
username = (first/random row of returned set)
update Accounts set used = "True" where name = username
return username
I believe this would work, but it wouldnt be a great solution. Does anyone have any better ideas for this? I'm sure companies have this issue all the time with data consistency, how do they solve it?
Thanks!

MySQL / InnoDB offers four transaction isolation levels: READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, and SERIALIZABLE.
Assuming you perform all commands in a single transaction, with REPEATABLE READ and SERIALIZABLE isolation levels, only one transaction accessing the same rows would be executed at a time, so in the case of 100 users, only 1 user would be executing the transaction while the remaining 99 would be waiting in queue.
With READ UNCOMMITTED and READ COMMITTED isolation levels, it would be possible for two or more users to read the same row when it was used = False and try to set it to used = True.
I think it would be better if you refactored your database layout into two tables: one with all possible names, and the other with used names, with a unique constraint on the name column. For every new user, you would insert a new row into the used names table. If you tried to insert multiple users with the same name, you would get a unique constraint violated error, and would be able to try again with a different name.

Global locks on a database are a VERY bad thing. They will slow everything down immensely. Instead there are table locks (to be avoided), row locks (these are fine), and transactions.
Use a transaction. This serves to isolate your changes from others and theirs from yours. It also allows you to throw away all the changes, rollback, if there's a problem so you don't leave a change halfway done. Unless you have a very good reason otherwise, you should ALWAYS be in a transaction.
MySQL supports SELECT FOR UPDATE which tells the database that you're going to update the selected rows in this transaction so those rows get locked.
To use your pseudo-code example...
begin transaction
select * from Accounts where used = "False" for update
username = (first/random row of returned set)
update Accounts set used = "True" where name = username
commit
return username
Basically, transactions make a set of SQL statements "atomic" meaning they happen in a single operation from the point-of-view of concurrent use.
Other points... you should update with the primary key of Accounts to avoid the possibility of using a non-unique field. Maybe the primary key is username, maybe it isn't.
Second, a select without an order by can return in any order it wants. If you're working through a queue of accounts, you should probably specify some sort of order to ensure the oldest ones get done first (or whatever you decide your business logic will be). Even order by rand() will do a better job than relying on the default table ordering.
Finally, if you're only going to fetch one row, add a limit 1 so the database doesn't do a bunch of extra work.

First of all, I would add a new field to the table, let's call it session id. Each client that connecrs to that endpoint should have a unique session id, sg that sets it apart from the other clients.
Instead of doing a select, then an update, I would first update a single record and set its session id field to the client's session id, then retrieve the record bssed on the session id:
update Accounts
set used = "True", sessionid=...
where used="false" and sessionid is null limit 1;
select name from accounts where sessionid=...;
This way you avoid the need of locking

How to tell if CQLEngine made an insert or update through the Model class Save

I am using Python3.4 and CQLEngine. In my code, I am saving an object in an overloaded save operator as follows:
Class Foo(Model, ...):
id = columns.Integer(primary_key)=True
bar = column.Text()
...
def save(self):
super(Foo, self).save()
and I would like to know if the save() is making an insert or an update from the return of the save function.

INSERT and UPDATE are synonyms in Cassandra with a very few exceptions. Here is a description of INSERT where it briefly touches on a difference:
An INSERT writes one or more columns to a record in a Cassandra table
atomically and in isolation. No results are returned. You do not have
to define all columns, except those that make up the key. Missing
columns occupy no space on disk.
If the column exists, it is updated. You can qualify table names by
keyspace. INSERT does not support counters, but UPDATE does.
Internally, the insert and update operation are identical.
You don't know whether it will be an insert or update, and you can look at it as if it was a data save request, then the coordinator determines what it is.
This answers your original question - you can't know based on the return of the save function whether it was an insert or update.
The answer on your comment below, which explained why you wanted to have that output: You can't reliably get this info out of Cassandra, but you can use lightweight transactions to a certain extent and run 2 statements sequentially with the same rows of data:
INSERT ... IF NOT EXISTS followed by UPDATE ... IF EXISTS
In the target table you will need to have a column where each of these statements will write a value unique for each call. Then you can select data based on the primary keys of your dataset, and see how many rown have each value. This will roughly tell you how many updates and how many inserts were there. However of there were any concurrent processes, they may have overwritten your data over with their tokens, so this method will not be very accurate and will work (as any other method with databases like Cassandra) only where there are no concurrent processes.

What are some ways to maintain data consistency at the application layer of NoSQL?

My python web application uses DynamoDB as its datastore, but this is probably applicable to other NoSQL tables where index consistency is done at the application layer. I'm de-normalizing data and creating indicies in several tables to facilitate lookups.
For example, for my users table:
* Table 1: (user_id) email, employee_id, first name, last name, etc ...
Table 2: (email) user_id
Table 3: (employee_id) user_id
Table 1 is my "primary table" where user info is stored. If the user_id is known, all info about a user can be retrieved in a single GET query.
Table 2 and 3 enable lookups by email or employee_id, requiring a query to those tables first to get the user_id, then a second query to Table 1 to retrieve the rest of the information.
My concern is with the de-normalized data -- what is the best way to handle deletions from Table 1 to ensure the matching data gets deleted from Tables 2 + 3? Also ensuring inserts?
Right now my chain of events is something like:
1. Insert row in table 1
2. Insert row in table 2
3. Insert row in table 3
Does it make sense to add "checks" at the end? Some thing like:
4. Check that all 3 rows have been inserted.
5. If a row is missing, remove rows from all tables and raise an error.
Any other techniques?

Short answer is: There is no way to ensure consistency. This is the price you agreed to pay when moving to NoSQL in trade of performances and scalability.
DynamoDB-mapper has a "transaction engine". Transaction objects are plain DynamoDB Items and may be persisted. This way, If a logical group of actions aka transaction has succeeded, we can be sure of it by looking at the persisted status. But we have no mean to be sure it has not...
To do a bit of advertisment :) , dynamodb-mapper transaction engine supports
single/multiple targets
sub transactions
transaction creating objects (not released yet)
If you are rolling your own mapper (which is an enjoyable task), feel free to have a look at our source code: https://bitbucket.org/Ludia/dynamodb-mapper/src/52c75c5df921/dynamodb_mapper/transactions.py
Disclaimer: I am one of the main dynamodb-mapper project. Feel free to contribute :)

Disclaimer: I haven't actually used DynamoDB, just looked through the data model and API, so take this for what it's worth.
The use case you're giving is one primary table for the data, with other tables for hand-rolled indices. This really sounds like work for an RDBMS (maybe with some sharding for growth). But, if that won't cut it, here a couple of ideas which may or may not work for you.
A. Leave it as it is. If you'll never serve data from your index tables, then maybe you can afford to have lazy deletion and insertion as long as you handle the primary table first. Say this happens:
1) Delete JDoe from Main table
xxxxxxxxxx Process running code crashes xxxxxxx
2) Delete from email index // Never gets here
3) Delete from employee_id index // Never gets here
Well, if an "email" query comes in, you'll resolve the corresponding user_id from the index (now stale), but it won't show up on the main table. You know that something is wrong, so you can return a failure/error and clean up the indexes. In other words, you just live with some stale data and save yourself the trouble, cleaning it up as necessary. You'll have to figure out how much stale data to expect, and maybe write a script that does some housekeeping daily.
B. If you really want to simulate locks and transactions, you could consider using something like Apache Zookeeper, which is a distributed system for managing shared resources like locks. It'd be more work and overhead, but you could probably set it up to do what you want.

Selecting/Querying objects from django bulk_create?

If I use bulk_create to insert objects:
objList = [a, b, c,] #none are saved
model.objects.bulk_create(objList)
The id's of the objects would not be updated (see https://docs.djangoproject.com/en/dev/ref/models/querysets/#bulk-create).
So I can't use these guys as foreign key objects. I thought of querying them back from the database after they're bulk created and then using them as foreign key objects, but I don't have their ids to query them. How do I query these objects from the database (given that there can be duplicate values in columns other than the id)? Or is there a better way to make bulk created items as foreign keys?

If you have only three objects, as in your example, you might want to call save on each individually, wrapping the calls within a transaction, if it needs to be atomic.
If there are many more, which is likely the reason for using bulk_create, you could potentially loop through them instead and call save on each. Again, you could wrap that in a transaction if required. Though, one might not like this option as running tonnes of insert queries could potentially be a problem for some database setups.
Alternatively, a hack would be to add some known unique identifier to the object so you could re-query these after save.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.