I have a question about databases and updating rows.
I am currently running a flask application, and the endpoint runs a command like this (Please accept this pseudocode / shorthand syntax)
select * from Accounts where used = "False"
username = (first/random row of returned set)
update Accounts set used = "True" where name = username
return username
However, what if 100 people run a call to this endpoint at the Same Time?
How can I avoid colisions? (Meaning 2 people dont get the same username from the table, as the update statement hasnt run yet before the 2nd person queries).
The obvious solution is a lock, something like -- this way if both people hit the endpoint at the exact same time, the 2nd person will have to wait for the lock to release
Global lock
----
with lock:
select * from Accounts where used = "False"
username = (first/random row of returned set)
update Accounts set used = "True" where name = username
return username
I believe this would work, but it wouldnt be a great solution. Does anyone have any better ideas for this? I'm sure companies have this issue all the time with data consistency, how do they solve it?
Thanks!
MySQL / InnoDB offers four transaction isolation levels: READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, and SERIALIZABLE.
Assuming you perform all commands in a single transaction, with REPEATABLE READ and SERIALIZABLE isolation levels, only one transaction accessing the same rows would be executed at a time, so in the case of 100 users, only 1 user would be executing the transaction while the remaining 99 would be waiting in queue.
With READ UNCOMMITTED and READ COMMITTED isolation levels, it would be possible for two or more users to read the same row when it was used = False and try to set it to used = True.
I think it would be better if you refactored your database layout into two tables: one with all possible names, and the other with used names, with a unique constraint on the name column. For every new user, you would insert a new row into the used names table. If you tried to insert multiple users with the same name, you would get a unique constraint violated error, and would be able to try again with a different name.
Global locks on a database are a VERY bad thing. They will slow everything down immensely. Instead there are table locks (to be avoided), row locks (these are fine), and transactions.
Use a transaction. This serves to isolate your changes from others and theirs from yours. It also allows you to throw away all the changes, rollback, if there's a problem so you don't leave a change halfway done. Unless you have a very good reason otherwise, you should ALWAYS be in a transaction.
MySQL supports SELECT FOR UPDATE which tells the database that you're going to update the selected rows in this transaction so those rows get locked.
To use your pseudo-code example...
begin transaction
select * from Accounts where used = "False" for update
username = (first/random row of returned set)
update Accounts set used = "True" where name = username
commit
return username
Basically, transactions make a set of SQL statements "atomic" meaning they happen in a single operation from the point-of-view of concurrent use.
Other points... you should update with the primary key of Accounts to avoid the possibility of using a non-unique field. Maybe the primary key is username, maybe it isn't.
Second, a select without an order by can return in any order it wants. If you're working through a queue of accounts, you should probably specify some sort of order to ensure the oldest ones get done first (or whatever you decide your business logic will be). Even order by rand() will do a better job than relying on the default table ordering.
Finally, if you're only going to fetch one row, add a limit 1 so the database doesn't do a bunch of extra work.
First of all, I would add a new field to the table, let's call it session id. Each client that connecrs to that endpoint should have a unique session id, sg that sets it apart from the other clients.
Instead of doing a select, then an update, I would first update a single record and set its session id field to the client's session id, then retrieve the record bssed on the session id:
update Accounts
set used = "True", sessionid=...
where used="false" and sessionid is null limit 1;
select name from accounts where sessionid=...;
This way you avoid the need of locking
Related
I'm migrating a Django site from MySQL to PostgreSQL. The quantity of data isn't huge, so I've taken a very simple approach: I've just used the built-in Django serialize and deserialize routines to create JSON records, and then load them in the new instance, loop over the objects, and save each one to the new database.
This works very nicely, with one hiccup: after loading all the records, I run into an IntegrityError when I try to add new data after loading the old records. The Postgres equivalent of a MySQL autoincrement ID field is a serial field, but the internal counter for serial fields isn't incremented when id values are specified explicitly. As a result, Postgres tries to start numbering records at 1 -- already used -- causing a constraint violation. (This is a known issue in Django, marked wontfix.)
There are quite a few questions and answers related to this, but none of the answers seem to address the issue directly in the context of Django. This answer gives an example of the query you'd need to run to update the counter, but I try to avoid making explicit queries when possible. I could simply delete the ID field before saving and let Postgres do the numbering itself, but there are ForeignKey references that will be broken in that case. And everything else works beautifully!
It would be nice if Django provided a routine for doing this that intelligently handles any edge cases. (This wouldn't fix the bug, but it would allow developers to work around it in a consistent and correct way.) Do we really have to just use a raw query to fix this? It seems so barbaric.
If there's really no such routine, I will simply do something like the below, which directly runs the query suggested in the answer linked above. But in that case, I'd be interested to hear about any potential issues with this approach, or any other information about what I might be doing wrong. For example, should I just modify the records to use UUIDs instead, as this suggests?
Here's the raw approach (edited to reflect a simplified version of what I actually wound up doing). It's pretty close to Pere Picornell's answer, but his looks more robust to me.
table = model._meta.db_table
cur = connection.cursor()
cur.execute(
"SELECT setval('{}_id_seq', (SELECT max(id) FROM {}))".format(table, table)
)
About the debate: my case is a one-time migration, and my decision was to run this function right after I finish each table's migration, although you could call it anytime you suspect integrity could be broken.
def synchronize_last_sequence(model):
# Postgresql aut-increments (called sequences) don't update the 'last_id' value if you manually specify an ID.
# This sets the last incremented number to the last id
sequence_name = model._meta.db_table+"_"+model._meta.pk.name+"_seq"
with connections['default'].cursor() as cursor:
cursor.execute(
"SELECT setval('" + sequence_name + "', (SELECT max(" + model._meta.pk.name + ") FROM " +
model._meta.db_table + "))"
)
print("Last auto-incremental number for sequence "+sequence_name+" synchronized.")
Which I did using the SQL query you proposed in your question.
It's been very useful to find your post. Thank you!
It should work with custom PKs but not with multi-field PKs.
One option is to use natural keys during serialization and deserialization. That way when you insert it into PostgreSQL, it will auto-increment the primary key field and keep everything inline.
The downside to this approach is that you need to have a set of unique fields for each model that don't include the id.
I have a django app that has a model (Person) defined and I also have some DB (in there is a table Appointment) that do not have any models defined (not meant to be connected to the django app).
I need to move some data from Appointment table over to the Person such that all information the People table needs to mirror the Appointment table. It is this way because there are multiple independent DBs like Appointment that needs to be copied to the Person table (so I do not want to make any architectural changes to how this is setup).
Here is what I do now:
res = sourcedb.fetchall() # from Appointment Table
for myrecord in res:
try:
existingrecord = Person.objects.filter(vendorid = myrecord[12], office = myoffice)[0]
except:
existingrecord = Person(vendorid = myrecord[12], office = myoffice)
existingrecord.firstname = myrecord[0]
existingrecord.midname = myrecord[1]
existingrecord.lastname = myrecord[2]
existingrecord.address1 = myrecord[3]
existingrecord.address2 = myrecord[4]
existingrecord.save()
The problem is that this is way too slow (takes about 8 minutes for 20K records). What can I do to speed this up?
I have considered the following approach:
1. bulk_create: Cannot use this because I have to update sometimes.
2. delete all and then bulk_create There is dependency on the Person model to other things, so I cannot delete records in Person model.
3. INSERT ... ON DUPLICATE KEY UPDATE: cannot do this because the Person table's PK is different from the Appointment table PK (primary key). The Appointment PK is copied into Person table. If there was a way to check on two duplicate keys, this approach would work I think.
A few ideas:
EDIT: See Trewq's comment to this and create Indexes on your tables first of all…
Wrap it all in a transaction using with transaction.atomic():, as by default Django will create a new transaction per save() call which can become very expensive. With 20K records, one giant transaction might also be a problem, so you might have to write some code to split your transactions into multiple batches. Try it out and measure!
If RAM is not an issue (should not be one with 20k records), fetch all data first from the appointment table and then fetch all existing Person objects using a single SELECT query instead of one per record
Use bulk_create even if some of them are updates. This will still issue UPDATE queries for your updates, but will reduce all your INSERT queries to just one/a few, which still is an improvement. You can distinguish inserts and updates by the fact that inserts wont have a primary key set before calling save() and save the inserts into a Python list for a later bulk_create instead of saving them directly
As a last resort: Write raw SQL to make use of MySQLs INSERT … ON DUPLICATE KEY UPDATE syntax. You don't need the same primary key for this, a UNIQUE key would suffice. Keys can span multiple columns, see Django's Meta.unique_together model option.
My python web application uses DynamoDB as its datastore, but this is probably applicable to other NoSQL tables where index consistency is done at the application layer. I'm de-normalizing data and creating indicies in several tables to facilitate lookups.
For example, for my users table:
* Table 1: (user_id) email, employee_id, first name, last name, etc ...
Table 2: (email) user_id
Table 3: (employee_id) user_id
Table 1 is my "primary table" where user info is stored. If the user_id is known, all info about a user can be retrieved in a single GET query.
Table 2 and 3 enable lookups by email or employee_id, requiring a query to those tables first to get the user_id, then a second query to Table 1 to retrieve the rest of the information.
My concern is with the de-normalized data -- what is the best way to handle deletions from Table 1 to ensure the matching data gets deleted from Tables 2 + 3? Also ensuring inserts?
Right now my chain of events is something like:
1. Insert row in table 1
2. Insert row in table 2
3. Insert row in table 3
Does it make sense to add "checks" at the end? Some thing like:
4. Check that all 3 rows have been inserted.
5. If a row is missing, remove rows from all tables and raise an error.
Any other techniques?
Short answer is: There is no way to ensure consistency. This is the price you agreed to pay when moving to NoSQL in trade of performances and scalability.
DynamoDB-mapper has a "transaction engine". Transaction objects are plain DynamoDB Items and may be persisted. This way, If a logical group of actions aka transaction has succeeded, we can be sure of it by looking at the persisted status. But we have no mean to be sure it has not...
To do a bit of advertisment :) , dynamodb-mapper transaction engine supports
single/multiple targets
sub transactions
transaction creating objects (not released yet)
If you are rolling your own mapper (which is an enjoyable task), feel free to have a look at our source code: https://bitbucket.org/Ludia/dynamodb-mapper/src/52c75c5df921/dynamodb_mapper/transactions.py
Disclaimer: I am one of the main dynamodb-mapper project. Feel free to contribute :)
Disclaimer: I haven't actually used DynamoDB, just looked through the data model and API, so take this for what it's worth.
The use case you're giving is one primary table for the data, with other tables for hand-rolled indices. This really sounds like work for an RDBMS (maybe with some sharding for growth). But, if that won't cut it, here a couple of ideas which may or may not work for you.
A. Leave it as it is. If you'll never serve data from your index tables, then maybe you can afford to have lazy deletion and insertion as long as you handle the primary table first. Say this happens:
1) Delete JDoe from Main table
xxxxxxxxxx Process running code crashes xxxxxxx
2) Delete from email index // Never gets here
3) Delete from employee_id index // Never gets here
Well, if an "email" query comes in, you'll resolve the corresponding user_id from the index (now stale), but it won't show up on the main table. You know that something is wrong, so you can return a failure/error and clean up the indexes. In other words, you just live with some stale data and save yourself the trouble, cleaning it up as necessary. You'll have to figure out how much stale data to expect, and maybe write a script that does some housekeeping daily.
B. If you really want to simulate locks and transactions, you could consider using something like Apache Zookeeper, which is a distributed system for managing shared resources like locks. It'd be more work and overhead, but you could probably set it up to do what you want.
From my understanding, #db.transactional(xg=True) allows for transactions across groups, however the following code returns "queries inside transactions must have ancestors".
#db.transactional(xg=True)
def insertUserID(self,userName):
user = User.gql("WHERE userName = :1", userName).get()
highestUser = User.all().order('-userID').get()
nextUserID = highestID + 1
user.userID = nextUserID
user.put()
Do you need to pass in the key for each entity despite being a cross group transaction? Can you please help modify this example accordingly?
An XG transaction can be applied across max 25 entity groups. Ancestor query limits the query to a single entity group, and you would be able to do queries within those 25 entity groups in a single XG transaction.
A transactional query without parent would potentially include all entity groups in the application and lock everything up, so you get an error message instead.
In app engine one usually tries to avoid monotonically increasing ids. The auto assigned ones might go like 101, 10001, 10002 and so on. If you know that you need monotonically increasng ids it and it'll work for you performance wise, how about:
Have some kind of model representation of userId to enable key_name
usage and direct lookup
Query for userId outside transaction, get highest candidate id
In transaction do get_or_insert; lookup UserId.get_by_key_name(candidateid+1). If
already present and pointing to a different user, try again with +2
and so on until you find a free one and create it, updating the
userid attribute of user at the same time.
If the XG-transaction of updating UserId+User is too slow, perhaps create UserId+task in transaction (not XG), and let the executing task associate UserId and User afterwards. Or a single backend that can serialize UserId creation and perhaps allow put_async if you retry to avoid holes in the sequence and do something like 50 creations per second.
If it's possible to use userName as key_name you can do direct lookup instead of query and make things faster and cheaper.
Cross group transactions allow you to perform a transaction across multiple groups, but they don't remove the prohibition on queries inside transactions. You need to perform the query outside the transaction, and pass the ID of the entity in (and then check any invariants specified in the query still hold) - or, as Shay suggests, use IDs so you don't have to do a query in the first place.
Every datastore entity has a key, a key (amount other things) has a numeric id that the AppEngine assign to it or key_name which you can give it.
In your case it looks like you can use the numeric id, after you call put() on the user entity you will have: user.key().id() (or user.key.id() if your using NDB) which will be unique for each user (as long as all the user have the same parent, which is None in your code).
This id is not sequential but guarantee to be unique.
Let me explain my particular situation:
A business can pick a neighbourhood they are in, this option is persisted to the DB, a signal is fired when this objects is saved. A method listens for this signal and should only once update all other users who follow this neighbourhood as well. There is a check that happens in this method, trying to verify if any other user is already following this business, for every user that is not following this business, but following this neighbourhood, a follow relation will be created in the db. Everything should be fine, if user is already following this business, then no relation is set...
But what happes is that sometimes two or more of these transactions happen at the same time, all of them checking if the user is following this business, of course, since none of them can see a follow relation between the user and the business, multiple follow relations are now established.
I tried making sure the signal isn't sent multiple times, but I'm not sure why these multiple transactions are happening at the same time.
While I have found some answers to doing row locking when trying to avoid concurrency problems on updates, I am at a loss about how to make sure that only one insert happens.
Is table locking the only way to ensure that one one insert of a kind happens?
# when a business updates their neighborhood, this sets the follow relationship
def follow_biz(sender, instance, created, **kwargs):
if instance.neighborhood:
neighborhood_followers = FollowNeighborhood.objects.filter(neighborhood=instance.neighborhood)
# print 'INSTANCE Neighborhood %s ' % instance.neighborhood
for follower in neighborhood_followers:
if not Follow.objects.filter(user=follower.user, business=instance).count():
follow = Follow(business=instance, user=follower.user)
follow.save()
# a unique static string to prevent signal from being called twice
follow_biz_signal_uid = hashlib.sha224("follow_biz").hexdigest()
# signal
post_save.connect(follow_biz, sender=Business, dispatch_uid=follow_biz_signal_uid)
By ensuring uniqueness[1] of the rows at the database level using a constraint on the relevant columns Django, AFAICT, will do the right thing and insert or update as necessary.
In this case the constraint should be on the user and business id.
[1] Of course ensuring uniqueness where applicable is always a good idea.