google app engine cross group transactions needing parent ancestor - python

From my understanding, #db.transactional(xg=True) allows for transactions across groups, however the following code returns "queries inside transactions must have ancestors".
#db.transactional(xg=True)
def insertUserID(self,userName):
user = User.gql("WHERE userName = :1", userName).get()
highestUser = User.all().order('-userID').get()
nextUserID = highestID + 1
user.userID = nextUserID
user.put()
Do you need to pass in the key for each entity despite being a cross group transaction? Can you please help modify this example accordingly?

An XG transaction can be applied across max 25 entity groups. Ancestor query limits the query to a single entity group, and you would be able to do queries within those 25 entity groups in a single XG transaction.
A transactional query without parent would potentially include all entity groups in the application and lock everything up, so you get an error message instead.
In app engine one usually tries to avoid monotonically increasing ids. The auto assigned ones might go like 101, 10001, 10002 and so on. If you know that you need monotonically increasng ids it and it'll work for you performance wise, how about:
Have some kind of model representation of userId to enable key_name
usage and direct lookup
Query for userId outside transaction, get highest candidate id
In transaction do get_or_insert; lookup UserId.get_by_key_name(candidateid+1). If
already present and pointing to a different user, try again with +2
and so on until you find a free one and create it, updating the
userid attribute of user at the same time.
If the XG-transaction of updating UserId+User is too slow, perhaps create UserId+task in transaction (not XG), and let the executing task associate UserId and User afterwards. Or a single backend that can serialize UserId creation and perhaps allow put_async if you retry to avoid holes in the sequence and do something like 50 creations per second.
If it's possible to use userName as key_name you can do direct lookup instead of query and make things faster and cheaper.

Cross group transactions allow you to perform a transaction across multiple groups, but they don't remove the prohibition on queries inside transactions. You need to perform the query outside the transaction, and pass the ID of the entity in (and then check any invariants specified in the query still hold) - or, as Shay suggests, use IDs so you don't have to do a query in the first place.

Every datastore entity has a key, a key (amount other things) has a numeric id that the AppEngine assign to it or key_name which you can give it.
In your case it looks like you can use the numeric id, after you call put() on the user entity you will have: user.key().id() (or user.key.id() if your using NDB) which will be unique for each user (as long as all the user have the same parent, which is None in your code).
This id is not sequential but guarantee to be unique.

Related

Django Is it correct to use uuid for a sales transaction

Is it correct to use a uuid as a sales order transaction? for example, in an ecommerce website when someones orders any product or products is it ok to use the uuid as a unique identifier for the order transaction?
Yes. It is universally unique, so you can use it as a unique ID for a transaction, or whatever you want to be able to identify maybe cross systems and for storage etc.
You can consider alternatives also - sometimes perhaps also just a sequence number works, if they all come from one place and in order, and it's useful to know the order. With UUIDs you don't have it, unless store the sequence somewhere of course.

Google Cloud Datastore Indexes for count queries

Google cloud datastore mandates that there needs to be composite indexes built to query on multiple fields of one kind. Taking the following query for example,
class Greeting(ndb.Model):
user = ndb.StringProperty()
place = ndb.StringProperty()
# Query 1
Greeting.query(Greeting.user == 'yash#gmail.com', Greeting.place == 'London').fetch()
# Query 2
Greeting.query(Greeting.user == 'yash#gmail.com', Greeting.place == 'London').count()
I am using python with ndb to access cloud datastore. In the above example, Query 1 raises NeedIndexError if there is no composite index defined on user and place. But Query 2 works fine even if there is no index on user and place.
I would like to understand how cloud datastore fetches the count (Query 2) without the index when it mandates the index for fetching the list of entities (Query 1). I understand it stores Stats per kind per index which would result in quicker response for counts on existing indexes (Refer docs). But I'm unable to explain the above behaviour.
Note: There is no issue when querying on one property of a given kind as cloud datastore has indexes on a single properties by default.
There is no clear & direct explanation on why this happens but most likely its because how improved query planner works with zigzag indexes.
You can read more about this here: https://cloud.google.com/appengine/articles/indexselection#Improved_Query_Planner
The logic behind count() working and fetch() does not probably because with count() you don't need to keep in memory a lot of results.
So in case of count() you can easily scale by splitting work in multiple chunks processed in parallel and then just sum corresponding counts into one. You can't do this cheaply with cursors/recordsets.

How do large scale databases handle locks?

I have a question about databases and updating rows.
I am currently running a flask application, and the endpoint runs a command like this (Please accept this pseudocode / shorthand syntax)
select * from Accounts where used = "False"
username = (first/random row of returned set)
update Accounts set used = "True" where name = username
return username
However, what if 100 people run a call to this endpoint at the Same Time?
How can I avoid colisions? (Meaning 2 people dont get the same username from the table, as the update statement hasnt run yet before the 2nd person queries).
The obvious solution is a lock, something like -- this way if both people hit the endpoint at the exact same time, the 2nd person will have to wait for the lock to release
Global lock
----
with lock:
select * from Accounts where used = "False"
username = (first/random row of returned set)
update Accounts set used = "True" where name = username
return username
I believe this would work, but it wouldnt be a great solution. Does anyone have any better ideas for this? I'm sure companies have this issue all the time with data consistency, how do they solve it?
Thanks!
MySQL / InnoDB offers four transaction isolation levels: READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, and SERIALIZABLE.
Assuming you perform all commands in a single transaction, with REPEATABLE READ and SERIALIZABLE isolation levels, only one transaction accessing the same rows would be executed at a time, so in the case of 100 users, only 1 user would be executing the transaction while the remaining 99 would be waiting in queue.
With READ UNCOMMITTED and READ COMMITTED isolation levels, it would be possible for two or more users to read the same row when it was used = False and try to set it to used = True.
I think it would be better if you refactored your database layout into two tables: one with all possible names, and the other with used names, with a unique constraint on the name column. For every new user, you would insert a new row into the used names table. If you tried to insert multiple users with the same name, you would get a unique constraint violated error, and would be able to try again with a different name.
Global locks on a database are a VERY bad thing. They will slow everything down immensely. Instead there are table locks (to be avoided), row locks (these are fine), and transactions.
Use a transaction. This serves to isolate your changes from others and theirs from yours. It also allows you to throw away all the changes, rollback, if there's a problem so you don't leave a change halfway done. Unless you have a very good reason otherwise, you should ALWAYS be in a transaction.
MySQL supports SELECT FOR UPDATE which tells the database that you're going to update the selected rows in this transaction so those rows get locked.
To use your pseudo-code example...
begin transaction
select * from Accounts where used = "False" for update
username = (first/random row of returned set)
update Accounts set used = "True" where name = username
commit
return username
Basically, transactions make a set of SQL statements "atomic" meaning they happen in a single operation from the point-of-view of concurrent use.
Other points... you should update with the primary key of Accounts to avoid the possibility of using a non-unique field. Maybe the primary key is username, maybe it isn't.
Second, a select without an order by can return in any order it wants. If you're working through a queue of accounts, you should probably specify some sort of order to ensure the oldest ones get done first (or whatever you decide your business logic will be). Even order by rand() will do a better job than relying on the default table ordering.
Finally, if you're only going to fetch one row, add a limit 1 so the database doesn't do a bunch of extra work.
First of all, I would add a new field to the table, let's call it session id. Each client that connecrs to that endpoint should have a unique session id, sg that sets it apart from the other clients.
Instead of doing a select, then an update, I would first update a single record and set its session id field to the client's session id, then retrieve the record bssed on the session id:
update Accounts
set used = "True", sessionid=...
where used="false" and sessionid is null limit 1;
select name from accounts where sessionid=...;
This way you avoid the need of locking

How can I improve django mysql copy performance?

I have a django app that has a model (Person) defined and I also have some DB (in there is a table Appointment) that do not have any models defined (not meant to be connected to the django app).
I need to move some data from Appointment table over to the Person such that all information the People table needs to mirror the Appointment table. It is this way because there are multiple independent DBs like Appointment that needs to be copied to the Person table (so I do not want to make any architectural changes to how this is setup).
Here is what I do now:
res = sourcedb.fetchall() # from Appointment Table
for myrecord in res:
try:
existingrecord = Person.objects.filter(vendorid = myrecord[12], office = myoffice)[0]
except:
existingrecord = Person(vendorid = myrecord[12], office = myoffice)
existingrecord.firstname = myrecord[0]
existingrecord.midname = myrecord[1]
existingrecord.lastname = myrecord[2]
existingrecord.address1 = myrecord[3]
existingrecord.address2 = myrecord[4]
existingrecord.save()
The problem is that this is way too slow (takes about 8 minutes for 20K records). What can I do to speed this up?
I have considered the following approach:
1. bulk_create: Cannot use this because I have to update sometimes.
2. delete all and then bulk_create There is dependency on the Person model to other things, so I cannot delete records in Person model.
3. INSERT ... ON DUPLICATE KEY UPDATE: cannot do this because the Person table's PK is different from the Appointment table PK (primary key). The Appointment PK is copied into Person table. If there was a way to check on two duplicate keys, this approach would work I think.
A few ideas:
EDIT: See Trewq's comment to this and create Indexes on your tables first of all…
Wrap it all in a transaction using with transaction.atomic():, as by default Django will create a new transaction per save() call which can become very expensive. With 20K records, one giant transaction might also be a problem, so you might have to write some code to split your transactions into multiple batches. Try it out and measure!
If RAM is not an issue (should not be one with 20k records), fetch all data first from the appointment table and then fetch all existing Person objects using a single SELECT query instead of one per record
Use bulk_create even if some of them are updates. This will still issue UPDATE queries for your updates, but will reduce all your INSERT queries to just one/a few, which still is an improvement. You can distinguish inserts and updates by the fact that inserts wont have a primary key set before calling save() and save the inserts into a Python list for a later bulk_create instead of saving them directly
As a last resort: Write raw SQL to make use of MySQLs INSERT … ON DUPLICATE KEY UPDATE syntax. You don't need the same primary key for this, a UNIQUE key would suffice. Keys can span multiple columns, see Django's Meta.unique_together model option.

What does the write limitation on ancestor queries mean?

According to the documentation, when using ancestor queries the following limitation will be enforced:
Ancestor queries allow you to make strongly consistent queries to the
datastore, however entities with the same ancestor are limited to 1
write per second.
class Customer(ndb.Model):
name = ndb.StringProperty()
class Purchase(ndb.Model):
price = ndb.IntegerProperty
purchase1 = Purchase(ancestor=customer_entity.key)
purchase2 = Purchase(ancestor=customer_entity.key)
purchase3 = Purchase(ancestor=customer_entity.key)
purchase1.put()
purchase2.put()
purchase3.put()
Taking the same example, if I was about to write three purchases at same time, would I get an exception, as its less than a second apart?
Here you can find two excellent videos about the datastore, strong consistency and entity groups. Datastore Introduction and Datastore Query, Index and Transaction.
About your example. You can use a put_multi() which "counts" for a single entity group write.

Categories