Many different sharded counters in single transaction

Many different sharded counters in single transaction - python

I have to increment three different counters in a single transaction. Beside that I have to manipulate three other entities as well. I get
too many entity groups in a single transaction
I've used the recipie from https://developers.google.com/appengine/articles/sharding_counters to implement my counters. I increment my counters inside some model (class) methods depending on business logic.
As a workaround I implemented a deferred increment method that uses tasks to update the counter. But that doesn't scale well if the number of counters increases further as there is a limit of tasks in a single transaction as well (I thinks it's 5) and I guess it's not the most effective way.
I also found https://github.com/DocSavage/sharded_counter/blob/master/counter.py which seems to ensure updating the counter even in case of a db error through memcache. But I don't want to increment my counters if the transaction fails.
Another idea is to remember the counters I have to increment during a web request and to increment them in a single deferred task. I don't know how to implement this in a clean and thread safe way without passing objects created in the request to the model methods. I think this code would be ugly and not in the same transcation:
def my_request_handler():
counter_session = model.counter_session()
model.mylogic(counter_session, other_params)
counter_session.write()
Any experiences or ideas?
BTW: I'm using python, ndb and flask
It would be ok if the counter is not 100% accurate.

As said in Transactions and entity groups:
the simplest approach is to determine which entities you need to be
able to process in the same transaction. Then, when you create those
entities, place them in the same entity group by declaring them with a
common ancestor. They will then all be in the same entity group and
you will always be able to update and read them transactionally.

Related

Ndb strong consistency and frequent writes

I'm trying to achieve strong consistency with ndb using python.
And looks like I'm missing something as my reads behave like they're not strongly consistent.
The query is:
links = Link.query(ancestor=lead_key).filter(Link.last_status ==
None).fetch(keys_only=True)
if links:
do_action()
The key structure is:
Lead root (generic key) -> Lead -> Website (one per lead) -> Link
I have many tasks that are executed concurrently using TaskQueue and this query is performed at the end of every task. Sometimes I'm getting "too much contention" exception when updating the last_status field but I deal with it using retries. Can it break strong consistency?
The expected behavior is having do_action() called when there are no links left with last_status equal to None. The actual behavior is inconsistent: sometimes do_action() is called twice and sometimes not called at all.

Using an ancestor key to get strong consistency has a limitation: you're limited to one update per second per entity group. One way to work around this is to shard the entity groups. Sharding Counters describes the technique. It's an old article, but as far as I know, the advise is still sound.

Adding to Dave's answer which is the 1st thing to check.
One thing which isn't well documented and can be a bit surprising is that the contention can be caused by read operations as well, not only by the write ones.
Whenever a transaction starts the entity groups being accessed (by read or write ops, doesn't matter) are marked as such. The too much contention error indicates that too many parallel transactions simultaneously try to access the same entity group. It can happen even if none of the transactions actually attempts to write!
Note: this contention is NOT emulated by the development server, it can only be seen when deployed on GAE, with the real datastore!
What can add to the confusion is the automatic re-tries of the transactions, which can happen after both actual write conflicts or just plain access contention. These retries may appear to the end-user as suspicious repeated execution of some code paths - which I suspect could explain your reports of do_action() being called twice.
Usually when you run into such problems you have to re-visit your data structures and/or the way you're accessing them (your transactions). In addition to solutions maintaining the strong consistency (which can be quite expensive) you may want to re-check if consistency is actually a must. In some cases it's added as a blanket requirement just because appears to simplify things. From my experience it doesn't :)

There is nothing in your sample that ensures that your code is only called once.
For the moment, I am going to assume that your "do_action" function does something to the Link entities, specifically that it sets the "last_status" property.
If you do not perform the query and the write to the Link Entity inside a transaction, then it is possible for two different requests (task queue tasks) to get results back from the query, then both write their new value to the Link entity (with the last write overwriting the previous value).
Remember that even if you do use a transaction, you don't know until the transaction is successfully completed that nobody else tried to perform a write. This is important if you are trying to do something external to datastore (for example, making a http request to an external system), as you may see http requests from transactions that would eventually fail with a concurrent modification exception.

How efficient is Google App Engine ndb.delete_multi()?

I'm working on something to clear my database of ~10,000 entities, and my plan is to put it in a task that deletes 200 at a time using ndb.delete_multi() and then recursively calls itself again until there are no entities left.
For now, I don't have the recursion in it yet so I could run the code a few times manually and check for errors, quota use, etc. The code is:
entities = MyModel.query_all(ndb.Key('MyModel', '*defaultMyModel')).fetch(200)
key_list = ndb.put_multi(entities)
ndb.delete_multi(key_list)
All the query_all() does is query MyModel and return everything.
I've done some testing by commenting out things and running the method, and it looks like the first two lines take up the expected amount of writes (~200).
Running the third line, ndb.delete_multi(), takes up about 8% of my 50,000 daily write allowance, so about 4000 writes--20 times as many as I think it should be doing.
I've also made sure the key_list contains only 200 keys with logging.
Any ideas on why this takes up so many writes? Am I using the method wrong? Or does it just use a ton of memory? In that case, is there any way for me to do this more efficiently?
Thanks.

When you delete an entity, the Datastore has to remove an entity and a record from an index for each indexed property as well as for each custom index. The number of writes is not dependent on which delete method you use.

Your code example is extremely inefficient. If you are deleting large numbers of entities than you will need to batch the below but, you should be retrieving data with a keys_only query and then deleting:
from google.appengine.ext import ndb
ndb.delete_multi(
MyModel.query().fetch(keys_only=True)
)
In regards to the number of write operations (see Andrei's answer), ensure only the fields on your model that are required to be indexed "have an index enabled".

performance implications of transactions on large entity groups (python, NDB, Master/Slave)

I have several tens of thousands of related small entities (NDB atop of Master-Slave, will have to move to HRD one day..), which I'd like to put in the same entity group to enable transactions.
Small subsets of those entities will be updated by transactions.
What are the performance implications of this setup?
Does it mean the whole group gets locked during the update? I.e. one transaction at a time.
Thanks!

There's an approximate performance limit of 1 write transaction per second to an entity group.
The whole group does get locked for the update. A subsequent transaction will fail and retry.
10k entities in an entity group sounds like a lot, but it really depends on your write patterns. For example, if only a few entities in the group are ever updated, it may not be a big issue. However, if random users are constantly updating random entities in the group, you'll want to split it up into more entity groups.

Multiple queries vs. manually sorting one large query (AppEngine NDB)

For a model like:
class Thing(ndb.Model):
visible = ndb.BooleanProperty()
made_by = ndb.KeyProperty(kind=User)
belongs_to = ndb.KeyProperty(kind=AnotherThing)
Essentially performing an 'or' query, but comparing different properties so I can't use a built in OR... I want to get all Thing (belonging to a particular AnotherThing) which either have visible set to True or visible is False and made_by is the current user.
Which would be less demanding on the datastore (ie financially cost less):
Query to get everything, ie: Thing.query(Thing.belongs_to == some_thing.key) and iterate through the results, storing the visible ones, and the ones that aren't visible but are made_by the current user?
Query to get the visible ones, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "True") and query separately to get the non-visible ones by the current user, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "False", Thing.made_by = current_user)?
Number 1. would get many unneeded results, like non-visible Things by other users - which I think is many reads of the datastore? 2. is two whole queries though, which is also possibly unnecessarily heavy, right? I'm still trying to work out what kinds of interaction with the database cause what kinds of costs.
I'm using ndb, tasklets and memcache where necessary, in case that's relevant.

Number two is going to be financially less for two reasons. First you pay for each read of the data store and for each returned entity in a query, therefore you will be paying more for the first one which you have to Read all data and query all data. The second way you only pay for what you need.
Secondly you also pay for backend or frontend time, and you will be using time to iterate through all your results in the first method, where as you need to spend no time for the second method.
I can't see a way where the first option is better. (maybe if you only have a few entities??)
To understand how reads and queries cost you scroll down a little on:
https://developers.google.com/appengine/docs/billing
You will see how Read, Writes and Smalls are added up for reads, writes and queries.
I would also just query for ones that are owned by the current user instead of visible=false and owner=current, this way you don't need a composite index which will save some time. You can also make visible a partial index this was saving some space as well (only index it when true, assuming you never need to query for false ones). You will need to do a litte work to remove duplicates, but that is probably not to bad.

You are probably best benchmarking both cases using real-world data. It's hard to determine things like this in the abstract, as there are many subtleties that may affect overall performance.
I would expect option 2 to be better though. Loading tons of objects that you don't care about is simply going to put a heavy burden on the data store that I don't think an extra query would be comparable to. Of course, it depends on how many extra things, etc.

google app engine persistent globals

I'm looking for a way to keep the equivalent of persistent global variables in app engine (python). What I'm doing is creating a global kind that I initialize once (i.e. when I reset all my database objects when I'm testing). I have things in there like global counters, or the next id to assign certain kinds I create.
Is this a decent way to do this sort of thing or is there generally another approach that is used?

The datastore is the only place you can have guaranteed-persistent data that are also modifiable. So you can have a single large object, or several smaller ones (with a name attribute and others), depending on your desired access patterns -- but live in the datastore it must. You can use memcache for faster cache that usually persists across queries, but any memcache entry could go away any time, so you'll always need it to be backed by the datastore (in particular, any change must go to the datastore, not just to memcache).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.