in my google app application, whenever a user purchases a number of contracts, these events are executed (simplified for clarity):
user.cash is decreased
user.contracts is increased by the number
contracts.current_price is updated.
market.no_of_transactions is increased by 1.
in a rdms, these would be placed within the same transaction. I conceive that google datastore does not allow entities of more than one model to be in the same transaction.
what is the correct approach to this issue? how can I ensure that if a write fails, all preceding writes are rolled back?
edit: I have obviously missed entity groups. Now I'd appreciate some further information regarding how they are used. Another point to clarify is google says "Only use entity groups when they are needed for transactions. For other relationships between entities, use ReferenceProperty properties and Key values, which can be used in queries". does it mean I have to define both a reference property (since I need queriying them) and a parent-child relationship (for transactions)?
edit 2: and finally, how do I define two parents for an entity if the entity is being created to establish an n-to-n relationship between 2 parents?
After a through research, I have found that a distributed transaction layer that provides a solution to the single entity group restriction has been developed in userland with the help of some google people. But so far, it is not released and is only available in java.
Let me add a quote from the Datastore documentation:
A good rule of thumb for entity groups is that they should be about
the size of a single user's worth of
data or smaller.
You could create a pseudo root entity and put everything below this. Then, you execute everything in a transaction.
shanyu, you mentioned the distributed transaction layer that lets you operate across arbitrarily many entity groups in a single transaction. it actually
has been released, it just hasn't been advertised very loudly. it was designed and written by daniel wilkerson and erick armbrust, with some consulting on my part. dan describes it in this talk.
nick johnson has also described
how to do "transfer" type operations across entity groups, similar to what you describe. it's not as general purpose as tapioca-orm, but it's simpler and lighter weight.
there's a related built in feature, transactional tasks, that lets you add a task to a queue within a datastore transaction, such that it will only be added if the transaction commits successfully. that task can then do more datastore operations, including a transaction on a different entity group. it's not as strong as dan and erick's solution, but it does give you guaranteed eventual consistency across entity groups, which is good enough for many use cases, without the extra overhead.
in response to your questions: 1) you're not required to use both reference properties and parent/child relationships (ie entity groups). that guideline just means that entity groups limit datastore write throughput, since writes are serialized per entity group. you should be aware of that if you're considering structuring your data into entity groups just for ancestor queries.
2) an entity can't have more than one parent. if you want to model a many-to-many relationship, you should generally use a ListProperty of reference properties (ie keys). see this article and this talk for details.
Related
From What can be done in a transaction (highlight from me):
All Datastore operations in a transaction must operate on entities in
the same entity group if the transaction is a single group
transaction, or on entities in a maximum of twenty-five entity
groups if the transaction is a cross-group (XG) transaction.
Is there an actual definition corresponding to that 25 number that I can reference in my python app code? Or an API call returning it? I'd prefer to use one, if available, rather than create my own definition of it, just in case Google decides to change it down the road...
Update: To clarify, I'm talking about the equivalent of _MAX_EG_PER_TXN which I just spotted in LiveTxn._GetTracker() from the SDK's google/appengine/datastore/datastore_stub_util.py file:
if self._allow_multiple_eg:
Check(len(self._entity_groups) < _MAX_EG_PER_TXN,
'operating on too many entity groups in a single transaction.')
As a side note it'd be great for debugging if the tracked groups info from self._entity_groups could be somehow accessed when such exceptions are raised.
Entity-Groups are the top level entities (no parent/ancestor). Every child entity that references it as a ancestor falls into the same Entity-Group.
There is not a programmatic way to ask the API to get a count, but you should be able to determine it in your own code either through explicit design:
No more than 25 entities in a transaction if using top-level entities
No more than 25 distinct parents if using child entities
The definition of Entity-Groups is baked into the underlying Megastore storage layer and so is unlikely change in any way other than us increasing the limit or removing the limit.
EDIT: FR logged: https://github.com/GoogleCloudPlatform/google-cloud-datastore/issues/121
I'm trying to model a basic linear commenting system for my blog in App Engine (you can see it at http://codeinsider.us). My main classes of objects are:
Users,
Articles,
Comments
One user will have many comments and should be able to view their comments at a glance.
One article will have many comments and should be visible at a glance.
One comment will be associated with exactly one user and exactly one article.
I know how I might build this in a standard relational database - I might have, say, separate tables for comments, users, and articles, with foreign keys to tie them together, uniqueness constraints on articles and users, and none on comments, etc. Nothing fancy.
What's the best way of modeling this in Python App Engine with NDB? ndb.KeyProperty seems interesting, as does StructuredProperty. I don't think I can use StructuredProperty though, since a comment can "belong" to both a User and an Article. But with ndb.KeyProperty, it seems like the keyProperty doesn't do any checking or validation logic, so I'd have to implement that on my own.
The other thing I can do is just throw in the towel, and store giant JSON blobs in Users and Articles representing the Keys and Kinds of comments. That may not be a bad solution.
Any thoughts?
Edit:
This is going to be high-read, low-write. I may add some engagement on comments (upvotes/downvotes), but even then, it will be heavily weighted towards reads.
I recommend to you thinking carefully on what features are you planning to provide since structuring your models in some way may difficult some changes in the future.
I will do this as follows:
First, assume some eventual consistency. No matter how you design this, you will have some eventual consistency in some queries.
Make a KeyProperty "owner" in article to store the user_key. If you want to achieve strong consistency when querying the articles of a single user then instead of using the "owner" KeyProperty just make the user_key the parent of the Article (this will create an entity group for the user and it's articles and is fine here).
With comments you can do more things.
If you expect less than 100 (depending on Article size on the
datastore can be more) comments for each article create a comments
KeyProperty(repeated=True) in Article to store all the comments keys
and then get them with get_multi (strong consistency).
To create the comment and also modify the Article comments property
you may need a transaction, because you will want to accomplish the
two operations or non of them. But.. the two entities are not in the
same entity group so: 1) use cross group transaction or 2) make the
parent of the comment the Article (this second option will have some
consequences discussed later) Counts of comments are easy but
limited to 100 or more comments as said before.
Create a Comment ndb model with two KeyProperties, "owner" and
"article". The article will fetch comments with a query. To query
all the comments within an Article you will have eventual
consistency unless you make the article the parent of the comment
(in that case don't create the article KeyProperty of course). This
approach allows lots of comments.
The problem of using entity groups are that for example, if you allow to vote on comments, then a single write operation on each comment will block any write in the hole entity group of the Article affected. So creation and voting by other users may be affected. But don't really care about this if you expect few votes and you keep entity groups small.
If you want to allow comment votes this can get quite complicated as you may want for example only one vote per user. This will require extra relationships that need to be thought before.
Personally I prefer to assume eventual consistency almost always.
More approaches are possible but I like this two.
High read, low write scenario is the specialty on GAE, so that's a good thing for your purpose.
I'd take advantage of the ancestry feature of GAE Model as it assures you transactional/atomic operations within an entity group. I guess you don't need much of that but it's a good thing to have still.
The right structure is determined by the way you are going to treat/use your data. I'm assuming the typical case in your blog would be to show comments for an article, thus, I'd make your comment model a child of your article model - you could then query comments for a certain (article) ancestor and that would scale magnificently.
I'd include a KeyProperty for the author on the comment, as that would be used mainly to fetch a user from the key I assume. If you want to extend KeyProperty functionality you can do so. Here's an example on how to make KeyProperty behave as ReferenceProperty used to in db. (point 1.)
How do we implement agregation or composition with NDB on Google App Engine ? What is the best way to proceed depending on use cases ?
Thanks !
I've tried to use a repeated property. In this very simple example, a Project have a list of Tag keys (I have chosen to code it this way instead of using StructuredProperty because many Project objects can share Tag objects).
class Project(ndb.Model):
name = ndb.StringProperty()
tags = ndb.KeyProperty(kind=Tag, repeated=True)
budget = ndb.FloatProperty()
date_begin = ndb.DateProperty(auto_now_add=True)
date_end = ndb.DateProperty(auto_now_add=True)
#classmethod
def all(cls):
return cls.query()
#classmethod
def addTags(cls, from_str):
tagname_list = from_str.split(',')
tag_list = []
for tag in tagname_list:
tag_list.append(Tag.addTag(tag))
cls.tags = tag_list
--
Edited (2) :
Thanks. Finally, I have chosen to create a new Model class 'Relation' representing a relation between two entities. It's more an association, I confess that my first design was unadapted.
An alternative would be to use BigQuery. At first we used NDB, with a RawModel which stores individual, non-aggregated records, and an AggregateModel, which a stores the aggregate values.
The AggregateModel was updated every time a RawModel was created, which caused some inconsistency issues. In hindsight, properly using parent/ancestor keys as Tim suggested would've worked, but in the end we found BigQuery much more pleasant and intuitive to work with.
We just have cronjobs that run everyday to push RawModel to BigQuery and another to create the AggregateModel records with data fetched from BigQuery.
(Of course, this is only effective if you have lots of data to aggregate)
It really does depend on the use case. For small numbers of items StructuredProperty and repeated properties may well be the best fit.
For large numbers of entities you will then look at setting the parent/ancestor in the Key for composition, and have a KeyProperty pointing to the primary entity in a many to one aggregation.
However the choice will also depend heavily on the actual use pattern as well. Then considerations of efficiency kick in.
The best I can suggest is consider carefully how you plan to use these relationships, how active are they (ie are they constantly changing, adding, deleting), do you need to see all members of the relation most of the time, or just subsets. These consideration may well require adjustments to the approach.
Need a way to improve performance on my website's SQL based Activity Feed. We are using Django on Heroku.
Right now we are using actstream, which is a Django App that implements an activity feed using Generic Foreign Keys in the Django ORM. Basically, every action has generic foreign keys to its actor and to any objects that it might be acting on, like this:
Action:
(Clay - actor) wrote a (comment - action object) on (Andrew's review of Starbucks - target)
As we've scaled, its become way too slow, which is understandable because it relies on big, expensive SQL joins.
I see at least two options:
Put a Redis layer on top of the SQL database and get activity feeds from there.
Try to circumvent the Django ORM and do all the queries in raw SQL, which I understand can improve performance.
Any one have thoughts on either of these two, or other ideas, I'd love to hear them.
You might want to look at Materialized Views. Since you're on Heroku, and that uses PostgreSQL generally, you could look at Materialized View Support for PostgreSQL. It is not as mature as for other database servers, but as far as I understand, it can be made to work. To work with the Django ORM, you would probably have to create a new "entity" (not familiar with Django here so modify as needed) for the feed, and then do queries over it as if it was a table. Manual management of the view is a consideration, so look into it carefully before you commit to it.
Hope this helps!
You said redis? Everything is better with redis.
Caching is one of the best ideas in software development, no mather if you use Materialized Views you should also consider trying to cache those, believe me your users will notice the difference.
Went with an approach that sort of combined the two suggestions.
We created a master list of every action in the database, which included all the information we needed about the actions, and stuck it in Redis. Given an action ID, we can now do a Redis look up on it and get a dictionary object that is ready to be returned to the front end.
We also created action id lists that correspond to all the different types of activity streams that are available to a user. So given a user id, we have his friends' activity, his own activity, favorite places activity, etc, available for look up. (These I guess correspond somewhat to materialized views, although they are in Redis, not in PSQL.)
So we get a user's feed as a list of action ids. Then we get the details of those actions by look ups on the ids in the master action list. Then we return the feed to the front end.
Thanks for the suggestions, guys.
In the High-Replication Datastore (I'm using NDB), the consistency is eventual. In order to get a guaranteed complete set, ancestor queries can be used. Ancestor queries also provide a great way to get all the "children" of a particular ancestor with kindless queries. In short, being able to leverage the ancestor model is hugely useful in GAE.
The problem I seem to have is rather simplistic. Let's say I have a contact record and a message record. A given contact record is being treated as the ancestor for each message. However, it is possible that two contacts are created for the same person (user error, different data points, whatever). This situation produces two contact records, which have messages related to them.
I need to be able to "merge" the two records, and bring put all the messages into one big pile. Ideally, I'd be able to modify ancestor for one of the record's children.
The only way I can think of doing this, is to create a mapping and make my app check to see if record has been merged. If it has, look at the mappings to find one or more related records, and perform queries against those. This seems hugely inefficient. Is there more of "by the book" way of handling this use case?
The only way to change the ancestor of an entity is to delete the old one and create a new one with a new key. This must be done for all child (and grand child, etc) entities in the ancestor path. If this isn't possible, then your listed solution works.
This is required because the ancestor path of an entity is part of its unique key. Parents of entities (i.e., entities in the ancestor path) need not exist, so changing a parent's key will leave the children in the datastore with no parent.