I'm trying to model a basic linear commenting system for my blog in App Engine (you can see it at http://codeinsider.us). My main classes of objects are:
Users,
Articles,
Comments
One user will have many comments and should be able to view their comments at a glance.
One article will have many comments and should be visible at a glance.
One comment will be associated with exactly one user and exactly one article.
I know how I might build this in a standard relational database - I might have, say, separate tables for comments, users, and articles, with foreign keys to tie them together, uniqueness constraints on articles and users, and none on comments, etc. Nothing fancy.
What's the best way of modeling this in Python App Engine with NDB? ndb.KeyProperty seems interesting, as does StructuredProperty. I don't think I can use StructuredProperty though, since a comment can "belong" to both a User and an Article. But with ndb.KeyProperty, it seems like the keyProperty doesn't do any checking or validation logic, so I'd have to implement that on my own.
The other thing I can do is just throw in the towel, and store giant JSON blobs in Users and Articles representing the Keys and Kinds of comments. That may not be a bad solution.
Any thoughts?
Edit:
This is going to be high-read, low-write. I may add some engagement on comments (upvotes/downvotes), but even then, it will be heavily weighted towards reads.
I recommend to you thinking carefully on what features are you planning to provide since structuring your models in some way may difficult some changes in the future.
I will do this as follows:
First, assume some eventual consistency. No matter how you design this, you will have some eventual consistency in some queries.
Make a KeyProperty "owner" in article to store the user_key. If you want to achieve strong consistency when querying the articles of a single user then instead of using the "owner" KeyProperty just make the user_key the parent of the Article (this will create an entity group for the user and it's articles and is fine here).
With comments you can do more things.
If you expect less than 100 (depending on Article size on the
datastore can be more) comments for each article create a comments
KeyProperty(repeated=True) in Article to store all the comments keys
and then get them with get_multi (strong consistency).
To create the comment and also modify the Article comments property
you may need a transaction, because you will want to accomplish the
two operations or non of them. But.. the two entities are not in the
same entity group so: 1) use cross group transaction or 2) make the
parent of the comment the Article (this second option will have some
consequences discussed later) Counts of comments are easy but
limited to 100 or more comments as said before.
Create a Comment ndb model with two KeyProperties, "owner" and
"article". The article will fetch comments with a query. To query
all the comments within an Article you will have eventual
consistency unless you make the article the parent of the comment
(in that case don't create the article KeyProperty of course). This
approach allows lots of comments.
The problem of using entity groups are that for example, if you allow to vote on comments, then a single write operation on each comment will block any write in the hole entity group of the Article affected. So creation and voting by other users may be affected. But don't really care about this if you expect few votes and you keep entity groups small.
If you want to allow comment votes this can get quite complicated as you may want for example only one vote per user. This will require extra relationships that need to be thought before.
Personally I prefer to assume eventual consistency almost always.
More approaches are possible but I like this two.
High read, low write scenario is the specialty on GAE, so that's a good thing for your purpose.
I'd take advantage of the ancestry feature of GAE Model as it assures you transactional/atomic operations within an entity group. I guess you don't need much of that but it's a good thing to have still.
The right structure is determined by the way you are going to treat/use your data. I'm assuming the typical case in your blog would be to show comments for an article, thus, I'd make your comment model a child of your article model - you could then query comments for a certain (article) ancestor and that would scale magnificently.
I'd include a KeyProperty for the author on the comment, as that would be used mainly to fetch a user from the key I assume. If you want to extend KeyProperty functionality you can do so. Here's an example on how to make KeyProperty behave as ReferenceProperty used to in db. (point 1.)
Related
Let's assume I am developing a service that provides a user with articles. Users can favourite articles and I am using Solr to store these articles for search purposes.
However, when the user adds an article to their favourites list, I would like to be able to figure out out which articles the user has added to favourites so that I can highlight the favourite button.
I am thinking of two approaches:
Fetch articles from Solr and then loop through each article to fetch the "favourite-status" of this article for this specific user from MySQL.
Whenever a user favourites an article, add this user's ID to a multi-valued column in Solr and check whether the ID of the current user is in this column or not.
I don't know the capacity of the multivalued column... and I also don't think the second approach would be a "good practice" (saving user-related data in index).
What other options do I have, if any? Is approach 2 a correct approach?
I'd go with a modified version of the first one - it'll keep user specific data that's not going to be used for search out of the index (although if you foresee a case where you want to search for favourite'd articles, it would probably be an interesting field to have in the index) for now. For just display purposes like in this case, I'd take all the id's returned from Solr, fetch them in one SQL statement from the database and then set the UI values depending on that. It's a fast and easy solution.
If you foresee that "search only in my fav'd articles" as a use case, I would try to get that information into the index as well (or other filter applications against whether a specific user has added the field as a favourite). I'd try to avoid indexing anything more than the user id that fav'd the article in that case.
Both solutions would however work, although the latter would require more code - and the required response from Solr could grow large if a large number of users fav's an article, so I'd try to avoid having to return a set of userid's if that's the case (many fav's for a single article).
How do we implement agregation or composition with NDB on Google App Engine ? What is the best way to proceed depending on use cases ?
Thanks !
I've tried to use a repeated property. In this very simple example, a Project have a list of Tag keys (I have chosen to code it this way instead of using StructuredProperty because many Project objects can share Tag objects).
class Project(ndb.Model):
name = ndb.StringProperty()
tags = ndb.KeyProperty(kind=Tag, repeated=True)
budget = ndb.FloatProperty()
date_begin = ndb.DateProperty(auto_now_add=True)
date_end = ndb.DateProperty(auto_now_add=True)
#classmethod
def all(cls):
return cls.query()
#classmethod
def addTags(cls, from_str):
tagname_list = from_str.split(',')
tag_list = []
for tag in tagname_list:
tag_list.append(Tag.addTag(tag))
cls.tags = tag_list
--
Edited (2) :
Thanks. Finally, I have chosen to create a new Model class 'Relation' representing a relation between two entities. It's more an association, I confess that my first design was unadapted.
An alternative would be to use BigQuery. At first we used NDB, with a RawModel which stores individual, non-aggregated records, and an AggregateModel, which a stores the aggregate values.
The AggregateModel was updated every time a RawModel was created, which caused some inconsistency issues. In hindsight, properly using parent/ancestor keys as Tim suggested would've worked, but in the end we found BigQuery much more pleasant and intuitive to work with.
We just have cronjobs that run everyday to push RawModel to BigQuery and another to create the AggregateModel records with data fetched from BigQuery.
(Of course, this is only effective if you have lots of data to aggregate)
It really does depend on the use case. For small numbers of items StructuredProperty and repeated properties may well be the best fit.
For large numbers of entities you will then look at setting the parent/ancestor in the Key for composition, and have a KeyProperty pointing to the primary entity in a many to one aggregation.
However the choice will also depend heavily on the actual use pattern as well. Then considerations of efficiency kick in.
The best I can suggest is consider carefully how you plan to use these relationships, how active are they (ie are they constantly changing, adding, deleting), do you need to see all members of the relation most of the time, or just subsets. These consideration may well require adjustments to the approach.
Need a way to improve performance on my website's SQL based Activity Feed. We are using Django on Heroku.
Right now we are using actstream, which is a Django App that implements an activity feed using Generic Foreign Keys in the Django ORM. Basically, every action has generic foreign keys to its actor and to any objects that it might be acting on, like this:
Action:
(Clay - actor) wrote a (comment - action object) on (Andrew's review of Starbucks - target)
As we've scaled, its become way too slow, which is understandable because it relies on big, expensive SQL joins.
I see at least two options:
Put a Redis layer on top of the SQL database and get activity feeds from there.
Try to circumvent the Django ORM and do all the queries in raw SQL, which I understand can improve performance.
Any one have thoughts on either of these two, or other ideas, I'd love to hear them.
You might want to look at Materialized Views. Since you're on Heroku, and that uses PostgreSQL generally, you could look at Materialized View Support for PostgreSQL. It is not as mature as for other database servers, but as far as I understand, it can be made to work. To work with the Django ORM, you would probably have to create a new "entity" (not familiar with Django here so modify as needed) for the feed, and then do queries over it as if it was a table. Manual management of the view is a consideration, so look into it carefully before you commit to it.
Hope this helps!
You said redis? Everything is better with redis.
Caching is one of the best ideas in software development, no mather if you use Materialized Views you should also consider trying to cache those, believe me your users will notice the difference.
Went with an approach that sort of combined the two suggestions.
We created a master list of every action in the database, which included all the information we needed about the actions, and stuck it in Redis. Given an action ID, we can now do a Redis look up on it and get a dictionary object that is ready to be returned to the front end.
We also created action id lists that correspond to all the different types of activity streams that are available to a user. So given a user id, we have his friends' activity, his own activity, favorite places activity, etc, available for look up. (These I guess correspond somewhat to materialized views, although they are in Redis, not in PSQL.)
So we get a user's feed as a list of action ids. Then we get the details of those actions by look ups on the ids in the master action list. Then we return the feed to the front end.
Thanks for the suggestions, guys.
I cant find "best" solution for very simple problem(or not very)
Have classical set of data: posts that attached to users, comments that attached to post and to user.
Now i can't decide how to build scheme/classes
On way is to store user_id inside comments and inside.
But what happens when i have 200 comments on page?
Or when i have N posts on page?
I mean it should be 200 additional requests to database to display user info(such as name,avatar)
Another solution is to embed user data into each comment and each post.
But first -> it is huge overhead, second -> model system is getting corrupted(using mongoalchemy), third-> user can change his info(like avatar). And what then? As i understand update operation on huge collections of comments or posts is not simple operation...
What would you suggest? Is 200 requests per page to mongodb is OK(must aim for performance)?
Or may be I am just missing something...
You can avoid the N+1-problem of hundreds of requests using $in-queries. Consider this:
Post {
PosterId: ObjectId
Text: string
Comments: [ObjectId, ObjectId, ...] // option 1
}
Comment {
PostId: ObjectId // option 2 (better)
Created: dateTime,
AuthorName: string,
AuthorId: ObjectId,
Text: string
}
Now you can find the posts comments with an $in query, and you can also easily find all comments made by a specific author.
Of course, you could also store the comments as an embedded array in post, and perform an $in query on the user information when you fetch the comments. That way, you don't need to de-normalize user names and still don't need hundreds of queries.
If you choose to denormalize the user names, you will have to update all comments ever made by that user when a user changes e.g. his name. On the other hand, if such operations don't occur very often, it shouldn't be a big deal. Or maybe it's even better to store the name the user had when he made the comment, depending your requirements.
A general problem with embedding is that different writers will write to the same object, so you will have to use the atomic modifiers (such as $push). This is sometimes harder to use with mappers (I don't know mongoalchemy though), and generally less flexible.
What I would do with mongodb would be to embed the user id into the comments (which are part of the structure of the "post" document).
Three simple hints for better performances:
1) Make sure to ensure an index on the user_id
2) Use comment pagination method to avoid querying 200 times the database
3) Caching is your friend
You could cache your user objects so you don't have to query the database each time.
I like the idea of embedding user data into each post but then you have to think about what happens when a user's profile is updated? have to make sure that no post is missed.
I would recommend starting out just by skimming how mongo recommends you handle schemas.
Generally, for "contains" relationships between entities,
embedding should be be chosen. Use linking when not using linking would result in
duplication of data.
http://www.mongodb.org/display/DOCS/Schema+Design
There's a pretty good use case from the MongoDB docs: http://docs.mongodb.org/manual/use-cases/storing-comments/
Conveniently it's also written in Python :-)
in my google app application, whenever a user purchases a number of contracts, these events are executed (simplified for clarity):
user.cash is decreased
user.contracts is increased by the number
contracts.current_price is updated.
market.no_of_transactions is increased by 1.
in a rdms, these would be placed within the same transaction. I conceive that google datastore does not allow entities of more than one model to be in the same transaction.
what is the correct approach to this issue? how can I ensure that if a write fails, all preceding writes are rolled back?
edit: I have obviously missed entity groups. Now I'd appreciate some further information regarding how they are used. Another point to clarify is google says "Only use entity groups when they are needed for transactions. For other relationships between entities, use ReferenceProperty properties and Key values, which can be used in queries". does it mean I have to define both a reference property (since I need queriying them) and a parent-child relationship (for transactions)?
edit 2: and finally, how do I define two parents for an entity if the entity is being created to establish an n-to-n relationship between 2 parents?
After a through research, I have found that a distributed transaction layer that provides a solution to the single entity group restriction has been developed in userland with the help of some google people. But so far, it is not released and is only available in java.
Let me add a quote from the Datastore documentation:
A good rule of thumb for entity groups is that they should be about
the size of a single user's worth of
data or smaller.
You could create a pseudo root entity and put everything below this. Then, you execute everything in a transaction.
shanyu, you mentioned the distributed transaction layer that lets you operate across arbitrarily many entity groups in a single transaction. it actually
has been released, it just hasn't been advertised very loudly. it was designed and written by daniel wilkerson and erick armbrust, with some consulting on my part. dan describes it in this talk.
nick johnson has also described
how to do "transfer" type operations across entity groups, similar to what you describe. it's not as general purpose as tapioca-orm, but it's simpler and lighter weight.
there's a related built in feature, transactional tasks, that lets you add a task to a queue within a datastore transaction, such that it will only be added if the transaction commits successfully. that task can then do more datastore operations, including a transaction on a different entity group. it's not as strong as dan and erick's solution, but it does give you guaranteed eventual consistency across entity groups, which is good enough for many use cases, without the extra overhead.
in response to your questions: 1) you're not required to use both reference properties and parent/child relationships (ie entity groups). that guideline just means that entity groups limit datastore write throughput, since writes are serialized per entity group. you should be aware of that if you're considering structuring your data into entity groups just for ancestor queries.
2) an entity can't have more than one parent. if you want to model a many-to-many relationship, you should generally use a ListProperty of reference properties (ie keys). see this article and this talk for details.