appengine many to many field update value and lookup efficiently

appengine many to many field update value and lookup efficiently - python

I am using appengine with python 2.7 and webapp2 framework. I am not using ndb.model.
I have the following model:
class Story(db.Model);
name = db.StringProperty()
class UserProfile(db.Model):
name = db.StringProperty()
user = db.UserProperty()
class Tracking(db.Model):
user_profile = db.ReferenceProperty(UserProfile)
story = db.ReferenceProperty(Story)
upvoted = db.BooleanProperty()
flagged = db.BoolenProperty()
A user can upvote and/or flag a story but only once. Hence I came up with the above model.
Now when a user clicks on the upvote link, on the database I try to see if the user has not already voted it, hence I do try to do the following:
get the user instance with his id as up = db.get(db.Key.from_path('UserProfile', uid))
then get the story instance as follows s_ins = db.get(db.Key.from_path('Story', uid))
Now it is the turn to check if a Tracking based on these two exist, if yes then don't allow voting, else allow him to vote and update the Tracking instance.
What is the most convenient way to fetch a Tracking instance given an id(db.key().id()) of user_profile and story?
What is the most convenient way to save a Tracking model having given a user profile id and an story id?
Is there a better way to implement tracking?

You can try tracking using lists of keys versus having a separate entry for track/user/story:
class Story(db.Model);
name = db.StringProperty()
class UserProfile(db.Model):
name = db.StringProperty()
user = db.UserProperty()
class Tracking(db.Model):
story = db.ReferenceProperty(Story)
upvoted = db.ListProperty(db.Key)
flagged = db.ListProperty(db.Key)
So when you want to see if a user upvoted for a given story:
Tracking.all().filter('story =', db.Key.from_path('Story', uid)).filter('upvoted =', db.Key.from_path('UserProfile', uid)).get(keys_only=True)
Now the only problem here is the size of the upvoted/flagged lists can't grow too large (I think the limit is 5000), so you'd have to make a class to manage this (that is, when adding to the upvoted/flagged lists, detect if X entries exists, and if so, start a new tracking object to hold additional values). You will also have to make this transactional and with HR you have a 1 write per second threshold. This may or may not be an issue depending on your expected use case. A way around the write threshold would be to implement upvotes/flags using pull-queues and to have a cron job that pulls and batch updates tracking objects as needed.
This method has its pros/cons. The most obvious cons are the ones I just listed. The pros, however, may be worth it. You can get a full list of users who upvoted/flagged a story from a single list (or multiple depending on how popular the story is). You can get a full list of users with a lot fewer queries to the datastore. This method should also take less storage, index, and metadata space. Additionally, adding a user to a tracking object will be cheaper, instead of writing a new object + 2 writes for each property, you would just be charged 1 write for the object + 2 writes for the entry to the list (9 vs 3 writes for adding users to a pre-existing tracked story, or 9 vs 7 for untracked stories)

What you propose sounds reasonable.
Don't use the app engine generated key for Tracking. Because the combination of story/user should be unique, create your own key as a combination of the story/user. Something like
tracking = Tracking.get_or_insert(str(story.id) + "-" + str(user.id), **params)
If you know the story/user, then you can always fetch the tracking by key name.

Related

ForeignKey vs CharField

I have an idea for data model in django and I was wondering if someone can point out pros and cons for these two setups.
Setup 1: This would be an obvious one. Using CharFields for each field of each object
class Person(models.Model):
name = models.CharField(max_length=255)
surname = models.CharField(max_length=255)
city = models.CharField(max_length=255)
Setup 2: This is the one I am thinking about. Using a ForeignKey to Objects that contain the values that current Object should have.
class Person(models.Model):
name = models.ForeignKey('Name')
surname = models.ForeignKey('Surname')
city = models.ForeignKey('City')
class Chars(models.Model):
value = models.CharField(max_length=255)
def __str__(self):
return self.value
class Meta:
abstract = True
class Name(Chars):pass
class Surname(Chars):pass
class City(Chars):pass
So in setup 1, I would create an Object with:
Person.objects.create(name='Name', surname='Surname', city='City')
and each object would have it's own data. In setup 2, I would have to do this:
_name = Name.objects.get_or_create(value='Name')[0]
_surname = Surname.objects.get_or_create(value='Surname')[0]
_city = City.objects.get_or_create(value='City')[0]
Person.objects.create(name=_name, surname=_surname, city=_city)
Question: Main purpose for this would be to reuse existing values for multiple objects, but is this something worth doing, when you take into consideration that you need multiple hits on the database to create an Object?

Choosing the correct design pattern for your application is a very wide area which is influenced by many factors that are even possibly out of scope in a Stack Overflow question. So in a sense your question could be a bit subjective and too broad.
Nevertheless, I would say that assigning a separate model (class) for first name, another separate for last name etc. is an overkill. You might essentially end up overengineering your app.
The main reasoning behind the above recommendation is that you probably do not want to treat a name as a separate entity and possibly attach additional properties to it. Unless you really would need such a feature, a name is usually a plain string that some users happen to have identical.

It doesn't make any good to keep name and surname as separate object/model/db table. In your setup, if you don't set name and surname as unique, then it doesn't make any sense to put them in separate model. Even worse, it will incur additional DB work and decrease performance. Now, if you set them as unique, then you have to work over the situation when, e.g. some user changes his name and by default it would be changed for all users with that name.
On the other hand, city - there're not that many cities and it's a good idea to keep it as separate object and refer to it via foreign key from user. This will save disk space, allow to easily get all users from same city. Even better, you can prepopulate cities DB and provide autocompletion fro users entering there city. Though for performance you might still want to keep city as a string on the user model.
Also, to mention 'gender' field, since there're not many possible choices for this data, it's worth to use enumeration in your code and store a value in DB, i.e. use choices instead of ForeignKey to a separate DB table.

How to model a 'Like' mechanism via ndb?

We are about to introduce a social aspect into our app, where users can like each others events.
Getting this wrong would mean a lot of headache later on, hence I would love to get input from some experienced developers on GAE, how they would suggest to model it.
It seems there is a similar question here however the OP didn't provide any code to begin with.
Here are two models:
class Event(ndb.Model):
user = ndb.KeyProperty(kind=User, required=True)
time_of_day = ndb.DateTimeProperty(required=True)
notes = ndb.TextProperty()
timestamp = ndb.FloatProperty(required=True)
class User(UserMixin, ndb.Model):
firstname = ndb.StringProperty()
lastname = ndb.StringProperty()
We need to know who has liked an event, in case that the user may want to unlike it again. Hence we need to keep a reference. But how?
One way would be introducing a RepeatedProperty to the Event class.
class Event(ndb.Model):
....
ndb.KeyProperty(kind=User, repeated=True)
That way any user that would like this Event, would be stored in here. The number of users in this list would determine the number of likes for this event.
Theoretically that should work. However this post from the creator of Python worries me:
Do not use repeated properties if you have more than 100-1000 values.
(1000 is probably already pushing it.) They weren't designed for such
use.
And back to square one. How am I supposed to design this?

RepeatProperty has limitation in number of values (< 1000).
One recommended way to break the limit is using shard:
class Event(ndb.Model):
# use a integer to store the total likes.
likes = ndb.IntegerProperty()
class EventLikeShard(ndb.Model):
# each shard only store 500 users.
event = ndb.KeyProperty(kind=Event)
users = ndb.KeyProperty(kind=User, repeated=True)
If the limitation is more than 1000 but less than 100k.
A simpler way:
class Event(ndb.Model):
likers = ndb.PickleProperty(compressed=True)

Use another model "Like" where you keep the reference to user and event.
Old way of representing many to many in a relational manner. This way you keep all entities separated and can easily add/remove/count.

I would recommend the usual many-to-many relationship using an EventUser model given that the design seems to require unlimited number of user linking an event. The only tricky part is that you must ensure that event/user combination is unique, which can be done using _pre_put_hook. Keeping a likes counter as proposed by #lucemia is indeed a good idea.
You would then would capture the liked action using a boolean, or, you can make it a bit more flexible by including an actions string array. This way, you could also capture action such as signed-up or attended.
Here is a sample code:
class EventUser(ndb.Model):
event = ndb.KeyProperty(kind=Event, required=True)
user = ndb.KeyProperty(kind=User, required=True)
actions = ndb.StringProperty(repeated=True)
# make sure event/user is unique
def _pre_put_hook(self):
cur_key = self.key
for entry in self.query(EventUser.user == self.user, EventUser.event == self.event):
# If cur_key exists, means that user is performing update
if cur_key.id():
if cur_key == entry.key:
continue
else:
raise ValueError("User '%s' is a duplicated entry." % (self.user))
# If adding
raise ValueError("User Add '%s' is a duplicated entry." % (self.user))

Using GQL to get all new results since the previous result

I'm pretty useless when it comes to queries, I'm wondering what's the correct structure for this problem.
Clients are sent data including the key of the object, they use the key to tell the server what was the most recent object they downloaded.
I want to get all objects since that point, the objects have an automatic date attribute.
Additionally, I want to be able to give the 15 (or so) most recent objects to new users who may request using a specific 'new user' key or something similar.
Using the Python2.7 runtime, never used GQL before,
Any help is greatly appreciated.
The Model Class is this:
class Message(db.Model):
user = db.StringProperty()
content = db.TextProperty()
colour = db.StringProperty()
room = db.StringProperty()
date = db.DateTimeProperty(auto_now_add=True)

If it is a db.Key object or a string representation of a key using the db (as opposed to the ndb) API:
last_message = Message.get(lastkey)
If you have the key in another representation, such as the key name:
last_message = Message.get_by_key_name(lastkey)
If you have the key as the numeric ID of the object:
last_message = Message.get_by_id(int(lastkey))
Then, you can get the messages since that last message as follows:
messages_since_last_message = Message.all().filter('date >', last_message.date).order('date')
#OR GQL:
messages_since_last_message = Message.gql("WHERE date > :1 ORDER BY date ASC", last_message.date)
You should maybe use the >= comparator only because there may be multiple messages that arrive at the same exact time, and then filter out all messages that are in the list up to and including the last key you are looking for (this actually depends on your use case and how closely message could be written). Additionally, with the High Replication datastore, there is eventual consistency, so your query is not guaranteed to accurately reflect the datastore unless you use ancestor queries, in which case you limit your entity group to ~1 write per second, which again, depending on your use case, could be a non-issue. The Entity Group here reflects the parent model and all of its children ancestors. The group of ancestors resides on a single entity group.

How can I create two unique, queriable fields for a GAE Datastore Data Model?

First a little setup. Last week I was having trouble implementing a specific methodology that I had constructed which would allow me to manage two unique fields associated with one db.Model object. Since this isn't possible, I created a parent entity class and a child entity class, each having the key_name assigned one of the unique values. You can find my previous question located here, which includes my sample code and a general explaination of my insertion process.
On my original question, someone commented that my solution would not solve my problem of needing two unique fields associated with one db.Model object.
My implementation tried to solve this problem by implementing a static method that creates a ParentEntity and it's key_name property is assigned to one of my unique values. In step two of my process I create a child entity and assign the parent entity to the parent parameter. Both of these steps are executed within a db transaction so I assumed that this would force the uniqueness contraint to work since both of my values were stored within two, separate key_name fields across two separate models.
The commenter pointed out that this solution would not work because when you set a parent to a child entity, the key_name is no longer unique across the entire model but, instead, is unique across the parent-child entries. Bummer...
I believe that I could solve this new problem by changing how these two models are associated with one another.
First, I create a parent object as mentioned above. Next, I create a child entity and assign my second, unique value to it's key_name. The difference is that the second entity has a reference property to the parent model. My first entity is assigned to the reference property but not to the parent parameter. This does not force a one-to-one reference but it does keep both of my values unique and I can manage the one-to-one nature of these objects so long as I can control the insertion process from within a transaction.
This new solution is still problematic. According to the GAE Datastore documentation you can not execute multiple db updates in one transaction if the various entities within the update are not of the same entity group. Since I no longer make my first entity a parent of the second, they are no longer part of the same entity group and can not be inserted within the same transaction.
I'm back to square one. What can I do to solve this problem? Specifically, what can I do to enforce two, unique values associated with one Model entity. As you can see, I am willing to get a bit creative. Can this be done? I know this will involve an out-of-the-box solution but there has to be a way.
Below is my original code from my question I posted last week. I've added a few comments and code changes to implement my second attempt at solving this problem.
class ParentEntity(db.Model):
str1_key = db.StringProperty()
str2 = db.StringProperty()
#staticmethod
def InsertData(string1, string2, string3):
try:
def txn():
#create first entity
prt = ParentEntity(
key_name=string1,
str1_key=string1,
str2=string2)
prt.put()
#create User Account Entity
child = ChildEntity(
key_name=string2,
#parent=prt, #My prt object was previously the parent of child
parentEnt=prt,
str1=string1,
str2_key=string2,
str3=string3,)
child.put()
return child
#This should give me an error, b/c these two entities are no longer in the same entity group. :(
db.run_in_transaction(txn)
except Exception, e:
raise e
class ChildEntity(db.Model):
#foreign and primary key values
str1 = db.StringProperty()
str2_key = db.StringProperty()
#This is no longer a "parent" but a reference
parentEnt = db.ReferenceProperty(reference_class=ParentEntity)
#pertinent data below
str3 = db.StringProperty()

The system you describe will work, at the cost of transactionality. Note that the second entity is no longer a child entity - it's just another entity with a ReferenceProperty.
This solution may be sufficient to your needs - for instance, if you need to enforce that every user has a unique email address, but this is not your primary identifier for a user, you can insert a record into an 'emails' table first, then if that succeeds, insert your primary record. If a failure occurs after the first operation but before the second, you have an email address associated with no record. You can simply ignore this, or timestamp the record and allow it to be reclaimed after some period of time (for example, 30 seconds, the maximum length of a frontend request).
If your requirements on transactionality and uniqueness are stronger than that, there are other options with increasing levels of complexity, such as implementing some form of distributed transactions, but it's unlikely you'll actually need that. If you can tell us more about the nature of the records and the unique keys, we may be able to provide more detailed suggestions.

After scratching my head a bit, last night I decided to go with the following solution. I would assume that this still provides a bit of undesirable overhead for many scenarios, however, I think the overhead may be acceptable for my needs.
The code posted below is a further modification of the code in my question. Most notably, I've created another Model class, called named EGEnforcer (which stands for Entity Group Enforcer.)
The idea is simple. If a transaction can only update multiple records if they are associated with one entity group, I must find a way to associate each of my records that contains my unique values with the same entity group.
To do this, I create an EGEnforcer entry when the application initially starts. Then, when the need arises to make a new entry into my models, I query the EGEnforcer for the record associated with my paired models. After I get my EGEnforcer record, I make it the parent of both records. Viola! My data is now all associated with the same entity group.
Since the *key_name* parameter is unique only across the parent-key_name groups, this should inforce my uniqueness constraints because all of my FirstEntity (previously ParentEntity) entries will have the same parent. Likewise, my SecondEntity (previously ChildEntity) should also have a unique value stored as the key_name because the parent is also always the same.
Since both entities also have the same parent, I can execute these entries within the same transaction. If one fails, they all fail.
#My new class containing unique entries for each pair of models associated within one another.
class EGEnforcer(db.Model):
KEY_NAME_EXAMPLE = 'arbitrary unique value'
#staticmethod
setup():
''' This only needs to be called once for the lifetime of the application. setup() inserts a record into EGEnforcer that will be used as a parent for FirstEntity and SecondEntity entries. '''
ege = EGEnforcer.get_or_insert(EGEnforcer.KEY_NAME_EXAMPLE)
return ege
class FirstEntity(db.Model):
str1_key = db.StringProperty()
str2 = db.StringProperty()
#staticmethod
def InsertData(string1, string2, string3):
try:
def txn():
ege = EGEnforcer.get_by_key_name(EGEnforcer.KEY_NAME_EXAMPLE)
prt = FirstEntity(
key_name=string1,
parent=ege) #Our EGEnforcer record.
prt.put()
child = SecondEntity(
key_name=string2,
parent=ege, #Our EGEnforcer record.
parentEnt=prt,
str1=string1,
str2_key=string2,
str3=string3)
child.put()
return child
#This works because our entities are now part of the same entity group
db.run_in_transaction(txn)
except Exception, e:
raise e
class SecondEntity(db.Model):
#foreign and primary key values
str1 = db.StringProperty()
str2_key = db.StringProperty()
#This is no longer a "parent" but a reference
parentEnt = db.ReferenceProperty(reference_class=ParentEntity)
#Other data...
str3 = db.StringProperty()
One quick note-- Nick Johnson pinned my need for this solution:
This solution may be sufficient to
your needs - for instance, if you need
to enforce that every user has a
unique email address, but this is not
your primary identifier for a user,
you can insert a record into an
'emails' table first, then if that
succeeds, insert your primary record.
This is exactly what I need but my solution is, obviously, a bit different than your suggestion. My method allows for the transaction to completely occur or completely fail. Specifically, when a user creates an account, they first login to their Google account. Next, they are forced to the account creation page if there is no entry associated with their Google account in SecondEntity (which is actually UserAccount form my actual scenario.) If the insertion process fails, they are redirected to the creation page with the reason for this failure.
This could be because their ID is not unique or, potentially, a transactional timeout. If there is a timeout on the insertion of their new user account, I will want to know about it but I will implement some form of checks-and-balance in the near future. For now I simply want to go live, but this uniqueness constraint is an absolute necessity.
Being that my approach is strictly for account creation, and my user account data will not change once created, I believe that this should work and scale well for quite a while. I'm open for comments if this is incorrect.

A good data model for finding a user's favorite stories

Original Design
Here's how I originally had my Models set up:
class UserData(db.Model):
user = db.UserProperty()
favorites = db.ListProperty(db.Key) # list of story keys
# ...
class Story(db.Model):
title = db.StringProperty()
# ...
On every page that displayed a story I would query UserData for the current user:
user_data = UserData.all().filter('user =' users.get_current_user()).get()
story_is_favorited = (story in user_data.favorites)
New Design
After watching this talk: Google I/O 2009 - Scalable, Complex Apps on App Engine, I wondered if I could set things up more efficiently.
class FavoriteIndex(db.Model):
favorited_by = db.StringListProperty()
The Story Model is the same, but I got rid of the UserData Model. Each instance of the new FavoriteIndex Model has a Story instance as a parent. And each FavoriteIndex stores a list of user id's in it's favorited_by property.
If I want to find all of the stories that have been favorited by a certain user:
index_keys = FavoriteIndex.all(keys_only=True).filter('favorited_by =', users.get_current_user().user_id())
story_keys = [k.parent() for k in index_keys]
stories = db.get(story_keys)
This approach avoids the serialization/deserialization that's otherwise associated with the ListProperty.
Efficiency vs Simplicity
I'm not sure how efficient the new design is, especially after a user decides to favorite 300 stories, but here's why I like it:
A favorited story is associated with a user, not with her user data
On a page where I display a story, it's pretty easy to ask the story if it's been favorited (without calling up a separate entity filled with user data).
fav_index = FavoriteIndex.all().ancestor(story).get()
fav_of_current_user = users.get_current_user().user_id() in fav_index.favorited_by
It's also easy to get a list of all the users who have favorited a story (using the method in #2)
Is there an easier way?
Please help. How is this kind of thing normally done?

What you've described is a good solution. You can optimise it further, however: For each favorite, create a 'UserFavorite' entity as a child entity of the relevant Story entry (or equivalently, as a child entity of a UserInfo entry), with the key name set to the user's unique ID. This way, you can determine if a user has favorited a story with a simple get:
UserFavorite.get_by_name(user_id, parent=a_story)
get operations are 3 to 5 times faster than queries, so this is a substantial improvement.

I don't want to tackle your actual question, but here's a very small tip: you can replace this code:
if story in user_data.favorites:
story_is_favorited = True
else:
story_is_favorited = False
with this single line:
story_is_favorited = (story in user_data.favorites)
You don't even need to put the parentheses around the story in user_data.favorites if you don't want to; I just think that's more readable.

You can make the favorite index like a join on the two models
class FavoriteIndex(db.Model):
user = db.UserProperty()
story = db.ReferenceProperty()
or
class FavoriteIndex(db.Model):
user = db.UserProperty()
story = db.StringListProperty()
Then your query on by user returns one FavoriteIndex object for each story the user has favorited
You can also query by story to see how many users have Favorited it.
You don't want to be scanning through anything unless you know it is limited to a small size

With your new Design you can lookup if a user has favorited a certain story with a query.
You don't need the UserFavorite class entities.
It is a keys_only query so not as fast as a get(key) but faster then a normal query.
The FavoriteIndex classes all have the same key_name='favs'.
You can filter based on __key__.
a_story = ......
a_user_id = users.get_current_user().user_id()
favIndexKey = db.Key.from_path('Story', a_story.key.id_or_name(), 'FavoriteIndex', 'favs')
doesFavStory = FavoriteIndex.all(keys_only=True).filter('__key__ =', favIndexKey).filter('favorited_by =', a_user_id).get()
If you use multiple FavoriteIndex as childs of a Story you can use the ancestor filter
doesFavStory = FavoriteIndex.all(keys_only=True).ancestor(a_story).filter('favorited_by =', a_user_id).get()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

appengine many to many field update value and lookup efficiently - python

Related

ForeignKey vs CharField

How to model a 'Like' mechanism via ndb?

Using GQL to get all new results since the previous result

How can I create two unique, queriable fields for a GAE Datastore Data Model?

A good data model for finding a user's favorite stories

Categories

Resources