Minimize subqueries with IN queries on AppEngine (python)

Minimize subqueries with IN queries on AppEngine (python) - python

Is there any clever way to avoid making a costly query with an IN clause in cases like the following one?
I'm using Google App Engine to build a Facebook application and at some point I (obviously) need to query the datastore to get all the entities that belong to any of the facebook friends of the given user.
Suppose I have a couple of entities modeled as such:
class Thing(db.Model):
owner = db.ReferenceProperty(reference_class=User, required=True)
owner_id = db.StringProperty(required=True)
...
and
class User(db.Model):
id = db.StringProperty(required=True)
...
At some point I query Facebook to get the list of friends of a given user and I need to perform the following query
# get all Thing instances that belong to friends
query = Thing.all()
query.filter('owner_id IN', friend_ids)
If I did that, AppEngine would perform a subquery for each id in friend_ids, probably exceeding the maximum number of subqueries any query can spawn (30).
Is there any better way to do this (i.e. minimizing the number of queries)?
I understand that there are no relations and joins using the datastore but, in particular, I would consider adding new fields to the User or Thing class if it helps in making things easier.

I don't think there's an elegant solution, but you could try this:
On the User model, use Facebook ID as the key name, and store each user's list of things in a ListProperty.
class Thing(db.Model):
...
class User(db.Model):
things = db.ListProperty(db.Key)
...
Entity creation would go like this:
user = User.get_or_insert(my_facebook_id)
thing = Thing()
thing.put()
user.things.append(thing.key())
user.put()
Retrieval takes 2 queries:
friends = User.get_by_key_name(friend_ids)
thing_keys = []
for friend in friends:
thing_keys.extend(friend.things)
things = db.get(thing_keys)

This Google I/O talk by Brett Slatkin addresses the exact situation you're dealing with. See also his follow up talk this year.

Related

Calculate weighted score from Salesforce data in Django

I'm looking to connect my website with Salesforce and have a view that shows a breakdown of a user's activities in Salesforce, then calculate an overall score based on assigned weights to each activity. I'm using Django-Salesforce to initiate the connection and extend the Activity model, but I'm not sure I've setup the Activity or OverallScore classes correctly.
Below is my code for what I already have. Based on other questions I've seen that are similar, it seems like a custom save method is the suggested result, but my concern is that my database would quickly become massive, as the connection will refresh every 5 minutes.
The biggest question I have is how to setup the "weighted_score" attribute of the Activity class, as I doubt what I have currently is correct.
class Activity(salesforce.models.Model):
owner = models.ManyToManyField(Profile)
name = models.CharField(verbose_name='Name', max_length=264,
unique=True)
weight = models.DecimalField(verbose_name='Weight', decimal_places=2,
default=0)
score = models.IntegerField(verbose_name='Score', default=0)
weighted_score = weight*score
def __str__(self):
return self.name
class OverallScore(models.Model):
factors = models.ManyToManyField(Activity)
score = Activity.objects.aggregate(Sum('weighted_score'))
def __str__(self):
return "OverallScore"
The ideal end result would be each user logged in gets a "live" look at their activity scores and one overall score which is refreshed every 5 minutes from the Salesforce connection, then at the end of the day I would run a cron job to save the end of day results to the database.

Excuse a late partial response only to parts of question that are clear.
The implementation of arithmetic on fields in weighted_score depends on your preferences if your prefer an expression on Django side or on Salesforce side.
The easiest, but very limited solution is by #property decorator on a method.
class Activity(salesforce.models.Model):
... # the same fields
#property
def weighted_score(self)
return self.weight * self.score
This can be used in Python code as self.weighted_score, but it can not be passed any way to SOQL and it gives you not more power than if you would write a longer (self.weight * self.score) on the same place.
Salesforce SOQL does not support arithmetic expressions in SELECT clause, but you can define a custom "Formula" field in Salesforce setup of the Activity object and use it as normal numeric read only field in Django. If the Activity would be a Master-Detail Relationship of any other Salesforce object you can apply very fast Sum, max or average formula on that object.
ManyToMany field require to create the binding object in Salesforce Setup manually and to assign it to the through attribute of the ManyToMany field. An example is on wiki Foreign Key Support. As a rule of thumb your object definition must first exist in Salesforce with useful relationships (it can be Lookup Relationship or Lookup Relationship) and manageable data structure. Then you can run python manage.py inspectdb --database=salesforce ... table names (optionally a list of API Names of used tables, separated by spaces) That is much code to prune many unnecessary fields and choices, but still easier and reliably functional than to ask someone. Salesforce has no special form of custom ManyToMany relationship, therefore everything is written by ForeignKey in models.py. Master-Detail is only a comment on the ForeignKey. You can finally create a ManyToMany field manually, but it is mainly only a syntactic sugar to have nice mnemonic name for a forward and reverse traversing by the two foreign keys on the "through=" binding object.
(The rest of question was too broad and unclear for me.)

Google Cloud Datastore realworld example od model in python

I’m new on Google Datasore and Python but I have a project on it. Even with the great Google documentation I miss a realworld example of modeling data. So here a part of my project, and a proposition of modelisation and some questions about it…
I’m sure you can help me to understand more clearly Datastore and I think this questions can help some beginners like me to see how to model our data to have a great application !
A soccer feed contain some general informations about the match itself such as : the name of the competition it belongs, the pool name, the season, the matchday, the winner team.
For each team, the winner and the looser, we have the detail of the actions occurred during the match : cards and goals.
For the cards, we have theses informations : the color, the period it occurred, a player id, the reason, the time it occurred.
For the Goals, we have the period, a player id, the time, a player assistant id.
We have also the detail for each team of : the player name, their position (center, middle…), and date of birth.
Here the model I would like to use to ingest data from the soccer feed into the Datastore using python :
I have some entities : Team, Player, Match, TeamData, Card and Goal.
For each match, we will have two TeamData one for each team and the detail of action (cards and goal)
I used Key Property between TeamData and Match and between Card/Goal and TeamData but I think I could use parent relationship, I don’t know what is the best.
class Team(ndb.Model):
name = ndb.StringProperty()
class Player(ndb.Model):
teamKey = ndb.KeyProperty(Kind=Team)
name = ndb.StringProperty()
date_of_birth
position = ndb.StringProperty()
class Match(ndb.Model):
name_compet = ndb.StringProperty()
round = ndb.StringProperty()
season
matchday
team1Key = ndb.KeyProperty(Kind=Team)
team2Key = ndb.KeyProperty(Kind=Team)
winning_teamKey = ndb.KeyProperty(Kind=Team)
class TeamData(ndb.Model):
match = ndb.ReferenceProperty(Match, collection_name=’teamdata’)
score
side(away or home) = ndb.StringProperty()
teamKey = ndb.KeyProperty(Kind=Team)
class Card(ndb.Model):
teamdata = ndb.ReferenceProperty(TeamData, collection_name=’card’)
playerKey = ndb.KeyProperty(Kind=Player)
color = ndb.StringProperty()
period = ndb.StringProperty()
reason = ndb.StringProperty()
time
timestamp
class Goal((ndb.Model):
teamdata = ndb.ReferenceProperty(TeamData, collection_name=’goal’)
period = ndb.StringProperty(Kind=Player)
playerkey = ndb.KeyProperty(Kind=Player)
time = ndb.StringProperty()
type = ndb.StringProperty()
assistantplayerKey = ndb.KeyProperty(Kind=Player)
Here my questions :
Is this modelisation “correct” and allows basic queries (which team played on a certain day, what are the result with detail of cards and goal (player, assistant, reason, time) for a certain match)
and more complexe queries (how many goals does a certain player made for a certain season) ?
I don’t really see the difference between an SQL database and a NoSQL database such as DataStore except that the datastore deals with the keys and not us. Can you explain me clearly what advantage I have with this NoSQL modelisation ?
Thank you for helping me !

The NoSQL makes it WAY faster, and not dependent on size of data scanned. For a 3 Terabytes table in SQL, no matter what you return, it'll take the same "computation time" server side. In Datastore, since it DIRECTLY scans where you need, the size of the RETURNED rows/columns actually dictate the time it will take.
On the other hand, it takes a bit more time to save (since it needs to save to multiple indexes), and it CANNOT do server-side computations. For instance, with the datastore, you can't SUM or AVERAGE. The datastore ONLY scans and returns, that's why it's so fast. It was never intended to do calculations on your behalf (so the answer to "can it do more complex queries?" is no. But that's not your model, that's the datastore). One thing that could help to do these kinda sums is to keep a counter in a different entity and update it as needed (have another entity "totalGoals" with "keyOfPlayer" and "numberOfGoals")
One thing worth mentioning is about eventual consistency. In SQL, when you "insert", the data is in the table and can be retrieved immediately. In the Datastore, consistency is not immediate (because it needs to copy to different indexes, you can't know WHEN the insert is completely done). There are ways to force consistency. Ancestor queries is one of them, as is querying directly by key, or opening your datastore viewer.
Another thing, even if it won't touch you (in the same idea of "providing a question for other beginners, I try to include as much as I can think of) is that ancestor queries, to make them safe, actually FREEZE the entity group they are using (entity group = parents + childs + childs of childs + etc) when you query one.
Other questions? Refer to docs about entities, indexes, queries, and modeling for strong consistencies. Or feel free to ask, I'll edit my answer in consequence :)

How to implement composition/agregation with NDB on GAE

How do we implement agregation or composition with NDB on Google App Engine ? What is the best way to proceed depending on use cases ?
Thanks !
I've tried to use a repeated property. In this very simple example, a Project have a list of Tag keys (I have chosen to code it this way instead of using StructuredProperty because many Project objects can share Tag objects).
class Project(ndb.Model):
name = ndb.StringProperty()
tags = ndb.KeyProperty(kind=Tag, repeated=True)
budget = ndb.FloatProperty()
date_begin = ndb.DateProperty(auto_now_add=True)
date_end = ndb.DateProperty(auto_now_add=True)
#classmethod
def all(cls):
return cls.query()
#classmethod
def addTags(cls, from_str):
tagname_list = from_str.split(',')
tag_list = []
for tag in tagname_list:
tag_list.append(Tag.addTag(tag))
cls.tags = tag_list
--
Edited (2) :
Thanks. Finally, I have chosen to create a new Model class 'Relation' representing a relation between two entities. It's more an association, I confess that my first design was unadapted.

An alternative would be to use BigQuery. At first we used NDB, with a RawModel which stores individual, non-aggregated records, and an AggregateModel, which a stores the aggregate values.
The AggregateModel was updated every time a RawModel was created, which caused some inconsistency issues. In hindsight, properly using parent/ancestor keys as Tim suggested would've worked, but in the end we found BigQuery much more pleasant and intuitive to work with.
We just have cronjobs that run everyday to push RawModel to BigQuery and another to create the AggregateModel records with data fetched from BigQuery.
(Of course, this is only effective if you have lots of data to aggregate)

It really does depend on the use case. For small numbers of items StructuredProperty and repeated properties may well be the best fit.
For large numbers of entities you will then look at setting the parent/ancestor in the Key for composition, and have a KeyProperty pointing to the primary entity in a many to one aggregation.
However the choice will also depend heavily on the actual use pattern as well. Then considerations of efficiency kick in.
The best I can suggest is consider carefully how you plan to use these relationships, how active are they (ie are they constantly changing, adding, deleting), do you need to see all members of the relation most of the time, or just subsets. These consideration may well require adjustments to the approach.

Custom properties not saved correctly for Expando models in repeated StructuredProperty

I am trying to use an Expando model as a repeated StructuredProperty in another model. Namely, I would like to add an indefinite number of Accounts to my User model. As Accounts can have different properties depending on their types (Accounts are references to social network accounts, and for example Twitter requires more information than Facebook for its OAuth process), I have designed my Account model as an Expando. I've added all basic information in the model definition, but I plan to add custom properties for specific social networks (e.g., a specific access_token_secret property for Twitter).
1/ Can you confirm the following design (Expando in repeated StructuredProperty) should work?
class Account(ndb.Expando):
account_type = ndb.StringProperty(required=True, choices=['fb', 'tw', 'li'])
account_id = ndb.StringProperty()
state = ndb.StringProperty()
access_token = ndb.StringProperty()
class HUser(User):
email = ndb.StringProperty(required=True, validator=validate_email)
created = ndb.DateTimeProperty(auto_now_add=True)
accounts = ndb.StructuredProperty(Account, repeated=True)
2/ Now the problem I am facing is: when I add a Facebook account to my HUser instance, everything works fine ; however the problem rises when I append a Twitter account to that same instance, and add a new property not declared in the model, like that:
for account in huser.accounts:
if account.state == "state_we_re_looking_for" and account.account_type == 'tw':
# we found the appropriate Twitter account reference
account.access_token_secret = "..." # store the access token secret fetched from Twitter API
huser.put() # save to the Datastore
break
This operation is supposed to save the access token secret in the Twitter Account instance of my User, but in fact it saves it in the Facebook Account instance (at index 0)!
What am I doing wrong?
Thanks.

This is a fundamental problem with how ndb stores the StructuredProperty. Datastore does not currently have a way to store this, so ndb basically explodes your properties.
For example, consider the entity:
HUser(email='test#example.com'.
accounts=(Account(type='fb',
account_id='1',
state='1',
access_token='1'),
Account(type='tw',
account_id='2',
state='2',
access_token='2',
access_token_secret='2')))
This will actually get stored in an entity that looks like:
{
email : 'test#example.com',
accounts.type : ['fb', 'tw'],
accounts.account_id : ['1', '2'],
accounts.state : ['1', '2'],
accounts.access_token : ['1', '2'],
accounts.access_token_secret : ['2']
}
Because you are using an ndb.Expando, ndb doesn't know that it should populate the access_token_secret field with a None for the facebook account. When ndb repopulates your entities, it will fill in the access_token_secret for the first account it sees, which is the facebook account.
Restructuring your data sounds like the right way to go about this, but you may want to make your HUser an ancestor of the Account for that HUser so that you query for a user's accounts using strong consistency.

From what I understand, it seems like App Engine NDB does not support Expando entities containing Expando entities themselves.
One thing that I didn't realize at first is that my HUser model inherits from Google's User class, which is precisely an Expando model!
So without even knowing it, I was trying to put a repeated StructuredProperty of Expando objects inside another Expando, which seemingly is not supported (I didn't find anything clearly written on this limitation, however).
The solution is to design the data model in a different way. I put my Account objects in a separate entity kind (and this time, they are truly Expando objects!), and I've added a KeyProperty to reference the HUser entity. This involves more read/write ops, but the code is actually much simpler to read now...
I'll mark my own question as answered, unless someone has another interesting input regarding the limitation found here.

how to limit/offset sqlalchemy orm relation's result?

in case i have a user Model and article Model, user and article are one-to-many relation. so i can access article like this
user = session.query(User).filter(id=1).one()
print user.articles
but this will list user's all articles, what if i want to limit articles to 10 ? in rails there is an all() method which can have limit / offset in it. in sqlalchemy there also is an all() method, but take no params, how to achieve this?
Edit:
it seems user.articles[10:20] is valid, but the sql didn't use 10 / 20 in queries. so in fact it will load all matched data, and filter in python?

The solution is to use a dynamic relationship as described in the collection configuration techniques section of the SQLAlchemy documentation.
By specifying the relationship as
class User(...):
# ...
articles = relationship('Articles', order_by='desc(Articles.date)', lazy='dynamic')
you can then write user.articles.limit(10) which will generate and execute a query to fetch the last ten articles by the user. Or you can use the [x:y] syntax if you prefer which will automatically generate a LIMIT clause.
Performance should be reasonable unless you want to query the past ten articles for 100 or so users (in which instance at least 101 queries will be sent to the server).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.