I’m new on Google Datasore and Python but I have a project on it. Even with the great Google documentation I miss a realworld example of modeling data. So here a part of my project, and a proposition of modelisation and some questions about it…
I’m sure you can help me to understand more clearly Datastore and I think this questions can help some beginners like me to see how to model our data to have a great application !
A soccer feed contain some general informations about the match itself such as : the name of the competition it belongs, the pool name, the season, the matchday, the winner team.
For each team, the winner and the looser, we have the detail of the actions occurred during the match : cards and goals.
For the cards, we have theses informations : the color, the period it occurred, a player id, the reason, the time it occurred.
For the Goals, we have the period, a player id, the time, a player assistant id.
We have also the detail for each team of : the player name, their position (center, middle…), and date of birth.
Here the model I would like to use to ingest data from the soccer feed into the Datastore using python :
I have some entities : Team, Player, Match, TeamData, Card and Goal.
For each match, we will have two TeamData one for each team and the detail of action (cards and goal)
I used Key Property between TeamData and Match and between Card/Goal and TeamData but I think I could use parent relationship, I don’t know what is the best.
class Team(ndb.Model):
name = ndb.StringProperty()
class Player(ndb.Model):
teamKey = ndb.KeyProperty(Kind=Team)
name = ndb.StringProperty()
date_of_birth
position = ndb.StringProperty()
class Match(ndb.Model):
name_compet = ndb.StringProperty()
round = ndb.StringProperty()
season
matchday
team1Key = ndb.KeyProperty(Kind=Team)
team2Key = ndb.KeyProperty(Kind=Team)
winning_teamKey = ndb.KeyProperty(Kind=Team)
class TeamData(ndb.Model):
match = ndb.ReferenceProperty(Match, collection_name=’teamdata’)
score
side(away or home) = ndb.StringProperty()
teamKey = ndb.KeyProperty(Kind=Team)
class Card(ndb.Model):
teamdata = ndb.ReferenceProperty(TeamData, collection_name=’card’)
playerKey = ndb.KeyProperty(Kind=Player)
color = ndb.StringProperty()
period = ndb.StringProperty()
reason = ndb.StringProperty()
time
timestamp
class Goal((ndb.Model):
teamdata = ndb.ReferenceProperty(TeamData, collection_name=’goal’)
period = ndb.StringProperty(Kind=Player)
playerkey = ndb.KeyProperty(Kind=Player)
time = ndb.StringProperty()
type = ndb.StringProperty()
assistantplayerKey = ndb.KeyProperty(Kind=Player)
Here my questions :
Is this modelisation “correct” and allows basic queries (which team played on a certain day, what are the result with detail of cards and goal (player, assistant, reason, time) for a certain match)
and more complexe queries (how many goals does a certain player made for a certain season) ?
I don’t really see the difference between an SQL database and a NoSQL database such as DataStore except that the datastore deals with the keys and not us. Can you explain me clearly what advantage I have with this NoSQL modelisation ?
Thank you for helping me !
The NoSQL makes it WAY faster, and not dependent on size of data scanned. For a 3 Terabytes table in SQL, no matter what you return, it'll take the same "computation time" server side. In Datastore, since it DIRECTLY scans where you need, the size of the RETURNED rows/columns actually dictate the time it will take.
On the other hand, it takes a bit more time to save (since it needs to save to multiple indexes), and it CANNOT do server-side computations. For instance, with the datastore, you can't SUM or AVERAGE. The datastore ONLY scans and returns, that's why it's so fast. It was never intended to do calculations on your behalf (so the answer to "can it do more complex queries?" is no. But that's not your model, that's the datastore). One thing that could help to do these kinda sums is to keep a counter in a different entity and update it as needed (have another entity "totalGoals" with "keyOfPlayer" and "numberOfGoals")
One thing worth mentioning is about eventual consistency. In SQL, when you "insert", the data is in the table and can be retrieved immediately. In the Datastore, consistency is not immediate (because it needs to copy to different indexes, you can't know WHEN the insert is completely done). There are ways to force consistency. Ancestor queries is one of them, as is querying directly by key, or opening your datastore viewer.
Another thing, even if it won't touch you (in the same idea of "providing a question for other beginners, I try to include as much as I can think of) is that ancestor queries, to make them safe, actually FREEZE the entity group they are using (entity group = parents + childs + childs of childs + etc) when you query one.
Other questions? Refer to docs about entities, indexes, queries, and modeling for strong consistencies. Or feel free to ask, I'll edit my answer in consequence :)
Related
I'm looking to connect my website with Salesforce and have a view that shows a breakdown of a user's activities in Salesforce, then calculate an overall score based on assigned weights to each activity. I'm using Django-Salesforce to initiate the connection and extend the Activity model, but I'm not sure I've setup the Activity or OverallScore classes correctly.
Below is my code for what I already have. Based on other questions I've seen that are similar, it seems like a custom save method is the suggested result, but my concern is that my database would quickly become massive, as the connection will refresh every 5 minutes.
The biggest question I have is how to setup the "weighted_score" attribute of the Activity class, as I doubt what I have currently is correct.
class Activity(salesforce.models.Model):
owner = models.ManyToManyField(Profile)
name = models.CharField(verbose_name='Name', max_length=264,
unique=True)
weight = models.DecimalField(verbose_name='Weight', decimal_places=2,
default=0)
score = models.IntegerField(verbose_name='Score', default=0)
weighted_score = weight*score
def __str__(self):
return self.name
class OverallScore(models.Model):
factors = models.ManyToManyField(Activity)
score = Activity.objects.aggregate(Sum('weighted_score'))
def __str__(self):
return "OverallScore"
The ideal end result would be each user logged in gets a "live" look at their activity scores and one overall score which is refreshed every 5 minutes from the Salesforce connection, then at the end of the day I would run a cron job to save the end of day results to the database.
Excuse a late partial response only to parts of question that are clear.
The implementation of arithmetic on fields in weighted_score depends on your preferences if your prefer an expression on Django side or on Salesforce side.
The easiest, but very limited solution is by #property decorator on a method.
class Activity(salesforce.models.Model):
... # the same fields
#property
def weighted_score(self)
return self.weight * self.score
This can be used in Python code as self.weighted_score, but it can not be passed any way to SOQL and it gives you not more power than if you would write a longer (self.weight * self.score) on the same place.
Salesforce SOQL does not support arithmetic expressions in SELECT clause, but you can define a custom "Formula" field in Salesforce setup of the Activity object and use it as normal numeric read only field in Django. If the Activity would be a Master-Detail Relationship of any other Salesforce object you can apply very fast Sum, max or average formula on that object.
ManyToMany field require to create the binding object in Salesforce Setup manually and to assign it to the through attribute of the ManyToMany field. An example is on wiki Foreign Key Support. As a rule of thumb your object definition must first exist in Salesforce with useful relationships (it can be Lookup Relationship or Lookup Relationship) and manageable data structure. Then you can run python manage.py inspectdb --database=salesforce ... table names (optionally a list of API Names of used tables, separated by spaces) That is much code to prune many unnecessary fields and choices, but still easier and reliably functional than to ask someone. Salesforce has no special form of custom ManyToMany relationship, therefore everything is written by ForeignKey in models.py. Master-Detail is only a comment on the ForeignKey. You can finally create a ManyToMany field manually, but it is mainly only a syntactic sugar to have nice mnemonic name for a forward and reverse traversing by the two foreign keys on the "through=" binding object.
(The rest of question was too broad and unclear for me.)
I am trying to create a Report Card Model. I have with me:
Question ids, answers selected for each question by candidate, correct answer id of each question, weight of each question.
Is it a good idea, to create fields like "Total marks, average, no of correct answers, number of questions" etc in my ReportCard model OR should I calculate everything , every time a viewer visits the detail view of this report card ?
My Model so far:
class ReportCard(models.Model):
exam = models.OneToOneField(Exam)
class ExamChoiceMade(models.Model):
report_card = models.OneToOneField(ReportCard)
question_no = models.PositiveIntegerField(default=0)
answer_chosen = models.PositiveIntegerField(default=0)
is_correct = models.BooleanField(default=False)
First thing that you need to remember is that no matter what decisions you make, there will be trade-offs. And among all the choices you have, you need to consider the best ever option.
In web you mainly need to consider the scalability as the main issue related to performance trade-offs.
It is a good practice to keep lightly calculated (as in non-resource hungry) fields as model-properties so that they will act as a field of the tables but never gets stored and is calculated on-demand.
Now when we consider the on-demand calculation if it is resource hungry, your response is going to be very slow. And we should be very careful to keep our response time < 100ms for any normal(even those appears to be normal for end user) actions.
So the answer to you question is that the call on whether to store or calculate on demand is requirement dependant.
However the fileds that you have mentioned above doesn't seem to be resource hungry and so can be just model property.
I am trying to wrap my head 'round gae datastore, but I do not fully understand the documentation for the Key Class / or maybe it is ancestor relationships in general I do not grasp.
I think what I want is multiple ancestors.
Example:
Say I wanted to model our school's annual sponsored run for charity; school kids run rounds around the track and their relatives (=sponsors) donate to charity for each round completed.
In my mind, I would create the following kinds:
Profile (can be both runner and sponsor)
Run (defines who (cf. profile) runs for what charity, rounds actually completed)
Sponsorship (defines who (cf. profile) donates how much for what run, whether the donation has been made)
I've learned that datastore is a nosql, non-relational database, but haven't fully grasped it. So my questions are:
a. Is creating an entity for "Sponsorship" even the best way in datastore? I could also model it as a has-a relationship (every run has sponsors) - but since I also want to track the amount sponsored, whether sponsor paid up and maybe more this seems inappropriate
b. I'd like to easily query all sponsorhips made by a single person and also all sponsorships belonging to a certain run.
So, I feel, this would be appropriate:
Profile --is ancestor of--> Run
Profile --is ancestor of--> Sponsorship
Run --is ancestor of--> Sponsorship
Is that sensible?
I can see a constructor for a Key that takes several kinds in ancestor order as arguments. Was that designed for this case? "Run" and "profile" would be at the same "level" (i.e. mum&dad ancestors not father&grandfather) - what would that constructor look like in python?
The primary way of establishing relationships between entities is via the key properties in the entity model. Normally no ancestry is needed.
For example:
class Profile(ndb.Model):
name = ndb.StringProperty()
class Run(ndb.Model):
runner = ndb.KeyProperty(kind='Profile')
rounds = ndb.IntegerProperty()
sponsorship = ndb.KeyProperty(kind='Sponsorship')
class Sponsorship(ndb.Model):
run = ndb.KeyProperty(kind='Run')
donor = ndb.KeyProperty(kind='Profile')
done = ndb.BooleanProperty()
The ancestry just places entities inside the same entity group (which can be quite limiting!) while enforcing additional relationships on top of the ones already established by the model. See Transactions and entity groups and maybe Contention problems in Google App Engine.
How do we implement agregation or composition with NDB on Google App Engine ? What is the best way to proceed depending on use cases ?
Thanks !
I've tried to use a repeated property. In this very simple example, a Project have a list of Tag keys (I have chosen to code it this way instead of using StructuredProperty because many Project objects can share Tag objects).
class Project(ndb.Model):
name = ndb.StringProperty()
tags = ndb.KeyProperty(kind=Tag, repeated=True)
budget = ndb.FloatProperty()
date_begin = ndb.DateProperty(auto_now_add=True)
date_end = ndb.DateProperty(auto_now_add=True)
#classmethod
def all(cls):
return cls.query()
#classmethod
def addTags(cls, from_str):
tagname_list = from_str.split(',')
tag_list = []
for tag in tagname_list:
tag_list.append(Tag.addTag(tag))
cls.tags = tag_list
--
Edited (2) :
Thanks. Finally, I have chosen to create a new Model class 'Relation' representing a relation between two entities. It's more an association, I confess that my first design was unadapted.
An alternative would be to use BigQuery. At first we used NDB, with a RawModel which stores individual, non-aggregated records, and an AggregateModel, which a stores the aggregate values.
The AggregateModel was updated every time a RawModel was created, which caused some inconsistency issues. In hindsight, properly using parent/ancestor keys as Tim suggested would've worked, but in the end we found BigQuery much more pleasant and intuitive to work with.
We just have cronjobs that run everyday to push RawModel to BigQuery and another to create the AggregateModel records with data fetched from BigQuery.
(Of course, this is only effective if you have lots of data to aggregate)
It really does depend on the use case. For small numbers of items StructuredProperty and repeated properties may well be the best fit.
For large numbers of entities you will then look at setting the parent/ancestor in the Key for composition, and have a KeyProperty pointing to the primary entity in a many to one aggregation.
However the choice will also depend heavily on the actual use pattern as well. Then considerations of efficiency kick in.
The best I can suggest is consider carefully how you plan to use these relationships, how active are they (ie are they constantly changing, adding, deleting), do you need to see all members of the relation most of the time, or just subsets. These consideration may well require adjustments to the approach.
Is it possible in any way to query entities using one of their parent's property in GAE, like this (which doesn't work)?
class Car(db.Model):
title = db.StringProperty()
type = db.StringProperty()
class Part(db.Model):
title = db.StringProperty()
car = Car()
car.title = 'BMW X5'
car.type = 'SUV'
car.put()
part = Part(parent = car)
part.title = 'Left door'
part.put()
parts = Part.all()
parts.filter('parent.type ==', 'SUV') # this in particular
I've read about ReferenceProperty, and Indexes but I'm not sure what I need.
GAE lets me set a parent to the Part entity, but do I need an actually (kind of duplicate):
parent = db.ReferenceProperty(Car, required=True)
That would feel like duplicating what the system does already since it has a parent. Or is there an other way?
It's not an answer to your question as such, but NDB offers structured properties.
https://developers.google.com/appengine/docs/python/ndb/properties#structured
You can structure a model's properties. For example, you can define a model class Contact containing a list of addresses, each with internal structure.
Although the structured properties instances are defined using the same syntax as for model classes, they are not full-fledged entities. They don't have their own keys in the Datastore. They cannot be retrieved independently of the entity to which they belong. An application can, however, query for the values of their individual fields.
So here car would contain parts as a structured property. If this is viable in your use case depends on how you structure your data. If you want to know what parts make up a specific car, that seems viable. If you want to filer global parts regardless of what car they belong to, then you can still do that but you'll have to make the "parts" inside each car also refer to a different model. If you see what I mean (I'm not sure I do), as each car contains it's own parts.
Adding the parent as an explicit property isn't going to help.
You can break it up in two parts though:
for suv in Car.all().filter('type', 'SUV'):
for part in Part.all(ancestor=suv):
...do something with part...
If you want to query on the property of another (parent) object, you gotta get that object first.
I can think of two solutions to your problem:
Guido's way is to query for the parent, and then query for the part. This way issues more queries.
The second way is to store a copy of parent.type inside your Part. The downsides are that you're storing duplicate data (more storage), and you have to be careful that your the data in Part and data in Car match up. However, you only need to issue one query.
You'll have to figure out which one works better for you.