Modeling Hierarchical Data - GAE

Modeling Hierarchical Data - GAE - python

I'm new in google-app-engine and google datastore (bigtable) and I've some doubts in order of which could be the best approach to design the required data model.
I need to create a hierarchy model, something like a product catalog, each domain has some subdomains in deep. For the moment the structure for the products changes less than the read requirements. Wine example:
Origin (Toscana, Priorat, Alsacian)
Winery (Belongs only to one Origin)
Wine (Belongs only to one Winery)
All the relations are disjoint and incomplete. Additionally in order of the requirements probably we need to store counters of use for every wine (could require transactions)
In order of the documentation seems there're different potential solutions:
Ancestors management. Using parent relations and transactions
Pseudo-ancestor management. Simulating ancestors with a db.ListProperty(db.Key)
ReferenceProperty. Specifying explicitelly the relation between the classes
But in order of the expected requests to get wines... sometimes by variety, sometimes by origin, sometimes by winery... i'm worried about the behaviour of the queries using these structures (like the multiple joins in a relational model. If you ask for the products of a family... you need to join for the final deep qualifier in the tree of products and join since the family)
Maybe is better to create some duplicated information (in order of the google team recommendations: operations are expensive, but storage is not, so duplicate content should not be seen the main problem)
Some responses of other similar questions suggest:
Store all the parent ids as a hierarchy in a string... like a path property
Duplicate the relations between the Drink entity an all the parents in the tree ...
Any suggestions?
Hi Will,
Our case is more an strict hierarchical approach as you represent in the second example. And the queries is for retrieving list of products, retrieve only one is not usual.
We need to retrieve all the wines from an Origin, from a Winery or from a Variety (If we supose that the variety is another node of the strict hierarchical tree, is only an example)
One way could be include a path property, as you mentioned:
/origin/{id}/winery/{id}/variety/{id}
To allow me to retrieve a list of wines from a variety applying a query like this:
wines_query = Wine.all()
wines_query.filter('key_name >','/origin/toscana/winery/latoscana/variety/merlot/')
wines_query.filter('key_name <','/origin/toscana/winery/latoscana/variety/merlot/zzzzzzzz')
Or like this from an Origin:
wines_query = Wine.all()
wines_query.filter('key_name >','/origin/toscana/')
wines_query.filter('key_name <','/origin/toscana/zzzzzz')
Thank you!

I'm not sure what kinds of queries you'll need to do in addition to those mentioned in the question, but storing the data in an explicit ancestor hierarchy would make the ones you asked about fall out pretty easily.
For example, to get all wines from a particular origin:
origin_key = db.Key.from_path('Origin', 123)
wines_query = db.Query(Wine).ancestor(origin_key)
or to get all wines from a particular winery:
origin_key = db.Key.from_path('Origin', 123)
winery_key = db.Key.from_path('Winery', 456, parent=origin_key)
wines_query = db.Query(Wine).ancestor(winery_key)
and, assuming you're storing the variety as a property on the Wine model, all wines of a particular variety is as simple as
wines_query = Wine.all().filter('variety =', 'merlot')
One possible downside of this strict hierarchical approach is the kind of URL scheme it can impose on you. With a hierarchy that looks like
Origin -> Winery -> Wine
you must know the key name or ID of a wine's origin and winery in order to build a key to retrieve that wine. Unless you've already got the string representation of a wine's key. This basically forces you to have URLs for wines in one of the following forms:
/origin/{id}/winery/{id}/wine/{id}
/wine/{opaque and unfriendly datastore key as a string}
(The first URL could of course be replaced with querystring parameters; the important part is that you need three different pieces of information to identify a given wine.)
Maybe there are other alternatives to these URL schemes that have not occurred to me, though.

Related

Generic data handling in Python

The situation
While reading the Bible (as context) I'd like to point out certain dependencies e.g. of people and locations. Due to swift expandability I'm choosing Python to handle this versatile data. Currently I'm creating many feature vectors independent from each other, containing various information as the database.
In the end I'd like to type in a keyword to search in this whole database, which shall return everything that is in touch with it. Something simple as
results = database(key)
What I'm looking for
Unfortunately I'm not a Pro about different database handling possibilities and I hope you can help me finding an appropriate option.
Are there possibilities that can be used out of the box or do I need to create all logic by myself?

This is a little vague so I'll try to handle the People and Location bit of it to help you get started.
One possibility is to build a SQLite database. (The sqlite3 library + documentation is relatively friendly). Also here's a nice tutorial on getting started with SQLite.
To start, you can create two entity tables:
People: contains details about every person in bible.
Locations: contains details about every location in bible.
You can then create two relationship tables that reference people and locations (as Foreign Keys). For example, one of these relationship tables might be
People_Visited_Locations: contains information about where each person visited in their lifetime. The schema might looks something like this:
| person (Foreign Key)| location (Foreign Key) | year |
Remember that Foreign Key refers to an entry in another table. In our case, person is an existing unique ID from your entity table People, location is an existing unique ID from your entity table Locations, and year could be the year that person went to that location.
Then to fetch every place that some person, say Adam in the bible visited, you can create a Select statement that returns all entries in People_Visited_Locations with Adam as person.
I think key (pun intended) takeaway is how Relationship tables can help you map relationships between entities.
Hope this helps get you started :)

How to model a unique constraint in GAE ndb

I want to have several "bundles" (Mjbundle), which essentially are bundles of questions (Mjquestion). The Mjquestion has an integer "index" property which needs to be unique, but it should only be unique within the bundle containing it. I'm not sure how to model something like this properly, I try to do it using a structured (repeating) property below, but there is yet nothing actually constraining the uniqueness of the Mjquestion indexes. What is a better/normal/correct way of doing this?
class Mjquestion(ndb.Model):
"""This is a Mjquestion."""
index = ndb.IntegerProperty(indexed=True, required=True)
genre1 = ndb.IntegerProperty(indexed=False, required=True, choices=[1,2,3,4,5,6,7])
genre2 = ndb.IntegerProperty(indexed=False, required=True, choices=[1,2,3])
#(will add a bunch of more data properties later)
class Mjbundle(ndb.Model):
"""This is a Mjbundle."""
mjquestions = ndb.StructuredProperty(Mjquestion, repeated=True)
time = ndb.DateTimeProperty(auto_now_add=True)
(With the above model and having fetched a certain Mjbundle entity, I am not sure how to quickly fetch a Mjquestion from mjquestions based on the index. The explanation on filtering on structured properties looks like it works on the Mjbundle type level, whereas I already have a Mjbundle entity and was not sure how to quickly query only on the questions contained by that entity, without looping through them all "manually" in code.)
So I'm open to any suggestion on how to do this better.
I read this informational answer: https://stackoverflow.com/a/3855751/129202 It gives some thoughts about scalability and on a related note I will be expecting just a couple of bundles but each bundle will have questions in the thousands.
Maybe I should not use the mjquestions property of Mjbundle at all, but rather focus on parenting: each Mjquestion created should have a certain Mjbundle entity as parent. And then "manually" enforce uniqueness at "insert time" by doing an ancestor query.

When you use a StructuredProperty, all of the entities that type are stored as part of the containing entity - so when you fetch your bundle, you have already fetched all of the questions. If you stick with this way of storing things, iterating to check in code is the solution.

What are some good examples for App Engine NDB commenting models?

I'm trying to model a basic linear commenting system for my blog in App Engine (you can see it at http://codeinsider.us). My main classes of objects are:
Users,
Articles,
Comments
One user will have many comments and should be able to view their comments at a glance.
One article will have many comments and should be visible at a glance.
One comment will be associated with exactly one user and exactly one article.
I know how I might build this in a standard relational database - I might have, say, separate tables for comments, users, and articles, with foreign keys to tie them together, uniqueness constraints on articles and users, and none on comments, etc. Nothing fancy.
What's the best way of modeling this in Python App Engine with NDB? ndb.KeyProperty seems interesting, as does StructuredProperty. I don't think I can use StructuredProperty though, since a comment can "belong" to both a User and an Article. But with ndb.KeyProperty, it seems like the keyProperty doesn't do any checking or validation logic, so I'd have to implement that on my own.
The other thing I can do is just throw in the towel, and store giant JSON blobs in Users and Articles representing the Keys and Kinds of comments. That may not be a bad solution.
Any thoughts?
Edit:
This is going to be high-read, low-write. I may add some engagement on comments (upvotes/downvotes), but even then, it will be heavily weighted towards reads.

I recommend to you thinking carefully on what features are you planning to provide since structuring your models in some way may difficult some changes in the future.
I will do this as follows:
First, assume some eventual consistency. No matter how you design this, you will have some eventual consistency in some queries.
Make a KeyProperty "owner" in article to store the user_key. If you want to achieve strong consistency when querying the articles of a single user then instead of using the "owner" KeyProperty just make the user_key the parent of the Article (this will create an entity group for the user and it's articles and is fine here).
With comments you can do more things.
If you expect less than 100 (depending on Article size on the
datastore can be more) comments for each article create a comments
KeyProperty(repeated=True) in Article to store all the comments keys
and then get them with get_multi (strong consistency).
To create the comment and also modify the Article comments property
you may need a transaction, because you will want to accomplish the
two operations or non of them. But.. the two entities are not in the
same entity group so: 1) use cross group transaction or 2) make the
parent of the comment the Article (this second option will have some
consequences discussed later) Counts of comments are easy but
limited to 100 or more comments as said before.
Create a Comment ndb model with two KeyProperties, "owner" and
"article". The article will fetch comments with a query. To query
all the comments within an Article you will have eventual
consistency unless you make the article the parent of the comment
(in that case don't create the article KeyProperty of course). This
approach allows lots of comments.
The problem of using entity groups are that for example, if you allow to vote on comments, then a single write operation on each comment will block any write in the hole entity group of the Article affected. So creation and voting by other users may be affected. But don't really care about this if you expect few votes and you keep entity groups small.
If you want to allow comment votes this can get quite complicated as you may want for example only one vote per user. This will require extra relationships that need to be thought before.
Personally I prefer to assume eventual consistency almost always.
More approaches are possible but I like this two.

High read, low write scenario is the specialty on GAE, so that's a good thing for your purpose.
I'd take advantage of the ancestry feature of GAE Model as it assures you transactional/atomic operations within an entity group. I guess you don't need much of that but it's a good thing to have still.
The right structure is determined by the way you are going to treat/use your data. I'm assuming the typical case in your blog would be to show comments for an article, thus, I'd make your comment model a child of your article model - you could then query comments for a certain (article) ancestor and that would scale magnificently.
I'd include a KeyProperty for the author on the comment, as that would be used mainly to fetch a user from the key I assume. If you want to extend KeyProperty functionality you can do so. Here's an example on how to make KeyProperty behave as ReferenceProperty used to in db. (point 1.)

How to implement composition/agregation with NDB on GAE

How do we implement agregation or composition with NDB on Google App Engine ? What is the best way to proceed depending on use cases ?
Thanks !
I've tried to use a repeated property. In this very simple example, a Project have a list of Tag keys (I have chosen to code it this way instead of using StructuredProperty because many Project objects can share Tag objects).
class Project(ndb.Model):
name = ndb.StringProperty()
tags = ndb.KeyProperty(kind=Tag, repeated=True)
budget = ndb.FloatProperty()
date_begin = ndb.DateProperty(auto_now_add=True)
date_end = ndb.DateProperty(auto_now_add=True)
#classmethod
def all(cls):
return cls.query()
#classmethod
def addTags(cls, from_str):
tagname_list = from_str.split(',')
tag_list = []
for tag in tagname_list:
tag_list.append(Tag.addTag(tag))
cls.tags = tag_list
--
Edited (2) :
Thanks. Finally, I have chosen to create a new Model class 'Relation' representing a relation between two entities. It's more an association, I confess that my first design was unadapted.

An alternative would be to use BigQuery. At first we used NDB, with a RawModel which stores individual, non-aggregated records, and an AggregateModel, which a stores the aggregate values.
The AggregateModel was updated every time a RawModel was created, which caused some inconsistency issues. In hindsight, properly using parent/ancestor keys as Tim suggested would've worked, but in the end we found BigQuery much more pleasant and intuitive to work with.
We just have cronjobs that run everyday to push RawModel to BigQuery and another to create the AggregateModel records with data fetched from BigQuery.
(Of course, this is only effective if you have lots of data to aggregate)

It really does depend on the use case. For small numbers of items StructuredProperty and repeated properties may well be the best fit.
For large numbers of entities you will then look at setting the parent/ancestor in the Key for composition, and have a KeyProperty pointing to the primary entity in a many to one aggregation.
However the choice will also depend heavily on the actual use pattern as well. Then considerations of efficiency kick in.
The best I can suggest is consider carefully how you plan to use these relationships, how active are they (ie are they constantly changing, adding, deleting), do you need to see all members of the relation most of the time, or just subsets. These consideration may well require adjustments to the approach.

Django design patterns - models with ForeignKey references to multiple classes

I'm working through the design of a Django inventory tracking application, and have hit a snag in the model layout. I have a list of inventoried objects (Assets), which can either exist in a Warehouse or in a Shipment. I want to store different lists of attributes for the two types of locations, e.g.:
For Warehouses, I want to store the address, manager, etc.
For Shipments, I want to store the carrier, tracking number, etc.
Since each Warehouse and Shipment can contain multiple Assets, but each Asset can only be in one place at a time, adding a ForeignKey relationship to the Asset model seems like the way to go. However, since Warehouse and Shipment objects have different data models, I'm not certain how to best do this.
One obvious (and somewhat ugly) solution is to create a Location model which includes all of the Shipment and Warehouse attributes and an is_warehouse Boolean attribute, but this strikes me as a bit of a kludge. Are there any cleaner approaches to solving this sort of problem (Or are there any non-Django Python libraries which might be better suited to the problem?)

what about having a generic foreign key on Assets?

I think its perfectly reasonable to create a "through" table such as location, which associates an asset, a content (foreign key) and a content_type (warehouse or shipment) . And you could set a unique constraint on the asset_fk so thatt it can only exist in one location at a time

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.