MongoEngine: EmbeddedDocument v/s. ReferenceField

MongoEngine: EmbeddedDocument v/s. ReferenceField - python

EmbeddedDocument will allow to store a document inside another document, while RefereneField just stores it's reference. But, they're achieving a similar goal. Do they have specific use cases?
PS:
There's already a question on SO, but no good answers.

The answer to this really depends on what intend to do with the data you are storing in mongodb. It is important to remember that a ReferenceField will point to a document in another collection in mongodb, whereas an EmbeddedDocument is stored in the same document in the same collection.
Consider this schema:
Person
-> name
-> address
Address
-> street
-> city
-> country
If you expect every person to have only one address and each address to only be associated with one person (a one-to-one relationship) and you are generally going to query the database for one or more Person documents then the Person.address field should be EmbeddedDocumentField.
If you expect every person to have more than one address but each address will only be associated to one person (a one-to-many relationship) and you will still mainly query for a Person then you can use an EmbeddedDocumentListField.
If you expect every person to have more than one address and each address will be associated with many people (a many-to-many relationship) you probably should use ReferenceField.
However, even if you are one-to-one or one-to-many, if the Address is part of your data model that is of interest then it may be advantageous to have it stored in it's own collection because it makes aggregation and indexing easier.
One other point to consider is that unless you turn it off mongoengine will de-reference every ReferenceFieldwhen you retrieve a document - this might introduce performance penalties with lots of ReferenceField or references to very large documents.

It's really about the schema design of your collections in MongoDB. Generally it depends on different factors like cardinality of the relationship, way of accessing the data or size of the documents. It's explained well in official MongoDB's blog with some examples and I recommend you take a look at it.

Related

Generic data handling in Python

The situation
While reading the Bible (as context) I'd like to point out certain dependencies e.g. of people and locations. Due to swift expandability I'm choosing Python to handle this versatile data. Currently I'm creating many feature vectors independent from each other, containing various information as the database.
In the end I'd like to type in a keyword to search in this whole database, which shall return everything that is in touch with it. Something simple as
results = database(key)
What I'm looking for
Unfortunately I'm not a Pro about different database handling possibilities and I hope you can help me finding an appropriate option.
Are there possibilities that can be used out of the box or do I need to create all logic by myself?

This is a little vague so I'll try to handle the People and Location bit of it to help you get started.
One possibility is to build a SQLite database. (The sqlite3 library + documentation is relatively friendly). Also here's a nice tutorial on getting started with SQLite.
To start, you can create two entity tables:
People: contains details about every person in bible.
Locations: contains details about every location in bible.
You can then create two relationship tables that reference people and locations (as Foreign Keys). For example, one of these relationship tables might be
People_Visited_Locations: contains information about where each person visited in their lifetime. The schema might looks something like this:
| person (Foreign Key)| location (Foreign Key) | year |
Remember that Foreign Key refers to an entry in another table. In our case, person is an existing unique ID from your entity table People, location is an existing unique ID from your entity table Locations, and year could be the year that person went to that location.
Then to fetch every place that some person, say Adam in the bible visited, you can create a Select statement that returns all entries in People_Visited_Locations with Adam as person.
I think key (pun intended) takeaway is how Relationship tables can help you map relationships between entities.
Hope this helps get you started :)

Storing multiple values into a single field in mysql database that preserve order in Django

I've been trying to build a Tutorial system that we usually see on websites. Like the ones we click next -> next -> previous etc to read.
All Posts are stored in a table(model) called Post. Basically like a pool of post objects.
Post.objects.all() will return all the posts.
Now there's another Table(model)
called Tutorial That will store the following,
class Tutorial(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE)
tutorial_heading = models.CharField(max_length=100)
tutorial_summary = models.CharField(max_length=300)
series = models.CharField(max_length=40) # <---- Here [10,11,12]
...
Here entries in this series field are post_ids stored as a string representation of a list.
example: series will have [10,11,12] where 10, 11 and 12 are post_id that correspond to their respective entries in the Post table.
So my table entry for Tutorial model looks like this.
id heading summary series
"5" "Series 3 Tutorial" "lorem on ullt consequat." "[12, 13, 14]"
So I just read the series field and get all the Posts with the ids in this list then display them using pagination in Django.
Now, I've read from several stackoverflow posts that having multiple entries in a single field is a bad idea. And having this relationship to span over multiple tables as a mapping is a better option.
What I want to have is the ability to insert new posts into this series anywhere I want. Maybe in the front or middle. This can be easily accomplished by treating this series as a list and inserting as I please. Altering "[14,12,13]" will reorder the posts that are being displayed.
My question is, Is this way of storing multiple values in field for my usecase is okay. Or will it take a performance hit Or generally a bad idea. If no then is there a way where I can preserve or alter order by spanning the relationship by using another table or there is an entirely better way to accomplish this in Django or MYSQL.

Here entries in this series field are post_ids stored as a string representation of a list.
(...)
So I just read the series field and get all the Posts with the ids in this list then display them using pagination in Django.
DON'T DO THIS !!!
You are working with a relational database. There is one proper way to model relationships between entities in a relational database, which is to use foreign keys. In your case, depending on whether a post can belong only to a single tutorial ("one to many" relationship) or to many tutorials at the same time ("many to many" relationship, you'll want either to had to post a foreign key on tutorial, or to use an intermediate "post_tutorials" table with foreign keys on both post and tutorials.
Your solution doesn't allow the database to do it's job properly. It cannot enforce integrity constraints (what if you delete a post that's referenced by a tutorial ?), it cannot optimize read access (with proper schema the database can retrieve a tutorial and all it's posts in a single query) , it cannot follow reverse relationships (given a post, access the tutorial(s) it belongs to) etc. And it requires an external program (python code) to interact with your data, while with proper modeling you just need standard SQL.
Finally - but this is django-specific - using proper schema works better with the admin features, and with django rest framework if you intend to build a rest API.
wrt/ the ordering problem, it's a long known (and solved) issue, you just need to add an "order" field (small int should be enough). There are a couple 3rd part django apps that add support for this to both your models and the admin so it's almost plug and play.
IOW, there are absolutely no good reason to denormalize your schema this way and only good reasons to use proper relational modeling. FWIW I once had to work on a project based on some obscure (and hopefully long dead) PHP cms that had the brillant idea to use your "serialized lists" anti-pattern, and I can tell you it was both a disaster wrt/ performances and a complete nightmare to maintain. So do yourself and the world a favour: don't try to be creative, follow well-known and established best practices instead, and your life will be much happier. My 2 cents...

I can think of two approaches:
Approach One: Linked List
One way is using linked list like this:
class Tutorial(models.Model):
...
previous = models.OneToOneField('self', null=True, blank=True, related_name="next")
In this approach, you can access the previous Post of the series like this:
for tutorial in Tutorial.objects.filter(previous__isnull=True):
print(tutorial)
while(tutorial.next_post):
print(tutorial.next)
tutorial = tutorial.next
This is kind of complicated approach, for example whenever you want to add a new tutorial in middle of a linked-list, you need to change in two places. Like:
post = Tutorial.object.first()
next_post = post.next
new = Tutorial.objects.create(...)
post.next=new
post.save()
new.next = next_post
new.save()
But there is a huge benefit in this approach, you don't have to create a new table for creating series. Also, there is possibility that the order in tutorials will not be modified frequently, which means you don't need to take too much hassle.
Approach Two: Create a new Model
You can simply create a new model and FK to Tutorial, like this:
class Series(models.Model):
name = models.CharField(max_length=255)
class Tutorial(models.Model):
..
series = models.ForeignKey(Series, null=True, blank=True, related_name='tutorials')
order = models.IntegerField(default=0)
class Meta:
unique_together=('series', 'order') # it will make sure that duplicate order for same series does not happen
Then you can access tutorials in series by:
series = Series.object.first()
series.tutorials.all().order_by('tutorials__order')
Advantage of this approach is its much more flexible to access Tutorials through series, but there will be an extra table created for this, and one extra field as well to maintain order.

Association objects with history for relationships using ORM

A common type of relationship in schemas is this: a joiner table has a datetime element and is meant to store history about relationships between the rows of two other tables over time. These relationships are one-to-one or one-to-many even though we're using an association table which usually implies many-to-many. At any given point in time only one mapping, the latest at that point in time, is valid. For example:
Tables:
Computer: [id, name, description]
Locations: [id, name, address]
ComputerLocations: [id, computers_id, locations_id, timestamp]
A Computers object can only belong to one Locations object at a time (and Locations can have many Computers), but we store the history in the table. Rows in ComputerLocations aren't deleted, only superseded by new rows at query-time. Perhaps in the future some prune-type event will remove older rows as their usefulness is reduced.
What I'm looking do do is model this in SQLAlchemy, specifically in the ORM, so that a Computers class has the following properties:
A new Computer can be created without (independently of) a location (this makes sense because the location table is separate)
A new Location can be created without (independently of) a computer
If a Computer has a location it must be a member of Locations (foreign key constraint)
When updating an existing Computers object's location, a new row will be added to ComputerLocations with a datetime of NOW()
When creating a new Computers object with a location, a new row will be added to ComputerLocations with a datetime of NOW()
Everything should be atomic (i.e. fail if a new Computer is created but the row associating it to a location can't be created)
Is there a specific design pattern or a concrete method in SQLAlchemy ORM to accomplish this? The documentation has a section on Non-traditional mappings that includes mapping a class against multiple tables and to arbitrary selects so this looks promising. Further there was another question of stackoverflow that mentioned vertical tables. Due to my relative inexperience with SQLAlchemy I cannot synthesize this information into a robust and elegant solution yet. Any help would be greatly appreciated.
I'm using MySQL but a solution should be general enough for any database through the SQLAlchemy dialects system.

What are some good examples for App Engine NDB commenting models?

I'm trying to model a basic linear commenting system for my blog in App Engine (you can see it at http://codeinsider.us). My main classes of objects are:
Users,
Articles,
Comments
One user will have many comments and should be able to view their comments at a glance.
One article will have many comments and should be visible at a glance.
One comment will be associated with exactly one user and exactly one article.
I know how I might build this in a standard relational database - I might have, say, separate tables for comments, users, and articles, with foreign keys to tie them together, uniqueness constraints on articles and users, and none on comments, etc. Nothing fancy.
What's the best way of modeling this in Python App Engine with NDB? ndb.KeyProperty seems interesting, as does StructuredProperty. I don't think I can use StructuredProperty though, since a comment can "belong" to both a User and an Article. But with ndb.KeyProperty, it seems like the keyProperty doesn't do any checking or validation logic, so I'd have to implement that on my own.
The other thing I can do is just throw in the towel, and store giant JSON blobs in Users and Articles representing the Keys and Kinds of comments. That may not be a bad solution.
Any thoughts?
Edit:
This is going to be high-read, low-write. I may add some engagement on comments (upvotes/downvotes), but even then, it will be heavily weighted towards reads.

I recommend to you thinking carefully on what features are you planning to provide since structuring your models in some way may difficult some changes in the future.
I will do this as follows:
First, assume some eventual consistency. No matter how you design this, you will have some eventual consistency in some queries.
Make a KeyProperty "owner" in article to store the user_key. If you want to achieve strong consistency when querying the articles of a single user then instead of using the "owner" KeyProperty just make the user_key the parent of the Article (this will create an entity group for the user and it's articles and is fine here).
With comments you can do more things.
If you expect less than 100 (depending on Article size on the
datastore can be more) comments for each article create a comments
KeyProperty(repeated=True) in Article to store all the comments keys
and then get them with get_multi (strong consistency).
To create the comment and also modify the Article comments property
you may need a transaction, because you will want to accomplish the
two operations or non of them. But.. the two entities are not in the
same entity group so: 1) use cross group transaction or 2) make the
parent of the comment the Article (this second option will have some
consequences discussed later) Counts of comments are easy but
limited to 100 or more comments as said before.
Create a Comment ndb model with two KeyProperties, "owner" and
"article". The article will fetch comments with a query. To query
all the comments within an Article you will have eventual
consistency unless you make the article the parent of the comment
(in that case don't create the article KeyProperty of course). This
approach allows lots of comments.
The problem of using entity groups are that for example, if you allow to vote on comments, then a single write operation on each comment will block any write in the hole entity group of the Article affected. So creation and voting by other users may be affected. But don't really care about this if you expect few votes and you keep entity groups small.
If you want to allow comment votes this can get quite complicated as you may want for example only one vote per user. This will require extra relationships that need to be thought before.
Personally I prefer to assume eventual consistency almost always.
More approaches are possible but I like this two.

High read, low write scenario is the specialty on GAE, so that's a good thing for your purpose.
I'd take advantage of the ancestry feature of GAE Model as it assures you transactional/atomic operations within an entity group. I guess you don't need much of that but it's a good thing to have still.
The right structure is determined by the way you are going to treat/use your data. I'm assuming the typical case in your blog would be to show comments for an article, thus, I'd make your comment model a child of your article model - you could then query comments for a certain (article) ancestor and that would scale magnificently.
I'd include a KeyProperty for the author on the comment, as that would be used mainly to fetch a user from the key I assume. If you want to extend KeyProperty functionality you can do so. Here's an example on how to make KeyProperty behave as ReferenceProperty used to in db. (point 1.)

Django design patterns - models with ForeignKey references to multiple classes

I'm working through the design of a Django inventory tracking application, and have hit a snag in the model layout. I have a list of inventoried objects (Assets), which can either exist in a Warehouse or in a Shipment. I want to store different lists of attributes for the two types of locations, e.g.:
For Warehouses, I want to store the address, manager, etc.
For Shipments, I want to store the carrier, tracking number, etc.
Since each Warehouse and Shipment can contain multiple Assets, but each Asset can only be in one place at a time, adding a ForeignKey relationship to the Asset model seems like the way to go. However, since Warehouse and Shipment objects have different data models, I'm not certain how to best do this.
One obvious (and somewhat ugly) solution is to create a Location model which includes all of the Shipment and Warehouse attributes and an is_warehouse Boolean attribute, but this strikes me as a bit of a kludge. Are there any cleaner approaches to solving this sort of problem (Or are there any non-Django Python libraries which might be better suited to the problem?)

what about having a generic foreign key on Assets?

I think its perfectly reasonable to create a "through" table such as location, which associates an asset, a content (foreign key) and a content_type (warehouse or shipment) . And you could set a unique constraint on the asset_fk so thatt it can only exist in one location at a time

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.