We want to reduce the amount of same object instances in one python interpreter.
Example:
class Blog(models.Model):
author=models.ForeignKey(User)
If we iterate the thousand blogs, the same (same id but different python object) author objects get created several times.
Is there a way to make the django ORM reuse the already created user instances?
Example:
for blog in Blog.objects.all():
print (blog.author.username)
If author "foo-writer" has 100 blogs, there are 100 author objects in the memory. That's what we want to avoid.
I think solutions like mem-cached/redis won't help here, since we want to optimize the python process.
I'm not sure if it's database calls or memory usage you're concerned about here.
If the former, then using select_related will help you:
Blog.objects.all().select_related('author')
which will get all the blogs and their associated authors.
If you want to optimize memory, then the best way to do it is to get the relevant author objects manually in one go, store them in a dict, then manually annotate that object on each blog:
blogs = Blog.objects.all()
author_ids = set(b.author_id for b in blogs)
authors = Author.objects.in_bulk(list(author_ids))
for blog in blogs:
blog._author = authors[blog.author_id]
Related
I've been trying to build a Tutorial system that we usually see on websites. Like the ones we click next -> next -> previous etc to read.
All Posts are stored in a table(model) called Post. Basically like a pool of post objects.
Post.objects.all() will return all the posts.
Now there's another Table(model)
called Tutorial That will store the following,
class Tutorial(models.Model):
user = models.ForeignKey(User, on_delete=models.CASCADE)
tutorial_heading = models.CharField(max_length=100)
tutorial_summary = models.CharField(max_length=300)
series = models.CharField(max_length=40) # <---- Here [10,11,12]
...
Here entries in this series field are post_ids stored as a string representation of a list.
example: series will have [10,11,12] where 10, 11 and 12 are post_id that correspond to their respective entries in the Post table.
So my table entry for Tutorial model looks like this.
id heading summary series
"5" "Series 3 Tutorial" "lorem on ullt consequat." "[12, 13, 14]"
So I just read the series field and get all the Posts with the ids in this list then display them using pagination in Django.
Now, I've read from several stackoverflow posts that having multiple entries in a single field is a bad idea. And having this relationship to span over multiple tables as a mapping is a better option.
What I want to have is the ability to insert new posts into this series anywhere I want. Maybe in the front or middle. This can be easily accomplished by treating this series as a list and inserting as I please. Altering "[14,12,13]" will reorder the posts that are being displayed.
My question is, Is this way of storing multiple values in field for my usecase is okay. Or will it take a performance hit Or generally a bad idea. If no then is there a way where I can preserve or alter order by spanning the relationship by using another table or there is an entirely better way to accomplish this in Django or MYSQL.
Here entries in this series field are post_ids stored as a string representation of a list.
(...)
So I just read the series field and get all the Posts with the ids in this list then display them using pagination in Django.
DON'T DO THIS !!!
You are working with a relational database. There is one proper way to model relationships between entities in a relational database, which is to use foreign keys. In your case, depending on whether a post can belong only to a single tutorial ("one to many" relationship) or to many tutorials at the same time ("many to many" relationship, you'll want either to had to post a foreign key on tutorial, or to use an intermediate "post_tutorials" table with foreign keys on both post and tutorials.
Your solution doesn't allow the database to do it's job properly. It cannot enforce integrity constraints (what if you delete a post that's referenced by a tutorial ?), it cannot optimize read access (with proper schema the database can retrieve a tutorial and all it's posts in a single query) , it cannot follow reverse relationships (given a post, access the tutorial(s) it belongs to) etc. And it requires an external program (python code) to interact with your data, while with proper modeling you just need standard SQL.
Finally - but this is django-specific - using proper schema works better with the admin features, and with django rest framework if you intend to build a rest API.
wrt/ the ordering problem, it's a long known (and solved) issue, you just need to add an "order" field (small int should be enough). There are a couple 3rd part django apps that add support for this to both your models and the admin so it's almost plug and play.
IOW, there are absolutely no good reason to denormalize your schema this way and only good reasons to use proper relational modeling. FWIW I once had to work on a project based on some obscure (and hopefully long dead) PHP cms that had the brillant idea to use your "serialized lists" anti-pattern, and I can tell you it was both a disaster wrt/ performances and a complete nightmare to maintain. So do yourself and the world a favour: don't try to be creative, follow well-known and established best practices instead, and your life will be much happier. My 2 cents...
I can think of two approaches:
Approach One: Linked List
One way is using linked list like this:
class Tutorial(models.Model):
...
previous = models.OneToOneField('self', null=True, blank=True, related_name="next")
In this approach, you can access the previous Post of the series like this:
for tutorial in Tutorial.objects.filter(previous__isnull=True):
print(tutorial)
while(tutorial.next_post):
print(tutorial.next)
tutorial = tutorial.next
This is kind of complicated approach, for example whenever you want to add a new tutorial in middle of a linked-list, you need to change in two places. Like:
post = Tutorial.object.first()
next_post = post.next
new = Tutorial.objects.create(...)
post.next=new
post.save()
new.next = next_post
new.save()
But there is a huge benefit in this approach, you don't have to create a new table for creating series. Also, there is possibility that the order in tutorials will not be modified frequently, which means you don't need to take too much hassle.
Approach Two: Create a new Model
You can simply create a new model and FK to Tutorial, like this:
class Series(models.Model):
name = models.CharField(max_length=255)
class Tutorial(models.Model):
..
series = models.ForeignKey(Series, null=True, blank=True, related_name='tutorials')
order = models.IntegerField(default=0)
class Meta:
unique_together=('series', 'order') # it will make sure that duplicate order for same series does not happen
Then you can access tutorials in series by:
series = Series.object.first()
series.tutorials.all().order_by('tutorials__order')
Advantage of this approach is its much more flexible to access Tutorials through series, but there will be an extra table created for this, and one extra field as well to maintain order.
I'm trying to provide an interface for the user to write custom queries over the database. I need to make sure they can only query the records they are allowed to. In order to do that, I decided to apply row based access control using django-guardian.
Here is how my schemas look like
class BaseClass(models.Model):
somefield = models.TextField()
class Meta:
permissions = (
('view_record', 'View record'),
)
class ClassA(BaseClass):
# some other fields here
classb = models.ForeignKey(ClassB)
class ClassB(BaseClass):
# some fields here
classc = models.ForeignKey(ClassC)
class ClassC(BaseClass):
# some fields here
I would like to be able to use get_objects_for_group as follows:
>>> group = Group.objects.create('some group')
>>> class_c = ClassC.objects.create('ClassC')
>>> class_b = ClassB.objects.create('ClassB', classc=class_c)
>>> class_a = ClassA.objects.create('ClassA', classb=class_b)
>>> assign_perm('view_record', group, class_c)
>>> assign_perm('view_record', group, class_b)
>>> assign_perm('view_record', group, class_a)
>>> get_objects_for_group(group, 'view_record')
This gives me a QuerySet. Can I use the BaseClass that I defined above and write a raw query over other related classes?
>>> qs.intersection(get_objects_for_group(group, 'view_record'), \
BaseClass.objects.raw('select * from table_a a'
'join table_b b on a.id=b.table_a_id '
'join table_c c on b.id=c.table_b_id '
'where some conditions here'))
Does this approach make sense? Is there a better way to tackle this problem?
Thanks!
Edit:
Another way to tackle the problem might be creating a separate table for each user. I understand the complexity this might add to my application but:
The number of users will not be more than 100s for a long time. Not a consumer application.
Per our use case, it's quite unlikely that I'll need to query across these tables. I won't write a query that needs to aggregate anything from table1, table2, table3 that belongs to the same model.
Maintaining a separate table per customer could have an advantage.
Do you think this is a viable approach?
After researching many options I found out that I can solve this problem at the database level using Row Level Security on PostgreSQL. It ends up being the easiest and the most elegant.
This article helped me a lot to bridge the application level users with PostgreSQL policies.
What I learned by doing my research is:
Separate tables could still be an option in the future when customers can potentially affect each others' query performances since they are allowed to run arbitrary queries.
Trying to solve it at the ORM level is almost impossible if you are planning to use raw or ad-hoc queries.
I think you already know what you need to do. The word you are looking for is multitenancy. Although it is not one table per customer. The best suit for you will be one schema per customer. Unfortunately, the best article I had on multitenancy is no more available. See if you can find a cached version: https://msdn.microsoft.com/en-us/library/aa479086.aspx otherwise there are numerous articles availabe on the internet.
Another viable approach is to take a look at custom managers. You could write one custom manager for each Model-Customer and query it accordingly. But all this will lead to application complexity and will soon get out of your hand. Any bug in the application security layer is a nightmare to you.
Weighing both I'd be inclined to say multitenancy solution as you said in your edit is by far the best approach.
First, you should provide us with more details, how is your architecture set and built, with django so that we can help you. Have you implemented an API? using django template is not really a good idea if you are building a large scale application, consuming a lot of data.Because this can affect the query load massively.I can suggest extracting your front-end from the backend.
I want to make a model where I have the option of pulling a group of items and their descriptions from a postgres database based on tags. What is the most efficient way of doing this for performance using Django 1.9 and Postgres 9.5 on data that I do not really modify often?
I found multiple ways of doing this:
Toxi solution
Where 3 tables are made, one storing the item and descriptions, second storing the tags, and a third table associating tags with items. See here:
What is the most efficient way to store tags in a database?
Using Array fields
Where there is one table with items, descriptions, and an array field that stores tags
Using JSON fields
Where there is one table with items, descriptions, and an JSON field that stores tags. The django docs mention this uses JSONB: https://docs.djangoproject.com/en/1.11/ref/contrib/postgres/fields/
I am guessing that using JSON fields like the following:
from django.contrib.postgres.fields import JSONField
from django.db import models
class Item(models.Model):
name = models.CharField()
description = models.TextField(blank=True, null=True)
tags = JSONField()
is the most efficient for reading data but I am not really sure.
No one of theese is "most efficient", it really depends of what you're focused in.
First is classical relation approach, that provides consistency and normalisation. It also allows you to to retrieve complex information (i.e. give me all items with a tag T) pretty easy (but usually with a cost of joins). I think this should be your goto solution if you're using RDBMS.
Second and third (there mostly the same in your case) are easier for simple retrieve and store (i.e. give me all tags of an item I), but lead to a more work for keeping your data consistent and performing complex queries.
How do we implement agregation or composition with NDB on Google App Engine ? What is the best way to proceed depending on use cases ?
Thanks !
I've tried to use a repeated property. In this very simple example, a Project have a list of Tag keys (I have chosen to code it this way instead of using StructuredProperty because many Project objects can share Tag objects).
class Project(ndb.Model):
name = ndb.StringProperty()
tags = ndb.KeyProperty(kind=Tag, repeated=True)
budget = ndb.FloatProperty()
date_begin = ndb.DateProperty(auto_now_add=True)
date_end = ndb.DateProperty(auto_now_add=True)
#classmethod
def all(cls):
return cls.query()
#classmethod
def addTags(cls, from_str):
tagname_list = from_str.split(',')
tag_list = []
for tag in tagname_list:
tag_list.append(Tag.addTag(tag))
cls.tags = tag_list
--
Edited (2) :
Thanks. Finally, I have chosen to create a new Model class 'Relation' representing a relation between two entities. It's more an association, I confess that my first design was unadapted.
An alternative would be to use BigQuery. At first we used NDB, with a RawModel which stores individual, non-aggregated records, and an AggregateModel, which a stores the aggregate values.
The AggregateModel was updated every time a RawModel was created, which caused some inconsistency issues. In hindsight, properly using parent/ancestor keys as Tim suggested would've worked, but in the end we found BigQuery much more pleasant and intuitive to work with.
We just have cronjobs that run everyday to push RawModel to BigQuery and another to create the AggregateModel records with data fetched from BigQuery.
(Of course, this is only effective if you have lots of data to aggregate)
It really does depend on the use case. For small numbers of items StructuredProperty and repeated properties may well be the best fit.
For large numbers of entities you will then look at setting the parent/ancestor in the Key for composition, and have a KeyProperty pointing to the primary entity in a many to one aggregation.
However the choice will also depend heavily on the actual use pattern as well. Then considerations of efficiency kick in.
The best I can suggest is consider carefully how you plan to use these relationships, how active are they (ie are they constantly changing, adding, deleting), do you need to see all members of the relation most of the time, or just subsets. These consideration may well require adjustments to the approach.
For instance, I have an Author model, which contains 2 parts: author profile, and his articles.
class Author:
# Author profile properties.
name = ...
age = ...
# Author articles properties.
...
When designing the model, I got two options:
1) Create a separate model class called article and add a list of reference in the Author model.
2) Just define all these into the same model and use projection query to read profile.
So which one is better in the sense of read/write/update cost given the fact that most of the read is for profile? What if the article property will get larger and larger in the future?
A single entity if possible is the best solution. But what kind of queries do you use? Do have to query all the articles including the content or only the article metadata. Because you can put a lot of metadata is a single 1Mb entity.
So the real questions are: what kind of queries do you need, how much metadata do you have and what do you do with the result set, like showing a user a result page using a cursor.
Maybe you can use the search API : https://developers.google.com/appengine/docs/python/search/#Putting_Documents_in_an_Index