Best way to denormalize data in Django? [closed]

Best way to denormalize data in Django? [closed] - python

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I'm developing a simple web app, and it makes a lot of sense to store some denormalized data.
Imagine a blogging platform that keeps track of Comments, and the BlogEntry model has a "CommentCount" field that I'd like to keep up to date.
One way of doing this would be to use Django signals.
Another way of doing this would be to put hooks directly in my code that creates and destrys Comment objects to synchronously call some methods on BlogEntry to increment/decrement the comment count.
I suppose there are other pythonic ways of accomplishing this with decorators or some other voodoo.
What is the standard Design Pattern for denormalizing in Django? In practice, do you also have to write consistency checkers and data fixers in case of errors?

You have managers in Django.
Use a customized manager to do creates and maintain the FK relationships.
The manager can update the counts as the sets of children are updated.
If you don't want to make customized managers, just extend the save method. Everything you want to do for denormalizing counts and sums can be done in save.
You don't need signals. Just extend save.

I found django-denorm to be useful. It uses database-level triggers instead of signals, but as far as I know, there is also branch based on different approach.

The first approach (signals) has the advantage to loose the coupling between models.
However, signals are somehow more difficult to maintain, because dependencies are less explicit (at least, in my opinion).
If the correctness of the comment count is not so important, you could also think of a cron job that will update it every n minutes.
However, no matter the solution, denormalizing will make maintenance more difficult; for this reason I would try to avoid it as much as possible, resolving instead to using caches or other techniques -- for example, using with comments.count as cnt in templates may improve performance quite a lot.
Then, if everything else fails, and only in that case, think about what could be the best approach for the specific problem.

Django offers a great and efficient (though not very known) alternative to counter denormalization.
It will save your many lines of code and it's really slow since you retrieve the count in the same SQL query.
I will suppose you have these classes:
class BlogEntry(models.Model):
title = models.CharField()
...
class Comment(models.Model):
body = models.TextField()
blog_entry = models.ForeignKey(BlogEntry)
In your views.py, use annotations:
from django.db.models import Count
def blog_entry_list(Request):
blog_entries = BlogEntry.objects.annotate(count=Count('comment_set')).all()
...
And you will have an extra field per each BlogEntry, that contains the count of comments, plus the rest of fields of BlobEntry.
You can use this extra field in the templates too:
{% for blog_entry in blog_entries %}
{{ blog_entry.title }} has {{ blog_entry.count }} comments!
{% endfor %}
This will not only save you coding and maintenance time but it is really efficient (the query takes only a bit longer to be executed).

Why not just get the set of comments, and find the number of elements, using the count() method:
count = blog_entry.comment_set.count()
Then you can pass that into your template.
Or, alternative, in the template itself, you can do:
{{ blog_entry.comment_set.count }}
to get the number of comments.

Related

Why Django doesn't have an on_update=models.CASCADE option?

My question comes from a situation where I want to emulate the ON UPDATE CASCADE in SQL (when I update an id and I have a Foreignkey, it is going to be automatically updated) in Django, but I realized that (apparently) doesn't exist a native way to do it.
I have visited this old question where he is trying to do the same.
My question is: Why Django doesn't have a native option? Is that a bad practice? If so, why? Does that bring problems to my software or structure?
Is there a simple way to do this?
Thanks in advance!

It is a bad practice to think about updating key fields. Note that this field is used for index generation and other computationally expensive operations. This is why the key fields must contain a unique and unrepeatable value that identifies the record. It does not necessarily have to have a direct meaning with the content of the register, it can be an autonumeric value. If you need to update it, you may need to put the value in another field of the table to be able to do it without affecting the relationships.
It is for this reason that you will not find in Django the operation you are looking for.

Django - short non-linear non-predictable ID in the URL

I know there are similar questions (like this, this, this and this) but I have specific requirements and looking for a less-expensive way to do the following (on Django 1.10.2):
Looking to not have sequential/guessable integer ids in the URLs and ideally meet the following requirements:
Avoid UUIDs since that makes the URL really long.
Avoid a custom primary key. It doesn’t seem to work well if the models have ManyToManyFields. Got affected by at least three bugs while trying that (#25012, #24030 and #22997), including messing up the migrations and having to delete the entire db and recreating the migrations (well, lots of good learning too)
Avoid checking for collisions if possible (hence avoid a db lookup for every insert)
Don’t just want to look up by the slug since it’s less performant than just looking up an integer id.
Don’t care too much about encrypting the id - just don’t want it to
be a visibly sequential integer.
Note: The app would likely have 5 million records or so in the long term.

After researching a lot of options on SO, blogs etc., I ended up doing the following:
Encoding the id to base32 only for the URLs and decoding it back in urls.py (using an edited version of Django’s util functions to encode to base 36 since I needed uppercase letters instead of lowercase).
Not storing the encoded id anywhere. Just encoding and decoding everytime on the fly.
Keeping the default id intact and using it as primary key.
(good hints, posts and especially this comment helped a lot)
What this solution helps achieve:
Absolutely no edits to models or post_save signals.
No collision checks needed. Avoiding one extra request to the db.
Lookup still happens on the default id which is fast. Also, no double save()requests on the model for every insert.
Short and sweet encoded ID (the number of characters go up as the number of records increase but still not very long)
What it doesn’t help achieve/any drawbacks:
Encryption - the ID is encoded but not encrypted, so the user may
still be able to figure out the pattern to get to the id (but I dont
care about it much, as mentioned above).
A tiny overhead of encoding and decoding on each URL construction/request but perhaps that’s better than collision checks and/or multiple save() calls on the model object for insertions.
For reference, looks like there are multiple ways to generate random IDs that I discovered along the way (like Django’s get_random_string, Python’s random, Django’s UUIDField etc.) and many ways to encode the current ID (base 36, base 62, XORing, and what not).
The encoded ID can also be stored as another (indexed) field and looked up every time (like here) but depends on the performance parameters of the web app (since looking up a varchar id is less performant that looking up an integer id). This identifier field can either be saved from a overwritten model’s save() function, or by using a post_save() signal (see here) (while both approaches will need the save() function to be called twice for every insert).
All ears to optimizations to the above approach. I love SO and the community. Everytime there’s so much to learn here.
Update: After more than a year of this post, I found this great library called hashids which does pretty much the same thing quite well! Its available in many languages including Python.

Passing string rather than function in django url pattern

In the Django docs it says about url patterns:
It is possible to pass a string containing the path to a view rather
than the actual Python function object. This alternative is supported
for the time being, though is not recommended and will be removed in a
future version of Django.
Does anyone have any insight as to why this the case? I find this alternative to be quite handy and can't find anything explaining why this is a bad (or, at least, less than ideal) idea.

I think the 1.8 Release Notes in the repo explains it quite well. Here's a summary of the main points:
In the modern era, we have updated the tutorial to instead recommend importing
your views module and referencing your view functions (or classes) directly.
This has a number of advantages, all deriving from the fact that we are using
normal Python in place of "Django String Magic": the errors when you mistype a
view name are less obscure, IDEs can help with autocompletion of view names,
etc.
Thus patterns() serves little purpose and is a burden when teaching new users
(answering the newbie's question "why do I need this empty string as the first
argument to patterns()?"). For these reasons, we are deprecating it.
Updating your code is as simple as ensuring that urlpatterns is a list of
:func:django.conf.urls.url instances.

How can I do this in SQLAlchemy?

I have an SQLAlchemy model, say Entity and this has a column, is_published. If I query this, say by id, I only want to return this if is_published is set to True. I think I can achieve using a filter. But in case there is a relationship and I am able to and require to access it like another_model_obj.entity and I only want this to give the corresponding entity object if the is_published for that instance is set to True. How should I do this? One solution would be wrap this around using an if block each time I use this. But I use this too many times and if I use it again, I will have to remember this detail. Is there any way I can automate this in SQLAlchemy, or is there any other better solution to this problem as a whole? Thank you

It reads as if you would need a join:
session().query(AnotherModel).join(Entity).filter(Entity.is_published)

This question was asked many times here and still doesn't have a good answer. Here are possible solutions suggested by the author of SQLAlchemy. A more elaborate query class to exclude unpublished objects is provided in iktomi library. It works with SQLAlchemy 0.8.* branch only for now but should be ported to 0.9.* soon. See test cases for limitations (tests that fail are marked with #unittest.skip()) and usage examples.

Can django lazy-load fields in a model?

One of my django models has a large TextField which I often don't need to use. Is there a way to tell django to "lazy-load" this field? i.e. not to bother pulling it from the database unless I explicitly ask for it. I'm wasting a lot of memory and bandwidth pulling this TextField into python every time I refer to these objects.
The alternative would be to create a new table for the contents of this field, but I'd rather avoid that complexity if I can.

The functionality happens when you make the query, using the defer() statement, instead of in the model definition. Check it out here in the docs:
http://docs.djangoproject.com/en/dev/ref/models/querysets/#defer
Now, actually, your alternative solution of refactoring and pulling the data into another table is a really good solution. Some people would say that the need to lazy load fields means there is a design flaw, and the data should have been modeled differently.
Either way works, though!

There are two options for lazy-loading in Django: https://docs.djangoproject.com/en/1.6/ref/models/querysets/#django.db.models.query.QuerySet.only
defer(*fields)
Avoid loading those fields that require expensive processing to convert them to Python objects.
Entry.objects.defer("text")
only(*fields)
Only load the fields that you actually need
Person.objects.only("name")
Personally, I think only is better than defer since the code is not only easier to understand, but also more maintainable in the long run.

For something like this you can just override the default manager. Usually, it's not advised but for a defer() it makes sense:
class CustomManager(models.Manager):
def get_queryset(self):
return super(CustomManager, self).get_queryset().defer('YOUR_TEXTFIELD_FIELDNAME')
class DjangoModel(models.Model):
objects = CustomerManager()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.