Django queryset update performance and optimization - python

I've created a bulk delete function that updates all enabled items' is_active flag. I've tried updating 5000 records with the following statement
Item.objects.filter(owner=request.user.profile, enabled=True, is_active=True).update(is_active=False)
But it is painfully slow and I'm afraid that this is causing my server to run out of memory.
I've previously had the following and it was still quite slow.
items = Item.objects.filter(owner=request.user.profile, enabled=True, is_active=True)
for item in items:
item.is_active = False
item.save()
The database being used is SQLite and I am using Django 1.7.
I wish to optimize this operation as much as possible. Any pointers or good query optimization docs would be appreciated.

You saying that you are deleting but in your code you are updating the rows rather than deleting. Aside from this, the format you are using in the first snippet is the way to go.
To increase performance you can use index_together with owner, enabled and is_active fields (note this adds some load while adding items).
But, as #Selcuk commented, if you aim performance, go use some serious database backend like postgresql.
Btw, take a look at db optimization docs Django offers so you can learn some tricks for future implementations ;).

Related

Delete entries from ManyToMany table using _raw_delete

I have a huge amount of data in my db.
I cannot use .delete() method cause performance of Django ORM is insufficient in my case.
_raw_delete() method suits me cause I can do it python instead using raw SQL.
But I have problem I have no idead how can I delete relation tables using _raw_delete. They need to be deleted before models cause I have restrict in DB. Any ideas how can I achieve this?
I have found a solution.
You can operate on link model with this:
link_model = MyModel._meta.get_field('my_m2m_field').remote_field.through
qs = link_model.objects.filter(mymodel_id__in=mymodel_ids)
qs._raw_delete(qs.db)

How can I track all SQL query timings and counts in Django?

I'd like to have a Django application record how much time each SQL query took.
The first problem is that SQL queries differ, even when they originate from the same code. That can be solved by normalizing them, so that
SELECT first_name, last_name FROM people WHERE NOW() - birth_date < interval '20' years;
would become something like
SELECT $ FROM people WHERE $ - birth_date < $;
After getting that done, we could just log the normalized query and the query timing to a file, syslog or statsd (for statsd, I'd probably also use a hash of the query as a key, and keep an index of hash->query relations elsewhere).
The bigger problem, however, is figuring out where that action can be performed. The best place for that I could find is this: https://github.com/django/django/blob/b5bacdea00c8ca980ff5885e15f7cd7b26b4dbb9/django/db/backends/util.py#L46 (note: we do use that ancient version of Django, but I'm fine with suggestions that are relevant only to newer versions).
Ideally, I'd like to make this a Django extension, rather than modifying Django source code. Sounds like I can make another backend, inheriting from the one we currently use, and make its CursorWrapper's class execute method record the timing and counter.
Is that the right approach, or should I be using some other primitives, like QuerySet or something?
Django debug toolbar has a panel that shows "SQL queries including time to execute and links to EXPLAIN each query"
http://django-debug-toolbar.readthedocs.io/en/stable/panels.html#sql

Sharding a Django Project

I'm starting a Django project and need to shard multiple tables that are likely to all be of too many rows. I've looked through threads here and elsewhere, and followed the Django multi-db documentation, but am still not sure how that all stitches together. My models have relationships that would be broken by sharding, so it seems like the options are to either drop the foreign keys of forgo sharding the respective models.
For argument's sake, consider the classic Authot, Publisher and Book scenario, but throw in book copies and users that can own them. Say books and users had to be sharded. How would you approach that? A user may own a copy of a book that's not in the same database.
In general, what are the best practices you have used for routing and the sharding itself? Did you use Django database routers, manually selected a database inside commands based on your sharding logic, or overridden some parts of the ORM to achive that?
I'm using PostgreSQL on Ubuntu, if it matters.
Many thanks.
In the past I've done something similar using Postgresql Table Partitioning, however this merely splits a table up in the same DB. This is helpful in reducing table search time. This is also nice because you don't need to modify your django code much. (Make sure you perform queries with the fields you're using for constraints).
But it's not sharding.
If you haven't seen it yet, you should check out Sharding Postgres with Instagram.
I agree with #DanielRoseman. Also, how many is too many rows. If you are careful with indexing, you can handle a lot of rows with no performance problems. Keep your indexed values small (ints). I've got tables in excess of 400 million rows that produce sub-second responses even when joining with other many million row tables.
It might make more sense to break user up into multiple tables so that the user object has a core of commonly used things and then the "profile" info lives elsewhere (std Django setup). Copies would be a small table referencing books which has the bulk of the data. Considering how much ram you can put into a DB server these days, sharding before you have too seems wrong.

Python: RE vs. Query

I am building a website using Django, and this website uses blocks which are enabled for a certain page.
Right now I use a textfield containing paths were a block is enabled. When a page is requested, Django retrieves all blocks from database and does re.search on the TextField.
However, I was wondering if it is not a better idea to use a separate DB table for block/paths, were each row contains a single path and reference to a block, in terms of overhead.
A seperate DB table is definitely the "right" way to do it, because mysql has to send all the data from your TEXT fields every time you query. As you add more rows and the TEXT fields get bigger, you'll start to notice performance issues and eventually crash the server. Also, you'll be able to use VARCHAR and add a unique index to the paths, making lookups lightning fast.
I am not exactly familiar with Django, but if I am understanding the situation correctly, you should use a table.
In fact this is exactly the kind of use that DB software is designed and optimized for.
No worries. It will actually be faster.
By doing the search yourself, you are trying to implement part of the DB logic on your own. Fun, certainly, but not so fast. :)
Here are some nice links on designing a database:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
http://en.wikipedia.org/wiki/Third_normal_form
Hope this helps. Good luck. :-)

Attribute Cache in Django - What's the point?

I was just looking over EveryBlock's source code and I noticed this code in the alerts/models.py code:
def _get_user(self):
if not hasattr(self, '_user_cache'):
from ebpub.accounts.models import User
try:
self._user_cache = User.objects.get(id=self.user_id)
except User.DoesNotExist:
self._user_cache = None
return self._user_cache
user = property(_get_user)
I've noticed this pattern around a bunch, but I don't quite understand the use. Is the whole idea to make sure that when accessing the FK on self (self = alert object), that you only grab the user object once from the db? Why wouldn't you just rely upon the db caching amd django's ForeignKey() field? I noticed that the model definition only holds the user id and not a foreign key field:
class EmailAlert(models.Model):
user_id = models.IntegerField()
...
Any insights would be appreciated.
I don't know why this is an IntegerField; it looks like it definitely should be a ForeignKey(User) field--you lose things like select_related() here and other things because of that, too.
As to the caching, many databases don't cache results--they (or rather, the OS) will cache the data on disk needed to get the result, so looking it up a second time should be faster than the first, but it'll still take work.
It also still takes a database round-trip to look it up. In my experience, with Django, doing an item lookup can take around 0.5 to 1ms, for an SQL command to a local Postgresql server plus sometimes nontrivial overhead of QuerySet. 1ms is a lot if you don't need it--do that a few times and you can turn a 30ms request into a 35ms request.
If your SQL server isn't local and you actually have network round-trips to deal with, the numbers get bigger.
Finally, people generally expect accessing a property to be fast; when they're complex enough to cause SQL queries, caching the result is generally a good idea.
Although databases do cache things internally, there's still an overhead in going back to the db every time you want to check the value of a related field - setting up the query within Django, the network latency in connecting to the db and returning the data over the network, instantiating the object in Django, etc. If you know the data hasn't changed in the meantime - and within the context of a single web request you probably don't care if it has - it makes much more sense to get the data once and cache it, rather than querying it every single time.
One of the applications I work on has an extremely complex home page containing a huge amount of data. Previously it was carrying out over 400 db queries to render. I've refactored it now so it 'only' uses 80, using very similar techniques to the one you've posted, and you'd better believe that it gives a massive performance boost.

Categories