count() on filter chains in Django

count() on filter chains in Django - python

When chaining filters in Django, what is the most efficient way of counting the resulting records from an individual filter? Without running the filter twice that is.
i.e.
results = my_model.objects.all()
for filter in my_filters:
results = results.filter(filter.get_filter_string())
individual_num_records_affected = my_model.objects.filter(filter.get_filter_string())

Since you just need the number of results at each step, its more efficient to use .count() on the querysets.
From the querysets documentation:
Note: If you only need to determine the number of records in the set
(and don’t need the actual objects), it’s much more efficient to
handle a count at the database level using SQL’s SELECT COUNT(*).
Django provides a count() method for precisely this reason.
You should not use .len() as it will load all of the record into Python objects and calls len() on the result which you don't need. You just need the number of records and .count() would be the better option.
Also, remember that querysets are evaluated lazy.
Internally, a QuerySet can be constructed, filtered, sliced, and
generally passed around without actually hitting the database. No
database activity actually occurs until you do something to evaluate
the queryset.
So, until you try to use the results of the queryset, the queryset will not be evaluated i.e. no database hit will occur.
The statement results = results.filter(filter.get_filter_string()) will not hit the database until you try to use the results. You can do .filter() on a queryset multiple times but until you don't use it, there would be no database hit.

Related

Is using any with a QuerySet unoptimal?

Many times, one needs to check if there is at least one element inside a QuerySet. Mostly, I use exists:
if queryset.exists():
...
However, I've seen colleagues using python's any function:
if any(queryset):
...
Is using python's any function unoptimal?
My intuition tells me that this is a similar dilemma to one between using count and len: any will iterate through the QuerySet and, therefore, will need to evaluate it. In a case where we will use the QuerySet items, this doesn't create any slowdowns. However, if we need just to check if any pieces of data that satisfy a query exist, this might load data that we do not need.

Is using python's any function unoptimal?
The most Pythonic way would be:
if queryset:
# …
Indeed, a QuerySet has truthiness True if it contains at least one item, and False otherwise.
In case you later want to enumerate over the queryset (with a for loop for example), it will load the items in the cache if you check its truthiness, so for example:
if queryset:
for item in queryset:
# …
will only make one query to the database: one that will fetch all items when you check the if queryset, and then later you can reuse that cache without making a second query.
In case you do not consume the queryset later in the process, then you can work with a .exists() [Django-doc]: this will not load records in memory, but only make a query to check if at least one such record exists, this is thus less expensive in terms of bandwidth between the application and the database. If you however have to consume the queryset later, using .exists() is not a good idea, since then we make two queries.
Using any(queryset) however is non-sensical: you can check if a queryset contains elements by its truthiness, so using any() will usually only make that check slightly less efficient.

How will I get the time taken by a django orm query?

How will I get the time taken by a django orm query?
Also, which of the following queries will be faster?
ShipmentPPTLMapping.objects.get(shipment_id = shipment_id)
OR ShipmentPPTLMapping.objects.filter(shipment_id = shipment_id)[0]
Also for these queries. Which one is faster
ShipmentPPTLMapping.objects.filter(pptl_id = pptl_id).exclude(bag_seal_status = 'close').count())
OR ShipmentPPTLMapping.objects.filter(pptl_id = pptl_id,bag_seal_status = 'open').count())

One option is to use django-debug-toolbar: it measures every query done on every view.
It is a must on any django app.
Regarding the other questions:
I would say they are equivalent. In Django code, get uses filter and retrieves the first element.
should also be equivalent because filter and exclude are just different names for filter(Q(...)) and filter(~Q()) and chaining is equivalent to have a comma: the queries are connected with by AND.
However, because in one you are using "open" and in the other "close", this can have an impact in how the particular backend will perform, and AFAIK this difference can only be measured by profiling.

Efficient way to use filter() twice in Django

I am relatively new to Django and Python, but I have not been able to quite figure this one out.
I essentially want to query the database using filter for a large number of users. Then I want to make a bunch of queries on this just this section of users. So I thought it would be most efficient do first query for my larger filter parameters, and then make my separate filter queries on that set. In code, it looks like this
#Get the big groups of users, like all people with brown hair.
group_of_users = Data.objects.filter(......)
#Now get all the people with brown hair and blue eyes, and then all with green eyes, etc.
for each haircolor :
subset_of_group = group_of_users.filter(....)
That is just pseudo-code by the way, I am not that inept. I thought this would be more efficient, but it seems that if eliminate the first query and simply just get the querysets in the for loop, it is much faster (actually timed).
I fear this is because when I filter first, and then filter each time in the for loop, it is actually doing both sets of filter queries on each for loop execution. So really, doing twice the amount of work I want. I thought with caching this would not matter, as the first filter results would be cached and it would still be faster, but again, I timed it with multiple tests and the single filter is faster. Any ideas?
EDIT:
So it seems that querying for a set of data, and then trying to further query only against that set of data, is not possible. Rather, I should query for a set of data and then further parse that data using regular Python.

As garnertb ans lanzz said, it doesn't matter where you use the filter function, the only thing that matters is when you evaluate the query (see when querysets are evaluated). My guess is that in your tests, you evaluate the queryset somewhere in your code, and that you do more evaluations in your test with separate filter calls.
Whenever a queryset is evaluated, its results are cached. However, this cache does not carry over if you use another method, such as filter or order_by, on the queryset. SO you can't try to evaluate the bigger set, and use filtering on the queryset to retrieve the smaller sets without doing another query.
If you only have a small set of haircolours, you can get away with doing a query for each haircolour. However, if you have many of them, the amount of queries will have a severe impact on performance. In that case it might be better to do a query for the full set of users you want to use, and the do subsequent processing in python:
qs = Data.objects.filter(hair='brown')
objects = dict()
for obj in qs:
objects.setdefault(obj.haircolour, []).append(obj)
for (k, v) in objects.items():
print "Objects for colour '%s':" % k
for obj in v:
print "- %s" % obj

Filtering Django querysets does not perform any database operation, until you actually try to access the result. Filtering only adds conditions to the queryset, which are then used to build the final query when you access the result of the query.
When you assign group_of_users = Data.objects.filter(...), no data is retrieved from the database; you just get a queryset that knows that you want records that satisfy a specific condition (the filtering parameters you supplied to Data.objects.filter), but it does not pre-fetch those actual users. After that, when you assign subset_of_group = group_of_users.filter(....), you don't filter just that previous group of users, but only add more conditions to the queryset; still no data has been retrived from the database at this point. Only when you actually try to access the results of the queryset (by e.g. iterating over the queryset, or by slicing it, or by accessing a single index in it), the queryset will build an (usually) single query that would retrieve only user records that satisfy all filtering conditions you have accumulated in your querysets up to that point. It will still need to filter your entire users table to find those matching users; it cannot take advantage of the "previously retrieved" users from the group_of_users = Data.objects.filter(...) queryset, because nothing has been actually retrieved at that point.

Your approach is exactly right and it is efficient. The Querysets don't touch the database until they are evaluated, so you can add as many filters as you like and the database won't be touched. Django's excellent documentation provides all the information you need to figure out what operations cause the Queryset to be evaluated.

How do I speed up iteration of large datasets in Django

I have a query set of approximately 1500 records from a Django ORM query. I have used the select_related() and only() methods to make sure the query is tight. I have also used connection.queries to make sure there is only this one query. That is, I have made sure no extra queries are getting called on each iteration.
When I run the query cut and paste from connection.queries it runs in 0.02 seconds. However, it takes seven seconds to iterate over those records and do nothing with them (pass).
What can I do to speed this up? What causes this slowness?

A QuerySet can get pretty heavy when it's full of model objects. In similar situations, I've used the .values method on the queryset to specify the properties I need as a list of dictionaries, which can be much faster to iterate over.
Django documentation: values_list

1500 records is far from being a large dataset, and seven seconds is really too much. There is probably some problem in your models, you can easily check it by getting (as Brandon says) the values() query, and then create explicitly the 1500 object by iterating the dictionary. Just convert the ValuesQuerySet into a list before the construction to factor out the db connection.

How are you iterating over each item:
items = SomeModel.objects.all()
Regular for loop on each
for item in items:
print item
Or using the QuerySet iterator
for item in items.iterator():
print item
According to the doc, the iterator() can improve performance. The same applies while looping very large Python list or dictionaries, it's best to use iteritems().

Does your model's Meta declaration tell it to "order by" a field that is stored off in some other related table? If so, your attempt to iterate might be triggering 1,500 queries as Django runs off and grabs that field for each item, and then sorts them. Showing us your code would help us unravel the problem!

Django: Extending Querysets / Connect multiple filters with OR

I have to work with a queryset, that is already filtered, eg. qs = queryset.filter(language='de') but in some further operation i need to undo some of the already applied filtering, eg not to take only the rows with language='de' but entries in all languages. Is there a way to apply filter again and have the new parameters connected to the already existing ones using OR not add, eg. if the queryset is already filtered for language='de' and i would be able to connect an 'OR language='en' to that, it would give me what i'm looking for!
Thanks!

I don't believe it is possible to do what you are asking.
The way you do ORs in django is like this:
Model.objects.filter(Q(question__startswith='Who') | Q(question__startswith='What'))
so if you actually wanted to do this:
Model.objects.filter(Q(language='de') | Q(language='en'))
you would need to put them both in the same filter() call so you wouldn't be able to add the other or clause in a later filter() call.
I think the reason you may be trying to do this would be that you are concerned about hitting the database again but the only way to get accurate results would be to hit the database again.
If you are simply concerned about producing clean, DRY code, you can put all the filters that are common to both queries at the top and then "fork" that query set later, like this:
shared_qs = Model.objects.filter(active=True)
german_entries = shared_qs.filter(language='de')
german_and_english = shared_qs.filter(Q(language='de') | Q(language='en'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.