Is using any with a QuerySet unoptimal?

Is using any with a QuerySet unoptimal? - python

Many times, one needs to check if there is at least one element inside a QuerySet. Mostly, I use exists:
if queryset.exists():
...
However, I've seen colleagues using python's any function:
if any(queryset):
...
Is using python's any function unoptimal?
My intuition tells me that this is a similar dilemma to one between using count and len: any will iterate through the QuerySet and, therefore, will need to evaluate it. In a case where we will use the QuerySet items, this doesn't create any slowdowns. However, if we need just to check if any pieces of data that satisfy a query exist, this might load data that we do not need.

Is using python's any function unoptimal?
The most Pythonic way would be:
if queryset:
# …
Indeed, a QuerySet has truthiness True if it contains at least one item, and False otherwise.
In case you later want to enumerate over the queryset (with a for loop for example), it will load the items in the cache if you check its truthiness, so for example:
if queryset:
for item in queryset:
# …
will only make one query to the database: one that will fetch all items when you check the if queryset, and then later you can reuse that cache without making a second query.
In case you do not consume the queryset later in the process, then you can work with a .exists() [Django-doc]: this will not load records in memory, but only make a query to check if at least one such record exists, this is thus less expensive in terms of bandwidth between the application and the database. If you however have to consume the queryset later, using .exists() is not a good idea, since then we make two queries.
Using any(queryset) however is non-sensical: you can check if a queryset contains elements by its truthiness, so using any() will usually only make that check slightly less efficient.

Related

How to load django objects requesting DB once and do searches after

Suppose I have a QuerySet of all db entries:
all_db_entries = Entry.objects.all()
And then I want to get some specific objects from it by calling get(param=value) (or any other method).
The problem is, that In documentation of QuerySet methods it is said: "These methods do not use a cache. Rather, they query the database each time they’re called.".
But what I want to achieve is to load all data once (like doing Select *), and only after do some searches on them. I don't want to open a connection to the db every time I call get() in order to avoid a heavy load on it.

You can use values to convert your resulting queryset in an ordinary python list, which you can use to do searches etc., e.g.:
list(MyModel.objects.values('pk', 'field'))
values will fetch the queryset once.

count() on filter chains in Django

When chaining filters in Django, what is the most efficient way of counting the resulting records from an individual filter? Without running the filter twice that is.
i.e.
results = my_model.objects.all()
for filter in my_filters:
results = results.filter(filter.get_filter_string())
individual_num_records_affected = my_model.objects.filter(filter.get_filter_string())

Since you just need the number of results at each step, its more efficient to use .count() on the querysets.
From the querysets documentation:
Note: If you only need to determine the number of records in the set
(and don’t need the actual objects), it’s much more efficient to
handle a count at the database level using SQL’s SELECT COUNT(*).
Django provides a count() method for precisely this reason.
You should not use .len() as it will load all of the record into Python objects and calls len() on the result which you don't need. You just need the number of records and .count() would be the better option.
Also, remember that querysets are evaluated lazy.
Internally, a QuerySet can be constructed, filtered, sliced, and
generally passed around without actually hitting the database. No
database activity actually occurs until you do something to evaluate
the queryset.
So, until you try to use the results of the queryset, the queryset will not be evaluated i.e. no database hit will occur.
The statement results = results.filter(filter.get_filter_string()) will not hit the database until you try to use the results. You can do .filter() on a queryset multiple times but until you don't use it, there would be no database hit.

Efficient way to use filter() twice in Django

I am relatively new to Django and Python, but I have not been able to quite figure this one out.
I essentially want to query the database using filter for a large number of users. Then I want to make a bunch of queries on this just this section of users. So I thought it would be most efficient do first query for my larger filter parameters, and then make my separate filter queries on that set. In code, it looks like this
#Get the big groups of users, like all people with brown hair.
group_of_users = Data.objects.filter(......)
#Now get all the people with brown hair and blue eyes, and then all with green eyes, etc.
for each haircolor :
subset_of_group = group_of_users.filter(....)
That is just pseudo-code by the way, I am not that inept. I thought this would be more efficient, but it seems that if eliminate the first query and simply just get the querysets in the for loop, it is much faster (actually timed).
I fear this is because when I filter first, and then filter each time in the for loop, it is actually doing both sets of filter queries on each for loop execution. So really, doing twice the amount of work I want. I thought with caching this would not matter, as the first filter results would be cached and it would still be faster, but again, I timed it with multiple tests and the single filter is faster. Any ideas?
EDIT:
So it seems that querying for a set of data, and then trying to further query only against that set of data, is not possible. Rather, I should query for a set of data and then further parse that data using regular Python.

As garnertb ans lanzz said, it doesn't matter where you use the filter function, the only thing that matters is when you evaluate the query (see when querysets are evaluated). My guess is that in your tests, you evaluate the queryset somewhere in your code, and that you do more evaluations in your test with separate filter calls.
Whenever a queryset is evaluated, its results are cached. However, this cache does not carry over if you use another method, such as filter or order_by, on the queryset. SO you can't try to evaluate the bigger set, and use filtering on the queryset to retrieve the smaller sets without doing another query.
If you only have a small set of haircolours, you can get away with doing a query for each haircolour. However, if you have many of them, the amount of queries will have a severe impact on performance. In that case it might be better to do a query for the full set of users you want to use, and the do subsequent processing in python:
qs = Data.objects.filter(hair='brown')
objects = dict()
for obj in qs:
objects.setdefault(obj.haircolour, []).append(obj)
for (k, v) in objects.items():
print "Objects for colour '%s':" % k
for obj in v:
print "- %s" % obj

Filtering Django querysets does not perform any database operation, until you actually try to access the result. Filtering only adds conditions to the queryset, which are then used to build the final query when you access the result of the query.
When you assign group_of_users = Data.objects.filter(...), no data is retrieved from the database; you just get a queryset that knows that you want records that satisfy a specific condition (the filtering parameters you supplied to Data.objects.filter), but it does not pre-fetch those actual users. After that, when you assign subset_of_group = group_of_users.filter(....), you don't filter just that previous group of users, but only add more conditions to the queryset; still no data has been retrived from the database at this point. Only when you actually try to access the results of the queryset (by e.g. iterating over the queryset, or by slicing it, or by accessing a single index in it), the queryset will build an (usually) single query that would retrieve only user records that satisfy all filtering conditions you have accumulated in your querysets up to that point. It will still need to filter your entire users table to find those matching users; it cannot take advantage of the "previously retrieved" users from the group_of_users = Data.objects.filter(...) queryset, because nothing has been actually retrieved at that point.

Your approach is exactly right and it is efficient. The Querysets don't touch the database until they are evaluated, so you can add as many filters as you like and the database won't be touched. Django's excellent documentation provides all the information you need to figure out what operations cause the Queryset to be evaluated.

What are the specifics and conditions to django auto-caching query sets

Django stated in their docs that all query sets are automatically cached, https://docs.djangoproject.com/en/dev/topics/db/queries/#caching-and-querysets. But they weren't super specific with the details of this functionality.
The example that they gave was to save the qs in a python variable, and subsequent calls after the first will be taken from the cache.
queryset = Entry.objects.all()
print([p.headline for p in queryset]) # Evaluate the query set.
print([p.pub_date for p in queryset]) # Re-use the cache from the evaluation.
So even if two exact queryset calls were made without a variable subsequently when a user loads a view, would the results not be cached?
# When the user loads the homepage, call number one (not cached)
def home(request):
entries = Entry.objects.filter(something)
return render_to_response(...)
# Call number two, is this cached automatically? Or do I need to import cache and
# manually do it? This is the same method as above, called twice
def home(request):
entries = Entry.objects.filter(something)
return render_to_response(...)
Sorry if this is confusing, I pasted the method twice to make it look like the user is calling it twice, its just one method. Are entries automatically cached?
Thanks

The queryset example you have given rightly indicates that querysets are evaluated lazily i.e the first time they are used. So when subsequently used again, they are not evaluated in the same flow when assigned to a variable. This is not exactly caching but re-using an evaluated expression as long as it is available in an optimized manner.
For the kind of caching you are looking at i.e the same view called twice, you will need to manually cache the database object when it is fetched the first time. Memcached is good for this. Then subsequently check and fetch like in example below.
def view(request):
results = cache.get(request.user.id)
if not results:
results = do_a_ton_of_work()
cache.set(request.user.id, results)
There are of course a lot of other ways to do caching at different levels right from your proxy server to per url caching. Whatever works best for you. Here is a good read on this topic.

It is not cached for two reasons:
When you use just filter, but don't "loop" through the results the queryset is not yet evaluated, which means the cache is still empty.
Even they would be evaluated it is not cached, because when you call the function the second time the queryset is recreated (new local variable), even you created it already the first time you called the function. The second function call does simply not "know" what you did before. Its simply a new queryset instance. In this case you might rely on the database cache though.

Memcached is built-in module, it works nice but even you can try for "johnny cache" for more better results.
you can get the more info here
http://packages.python.org/johnny-cache/

How do I speed up iteration of large datasets in Django

I have a query set of approximately 1500 records from a Django ORM query. I have used the select_related() and only() methods to make sure the query is tight. I have also used connection.queries to make sure there is only this one query. That is, I have made sure no extra queries are getting called on each iteration.
When I run the query cut and paste from connection.queries it runs in 0.02 seconds. However, it takes seven seconds to iterate over those records and do nothing with them (pass).
What can I do to speed this up? What causes this slowness?

A QuerySet can get pretty heavy when it's full of model objects. In similar situations, I've used the .values method on the queryset to specify the properties I need as a list of dictionaries, which can be much faster to iterate over.
Django documentation: values_list

1500 records is far from being a large dataset, and seven seconds is really too much. There is probably some problem in your models, you can easily check it by getting (as Brandon says) the values() query, and then create explicitly the 1500 object by iterating the dictionary. Just convert the ValuesQuerySet into a list before the construction to factor out the db connection.

How are you iterating over each item:
items = SomeModel.objects.all()
Regular for loop on each
for item in items:
print item
Or using the QuerySet iterator
for item in items.iterator():
print item
According to the doc, the iterator() can improve performance. The same applies while looping very large Python list or dictionaries, it's best to use iteritems().

Does your model's Meta declaration tell it to "order by" a field that is stored off in some other related table? If so, your attempt to iterate might be triggering 1,500 queries as Django runs off and grabs that field for each item, and then sorts them. Showing us your code would help us unravel the problem!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.