How do I speed up iteration of large datasets in Django

How do I speed up iteration of large datasets in Django - python

I have a query set of approximately 1500 records from a Django ORM query. I have used the select_related() and only() methods to make sure the query is tight. I have also used connection.queries to make sure there is only this one query. That is, I have made sure no extra queries are getting called on each iteration.
When I run the query cut and paste from connection.queries it runs in 0.02 seconds. However, it takes seven seconds to iterate over those records and do nothing with them (pass).
What can I do to speed this up? What causes this slowness?

A QuerySet can get pretty heavy when it's full of model objects. In similar situations, I've used the .values method on the queryset to specify the properties I need as a list of dictionaries, which can be much faster to iterate over.
Django documentation: values_list

1500 records is far from being a large dataset, and seven seconds is really too much. There is probably some problem in your models, you can easily check it by getting (as Brandon says) the values() query, and then create explicitly the 1500 object by iterating the dictionary. Just convert the ValuesQuerySet into a list before the construction to factor out the db connection.

How are you iterating over each item:
items = SomeModel.objects.all()
Regular for loop on each
for item in items:
print item
Or using the QuerySet iterator
for item in items.iterator():
print item
According to the doc, the iterator() can improve performance. The same applies while looping very large Python list or dictionaries, it's best to use iteritems().

Does your model's Meta declaration tell it to "order by" a field that is stored off in some other related table? If so, your attempt to iterate might be triggering 1,500 queries as Django runs off and grabs that field for each item, and then sorts them. Showing us your code would help us unravel the problem!

Related

Is using any with a QuerySet unoptimal?

Many times, one needs to check if there is at least one element inside a QuerySet. Mostly, I use exists:
if queryset.exists():
...
However, I've seen colleagues using python's any function:
if any(queryset):
...
Is using python's any function unoptimal?
My intuition tells me that this is a similar dilemma to one between using count and len: any will iterate through the QuerySet and, therefore, will need to evaluate it. In a case where we will use the QuerySet items, this doesn't create any slowdowns. However, if we need just to check if any pieces of data that satisfy a query exist, this might load data that we do not need.

Is using python's any function unoptimal?
The most Pythonic way would be:
if queryset:
# …
Indeed, a QuerySet has truthiness True if it contains at least one item, and False otherwise.
In case you later want to enumerate over the queryset (with a for loop for example), it will load the items in the cache if you check its truthiness, so for example:
if queryset:
for item in queryset:
# …
will only make one query to the database: one that will fetch all items when you check the if queryset, and then later you can reuse that cache without making a second query.
In case you do not consume the queryset later in the process, then you can work with a .exists() [Django-doc]: this will not load records in memory, but only make a query to check if at least one such record exists, this is thus less expensive in terms of bandwidth between the application and the database. If you however have to consume the queryset later, using .exists() is not a good idea, since then we make two queries.
Using any(queryset) however is non-sensical: you can check if a queryset contains elements by its truthiness, so using any() will usually only make that check slightly less efficient.

Firestore query takes a too long time to get the value of only one field

. Hi, community.
I have a question/issue about firestore query from Firebase.
I have a collection of around 18000 documents. I would like to get the value of a single same field of some of these documents. I use the python firestore_v1 library from google-cloud-python client. So, for example with list_edges.length = 250:
[db_firestore.document(f"edges/{edge['id']}").get({"distance"}).to_dict()["distance"] for edge in list_edges]
it takes like 30+ seconds to be evaluated, meanwhile with the equal collection on MongoDB it takes not more than 3 seconds doing this and loading the whole object, not only a one field:
list(db_mongo["edges"].find({"city_id":{"$eq":city_id},"id": {"$in": [edge_id for edge in list_edges]}}))
...having said that, I thought the solution could be separate the large collection by city_id, so I create a new collection and copy the corresponded documents inside, so now the query looks like:
[db_firestore.document(f"edges/7/edges/{edge['id']}").get({"distance"}).to_dict()["distance"] for edge in list_edges]
where 7 is a city_id.
However, it takes the same time. So, maybe the issue is around the .get() method, but I could not find any optimized solution for my case.
Could you help me with this? Thanks!
EDITED
I've got the answer from firestore support. The problem is that I make 250 requests doing .get() for each document separately. The idea is to get all the data I want in only one request, so I need to modify the query.
Let's assume I have the next DB:
edges collection with multiples edge_id documents. For each new request, I use a new generated list of edges I need to catch.
In MongoDB, I can do it with the $in operator (having edge_id inside the document), but in firestore, the 'in' operator only accepts up to 10 equality.
So, I need to find out another way to do this.
Any ideas? Thanks!

Firebase recently added support for a limited in operation. See:
The blog post announcing the feature.
The documentation on in and array-contains-any queries.
From the latter:
cities_ref = db.collection(u'cities')
query = cities_ref.where(u'country', u'in', [u'USA', u'Japan'])
A few caveats though:
You can have at most 10 values in the in clause, and you can have only on in (or array-contains-any) clause in query.
I am not sure if you can use this operator to select by ID.

count() on filter chains in Django

When chaining filters in Django, what is the most efficient way of counting the resulting records from an individual filter? Without running the filter twice that is.
i.e.
results = my_model.objects.all()
for filter in my_filters:
results = results.filter(filter.get_filter_string())
individual_num_records_affected = my_model.objects.filter(filter.get_filter_string())

Since you just need the number of results at each step, its more efficient to use .count() on the querysets.
From the querysets documentation:
Note: If you only need to determine the number of records in the set
(and don’t need the actual objects), it’s much more efficient to
handle a count at the database level using SQL’s SELECT COUNT(*).
Django provides a count() method for precisely this reason.
You should not use .len() as it will load all of the record into Python objects and calls len() on the result which you don't need. You just need the number of records and .count() would be the better option.
Also, remember that querysets are evaluated lazy.
Internally, a QuerySet can be constructed, filtered, sliced, and
generally passed around without actually hitting the database. No
database activity actually occurs until you do something to evaluate
the queryset.
So, until you try to use the results of the queryset, the queryset will not be evaluated i.e. no database hit will occur.
The statement results = results.filter(filter.get_filter_string()) will not hit the database until you try to use the results. You can do .filter() on a queryset multiple times but until you don't use it, there would be no database hit.

Efficient way to use filter() twice in Django

I am relatively new to Django and Python, but I have not been able to quite figure this one out.
I essentially want to query the database using filter for a large number of users. Then I want to make a bunch of queries on this just this section of users. So I thought it would be most efficient do first query for my larger filter parameters, and then make my separate filter queries on that set. In code, it looks like this
#Get the big groups of users, like all people with brown hair.
group_of_users = Data.objects.filter(......)
#Now get all the people with brown hair and blue eyes, and then all with green eyes, etc.
for each haircolor :
subset_of_group = group_of_users.filter(....)
That is just pseudo-code by the way, I am not that inept. I thought this would be more efficient, but it seems that if eliminate the first query and simply just get the querysets in the for loop, it is much faster (actually timed).
I fear this is because when I filter first, and then filter each time in the for loop, it is actually doing both sets of filter queries on each for loop execution. So really, doing twice the amount of work I want. I thought with caching this would not matter, as the first filter results would be cached and it would still be faster, but again, I timed it with multiple tests and the single filter is faster. Any ideas?
EDIT:
So it seems that querying for a set of data, and then trying to further query only against that set of data, is not possible. Rather, I should query for a set of data and then further parse that data using regular Python.

As garnertb ans lanzz said, it doesn't matter where you use the filter function, the only thing that matters is when you evaluate the query (see when querysets are evaluated). My guess is that in your tests, you evaluate the queryset somewhere in your code, and that you do more evaluations in your test with separate filter calls.
Whenever a queryset is evaluated, its results are cached. However, this cache does not carry over if you use another method, such as filter or order_by, on the queryset. SO you can't try to evaluate the bigger set, and use filtering on the queryset to retrieve the smaller sets without doing another query.
If you only have a small set of haircolours, you can get away with doing a query for each haircolour. However, if you have many of them, the amount of queries will have a severe impact on performance. In that case it might be better to do a query for the full set of users you want to use, and the do subsequent processing in python:
qs = Data.objects.filter(hair='brown')
objects = dict()
for obj in qs:
objects.setdefault(obj.haircolour, []).append(obj)
for (k, v) in objects.items():
print "Objects for colour '%s':" % k
for obj in v:
print "- %s" % obj

Filtering Django querysets does not perform any database operation, until you actually try to access the result. Filtering only adds conditions to the queryset, which are then used to build the final query when you access the result of the query.
When you assign group_of_users = Data.objects.filter(...), no data is retrieved from the database; you just get a queryset that knows that you want records that satisfy a specific condition (the filtering parameters you supplied to Data.objects.filter), but it does not pre-fetch those actual users. After that, when you assign subset_of_group = group_of_users.filter(....), you don't filter just that previous group of users, but only add more conditions to the queryset; still no data has been retrived from the database at this point. Only when you actually try to access the results of the queryset (by e.g. iterating over the queryset, or by slicing it, or by accessing a single index in it), the queryset will build an (usually) single query that would retrieve only user records that satisfy all filtering conditions you have accumulated in your querysets up to that point. It will still need to filter your entire users table to find those matching users; it cannot take advantage of the "previously retrieved" users from the group_of_users = Data.objects.filter(...) queryset, because nothing has been actually retrieved at that point.

Your approach is exactly right and it is efficient. The Querysets don't touch the database until they are evaluated, so you can add as many filters as you like and the database won't be touched. Django's excellent documentation provides all the information you need to figure out what operations cause the Queryset to be evaluated.

Getting a field from document via mongoengine

I have a places collection, from which i was trying to extract the place names to suggest to the user, but it's taking much time, would like to know if there are any ways to optimize. I use mongoengine ORM and the database is mongodb.
query:
results = Place.objects(name__istartswith=query).only('name')
the query takes very less time in the matter of microseconds.
but now when i try to access the names from results
names = [result.name for result in results]
this line takes a very long time, varies from 3-5 secs, for a list of length around 2500.
I have tried using scalar, but now the time increases when i do an union over another list.
Is there a better way to access the names list.

A queryset isn't actioned until its iterated so results = Place.objects(name=query).only('name') returns a queryset that hasn't been called yet. When you iterate it the query takes place and data is sent over the wire.
Is the query slow when running via pymongo? As you don't need them as MongoEngine objects try using as_pymongo - which returns raw dictionaries back.
Other hints are to make sure the query is performant - using an index - see the profiler docs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.