Django large queryset return as response efficiently

Django large queryset return as response efficiently - python

I have a model in django called "Sample"
I want to query and return a large number of rows ~ 100k based on filters.
However, it's taking up to 4-5 seconds to return the response and I was wondering whether I could make it faster.
(Need to improve converting from queryset to df to response json. Not querying from DB)
My current code looks like this:
#api_view(['POST'])
def retrieve_signal_asset_weight_ts_by_signal(request):
#code to get item.id here based on request
qs = Sample.objects.filter(
data_date__range=[start_date, end_date],
item__id = item.id).values(*columns_required)
df = pd.DataFrame(list(qs), columns=columns_required)
response = df .to_json(orient='records')
return Response(response, status=status.HTTP_200_OK)
Based on multiple test cases -- I've noticed that the slow part isn't actually getting the data from DB, it's converting it to a DataFrame and then returning as JSON.
It's actually taking about 2 seconds just for this part df = pd.DataFrame(list(qs), columns=columns_required). Im looking for a faster way to convert queryset to a json which I can send as part of my "response" object!
Based on this link I've tried other methods including django-pandas and using .values_list() but they seem to be slower than this, and I noticed many of the answers are quite old so I was wondering whether Django 3 has anything to make it faster.
Thanks
Django version : 3.2.6

With your code, you can't write:
(Need to improve converting from queryset to df to response json. Not querying from DB)
It's actually taking about 2 seconds just for this part
df = pd.DataFrame(list(qs), columns=columns_required)
Get data from database is a lazy operation, so the query will be executed only when data is needed list(qs). According to the documentation:
QuerySets are lazy – the act of creating a QuerySet doesn’t involve any database activity. You can stack filters together all day long, and Django won’t actually run the query until the QuerySet is evaluated. Take a look at this example:
Try to separate operation:
records = list(qs)
df = pd.DataFrame(records, columns=columns_required))
Now, you can determine which operation is time-consuming.
Maybe, you look at StreamingHttpResponse

Related

How to retrieve count objects faster on Django?

My goal is to optimize the retrieval of the count of objects I have in my Django model.
I have two models:
Users
Prospects
It's a one-to-many relationship. One User can create many Prospects. One Prospect can only be created by one User.
I'm trying to get the Prospects created by the user in the last 24 hours.
Prospects model has roughly 7 millions rows on my PostgreSQL database. Users only 2000.
My current code is taking to much time to get the desired results.
I tried to use filter() and count():
import datetime
# get the date but 24 hours earlier
date_example = datetime.datetime.now() - datetime.timedelta(days = 1)
# Filter Prospects that are created by user_id_example
# and filter Prospects that got a date greater than date_example (so equal or sooner)
today_prospects = Prospect.objects.filter(user_id = 'user_id_example', create_date__gte = date_example)
# get the count of prospects that got created in the past 24 hours by user_id_example
# this is the problematic call that takes too long to process
count_total_today_prospects = today_prospects.count()
I works, but it takes too much time (5 minutes). Because it's checking the entire database instead of just checking, what I though it would: only the prospects that got created in the last 24 hours by the user.
I also tried using annotate but it's equally slow, because it's ultimately doing the same thing than the regular .count():
today_prospects.annotate(Count('id'))
How can I get the count in a more optimized way?

Assuming that you don't have it already, I suggest adding an index that includes both user and date fields (make sure that they are in this order, first the user and then the date, because for the user you are looking for an exact match but for the date you only have a starting point). That should speed up the query.
For example:
class Prospect(models.Model):
...
class Meta:
...
indexes = [
models.Index(fields=['user', 'create_date']),
]
...
This should create a new migration file (run makemigrations and migrate) where it adds the index to the database.
After that, your same code should run a bit faster:
count_total_today_prospects = Prospect.objects\
.filter(user_id='user_id_example', create_date__gte=date_example)\
.count()

Django's documentation:
A count() call performs a SELECT COUNT(*) behind the scenes, so you should always use count() rather than loading all of the record into Python objects and calling len() on the result (unless you need to load the objects into memory anyway, in which case len() will be faster).
Note that if you want the number of items in a QuerySet and are also retrieving model instances from it (for example, by iterating over it), it’s probably more efficient to use len(queryset) which won’t cause an extra database query like count() would.
If the queryset has already been fully retrieved, count() will use that length rather than perform an extra database query.
Take a look at this link: https://docs.djangoproject.com/en/3.2/ref/models/querysets/#count.
Try to use len().

Django filter based on retrieved values

I have a model which contains a few parameters like the following
{ "user_name" : "john",
"enter_time" : 1442257970.509184,
"frequency" : 30
}
I need to write a filter functions , preferably using django-filter or any other pythonic-way so that, I could filter out those records where in
def test():
cur_time = time.time()
if cur_time >= enter_time + frequency:
return True
return False
Currently in my views.py I am able to filter based on names and single values.
For example,
records = UserRecord.objects.filter(token=user_token, user_name=name, status='active').distinct()
serializer = RecordSerializer(records, many=True)
I am not sure, how to filter out based on the test condition defined in the test(). One workaround was to get serializer.data and process a list of OrderedDictionaries to filter out the content. But I am looking for django way of doing it.

You can use F expressions in filters. In your case:
UserRecord.objects.annotate(time_plus_freq=F('enter_time')+F('frequency'))\
.filter(time_plus_freq__lt=cur_time)
This way filtering is done in the SQL and you do not need to do any on the Python side.
If Django has trouble defining the type of the annotated expression look at output_field kwarg in annotate.

If you really, really want to do it directly, there is a way but it will be done on application-level, not on database-level. You can simply make generator like this:
records = (record for record in records if test(record))
but this is really bad and you should avoid it.
Much better solution is to rewrite your test condition, so it will use django queryset filtering, like it was mentioned in Ivan's answer.

Fetching queryset data one by one

I am aware that regular queryset or the iterator queryset methods evaluates and returns the entire data-set in one shot .
for instance, take this :
my_objects = MyObject.objects.all()
for rows in my_objects: # Way 1
for rows in my_objects.iterator(): # Way 2
Question
In both methods all the rows are fetched in a single-go.Is there any way in djago that the queryset rows can be fetched one by one from database.
Why this weird Requirement
At present my query fetches lets says n rows but sometime i get Python and Django OperationalError (2006, 'MySQL server has gone away').
so to have a workaround for this, i am currently using a weird while looping logic.So was wondering if there is any native or inbuilt method or is my question even logical in first place!! :)

I think you are looking to limit your query set.
Quote from above link:
Use a subset of Python’s array-slicing syntax to limit your QuerySet to a certain number of results. This is the equivalent of SQL’s LIMIT and OFFSET clauses.
In other words, If you start with a count you can then loop over and take slices as you require them..
cnt = MyObject.objects.count()
start_point = 0
inc = 5
while start_point + inc < cnt:
filtered = MyObject.objects.all()[start_point:inc]
start_point += inc
Of course you may need to error handle this more..

Fetching row by row might be worse. You might want to retrieve in batches for 1000s etc. I have used this Django snippet (not my work) successfully with very large querysets. It doesn't eat up memory and no trouble with connections going away.
Here's the snippet from that link:
import gc
def queryset_iterator(queryset, chunksize=1000):
'''''
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunksize (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
'''
pk = 0
last_pk = queryset.order_by('-pk')[0].pk
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()

To solve (2006, 'MySQL server has gone away') problem, your approach is not that logical. If you will hit database for each entry, it is going to increase number of queries which itself will create problem in future as usage of your application grows.
I think you should close mysql connection after iterating all elements of result, and then if you will try to make another query, django will create a new connection.
from django.db import connection:
connection.close()
Refer this for more details

Does django core pagination retrieve all data first?

I am using django 1.5. I need to split pages to data. I read docs here. I am not sure about whether it retrieves all data first or not. Since I have a large table, it should be better to using something like 'limit'. Thanks.
EDIT
I am using queryset in ModelManager.
example:
class KeywordManager(models.Manager):
def currentkeyword(self, kw, bd, ed):
wholeres = super(KeywordManager, self).get_query_set() \
.values("sc", "begindate", "enddate") \
.filter(keyword=kw, begindate__gte=bd, enddate__lte=ed) \
.order_by('enddate')
return wholeres

First, a queryset is a lazy object, and django will retrieve the data as soon you request it, but if you dont, django won't hit the DB. If you use over a queryset any list methods as len(), you will evaluate all the queryset and forcing django to retrieve all the data.
If you pass a queryset to the Paginator, it would not retrieve all the data, because, as docs says, if you pass a queryset, it will use .count() methods avoiding converting the queryset into a list and the use of len() method.

If your data is not coming from the database, then yes - Paginator will have to load all the information first in order to determine how to "split" it.
If you're not and you're simply interacting with the database with Django's auto-generated SQL, then the Paginator performs a query to determine the number of items in the database (i.e. an SQL COUNT()) and uses the value you supplied to determine how many pages to generate. Example: count() returns 43, and you want pages of 10 results - the number of pages generated is equivalent to: 43 % 10 + 1 = 5

In Django, how does one filter a QuerySet with dynamic field lookups?

Given a class:
from django.db import models
class Person(models.Model):
name = models.CharField(max_length=20)
Is it possible, and if so how, to have a QuerySet that filters based on dynamic arguments? For example:
# Instead of:
Person.objects.filter(name__startswith='B')
# ... and:
Person.objects.filter(name__endswith='B')
# ... is there some way, given:
filter_by = '{0}__{1}'.format('name', 'startswith')
filter_value = 'B'
# ... that you can run the equivalent of this?
Person.objects.filter(filter_by=filter_value)
# ... which will throw an exception, since `filter_by` is not
# an attribute of `Person`.

Python's argument expansion may be used to solve this problem:
kwargs = {
'{0}__{1}'.format('name', 'startswith'): 'A',
'{0}__{1}'.format('name', 'endswith'): 'Z'
}
Person.objects.filter(**kwargs)
This is a very common and useful Python idiom.

A simplified example:
In a Django survey app, I wanted an HTML select list showing registered users. But because we have 5000 registered users, I needed a way to filter that list based on query criteria (such as just people who completed a certain workshop). In order for the survey element to be re-usable, I needed for the person creating the survey question to be able to attach those criteria to that question (don't want to hard-code the query into the app).
The solution I came up with isn't 100% user friendly (requires help from a tech person to create the query) but it does solve the problem. When creating the question, the editor can enter a dictionary into a custom field, e.g.:
{'is_staff':True,'last_name__startswith':'A',}
That string is stored in the database. In the view code, it comes back in as self.question.custom_query . The value of that is a string that looks like a dictionary. We turn it back into a real dictionary with eval() and then stuff it into the queryset with **kwargs:
kwargs = eval(self.question.custom_query)
user_list = User.objects.filter(**kwargs).order_by("last_name")

Additionally to extend on previous answer that made some requests for further code elements I am adding some working code that I am using
in my code with Q. Let's say that I in my request it is possible to have or not filter on fields like:
publisher_id
date_from
date_until
Those fields can appear in query but they may also be missed.
This is how I am building filters based on those fields on an aggregated query that cannot be further filtered after the initial queryset execution:
# prepare filters to apply to queryset
filters = {}
if publisher_id:
filters['publisher_id'] = publisher_id
if date_from:
filters['metric_date__gte'] = date_from
if date_until:
filters['metric_date__lte'] = date_until
filter_q = Q(**filters)
queryset = Something.objects.filter(filter_q)...
Hope this helps since I've spent quite some time to dig this up.
Edit:
As an additional benefit, you can use lists too. For previous example, if instead of publisher_id you have a list called publisher_ids, than you could use this piece of code:
if publisher_ids:
filters['publisher_id__in'] = publisher_ids

Django.db.models.Q is exactly what you want in a Django way.

This looks much more understandable to me:
kwargs = {
'name__startswith': 'A',
'name__endswith': 'Z',
***(Add more filters here)***
}
Person.objects.filter(**kwargs)

A really complex search forms usually indicates that a simpler model is trying to dig it's way out.
How, exactly, do you expect to get the values for the column name and operation?
Where do you get the values of 'name' an 'startswith'?
filter_by = '%s__%s' % ('name', 'startswith')
A "search" form? You're going to -- what? -- pick the name from a list of names? Pick the operation from a list of operations? While open-ended, most people find this confusing and hard-to-use.
How many columns have such filters? 6? 12? 18?
A few? A complex pick-list doesn't make sense. A few fields and a few if-statements make sense.
A large number? Your model doesn't sound right. It sounds like the "field" is actually a key to a row in another table, not a column.
Specific filter buttons. Wait... That's the way the Django admin works. Specific filters are turned into buttons. And the same analysis as above applies. A few filters make sense. A large number of filters usually means a kind of first normal form violation.
A lot of similar fields often means there should have been more rows and fewer fields.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django large queryset return as response efficiently - python

Related

How to retrieve count objects faster on Django?

Django filter based on retrieved values

Fetching queryset data one by one

Does django core pagination retrieve all data first?

In Django, how does one filter a QuerySet with dynamic field lookups?

Categories

Resources