Enhancing a queryset's performance - python

I'm learning DRF and experimenting with a queryset. I'm trying to optimize to work as efficiently as possible. The goal being to get a list of grades for active students who are majoring in 'Art'.
Based on database optimization techniques,
I've ran some different updates and don't see a difference when I look at the Time returned via the console's Network tab. I DO however, see less logs in the Seq scan when I run the .explain() method on the model filtering. Am I accomplishing anything by doing that?
For example:
Grades.objects.filter(student_id__in=list(student_list)).order_by()
Anything else I can do to improve the below code that I might be missing? - Outside of adding any Foreign or Primary key model changes.
class GradeViewSet(viewsets.ModelViewSet):
serializer_class = GradesSerializer
def retrieve(self, request, *args, **kwargs):
active_students = Student.objects.filter(active=True)
student_list = active_students.filter(major='Art').values_list('student_id')
queryset = Grades.objects.filter(student_id__in=student_list)
serializers = GradesSerializer(queryset, many=True)
return Response(serializers.data)
SQL query I'm attempting to create in Django.
select * from app_grades g
join app_students s on g.student_id = s.student_id
where s.active = true and s.major = 'Art'

Your code will execute two separate database queries, I suggest that you try the following query instead:
queryset = Grades.objects.filter(student__active=True, student__major='Art')
this code will retrieve the exact same records but performing only one query with the appropriate JOIN clause.
You probably want to take a look at this part of the documentation.
Because of the lack of model relations that forbids the use of lookups I suggest that you use an Exists subuery. In this specific case the query will be as follows:
queryset = Grades.objects.annotate(student_passes_filter=Exists(
Student.objects.filter(id=OuterRef('student_id'), active=True, major='Art')
)).filter(student_passes_filter=True)
You will need to import Exists and OuterRef. Note that these are available from Django 1.11 onwards.

You should probably regroup those lines to reduce the number of queries:
active_students = Student.objects.filter(active=True)
student_list = active_students.filter(major='Art').values_list('student_id')
Into:
active_students = Student.objects.filter(active=True, major=‘Art’)
And converting to list then

Related

How to write manager class which use filter field as computed field not as a part of model fields?

I have a model Student with manager StudentManager as given below. As property gives the last date by adding college_duration in join_date. But when I execute this property computation is working well, but for StudentManager it gives an error. How to write manager class which on the fly computes some field using model fields and which is used to filter records.
The computed field is not in model fields. still, I want that as filter criteria.
class StudentManager(models.Manager):
def passed_students(self):
return self.filter(college_end_date__lt=timezone.now())
class Student(models.Model):
join_date = models.DateTimeField(auto_now_add=True)
college_duration = models.IntegerField(default=4)
objects = StudentManager()
#property
def college_end_date(self):
last_date = self.join_date + timezone.timedelta(days=self.college_duration)
return last_date
Error Django gives. when I tried to access Student.objects.passed_students()
django.core.exceptions.FieldError: Cannot resolve keyword 'college_end_date' into field. Choices are: join_date, college_duration
Q 1. How alias queries done in Django ORM?
By using the annotate(...)--(Django Doc) or alias(...) (New in Django 3.2) if you're using the value only as a filter.
Q 2. Why property not accessed in Django managers?
Because the model managers (more accurately, the QuerySet s) are wrapping things that are being done in the database. You can call the model managers as a high-level database wrapper too.
But, the property college_end_date is only defined in your model class and the database is not aware of it, and hence the error.
Q 3. How to write manager to filter records based on the field which is not in models, but can be calculated using fields present in the model?
Using annotate(...) method is the proper Django way of doing so. As a side note, a complex property logic may not be re-create with the annotate(...) method.
In your case, I would change college_duration field from IntegerField(...) to DurationField(...)--(Django Doc) since its make more sense (to me)
Later, update your manager and the properties as,
from django.db import models
from django.utils import timezone
class StudentManager(models.Manager):
<b>def passed_students(self):
default_qs = self.get_queryset()
college_end = models.ExpressionWrapper(
models.F('join_date') + models.F('college_duration'),
output_field=models.DateField()
)
return default_qs \
.annotate(college_end=college_end) \
.filter(college_end__lt=timezone.now().date())</b>
class Student(models.Model):
join_date = models.DateTimeField()
college_duration = models.DurationField()
objects = StudentManager()
#property
def college_end_date(self):
# return date by summing the datetime and timedelta objects
return <b>(self.join_date + self.college_duration).date()
Note:
DurationField(...) will work as expected in PostgreSQL and this implementation will work as-is in PSQL. You may have problems if you are using any other databases, if so, you may need to have a "database function" which operates over the datetime and duration datasets corresponding to your specific database.
Personally, I like this solution,
To quote #Willem Van Olsem's comment:
You don't. The database does not know anything about properties, etc. So it can not filter on this. You can make use of .annotate(..) to move the logic to the database side.
You can either do the message he shared, or make that a model field that auto calculates.
class StudentManager(models.Manager):
def passed_students(self):
return self.filter(college_end_date__lt=timezone.now())
class Student(models.Model):
join_date = models.DateTimeField(auto_now_add=True)
college_duration = models.IntegerField(default=4)
college_end_date = models.DateTimeField()
objects = StudentManager()
def save(self, *args, **kwargs):
# Add logic here
if not self.college_end_date:
self.college_end_date = self.join_date + timezone.timedelta(days-self.college_duration)
return super.save(*args, **kwargs)
Now you can search it in the database.
NOTE: This sort of thing is best to do from the start on data you KNOW you're going to want to filter. If you have pre-existing data, you'll need to re-save all existing instances.
Problem
You’re attempting to query on a row that doesn’t exist in the database. Also, Django ORM doesn’t recognize a property as a field to register.
Solution
The direct answer to your question would be to create annotations, which could be subsequently queried off of. However, I would reconsider your table design for Student as it introduces unnecessary complexity and maintenance overhead.
There’s much more framework/db support for start date, end date idiosyncrasy than there is start date, timedelta.
Instead of storing duration, store end_date and calculate duration in a model method. This makes more not only makes more sense as students are generally provided a start date and estimated graduation date rather than duration, but also because it’ll make queries like these much easier.
Example
Querying which students are graduating in 2020.
Students.objects.filter(end_date__year=2020)

How to avoid ordering by in django queryset, order_by() not working

I have a "big" db whith over 60M records, and I'm trying to paginate by 50.
I have another db whith ~8M records and it works perfectly, but with the 60M amount it just never loads and overflows the db.
I found that the problem was the order_by(id) made by django so I tried using a mysql view already ordered by id, but then django tries to order it again. To avoid this, I used order_by(), which is supposed to avoid any ordering, but it still does it.
def get_queryset(self, request):
qs = super(CropAdmin, self).get_queryset(request)
qs1 = qs.only('id', 'grain__id', 'scan__id', 'scan__acquisition__id',
'validated', 'area', 'crop_date', 'matched_label', 'grain__grain_number', 'filename').order_by()
if request.user.is_superuser:
return qs1
The query made is still using order_by:
SELECT `crops_ordered`.`crop_id`,
`crops_ordered`.`crop_date`,
`crops_ordered`.`area`,
`crops_ordered`.`matched_label`,
`crops_ordered`.`validated`,
`crops_ordered`.`scan_id`,
`crops_ordered`.`grain_id`,
`crops_ordered`.`filename`,
`scans`.`scan_id`,
`scans`.`acquisition_id`,
`acquisitions`.`acquisition_id`,
`grains`.`grain_id`,
`grains`.`grain_number`
FROM `crops_ordered`
INNER JOIN `scans`
ON (`crops_ordered`.`scan_id` = `scans`.`scan_id`)
INNER JOIN `acquisitions`
ON (`scans`.`acquisition_id` = `acquisitions`.`acquisition_id`)
INNER JOIN `grains`
ON (`crops_ordered`.`grain_id` = `grains`.`grain_id`)
**ORDER BY `crops_ordered`.`crop_id` DESC**
LIMIT 50
Any idea on how to fix this? Or a better way to work with a db of this size?
I don't believe order_by() will work as there will most likely be a default parameter when Django implemented this function. Having said that, I believe this thread has the answer that you want.
Edit
The link in that thread might provide too much information at once, although there aren't many details on this either. If you don't like Github, there's also an official documentation page on this method but you'll have to manually look for clear_ordering by using CTRL + f or any equivalence.

Prefetch object not working with order_by queryset

Using Django 11 with PostgreSQL db. I have the models as shown below. I'm trying to prefetch a related queryset, using the Prefetch object and prefetch_related without assigning it to an attribute.
class Person(Model):
name = Charfield()
#property
def latest_photo(self):
return self.photos.order_by('created_at')[-1]
class Photo(Model):
person = ForeignKey(Person, related_name='photos')
created_at = models.DateTimeField(auto_now_add=True)
first_person = Person.objects.prefetch_related(Prefetch('photos', queryset=Photo.objects.order_by('created_at'))).first()
first_person.photos.order_by('created_at') # still hits the database
first_person.latest_photo # still hits the database
In the ideal case, calling person.latest_photo will not hit the database again. This will allow me to use that property safely in a list display.
However, as noted in the comments in the code, the prefetched queryset is not being used when I try to get the latest photo. Why is that?
Note: I've tried using the to_attr argument of Prefetch and that seems to work, however, it's not ideal since it means I would have to edit latest_photo to try to use the prefetched attribute.
The problem is with slicing, it creates a different query.
You can work around it like this:
...
#property
def latest_photo(self):
first_use_the_prefetch = list(self.photos.order_by('created_at'))
then_slice = first_use_the_prefetch[-1]
return then_slice
And in case you want to try, it is not possible to use slicing inside the Prefetch(query=...no slicing here...) (there is a wontfix feature request for this somewhere in Django tracker).

Reorder Django QuerySet by dynamically added field

A have piece of code, which fetches some QuerySet from DB and then appends new calculated field to every object in the Query Set. It's not an option to add this field via annotation (because it's legacy and because this calculation based on another already pre-fetched data).
Like this:
from django.db import models
class Human(models.Model):
name = models.CharField()
surname = models.CharField()
def calculate_new_field(s):
return len(s.name)*42
people = Human.objects.filter(id__in=[1,2,3,4,5])
for s in people:
s.new_column = calculate_new_field(s)
# people.somehow_reorder(new_order_by=new_column)
So now all people in QuerySet have a new column. And I want order these objects by new_column field. order_by() will not work obviously, since it is a database option. I understand thatI can pass them as a sorted list, but there is a lot of templates and other logic, which expect from this object QuerySet-like inteface with it's methods and so on.
So question is: is there some not very bad and dirty way to reorder existing QuerySet by dinamically added field or create new QuerySet-like object with this data? I believe I'm not the only one who faced this problem and it's already solved with django. But I can't find anything (except for adding third-party libs, and this is not an option too).
Conceptually, the QuerySet is not a list of results, but the "instructions to get those results". It's lazily evaluated and also cached. The internal attribute of the QuerySet that keeps the cached results is qs._result_cache
So, the for s in people sentence is forcing the evaluation of the query and caching the results.
You could, after that, sort the results by doing:
people._result_cache.sort(key=attrgetter('new_column'))
But, after evaluating a QuerySet, it makes little sense (in my opinion) to keep the QuerySet interface, as many of the operations will cause a reevaluation of the query. From this point on you should be dealing with a list of Models
Can you try it functions.Length:
from django.db.models.functions import Length
qs = Human.objects.filter(id__in=[1,2,3,4,5])
qs.annotate(reorder=Length('name') * 42).order_by('reorder')

Tastypie Dehydrate reverse relation count

I have a simple model which includes a product and category table. The Product model has a foreign key Category.
When I make a tastypie API call that returns a list of categories /api/vi/categories/
I would like to add a field that determines the "product count" / the number of products that have a giving category. The result would be something like:
category_objects[
{
id: 53
name: Laptops
product_count: 7
},
...
]
The following code is working but the hit on my DB is heavy
def dehydrate(self, bundle):
category = Category.objects.get(pk=bundle.obj.id)
products = Product.objects.filter(category=category)
bundle.data['product_count'] = products.count()
return bundle
Is there a more efficient way to build this query? Perhaps with annotate ?
You can use prefetch_related method of QuerSet to reverse select_related.
Asper documentation,
prefetch_related(*lookups)
Returns a QuerySet that will automatically
retrieve, in a single batch, related objects for each of the specified
lookups.
This has a similar purpose to select_related, in that both are
designed to stop the deluge of database queries that is caused by
accessing related objects, but the strategy is quite different.
If you change your dehydrate function to following then database will be hit single time.
def dehydrate(self, bundle):
category = Category.objects.prefetch_related("product_set").get(pk=bundle.obj.id)
bundle.data['product_count'] = category.product_set.count()
return bundle
UPDATE 1
You should not initialize queryset inside dehydrate function. queryset should be always set in Meta class only. Please have a look at following example from django-tastypie documentation.
class MyResource(ModelResource):
class Meta:
queryset = User.objects.all()
excludes = ['email', 'password', 'is_staff', 'is_superuser']
def dehydrate(self, bundle):
# If they're requesting their own record, add in their email address.
if bundle.request.user.pk == bundle.obj.pk:
# Note that there isn't an ``email`` field on the ``Resource``.
# By this time, it doesn't matter, as the built data will no
# longer be checked against the fields on the ``Resource``.
bundle.data['email'] = bundle.obj.email
return bundle
As per official django-tastypie documentation on dehydrate() function,
dehydrate
The dehydrate method takes a now fully-populated bundle.data & make
any last alterations to it. This is useful for when a piece of data
might depend on more than one field, if you want to shove in extra
data that isn’t worth having its own field or if you want to
dynamically remove things from the data to be returned.
dehydrate() is only meant to make any last alterations to bundle.data.
Your code does additional count query for each category. You're right about annotate being helpfull in this kind of a problem.
Django will include all queryset's fields in GROUP BY statement. Notice .values() and empty .group_by() serve limiting field set to required fields.
cat_to_prod_count = dict(Product.objects
.values('category_id')
.order_by()
.annotate(product_count=Count('id'))
.values_list('category_id', 'product_count'))
The above dict object is a map [category_id -> product_count].
It can be used in dehydrate method:
bundle.data['product_count'] = cat_to_prod_count[bundle.obj.id]
If that doesn't help, try to keep similar counter on category records and use singals to keep it up to date.
Note categories are usually a tree-like beings and you probably want to keep count of all subcategories as well.
In that case look at the package django-mptt.

Categories