My goal is to optimize the retrieval of the count of objects I have in my Django model.
I have two models:
Users
Prospects
It's a one-to-many relationship. One User can create many Prospects. One Prospect can only be created by one User.
I'm trying to get the Prospects created by the user in the last 24 hours.
Prospects model has roughly 7 millions rows on my PostgreSQL database. Users only 2000.
My current code is taking to much time to get the desired results.
I tried to use filter() and count():
import datetime
# get the date but 24 hours earlier
date_example = datetime.datetime.now() - datetime.timedelta(days = 1)
# Filter Prospects that are created by user_id_example
# and filter Prospects that got a date greater than date_example (so equal or sooner)
today_prospects = Prospect.objects.filter(user_id = 'user_id_example', create_date__gte = date_example)
# get the count of prospects that got created in the past 24 hours by user_id_example
# this is the problematic call that takes too long to process
count_total_today_prospects = today_prospects.count()
I works, but it takes too much time (5 minutes). Because it's checking the entire database instead of just checking, what I though it would: only the prospects that got created in the last 24 hours by the user.
I also tried using annotate but it's equally slow, because it's ultimately doing the same thing than the regular .count():
today_prospects.annotate(Count('id'))
How can I get the count in a more optimized way?
Assuming that you don't have it already, I suggest adding an index that includes both user and date fields (make sure that they are in this order, first the user and then the date, because for the user you are looking for an exact match but for the date you only have a starting point). That should speed up the query.
For example:
class Prospect(models.Model):
...
class Meta:
...
indexes = [
models.Index(fields=['user', 'create_date']),
]
...
This should create a new migration file (run makemigrations and migrate) where it adds the index to the database.
After that, your same code should run a bit faster:
count_total_today_prospects = Prospect.objects\
.filter(user_id='user_id_example', create_date__gte=date_example)\
.count()
Django's documentation:
A count() call performs a SELECT COUNT(*) behind the scenes, so you should always use count() rather than loading all of the record into Python objects and calling len() on the result (unless you need to load the objects into memory anyway, in which case len() will be faster).
Note that if you want the number of items in a QuerySet and are also retrieving model instances from it (for example, by iterating over it), it’s probably more efficient to use len(queryset) which won’t cause an extra database query like count() would.
If the queryset has already been fully retrieved, count() will use that length rather than perform an extra database query.
Take a look at this link: https://docs.djangoproject.com/en/3.2/ref/models/querysets/#count.
Try to use len().
Related
I am running a webservice where a user sends a word as a request, and I use that word to filter entries in my database (the default Django SQLite). The relationship word-to-entry is one-to-one.
That means there are two possible cases:
The word exists in the database -> Return the associated Entry.
The word doesn't exist -> Throw exception.
The following lookup should then return a QuerySet with 1 or 0 objects:
Entry.objects.filter(word__iexact=word)
Expected Behavior:
Cases 1 and 2 do not differ perceptibly in speed.
Current Behavior:
Case 1 takes at most half a second.
Case 2 takes forever, around 1-2 minutes.
I find this puzzling. If an existing word can be looked up regardless of where it is in the database, then why does case 2 take forever? I am not a django or database expert, so I feel like I'm missing something here. Is it worth just setting up a different type of database to see if that helps?
Here is the relevant portion of my code. I'm defining a helper function that gets called from a view:
mysite/myapp/utils.py
from .models import Entry
def get_entry(word):
if Entry.objects.filter(word__iexact=word).exists():
queryset = Entry.objects.filter(
word__iexact=word
) # Case insensitive exact lookup
entry = queryset[0] # Retrieve entry from queryset
return entry
else:
raise IndexError
This is normal, especially with a few million records on sqlite and I'm assuming without an index.
A missing word will always have to go through all records if there is no usable index. A word that is found, will terminate once found. There's no noticable difference if the word you are looking for is the last word in table order.
And it's actually because you're using a slice, so the slice uses LIMIT and database can stop looking at first match.
Thanks for reading my question! I use django (3.0.8) with postgres 12. Below is a simplified model for Inventory. There are about 1M records.
class Inventory(models.Model):
account = models.ForeignKey(Account, on_delete=models.PROTECT)
item_id = models.LargeIntegerField(unique=True, db_index=True)
amount = models.IntegerField()
Every hour we receive new snapshots of one Account through REST API, e.g. acc_0. That contains a full list of items (~100k records), not deltas. I wanted to apply 3 actions:
Set amount=0 where account==acc_0 if item_id not in the new snapshots.
Set amount=snapshot['amount'] where account==acc_0 if item_id in the new snapshots.
Create new items if snapshot['item_id'] not in Inventory.item_id
Each time, most items already exist in DB and their amount is unchanged, i.e. the real delta is quite small (100-1k records).
What I'm doing now seems not very efficient:
with transaction.atomic():
new_items = {}
update_items = {}
Inventory.objects.filters(account=acc_0).update(amount=0)
for snapshot in snapshots:
item_id = snapshot['item_id']
results = Inventory.objects.filter(item_id=item_id)
if len(results) == 0:
new_items[item_id] = Inventory(...)
else:
item = result[0]
item.amount = snapshot['amount']
update_items[item_id] = item
Inventory.objects.bulk_create(new_items.values())
Inventory.objects.bulk_update(update_items.values(), ['amount'])
I was wondering should I upload the snapshot to a temporary table and use UPDATE SET CASE JOIN, INSERT INTO SELECT NOT EXISTS or even better there is a more pythonic way.
There is one similar question: Django Mass Update/Insert Performance but it's also open.
As you have three questions, I will answer your question in 3 separate parts. I'm no expert on Postgres, but I can tell you the most convenient way of solving your problem using Django.
I won't say efficient as I haven't handled large datasets, but seeing as they're all Django-default functions, I will expect them to perform fairly well.
1 and 2: I will assume that your account objects are already sorted in terms of their id. If they are sorted, a binary search algorithm will do the trick (only equal to linear search at its worse case, best at log(N) where N is the time for linear search). Once you find the item, do whatever you need to. If they are not sorted, don't bother yourself and just go for a standard linear search.
3: You either get the object, or you create one. Django's get_or_create is exactly what you need.
I have a database table in psql which contains of 10,000,000 rows and 60 columns (features). I define a Django Queryset as follows:
MyQ=MyDataBase.objects.filter(Name='Mike', date=date(2018, 2, 11),
Class='03')
There are only 5 rows that satisfy the above filter. But when I try something like
MyQ.count() #which equals 5
or
MyQ.aggregate(Sum('Score'))['Score__sum'] #which equals 61
each take about 3 minutes to give me the result. Isn't that weird? Aren't query sets supposed to make life easier by focusing only on the rows that we have told them to focus on? counting 5 rows or summing one of the fields of them must not take that long. What am I doing wrong?
I should also say this. The first time that I tried this code on this table, everything was fine and it took maybe 1 second to catch the result but now the 3 minutes is really annoying. And since then I have not changed anything in the database or the code.
Generally if you are filtering your table based on a certain field or number of fields, you should create an index on those fields. It allows the database query planner to take a more optimized path when searching/sorting.
It looks like you're using Postgres from your question, so you can run SELECT * FROM pg_indexes WHERE tablename = 'yourtable'; in psql to see any existing indexes.
Django can create these indexes for you in your model definition. For example, your model MyDatabase might look something like this:
class MyDatabase(models.Model):
name = models.TextField(index=True)
date = models.DateField(index=True)
class = models.TextField(index=True)
Here's some more reading specific to creating indexes on Django models: gun.io/blog/learn-indexing-dammit
This problem is related to a gaming arcade parlor where people go in the parlor and play a game. As a person plays, there is a new entry created in the database.
My model is like this:
class gaming_machine(models.Model):
machine_no = models.Integer()
score = models.Integer()
created = models.DateTimeField(auto_now_add=True)
My view is like this:
today = datetime.now().date()
# i am querying the db for getting the gaming_machine objects where score = 192 or 100 and the count of these objects separately for gaming_machines object which have 192 score and gaming_machine objects which have score as 100
gaming_machine.objects.filter(Q(points=100) | Q(points=192),created__startswith=today).values_list('machine_no','points').annotate(Count('machine_no'))
# this returns a list of tuples -> (machine_no, points, count)
<QuerySet [(330, 192,2), (330, 100,4), (331, 192,7),(331,192,8)]>
Can i change the returned queryset format to something like this:
{(330, 192):2, (330, 100) :4, (331, 192):7,(331,192):8} # that is a dictionary with a key as a tuple consisting (machine_no,score) and value as count of such machine_nos
I am aware that i can change the format of this queryset in the python side using something like dictionary comprehension, but i can't do that as it takes around 1.4 seconds of time to do that because django querysets are lazy.
Django's lazy queries...
but i can't do that as it takes around 1.4 seconds of time to do that because django querysets are lazy.
The laziness of Django's querysets actually has (close) to no impact on performance. They are lazy in the sense that they postpone querying the database until you need the result (for example when you start iterating over it). But then they will fetch all the rows. So there is no overhead in each time fetching the next row, all rows are fetched, and then Python iterates over it quite fast.
The laziness is thus not on a row-by-row basis: it does not advances the cursor each time you want to fetch the next row. The communication to the database is thus (quite) limited.
... and why it does not matter (performance-wise)
Unless the number of rows is huge (50'000 or more), the transition to a dictionary should also happen rather fast. So I suspect that the overhead is probably due to the query itself. Especially since Django has to "deserialize" the elements: turn the response into tuples, so although there can be some extra overhead, it usually will be reasonable compared to the work that already is done without the dictionary comprehension. Typically one encodes tasks in queries if they result in less data that is transferred to Python.
For example by performing the count at the database, the database will return an integer per row, instead of several rows, by filtering, we reduce the number of rows as well (since typically not all rows match a given criteria). Furthermore the database has typically fast lookup mechanisms that boost WHEREs, GROUP BYs, ORDER BYs, etc. But post-processing the stream to a different object would usually take the same magnitude of time for a database.
So the dictionary comprehension should do:
{
d[:2]: d[3]
for d in gaming_machine.objects.filter(
Q(points=100) | Q(points=192),created__startswith=today
).values_list(
'machine_no','points'
).annotate(
Count('machine_no')
)
}
Speeding up queries
Since the problem is probably located at the database, you probably want to consider some possibilities for speedup.
Building indexes
Typically the best way to boost performance of queries, is by constructing an index on columns that you filter on frequently, and have a large number of distinct values.
In that case the database will construct a data structure that stores for every value of that column, a list of rows that match with that value. So as a result, instead of reading through all the rows and selecting the relevant ones, the database can instantly access the datastructure and typically know in reasonable time, what rows have that value.
Note that this typically only helps if the column contains a large number of distinct values: if for example the column only contains two values (in 1% of the cases the value is 0, and 99% of the cases are 1) and we filter on a very common value, this will not produce much speedup, since the set we need to process, has approximately the same size.
So depending on how distinct the values, are, we can add indices to the points, and created field:
class gaming_machine(models.Model):
machine_no = models.Integer()
score = models.Integer(db_index=True)
created = models.DateTimeField(auto_now_add=True, db_index=True)
Improve the query
Secondly, we can also aim to improve the query itself, although this might be more database dependent (if we have two queries q1 and q2, then it is possible that q1 works faster than q2 on a MySQL database, and q2 works for example faster than q1 on a PostgreSQL database). So this is quite tricky: there are of course some things that typically work in general, but it is hard to give guarantees.
For example somtimes x IN (100, 192) works faster than x = 100 OR x = 192 (see here). Furthermore you here use __startswith, which might perform well - depending on how the database stores timestamps - but it can result in a computationally expensive query if it first needs to convert the datetime. Anyway, it is more declarative to use created__date, since it makes it clear that you want the date of the created equal to today, so a more efficient query is probably:
{
d[:2]: d[3]
for d in gaming_machine.objects.filter(
points__in=[100, 192], created__date=today
).values_list(
'machine_no','points'
).annotate(
Count('machine_no')
)
}
I am aware that regular queryset or the iterator queryset methods evaluates and returns the entire data-set in one shot .
for instance, take this :
my_objects = MyObject.objects.all()
for rows in my_objects: # Way 1
for rows in my_objects.iterator(): # Way 2
Question
In both methods all the rows are fetched in a single-go.Is there any way in djago that the queryset rows can be fetched one by one from database.
Why this weird Requirement
At present my query fetches lets says n rows but sometime i get Python and Django OperationalError (2006, 'MySQL server has gone away').
so to have a workaround for this, i am currently using a weird while looping logic.So was wondering if there is any native or inbuilt method or is my question even logical in first place!! :)
I think you are looking to limit your query set.
Quote from above link:
Use a subset of Python’s array-slicing syntax to limit your QuerySet to a certain number of results. This is the equivalent of SQL’s LIMIT and OFFSET clauses.
In other words, If you start with a count you can then loop over and take slices as you require them..
cnt = MyObject.objects.count()
start_point = 0
inc = 5
while start_point + inc < cnt:
filtered = MyObject.objects.all()[start_point:inc]
start_point += inc
Of course you may need to error handle this more..
Fetching row by row might be worse. You might want to retrieve in batches for 1000s etc. I have used this Django snippet (not my work) successfully with very large querysets. It doesn't eat up memory and no trouble with connections going away.
Here's the snippet from that link:
import gc
def queryset_iterator(queryset, chunksize=1000):
'''''
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunksize (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
'''
pk = 0
last_pk = queryset.order_by('-pk')[0].pk
queryset = queryset.order_by('pk')
while pk < last_pk:
for row in queryset.filter(pk__gt=pk)[:chunksize]:
pk = row.pk
yield row
gc.collect()
To solve (2006, 'MySQL server has gone away') problem, your approach is not that logical. If you will hit database for each entry, it is going to increase number of queries which itself will create problem in future as usage of your application grows.
I think you should close mysql connection after iterating all elements of result, and then if you will try to make another query, django will create a new connection.
from django.db import connection:
connection.close()
Refer this for more details