Conditional bulk updates: Django bulk_update or temporary table + sql

Conditional bulk updates: Django bulk_update or temporary table + sql - python

Thanks for reading my question! I use django (3.0.8) with postgres 12. Below is a simplified model for Inventory. There are about 1M records.
class Inventory(models.Model):
account = models.ForeignKey(Account, on_delete=models.PROTECT)
item_id = models.LargeIntegerField(unique=True, db_index=True)
amount = models.IntegerField()
Every hour we receive new snapshots of one Account through REST API, e.g. acc_0. That contains a full list of items (~100k records), not deltas. I wanted to apply 3 actions:
Set amount=0 where account==acc_0 if item_id not in the new snapshots.
Set amount=snapshot['amount'] where account==acc_0 if item_id in the new snapshots.
Create new items if snapshot['item_id'] not in Inventory.item_id
Each time, most items already exist in DB and their amount is unchanged, i.e. the real delta is quite small (100-1k records).
What I'm doing now seems not very efficient:
with transaction.atomic():
new_items = {}
update_items = {}
Inventory.objects.filters(account=acc_0).update(amount=0)
for snapshot in snapshots:
item_id = snapshot['item_id']
results = Inventory.objects.filter(item_id=item_id)
if len(results) == 0:
new_items[item_id] = Inventory(...)
else:
item = result[0]
item.amount = snapshot['amount']
update_items[item_id] = item
Inventory.objects.bulk_create(new_items.values())
Inventory.objects.bulk_update(update_items.values(), ['amount'])
I was wondering should I upload the snapshot to a temporary table and use UPDATE SET CASE JOIN, INSERT INTO SELECT NOT EXISTS or even better there is a more pythonic way.
There is one similar question: Django Mass Update/Insert Performance but it's also open.

As you have three questions, I will answer your question in 3 separate parts. I'm no expert on Postgres, but I can tell you the most convenient way of solving your problem using Django.
I won't say efficient as I haven't handled large datasets, but seeing as they're all Django-default functions, I will expect them to perform fairly well.
1 and 2: I will assume that your account objects are already sorted in terms of their id. If they are sorted, a binary search algorithm will do the trick (only equal to linear search at its worse case, best at log(N) where N is the time for linear search). Once you find the item, do whatever you need to. If they are not sorted, don't bother yourself and just go for a standard linear search.
3: You either get the object, or you create one. Django's get_or_create is exactly what you need.

Related

How to retrieve count objects faster on Django?

My goal is to optimize the retrieval of the count of objects I have in my Django model.
I have two models:
Users
Prospects
It's a one-to-many relationship. One User can create many Prospects. One Prospect can only be created by one User.
I'm trying to get the Prospects created by the user in the last 24 hours.
Prospects model has roughly 7 millions rows on my PostgreSQL database. Users only 2000.
My current code is taking to much time to get the desired results.
I tried to use filter() and count():
import datetime
# get the date but 24 hours earlier
date_example = datetime.datetime.now() - datetime.timedelta(days = 1)
# Filter Prospects that are created by user_id_example
# and filter Prospects that got a date greater than date_example (so equal or sooner)
today_prospects = Prospect.objects.filter(user_id = 'user_id_example', create_date__gte = date_example)
# get the count of prospects that got created in the past 24 hours by user_id_example
# this is the problematic call that takes too long to process
count_total_today_prospects = today_prospects.count()
I works, but it takes too much time (5 minutes). Because it's checking the entire database instead of just checking, what I though it would: only the prospects that got created in the last 24 hours by the user.
I also tried using annotate but it's equally slow, because it's ultimately doing the same thing than the regular .count():
today_prospects.annotate(Count('id'))
How can I get the count in a more optimized way?

Assuming that you don't have it already, I suggest adding an index that includes both user and date fields (make sure that they are in this order, first the user and then the date, because for the user you are looking for an exact match but for the date you only have a starting point). That should speed up the query.
For example:
class Prospect(models.Model):
...
class Meta:
...
indexes = [
models.Index(fields=['user', 'create_date']),
]
...
This should create a new migration file (run makemigrations and migrate) where it adds the index to the database.
After that, your same code should run a bit faster:
count_total_today_prospects = Prospect.objects\
.filter(user_id='user_id_example', create_date__gte=date_example)\
.count()

Django's documentation:
A count() call performs a SELECT COUNT(*) behind the scenes, so you should always use count() rather than loading all of the record into Python objects and calling len() on the result (unless you need to load the objects into memory anyway, in which case len() will be faster).
Note that if you want the number of items in a QuerySet and are also retrieving model instances from it (for example, by iterating over it), it’s probably more efficient to use len(queryset) which won’t cause an extra database query like count() would.
If the queryset has already been fully retrieved, count() will use that length rather than perform an extra database query.
Take a look at this link: https://docs.djangoproject.com/en/3.2/ref/models/querysets/#count.
Try to use len().

Django queryset : how to change the returned datastructure

This problem is related to a gaming arcade parlor where people go in the parlor and play a game. As a person plays, there is a new entry created in the database.
My model is like this:
class gaming_machine(models.Model):
machine_no = models.Integer()
score = models.Integer()
created = models.DateTimeField(auto_now_add=True)
My view is like this:
today = datetime.now().date()
# i am querying the db for getting the gaming_machine objects where score = 192 or 100 and the count of these objects separately for gaming_machines object which have 192 score and gaming_machine objects which have score as 100
gaming_machine.objects.filter(Q(points=100) | Q(points=192),created__startswith=today).values_list('machine_no','points').annotate(Count('machine_no'))
# this returns a list of tuples -> (machine_no, points, count)
<QuerySet [(330, 192,2), (330, 100,4), (331, 192,7),(331,192,8)]>
Can i change the returned queryset format to something like this:
{(330, 192):2, (330, 100) :4, (331, 192):7,(331,192):8} # that is a dictionary with a key as a tuple consisting (machine_no,score) and value as count of such machine_nos
I am aware that i can change the format of this queryset in the python side using something like dictionary comprehension, but i can't do that as it takes around 1.4 seconds of time to do that because django querysets are lazy.

Django's lazy queries...
but i can't do that as it takes around 1.4 seconds of time to do that because django querysets are lazy.
The laziness of Django's querysets actually has (close) to no impact on performance. They are lazy in the sense that they postpone querying the database until you need the result (for example when you start iterating over it). But then they will fetch all the rows. So there is no overhead in each time fetching the next row, all rows are fetched, and then Python iterates over it quite fast.
The laziness is thus not on a row-by-row basis: it does not advances the cursor each time you want to fetch the next row. The communication to the database is thus (quite) limited.
... and why it does not matter (performance-wise)
Unless the number of rows is huge (50'000 or more), the transition to a dictionary should also happen rather fast. So I suspect that the overhead is probably due to the query itself. Especially since Django has to "deserialize" the elements: turn the response into tuples, so although there can be some extra overhead, it usually will be reasonable compared to the work that already is done without the dictionary comprehension. Typically one encodes tasks in queries if they result in less data that is transferred to Python.
For example by performing the count at the database, the database will return an integer per row, instead of several rows, by filtering, we reduce the number of rows as well (since typically not all rows match a given criteria). Furthermore the database has typically fast lookup mechanisms that boost WHEREs, GROUP BYs, ORDER BYs, etc. But post-processing the stream to a different object would usually take the same magnitude of time for a database.
So the dictionary comprehension should do:
{
d[:2]: d[3]
for d in gaming_machine.objects.filter(
Q(points=100) | Q(points=192),created__startswith=today
).values_list(
'machine_no','points'
).annotate(
Count('machine_no')
)
}
Speeding up queries
Since the problem is probably located at the database, you probably want to consider some possibilities for speedup.
Building indexes
Typically the best way to boost performance of queries, is by constructing an index on columns that you filter on frequently, and have a large number of distinct values.
In that case the database will construct a data structure that stores for every value of that column, a list of rows that match with that value. So as a result, instead of reading through all the rows and selecting the relevant ones, the database can instantly access the datastructure and typically know in reasonable time, what rows have that value.
Note that this typically only helps if the column contains a large number of distinct values: if for example the column only contains two values (in 1% of the cases the value is 0, and 99% of the cases are 1) and we filter on a very common value, this will not produce much speedup, since the set we need to process, has approximately the same size.
So depending on how distinct the values, are, we can add indices to the points, and created field:
class gaming_machine(models.Model):
machine_no = models.Integer()
score = models.Integer(db_index=True)
created = models.DateTimeField(auto_now_add=True, db_index=True)
Improve the query
Secondly, we can also aim to improve the query itself, although this might be more database dependent (if we have two queries q1 and q2, then it is possible that q1 works faster than q2 on a MySQL database, and q2 works for example faster than q1 on a PostgreSQL database). So this is quite tricky: there are of course some things that typically work in general, but it is hard to give guarantees.
For example somtimes x IN (100, 192) works faster than x = 100 OR x = 192 (see here). Furthermore you here use __startswith, which might perform well - depending on how the database stores timestamps - but it can result in a computationally expensive query if it first needs to convert the datetime. Anyway, it is more declarative to use created__date, since it makes it clear that you want the date of the created equal to today, so a more efficient query is probably:
{
d[:2]: d[3]
for d in gaming_machine.objects.filter(
points__in=[100, 192], created__date=today
).values_list(
'machine_no','points'
).annotate(
Count('machine_no')
)
}

Most efficient way to build list of highest prices from queryset?

In one page of my app, I'm trying to display the most expensive car for each company. My models look roughly like this:
class Company(models.Model):
id = models.IntegerField(primary_key=True)
company = models.CharField(max_length=100)
headcount = models.IntegerField(null=False)
info = models.CharField(max_length=100)
class Car(models.Model):
id = models.IntegerField(primary_key=True)
company_unique = models.ForeignKey(Company)
company = models.CharField(max_length=50)
name = models.CharField(max_length=100)
price = models.DecimalField(max_digits=9, decimal_places=2, default=0.00)
So, I want to build a list consisting of every company's single most expensive Car object.
I approached the problem like this:
company_list = Company.objects.all()
most_expensive = []
for company in company_list:
most_expensive.append(Car.objects.filter(company_unique=company.id).order_by("-price")[0])
However, this seems to be a very inefficient method. I can see with Django Debug Toolbar that this code is making way too many mysql queries.
Can someone suggest a better way to build this list that would hit MySQL maybe just once or twice?

While what you're dealing with is quite a common case, an obvious solution is seemingly lacking.
Solution 1, found in this article. You could probably try something along these lines:
companies = Company.objects.annotate(max_price=Max('car__price'))
values = tuple((company.id, company.max_price) for company in companies)
expensive_cars = Car.objects.extra(where=['(company_unique_id, price) IN %s' % (values,)])
Can't say I like the solution - .extra should be avoided - but I can't think of a better way. I am also not entirely sure this will work at all.
Solution 2, sub-optimal. You can make use of custom Prefetch object.
prefetch = Prefetch('cars', queryset=Car.objects.order_by('-price'), to_attr='cars_by_price')
companies = Company.objects.prefetch_related(prefetch)
most_expensive_cars = []
for company in companies:
most_expensive_cars.append(list(company.cars_by_price.all())[0])
That should definitely work and fetch everything in two queries, but is extremely wasteful , since it will load all Cars related to given set of Companies into memory. Do note that list() part is not optional: wherever you take a slice or index, a queryset is copied and produces a separate DB query, therefore negating the prefetch, while instantiating a list will use the result of said prefetch.
If you need to access companies afterwards, like Car.company, don't shy away from using select_related, as suggested by Erik in the comments.

Add a field to company called something like "priciest_car" and override save such that every time you save a company, you loop through it's related cars and sets the most expensive one to priciest_car. then when you need to call the most expensive cars for each company, you can just loop through each company and add company.priciest_car to the list. It's one loop, one sql call each itteration. the only extra work would be when you save a company but it will be per company so it shouldn't take too long. If it does, find a way to only make it set the "priciest_car" field only when you know it's been changed.

I swore this was how I was able to handle it, but it seems like I must be mistaken.
I think that's possible with Aggregation:
most_expensive = Car.objects.values('company_unique').annotate(Max('price'))
The following is in raw SQL, which has its benefits, but I feel like there may be a cleaner way:
from django.db import connection
cursor = connection.cursor()
cursor.execute("SELECT Max(price), company_unique FROM Car GROUP BY company_unique");
price_company = cursor.fetchall()
# This still does one query per car, only it fetches one item at a time.
most_expensive = [Cars.objects.get(price=pc[0],company_unique=pc[1])
for pc in price_company]
If you really wanted to limit it to one query, then you might be able to leverage raw:
most_expensive = Cars.objects.raw("""
SELECT * FROM Cars
INNER JOIN
(SELECT Max(price) as price, company_unique FROM Car GROUP BY company_unique) m
ON m.price = Cars.price, m.company_unique = Cars.company_unique
""")
The problem with using raw is that it is not database agnostic, so any refactoring will need to involve re-writing this query. (Oracle, for example, has a different secondary query syntax).
I feel like I should point out that the SELECT Max(price) as price, company_unique FROM Car GROUP BY company_unique query will be executed no matter what — if you're using a more Django-native solution it will happen behind the scenes.

Google app engine - Users, Lists and Products a Join question on effiency

Let's take this models:
User
- name
Product
- name
- category
List
- name
- creation_date
- user (reference)
Product_List
- list ( reference)
- product ( reference)
How can I retrieve a list of the products that remain out of the list?
Should I retrieve them all and then delete them programmatically ( doesn't this make the request slower? )
Get all the products of a certain list of a certain user
Get all the products
Extract the difference ( a nested for? )
Sorry, I'm kind of newbie on this, suggestions and comments are welcome!
Thanks!

If you structure your data like this:
class Product(db.Model):
# ...
class UserInfo(db.Model):
# ...
class ProductList(db.Model):
owner = db.ReferenceProperty(UserInfo)
products = db.ListProperty(db.Key)
Then you can retrieve the products not in a list like this:
product_keys = set(Product.all(keys_only=True).fetch(1000))
product_list = ProductList.get_by_id(product_list_id)
missing_products = product_keys - set(product_list.products)
missing_products is a set of keys, which you can pass to db.get to retrieve the corresponding product entities.
This will, of course, require retrieving the entire list of products, but this is exactly what a relational database would have to do to satisfy the query too.

For this particular cleanup job, you may want to use a map reduce job to scan through your set of Product_Lists.
In the future, try to keep such situations from occurring - for example, you could make each List an entity group, and when the List is removed, also remove all associated Product_Lists in the same transaction. Alternately, eliminate Product_Lists entirely, and just use a list property in the List entity itself.
In general, the lack of joins is a sacrifice you have to make for scalability. Many joins are only efficient when the entire dataset is in local memory; as such, once you have a database too large to fit in a single server, you will always need to sacrifice joins. This may mean major changes to your schema, and/or turning some operations into batch jobs. GAE is simply forcing you to make these design changes up front, when it's easier to do so, rather than later when you're pressed for time.

speed up calling lot of entities, and getting unique values, google app engine python

OK this is a 2 part question, I've seen and searched for several methods to get a list of unique values for a class and haven't been practically happy with any method so far.
So anyone have a simple example code of getting unique values for instance for this code. Here is my super slow example.
class LinkRating2(db.Model):
user = db.StringProperty()
link = db.StringProperty()
rating2 = db.FloatProperty()
def uniqueLinkGet(tabl):
start = time.time()
dic = {}
query = tabl.all()
for obj in query:
dic[obj.link]=1
end = time.time()
print end-start
return dic
My second question is calling for instance an iterator instead of fetch slower? Is there a faster method to do this code below? Especially if the number of elements called be larger than 1000?
query = LinkRating2.all()
link1 = 'some random string'
a = query.filter('link = ', link1)
adic ={}
for itema in a:
adic[itema.user]=itema.rating2

1) One trick to make this query fast is to denormalize your data. Specifically, create another model which simply stores a link as the key. Then you can get a list of unique links by simply reading everything in that table. Assuming that you have many LinkRating2 entities for each link, then this will save you a lot of time. Example:
class Link(db.Model):
pass # the only data in this model will be stored in its key
# Whenever a link is added, you can try to add it to the datastore. If it already
# exists, then this is functionally a no-op - it will just overwrite the old copy of
# the same link. Using link as the key_name ensures there will be no duplicates.
Link(key_name=link).put()
# Get all the unique links by simply retrieving all of its entities and extracting
# the link field. You'll need to use cursors if you have >1,000 entities.
unique_links = [x.key().name() for Link.all().fetch(1000)]
Another idea: If you need to do this query frequently, then keep a copy of the results in memcache so you don't have to read all of this data from the datastore all the time. A single memcache entry can only store 1MB of data, so you may have to split your links data into chunks to store it in memcache.
2) It is faster to use fetch() instead of using the iterator. The iterator causes entities to be fetched in "small batches" - each "small batch" results in a round-trip to the datastore to get more data. If you use fetch(), then you'll get all the data at once with just one round-trip to the datastore. In short, use fetch() if you know you are going to need lots of results.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Conditional bulk updates: Django bulk_update or temporary table + sql - python

Related

How to retrieve count objects faster on Django?

Django queryset : how to change the returned datastructure

Most efficient way to build list of highest prices from queryset?

Google app engine - Users, Lists and Products a Join question on effiency

speed up calling lot of entities, and getting unique values, google app engine python

Categories

Resources