Django `bulk_create` with related objects

Django `bulk_create` with related objects - python

I have a Django system that runs billing for thousands of customers on a regular basis. Here are my models:
class Invoice(models.Model):
balance = models.DecimalField(
max_digits=6,
decimal_places=2,
)
class Transaction(models.Model):
amount = models.DecimalField(
max_digits=6,
decimal_places=2,
)
invoice = models.ForeignKey(
Invoice,
on_delete=models.CASCADE,
related_name='invoices',
null=False
)
When billing is run, thousands of invoices with tens of transactions each are created using several nested for loops, which triggers an insert for each created record. I could run bulk_create() on the transactions for each individual invoice, but this still results in thousands of calls to bulk_create().
How would one bulk-create thousands of related models so that the relationship is maintained and the database is used in the most efficient way possible?
Notes:
I'm looking for a native Django solution that would work on all databases (with the possible exception of SQLite).
My system runs billing in a celery task to decouple long-running code from active requests, but I am still concerned with how long it takes to complete a billing cycle.
The solution should assume that other requests or running tasks are also reading from and writing to the tables in question.

You could bulk_create all the Invoice objects, refresh them from the db, so that they all have ids, create the Transaction objects for all the invoices and then also save them with bulk_create. All of this can be done inside a single transaction.atomic context.
Also, specifically for django 1.10 and postrgres, look at this answer.

You can do it with two bulk create queries, with following method.
new_invoices = []
new_transactions = []
for loop:
invoice = Invoice(params)
new_invoices.append(invoice)
for loop:
transaction = Transaction(params)
transaction.invoice = invoice
new_transactions.append(transaction)
Invoice.objects.bulk_create(new_invoices)
for each in new_transactions:
each.invoice_id = each.invoice.id
Transaction.objects.bulk_create(new_transactions)

Another way for this purpose can be like the below code snippet:
from django.utils import timezone
from django.db import transaction
new_invoices = []
new_transactions = []
for sth in sth_else:
...
invoice = Invoice(params)
new_invoices.append(invoice)
for sth in sth_else:
...
new_transactions.append(transaction)
with transaction.atomic():
other_invoice_ids = Invoice.objects.values_list('id', flat=True)
now = timezone.now()
Invoice.objects.bulk_create(new_invoices)
new_invoices = Invoice.objects.exclude(id__in=other_invoice_ids).values_list('id', flat=True)
for invoice_id in new_invoices:
transaction = Transaction(params, invoice_id=invoice_id)
new_transactions.append(transaction)
Transaction.objects.bulk_create(new_transactions)
I write this answer based on this post on another question in the community.

Related

Effecient way to fetch related models count in Django ORM

I'm working on Django 1.10 and PostgreSQL9.6
I have two models in my project: Order and Customer. Also I'm using Django's auth.User to store login credentials.
Here is a code snipped:
from django.contrib.auth.models import User
from django.db import models
class Order(models.Model):
created_by = models.ForeignKey(User, null=True, on_delete=models.SET_NULL, related_name='created_orders')
# ... other fields ...
class Customer(models.Model):
user = models.OneToOneField(User, on_delete=models.CASCADE)
# ... other fields ...
Now, I need to show a table of customers and show a number of orders created by each of them.
The staight forward code is:
for customer in Customer.objects.all():
print(customer.user.created_orders.count())
Now I need to avoid N+1 problem and make Django to fetch all data with constant number of queries.
I've tried to write something like:
query = Customer.objects.select_related('user').annotate(
orders_count=Count('user.created_orders')
)
But this gives me an error like Cannot resolve keyword 'user.created_orders' into field.
Can you help me with this?

You should not use a dot (.) here, but two consecutive underscores (__):
query = Customer.objects.select_related('user').annotate(
orders_count=Count('user__created_orders')
)
Note that you do not need to .select_related('user') in order to annotate a queryset. If you however plan to use the .user later in your logic, it can indeed boost performance.

What is the optimal way to store 100k daily database inserts in Django?

My problem is that Django inserts entries waaaaaaay too slow ( i didnt even time but it was more than 5 mins) for 100k entries from Pandas csv file. What i am doing is parsing csv file and then save those objects to postgresql in Django. It is going to be a daily cronjob with csv files differ for most of the entries(some can be duplicates from the previous days or owner could be the same)
I haven't tried raw queries, but i dont think that would help much.
and i am really stuck at this point honestly. apart from some iteration manipulations and making a generator, rather than iterator i can not somehow improve the time of insertions.
class TrendType(models.Model):
""" Описывает тип отчета (посты, видео, субъекты)"""
TREND_TYPE = Choices('video', 'posts', 'owners') ## << mnemonic
title = models.CharField(max_length=50)
mnemonic = models.CharField(choices=TREND_TYPE, max_length=30)
class TrendSource(models.Model):
""" Источник отчета (файла) """
trend_type = models.ForeignKey(TrendType, on_delete=models.CASCADE)
load_date = models.DateTimeField()
filename = models.CharField(max_length=100)
class TrendOwner(models.Model):
""" Владелец данных (группа, юзер, и т.п.)"""
TREND_OWNERS = Choices('group', 'user', '_')
title = models.CharField(max_length=50)
mnemonic = models.CharField(choices=TREND_OWNERS, max_length=30)
class Owner(models.Model):
""" Данные о владельце """
link = models.CharField(max_length=255)
name = models.CharField(max_length=255)
trend_type = models.ForeignKey(TrendType, on_delete=models.CASCADE)
trend_owner = models.ForeignKey(TrendOwner, on_delete=models.CASCADE)
class TrendData(models.Model):
""" Модель упаковка всех данных """
owner = models.ForeignKey(Owner, on_delete=models.CASCADE)
views = models.IntegerField()
views_u = models.IntegerField()
likes = models.IntegerField()
shares = models.IntegerField()
interaction_rate = models.FloatField()
mean_age = models.IntegerField()
source = models.ForeignKey(TrendSource, on_delete=models.CASCADE)
date_trend = models.DateTimeField() # << take it as a date
Basically, i would love a good solution for a 'fast' insertion to a database and is it even possible given these models.

Maybe you don't need an ORM here? You can try to implement a simple wrapper around typical SQL requests
Use bulk read/writing, using bulk_create() in Django ORM, or in your wrapper
Check https://docs.djangoproject.com/en/2.2/topics/db/optimization/

The problem is not with django but rather with postgresql itself. My suggestion would be to change your backend. Postgresql is good for UPDATE data, but there are better DBs for INSERT data. Postgresql vs TimescaleDB However, I dont think there is django ORM for TimescaleDB.
My suggestion would be to use Redis. The primary use is cache in memory but you can make it to persist your data too. And there is also ORM for python with redis called ROM

Reducing number of ORM queries in Django web application

I'm trying to improve the performance of one of my Django applications to make them run just a bit smoother, as part of a first iteration in improving what I currently have running. When doing some profiling I've noticed that I have a very high number of SQL queries being executed on a couple of pages.
The dashboard page for instance easily has 250+ SQL queries being executed. Further investigation pointed me to the following piece of code in my views.py:
for project in projects:
for historicaldata in project.historical_data_for_n_months_ago(i):
for key in ('hours', 'expenses'):
history_data[key] = history_data[key] + getattr(historicaldata, key)
Relevant function in models.py file:
def historical_data_for_n_months_ago(self, n=1):
n_year, n_month = n_months_ago(n)
try:
return self.historicaldata_set.filter(year=n_year, month=n_month)
except HistoricalData.DoesNotExist:
return []
As you can see, this will cause a lot of queries being executed for each project in the list. Originally this was set-up this way to keep functionality centrally at the model level and introduce convenience functions across the application.
What would be possible ways on how to reduce the number of queries being executed when loading this page? I was thinking on either removing the convince function and just working with select_related() in the view, but, it would still need a lot of queries in order to filter out records for a given year and month.
Thanks a lot in advance!
Edit As requested, some more info on the related models.
Project
class Project(models.Model):
name = models.CharField(max_length=200)
status = models.IntegerField(choices=PROJECT_STATUS_CHOICES, default=1)
last_updated = models.DateTimeField(default=datetime.datetime.now)
total_hours = models.DecimalField(default=0, max_digits=10, decimal_places=2)
total_expenses = models.DecimalField(default=0, max_digits=10, decimal_places=2)
def __str__(self):
return "{i.name}".format(i=self)
def historical_data_for_n_months_ago(self, n=1):
n_year, n_month = n_months_ago(n)
try:
return self.historicaldata_set.filter(year=n_year, month=n_month)
except HistoricalData.DoesNotExist:
return []
HistoricalData
class HistoricalData(models.Model):
project = models.ForeignKey(Project, on_delete=models.CASCADE)
person = models.ForeignKey(Person, on_delete=models.CASCADE)
year = models.IntegerField()
month = models.IntegerField()
hours = models.DecimalField(max_digits=10, decimal_places=2, default=0)
expenses = models.DecimalField(max_digits=10, decimal_places=2, default=0)
def __str__(self):
return "Historical data {i.month}/{i.year} for {i.person} ({i.project})".format(i=self)

I don't think looping through querysets is ever a good idea. So it would be better if you could find some other way. If you could elaborate your view function and exactly what its supposed be to done maybe I could help further.
If you want all the historical_data entries for a project (reverse related) you need to use prefetch_related. Since you want a specific portion of the historical data associated with said project you need to use it with Prefetch.
from django.db.models import Prefetch
Project.objects.prefetch_related(
Prefetch(
'historicaldata_set',
queryset=HistoricalData.objects.filter(year=n_year, month=n_month)
)
)
After that, you should be looping through this dataset in your django template (if you are using that). You can also pass it to a drf-serializer and that would also get your work done :)

Django 1.10 Delete large cascade querySet

I'm trying to delete all objects from a large queryset. Here is my models.py
from __future__ import unicode_literals
from django.contrib.auth.models import User
from django.db import models
class Fund(models.Model):
name = models.CharField(max_length=255, blank=False, null=False)
start_date = models.DateField(default=None, blank=False, null=False)
def __unicode__(self):
return self.name
class FundData(models.Model):
fund = models.ForeignKey(Fund, on_delete=models.CASCADE)
date = models.DateField(default=None, blank=False, null=False)
value = models.FloatField(default=None, blank=True, null=True)
def __unicode__(self):
return "{} --- Data: {} --- Value: {} ".format(str(self.fund), str(self.date), str(self.value))
But when I try to delete all records the query take too much time and mysql is being timed out.
Fund.objects.all().delete()
What is the best way to manage this operation inside a view?
There is a way to do that calling a django command from terminal?

First of all you can change the timeout time of MySQL by edditing settings.py
DATABASES = {
'default': {
...
OPTIONS = {
'connect_timeout': 5, # your timeout time
}
...
}
}
The reasons why .delete() may be slow are:
Django has to ensure cascade deleting functions properly. It has to look for foreign keys to your models
Django has to somehow handle pre and post-save signals for your models
If you are sure that your models don't have cascade deleting or any signals to be handled, you can try to use private _raw_delete as follows:
queryset._raw_delete(queryset.db)
You can find more details on it here

Finally I got a simple solution to delete all cascade object breaking down the macro deletion operation preferring delete Fund objects one by one. Because it is a really long operation (approximately 1 second for each object for thousands objects) I assigned it to a celery worker. Slow but safe I think, if someone got a better solution please let me know!
#shared_task
def reset_funds():
for fund in Fund.objects.all():
print "Delete Fund: {}".format(fund.name)
fund.delete()
return "All Funds Deleted!"

Efficiently add a custom field to a QuerySet

I have a Django website with activities. When checking for optimisation opportunities with the django-toolbar, I discovered that the view for an activity's subscription list was really inefficient. It made five database request per subscription, just to check if the user is a member.
My models are structured as follows:
class Subscription(models.Model):
user = models.ForeignKey(User, null=True)
activity = models.ForeignKey(Activity)
class MemberProfile(models.Model):
user = models.ForeignKey(User)
member_in = models.ManyToManyField(WorkYear, blank=True, null=True)
class WorkYear(models.Model):
year = models.SmallIntegerField(unique=True)
current = models.BooleanField(default=False, blank=True)
Now, to check if the subscribed user is a member, we must check if there's a MemberProfile referring to it, with a WorkYear in its member_in field with current set to true.
I had a property in Subscription called is_member which returned this information. In the template this property was called for every subscription, resulting in a massive amount of database requests.
Instead of doing this, I would like to add a custom field to the QuerySet created in the view.
I've experimented with the extra() function:
subscriptions = activity.subscription_set.extra(
select={
'is_member': 'SELECT current FROM activities_subscription LEFT OUTER JOIN (auth_user LEFT OUTER JOIN (members_memberprofile LEFT OUTER JOIN (members_memberprofile_member_in LEFT OUTER JOIN site_main_workyear ON members_memberprofile_member_in.workyear_id = site_main_workyear.id AND site_main_workyear.current = 1) ON members_memberprofile.id = members_memberprofile_member_in.memberprofile_id) ON auth_user.id = members_memberprofile.user_id) ON activities_subscription.user_id = auth_user.id'
},
tables=['site_main_workyear', 'members_memberprofile_member_in', 'members_memberprofile', 'auth_user']
).order_by('id')
This is really complex and for some reason it doesn't work. After reloading the page, Python takes 100% CPU and no response is given.
Is there a better and more simple way for doing this? And if not, what am I doing wrong?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django `bulk_create` with related objects - python

Related

Effecient way to fetch related models count in Django ORM

What is the optimal way to store 100k daily database inserts in Django?

Reducing number of ORM queries in Django web application

Django 1.10 Delete large cascade querySet

Efficiently add a custom field to a QuerySet

Categories

Resources