I'm populating my database from an API that provides year-to-date stats, and I'll be pulling from this API multiple times a day. Using the year-to-date stats, I need to generate monthly and weekly stats. I'm currently trying to do this by subtracting the stats at the start of the month from the stats at the end of the month and saving it in a separate model, but the process is taking far too long and I need it to go faster.
My models look something like this :
class Stats(models.Model):
date = models.DateField(default=timezone.now) # Date pulled from API
customer_id = models.IntegerField(default=0) # Simplified for this example
a = models.IntegerField(default=0)
b = models.IntegerField(default=0)
c = models.IntegerField(default=0)
d = models.IntegerField(default=0)
class Leaderboard(models.Model):
duration = models.CharField(max_length=7, default="YEARLY") # "MONTHLY", "WEEKLY"
customer_id = models.IntegerField(default=0)
start_stats = models.ForeignKey(Stats, related_name="start_stats") # Stats from the start of the Year/Month/Week
end_stats = models.ForeignKey(Stats, related_name="end_stats") # Stats from the end of the Year/Month/Week
needs_update = models.BooleanField(default=False) # set to True only if the end_stats changed (likely when a new pull happened)
a = models.IntegerField(default=0)
b = models.IntegerField(default=0)
c = models.IntegerField(default=0)
d = models.IntegerField(default=0)
e = models.IntegerField(default=0) # A value computed based on a-d, used to sort Leaderboards
I thought I was going to be home free using Leaderboard.objects.filter(needs_update=True).update(a=F("end_stats__a")-F("start_stats__a"), ...), but that gave me an error "Joined field references are not permitted in this query".
I'm currently iterating over the QuerySet Leaderboard.objects.filter(needs_update=True), doing the subtraction operations, and saving (all with #transaction.atomic), but ~380,000 test records processed this way took just over an hour, so I suspect that this way is going to be too slow for what I need.
I'm OK with changing how I store the data if a different format would help this Leaderboard update go faster (maybe do the subtraction when pulling in the data and saving daily deltas instead?), but I feel like I keep rushing towards whatever comes to mind without any idea of what I should be doing in this situation. Any feedback at this point would be very much appreciated.
After a lot of tinkering, I think I've got a method that will work for this situation. My test sample is smaller than before (84,600 records), but it completes in 8 seconds - about 10,575 records per second (compared to the roughly 6,300 records per second of my earlier tests).
There's probably a way to refine this even more, but here's what I'm doing:
from django.db.models import F, Subquery, OuterRef
# Get the latest versions of the stats
Leaderboard.objects.filter(needs_update=True).update(
a=Subquery(Stats.objects.filter(pk=OuterRef('end_stats')).values('a')[:1]),
b=Subquery(Stats.objects.filter(pk=OuterRef('end_stats')).values('b')[:1]),
c=Subquery(Stats.objects.filter(pk=OuterRef('end_stats')).values('c')[:1]),
d=Subquery(Stats.objects.filter(pk=OuterRef('end_stats')).values('d')[:1])
)
# Subtract the earliest versions of the stats
Leaderboard.objects.filter(needs_update=True).update(
a=F('a') - Subquery(Stats.objects.filter(pk=OuterRef('start_stats')).values('a')[:1]),
b=F('b') - Subquery(Stats.objects.filter(pk=OuterRef('start_stats')).values('b')[:1]),
c=F('c') - Subquery(Stats.objects.filter(pk=OuterRef('start_stats')).values('c')[:1]),
d=F('d') - Subquery(Stats.objects.filter(pk=OuterRef('start_stats')).values('d')[:1])
)
# Calculate stats that require earlier stats.
Leaderboard.objects.filter(needs_update=True).update(
e=F('a') + F('b') * F('c') / F('d'),
needs_update=False
)
I feel like there should be a way to only use one Subquery per update, which should improve the speed even more.
Related
I want to calculate the average delivery time (in days) of products using ORM single query (The reason of using single query is, I've 10000+ records in db and don't want to iterate them over loops). Here is the example of models file, I have:
class Product(models.Model):
name = models.CharField(max_length=10)
class ProductEvents(models.Model):
class Status(models.TextChoices):
IN_TRANSIT = ("in_transit", "In Transit")
DELIVERED = ("delivered", "Delivered")
product = models.ForiegnKey(Product, on_delete=models.CASCADE)
status = models.CharField(max_length=255, choices=Status.choices)
created = models.DateTimeField(blank=True)
To calculate the delivery time for 1 product is:
product = Product.objects.first()
# delivered_date - in_transit_date = days_taken
duration = product.productevent_set.get(status='delivered').created - product.productevent_set.get(status='in_transit').created
I'm here to get your help to getting started myself over this so, that I can calculate the average time between all of the Products. I'd prefer it to done in a single query because of the performance.
A basic solution is to annotate each Product with the minimum created time for related events that have the status "in-transit and select the maximum time for events with the delivered status then annotate the diff and aggregate the average of the diffs
from django.db.models import Min, Max, Q, F, Avg
Product.objects.annotate(
start=Min('productevents__created', filter=Q(productevents__status=ProductEvents.Status.IN_TRANSIT)),
end=Max('productevents__created', filter=Q(productevents__status=ProductEvents.Status.DELIVERED))
).annotate(
diff=F('end') - F('start')
).aggregate(
Avg('diff')
)
Returns a dictionary that should look like
{'diff__avg': datetime.timedelta(days=x, seconds=x)}
I have three Django models:
class Review(models.Model):
rating = models.FloatField()
reviewer = models.ForeignKey('Reviewer')
movie = models.ForeignKey('Movie')
class Movie(models.Model):
release_date = models.DateTimeField(auto_now_add=True)
class Reviewer(models.Model):
...
I would like to write a query that returns the following for each reviewer:
The reviewer's id
Their average rating for the 5 most recently released movies
Their average rating for the 10 most recently released movies
The release date for the most recent movie they rated a 3 (out of 5) or lower
The result would be formatted:
<Queryset [{'id': 1, 'average_5': 4.7, 'average_10': 4.3, 'most_recent_bad_review': '2018-07-27'}, ...]>
I'm familiar with using .annotate(Avg(...)), but I can't figure out how to write a query that averages just a subset of the potential values. Similarly, I'm lost on how to annotate a query for the most recent <3 rating.
All of those are basically just some if statements in python code and when statements in your database assuming it is SQL-like, so, you can just use django's built-in Case and When functions, you'd probably combine them with Avg in your case and would need a new annotation field for every when, so your queryset would look roughly like
Model.objects.annotate(
average_5=Avg(Case(When(then=...), When(then=...)),
average_10=Avg(Case(When(then=...), When(then=...)),
)
with appropriate conditions inside when and appropriate then values.
I have One form in my application with different fields from which i need to subtracts value of one field from other and store result in third field on the fly in Database.
Example i have 2 fields :
1. Cost of PR Basic & 2. Cost of PO Basic
Need to calculate Delta : Cost of PO Basic - Cost of PR Basic.
Delta is also field in database table.
In models.py i have
class PR_Data(models.Model)
Cost_PR_Basic_INR = models.DecimalField(max_digits=19,decimal_places=2)
Cost_Of_PO_Basic_INR = models.DecimalField(max_digits=19,decimal_places=2)
Delta = models.DecimalField(max_digits=19,decimal_places=2, editable=True)
So how can i calculate delta from values entered in other two fields and store result in Delta field.
Thanks in advance..!!
In your views.py:
do something like this:
pr_data = PR_Data()
pr_data.Cost_PR_Basic_INR = <value1>
pr_data.Cost_Of_PO_Basic_INR = <value2>
pr_data.Delta = <value1> - <value2>
pr_data.save()
just an idea how it should work, not the complete code ;)
In a previous question I was asking about how to do a complex query in Django. Here is my example model:
class Foo(models.Model):
name = models.CharField(max_length=50)
type = models.CharField(max_length=100, blank=True)
foo_value = models.CharField(max_length=14, blank=True)
time_event = models.DateTimeField(blank=True)
# ... and many many other fields
Now in my previous question #BearBrown answered me with using the When .. then expression to control my query.
Now I need something more. I need to calculate the mode (most repeated value) of the quarter of the month in the time_event field. Manually, I do it like this:
- I manually iterate over all records for the same user.
- Get the day using q['event_time'].day
- Define quarters using quarts = range(1, 31, 7)
- Then, append the calculated quarters to a list month_quarts.append(quarter if quarter <= 4 else 4)
- Then get the mode value for this specific user qm = mode(month_quarts)
Is there a way to automate this mode calculation function in the When .. then expression instead of manually iterating through all records for every user and calculating it?
have the following four models
Measurement:
class Measurement(models.Model):
config = models.ForeignKey(MeasurementConfig)
energy = models.ForeignKey(Energy)
dose = models.DecimalField(max_digits=20,decimal_places=9, blank=True, null=True)
MeasurementConfig
class MeasurementConfig(models.Model):
date = models.DateTimeField(auto_now_add=True)
linac = models.ForeignKey(Linac)
Linac
class Linac(models.Model):
name = models.CharField(max_length=10)
genre = models.ForeignKey(Type)
energies = models.ManyToManyField(Energy)
Energy
class Energy(models.Model):
value = models.PositiveIntegerField()
category = models.ForeignKey(EnergyCategory)
Now I want to get all the dose measurements from the Measurement model for a specific Linac and Energy.
I have used the following code to get these:
linac = get_object_or_404(Linac, name=linacname)
# Get all the energies for the specified linac
measurementconfigs = MeasurementConfig.objects.filter(linac=linac)
identifications = []
for measurementconfig in measurementconfigs:
identifications.append(measurementconfig.identification)
config = get_object_or_404(MeasurementConfig, identification=identifications[0])
energies = config.linac.energies.all()
# Get the measurements
energymeasurements = []
for energy in energies:
measurements = Measurement.objects.filter(config__linac=linac).filter(config__linac__energies__exact=energy)
energymeasurements.append(measurements)
What I expect energymeasurements to look like is like this:
energymeasurements = [(Measurements_**Energy1**), (Measurements_**Energy2**), (Measurements_**EnergyN**)]
Where N is the amount of energies
But what I get is this:
energymeasurements = [(Measurements_**ALLEnergies1**), (Measurements_**ALLEnergies2**), (Measurements_**ALLEnergiesN**)]
Where N is the amount of energies.
So I expect that the query I make gets all the measurements for the specific energy specified by the loop. But instead I get the measurements for all the energies.
I've already tried it without the loop by using only the id of a specific energy, the same problem occurs.
I know the problem is in the query, but I can't find out what it is.
Your objective:
Now I want to get all the dose measurements from the Measurement model for a specific Linac and Energy.
My approach:
Get the Linac record.
Use the many-to-many link table to get the appropriate Energy record.
Get all MeasurementConfig records linked to the Linac record.
For each MeasurementConfig records, get all Measurement records; use the Energy record to filter this record set.
In the code you posted, it looks like you're trying to filter the Measurements record set on a link that doesn't exist, according to the models you specified.
# Get the measurements
energymeasurements = []
for energy in energies:
measurements = Measurement.objects.filter(config__linac=linac).filter(config__linac__energies__exact=energy)
energymeasurements.append(measurements)
A Measurement isn't linked to a Linac, so I think this is why you're not getting the result you expect.