I want to calculate the average delivery time (in days) of products using ORM single query (The reason of using single query is, I've 10000+ records in db and don't want to iterate them over loops). Here is the example of models file, I have:
class Product(models.Model):
name = models.CharField(max_length=10)
class ProductEvents(models.Model):
class Status(models.TextChoices):
IN_TRANSIT = ("in_transit", "In Transit")
DELIVERED = ("delivered", "Delivered")
product = models.ForiegnKey(Product, on_delete=models.CASCADE)
status = models.CharField(max_length=255, choices=Status.choices)
created = models.DateTimeField(blank=True)
To calculate the delivery time for 1 product is:
product = Product.objects.first()
# delivered_date - in_transit_date = days_taken
duration = product.productevent_set.get(status='delivered').created - product.productevent_set.get(status='in_transit').created
I'm here to get your help to getting started myself over this so, that I can calculate the average time between all of the Products. I'd prefer it to done in a single query because of the performance.
A basic solution is to annotate each Product with the minimum created time for related events that have the status "in-transit and select the maximum time for events with the delivered status then annotate the diff and aggregate the average of the diffs
from django.db.models import Min, Max, Q, F, Avg
Product.objects.annotate(
start=Min('productevents__created', filter=Q(productevents__status=ProductEvents.Status.IN_TRANSIT)),
end=Max('productevents__created', filter=Q(productevents__status=ProductEvents.Status.DELIVERED))
).annotate(
diff=F('end') - F('start')
).aggregate(
Avg('diff')
)
Returns a dictionary that should look like
{'diff__avg': datetime.timedelta(days=x, seconds=x)}
Related
I have a Django application to store hourly price and volume (OHLCV candle) for several markets. What I'm trying to achieve is to compare the latest volume of all markets and set top10 = True to the 10 markets with the highest volume in the latest candle. What is the most efficient way to do that ?
EDIT: The queryset should select all the most recent candle in every markets and sort them by volume. Then return the 10 markets the top 10 candles belong to.
models.py
class Market(models.Model):
top10 = JSONField(null=True)
class Candle(models.Model):
market = models.ForeignKey(Market, on_delete=models.CASCADE, related_name='candle', null=True)
price = models.FloatField(null=True)
volume = models.FloatField(null=True)
dt = models.DateTimeField()
Finally, I've figured out the solution, I guess.
latest_distinct = Candle.objects.order_by('market__pk', '-dt').distinct('market__pk')
candle_top = Candle.objects.filter(id__in=latest_distinct).order_by('-volume')[:10]
for item in candle_top:
item.market.top10 = True
item.market.save()
latest_distinct = Candle.objects.order_by('market__pk', '-dt').distinct('market__pk') will select latest candle record for every Market.
candle_top = Candle.objects.filter(id__in=latest_distinct).order_by('-volume')[:10] will sort items in previous query in descending order and slice 10 greatest ones.
Then you iterate over it setting each market.top10 to True
Notice that I'm assuming that Market's top10 field is a boolean. You can substitute your own logic instead of item.market.top10 = True
I have found a solution to my own question by selecting the last candle in every markets with it primary key, and creating a list of lists with list element as [volume, pk]. Then I sort the nested lists by list element 0 volume and select top 10. It returns a list of desired markets:
import operator
v = [[m.candle.first().volume, m.candle.first().id] for m in Market.objects.all()]
top = sorted(v, key=operator.itemgetter(0))[-10:]
[Candle.objects.get(id=i).market for i in [t[1] for t in top]]
i want to set a condition to my Sum function inside annotate , and i tried to use Case When but it didnt work in my case
this is my models.py
class MyModel(models.Model):
name = models.ForeignKey(Product, on_delete=models.CASCADE)
order = models.IntegerField()
price = models.IntegerField()
class Prodcut(models.Model):
name = models.CharField(max_lenth=20)
cost = models.IntegerField()
price = models.IntegerField()
i want to something like this
total = F('price')*F('order')
base = (F(name__cost')+F('name__price')) * F('order')
if total> base:
income = Sum(F('total') - F('base'))
i tried this
MyModel.objects.values('name__name').annotate(total=(Sum(F('price') * F('order'),output_field=IntegerField())),
base=(Sum((F('name__price')+F('name__cost'))*F('order'),output_field=IntegerField())
),
income=Sum(
Case(When(total__gt=F('base') , then=Sum(F('total') - F('base'))),default=0),output_field=IntegerField()),)
but this raise this error:
Cannot compute Sum('<CombinedExpression: F(total) - F(base)>'): '<CombinedExpression: F(total) - F(base)>' is an aggregate
i dont want to use .filter(income__gt=0) because it stops quantity from counting
and i dont want to counting income to those products which loss its sold
for example
i make a post on MyModel(name=mouse ,order=2,price=20) and in my Product model i have these information for mouse product Product(name=mouse,cost=4,price=10) , when i calculate to find income for this product : (2 *20) - ((4+10)*2) => 40 - 28 = 12 , but sometimes happen the result will be a negative price when (2*10) - ((4+10)*2) => 20 - 28 = -8
*i use mysql v:8 for database
i want to prevent negative numbers to add to my income with respect the other columns quantity
The problem is that you cannot use an aggregate (total and base) inside yet another aggregate in the same query. There is only one GROUP BY clause and Django cannot automatically produce a valid query here. As far as I've understood, you need to firstly calculate total and base, find each MyModel income, and only then produce an aggregate:
MyModel.objects.annotate(
total=F('price') * F('order'),
base=(F('name__price') + F('name__cost')) * F('order'),
income=Case(
When(total__gt=F('base'), then=F('total') - F('base')),
default=0,
output_field=IntegerField()
)
).values('name__name').annotate(income=Sum('income'))
P.S. Please, format your code so people can read it without difficulties :)
P.P.S I can probably see another way, you don't need Sum() for the income because total and base are sums already
MyModel.objects.values('name__name').annotate(
total=Sum(F('price') * F('order')),
base=Sum((F('name__price') + F('name__cost')) * F('order')),
).annotate(
income=Case(
When(total__gt=F('base'), then=F('total') - F('base')),
default=0,
output_field=IntegerField()
)
)
Try this, maybe some twists needed, idea is using Conditional Expressions
from django.db.models import Case, When, Value, IntegerField
MyModel.objects.values('name__name').annotate(
total = F('price')*F('order')
base = (F('name__cost') + F('name__price')) * F('order')
).annotate(
income = Case(
When(total__gt=F('base'), then=Sum(F('total')-F('base'))
), default = F('total'), output_field=IntegerField())
)
I'm populating my database from an API that provides year-to-date stats, and I'll be pulling from this API multiple times a day. Using the year-to-date stats, I need to generate monthly and weekly stats. I'm currently trying to do this by subtracting the stats at the start of the month from the stats at the end of the month and saving it in a separate model, but the process is taking far too long and I need it to go faster.
My models look something like this :
class Stats(models.Model):
date = models.DateField(default=timezone.now) # Date pulled from API
customer_id = models.IntegerField(default=0) # Simplified for this example
a = models.IntegerField(default=0)
b = models.IntegerField(default=0)
c = models.IntegerField(default=0)
d = models.IntegerField(default=0)
class Leaderboard(models.Model):
duration = models.CharField(max_length=7, default="YEARLY") # "MONTHLY", "WEEKLY"
customer_id = models.IntegerField(default=0)
start_stats = models.ForeignKey(Stats, related_name="start_stats") # Stats from the start of the Year/Month/Week
end_stats = models.ForeignKey(Stats, related_name="end_stats") # Stats from the end of the Year/Month/Week
needs_update = models.BooleanField(default=False) # set to True only if the end_stats changed (likely when a new pull happened)
a = models.IntegerField(default=0)
b = models.IntegerField(default=0)
c = models.IntegerField(default=0)
d = models.IntegerField(default=0)
e = models.IntegerField(default=0) # A value computed based on a-d, used to sort Leaderboards
I thought I was going to be home free using Leaderboard.objects.filter(needs_update=True).update(a=F("end_stats__a")-F("start_stats__a"), ...), but that gave me an error "Joined field references are not permitted in this query".
I'm currently iterating over the QuerySet Leaderboard.objects.filter(needs_update=True), doing the subtraction operations, and saving (all with #transaction.atomic), but ~380,000 test records processed this way took just over an hour, so I suspect that this way is going to be too slow for what I need.
I'm OK with changing how I store the data if a different format would help this Leaderboard update go faster (maybe do the subtraction when pulling in the data and saving daily deltas instead?), but I feel like I keep rushing towards whatever comes to mind without any idea of what I should be doing in this situation. Any feedback at this point would be very much appreciated.
After a lot of tinkering, I think I've got a method that will work for this situation. My test sample is smaller than before (84,600 records), but it completes in 8 seconds - about 10,575 records per second (compared to the roughly 6,300 records per second of my earlier tests).
There's probably a way to refine this even more, but here's what I'm doing:
from django.db.models import F, Subquery, OuterRef
# Get the latest versions of the stats
Leaderboard.objects.filter(needs_update=True).update(
a=Subquery(Stats.objects.filter(pk=OuterRef('end_stats')).values('a')[:1]),
b=Subquery(Stats.objects.filter(pk=OuterRef('end_stats')).values('b')[:1]),
c=Subquery(Stats.objects.filter(pk=OuterRef('end_stats')).values('c')[:1]),
d=Subquery(Stats.objects.filter(pk=OuterRef('end_stats')).values('d')[:1])
)
# Subtract the earliest versions of the stats
Leaderboard.objects.filter(needs_update=True).update(
a=F('a') - Subquery(Stats.objects.filter(pk=OuterRef('start_stats')).values('a')[:1]),
b=F('b') - Subquery(Stats.objects.filter(pk=OuterRef('start_stats')).values('b')[:1]),
c=F('c') - Subquery(Stats.objects.filter(pk=OuterRef('start_stats')).values('c')[:1]),
d=F('d') - Subquery(Stats.objects.filter(pk=OuterRef('start_stats')).values('d')[:1])
)
# Calculate stats that require earlier stats.
Leaderboard.objects.filter(needs_update=True).update(
e=F('a') + F('b') * F('c') / F('d'),
needs_update=False
)
I feel like there should be a way to only use one Subquery per update, which should improve the speed even more.
I have a model which I want to get both the most recent values out of, meaning the values in the most recently added item, and an aggregated value over a period of time. I can get the answers in separate QuerySets and then unite them in Python but I feel like there should be a better ORM approach to this. Anybody know how it can be done?
Simplified example:
Class Rating(models.Model):
movie = models.ForeignKey(Movie, related_name="movieRatings")
rating = models.IntegerField(blank=True, null=True)
timestamp = models.DateTimeField(auto_now_add=True)
I wish to get the avg rating in the past month and the most recent rating per movie.
Current approach:
recent_rating = Rating.objects.order_by('movie_id','-timestamp').distinct('movie')
monthly_ratings = Rating.objects.filter(timestamp__gte=datetime.datetime.now() - datetime.timedelta(days=30)).values('movie').annotate(month_rating=Avg('rating'))
And then I need to somehow join them on the movie id.
Thank you!
Try this solution based on Subquery expressions:
from django.db.models import OuterRef, Subquery, Avg, DecimalField
month_rating_subquery = Rating.objects.filter(
movie=OuterRef('movie'),
timestamp__gte=datetime.datetime.now() - datetime.timedelta(days=30)
).values('movie').annotate(monthly_avg=Avg('rating'))
result = Rating.objects.order_by('movie', '-timestamp').distinct('movie').values(
'movie', 'rating'
).annotate(
monthly_rating=Subquery(month_rating_subquery.values('monthly_avg'), output_field=DecimalField())
)
I suggest you add a property method (monthly_rating) to your rating model using the #property decorator instead of calculating it in your views.py :
#property
def monthly_rating(self):
return 'calculate your avg rating here'
have the following four models
Measurement:
class Measurement(models.Model):
config = models.ForeignKey(MeasurementConfig)
energy = models.ForeignKey(Energy)
dose = models.DecimalField(max_digits=20,decimal_places=9, blank=True, null=True)
MeasurementConfig
class MeasurementConfig(models.Model):
date = models.DateTimeField(auto_now_add=True)
linac = models.ForeignKey(Linac)
Linac
class Linac(models.Model):
name = models.CharField(max_length=10)
genre = models.ForeignKey(Type)
energies = models.ManyToManyField(Energy)
Energy
class Energy(models.Model):
value = models.PositiveIntegerField()
category = models.ForeignKey(EnergyCategory)
Now I want to get all the dose measurements from the Measurement model for a specific Linac and Energy.
I have used the following code to get these:
linac = get_object_or_404(Linac, name=linacname)
# Get all the energies for the specified linac
measurementconfigs = MeasurementConfig.objects.filter(linac=linac)
identifications = []
for measurementconfig in measurementconfigs:
identifications.append(measurementconfig.identification)
config = get_object_or_404(MeasurementConfig, identification=identifications[0])
energies = config.linac.energies.all()
# Get the measurements
energymeasurements = []
for energy in energies:
measurements = Measurement.objects.filter(config__linac=linac).filter(config__linac__energies__exact=energy)
energymeasurements.append(measurements)
What I expect energymeasurements to look like is like this:
energymeasurements = [(Measurements_**Energy1**), (Measurements_**Energy2**), (Measurements_**EnergyN**)]
Where N is the amount of energies
But what I get is this:
energymeasurements = [(Measurements_**ALLEnergies1**), (Measurements_**ALLEnergies2**), (Measurements_**ALLEnergiesN**)]
Where N is the amount of energies.
So I expect that the query I make gets all the measurements for the specific energy specified by the loop. But instead I get the measurements for all the energies.
I've already tried it without the loop by using only the id of a specific energy, the same problem occurs.
I know the problem is in the query, but I can't find out what it is.
Your objective:
Now I want to get all the dose measurements from the Measurement model for a specific Linac and Energy.
My approach:
Get the Linac record.
Use the many-to-many link table to get the appropriate Energy record.
Get all MeasurementConfig records linked to the Linac record.
For each MeasurementConfig records, get all Measurement records; use the Energy record to filter this record set.
In the code you posted, it looks like you're trying to filter the Measurements record set on a link that doesn't exist, according to the models you specified.
# Get the measurements
energymeasurements = []
for energy in energies:
measurements = Measurement.objects.filter(config__linac=linac).filter(config__linac__energies__exact=energy)
energymeasurements.append(measurements)
A Measurement isn't linked to a Linac, so I think this is why you're not getting the result you expect.