Django Aggregation for Goals - python

I'm saving every Sale in a Store. I want to use aggregation to sum all of the sales in a month for every store. And i want to filter the stores that reach the goal (100.000$).
I've already came up with a solution using python and a list. But i wanted to know if there is a better solution using only the ORM.
Sales model
Store Sale Date
Store A 5.000 11/01/2014
Store A 3.000 11/01/2014
Store B 1.000 15/01/2014
Store C 8.000 17/01/2014
...
The result should be this:
Month: January
Store Amount
A 120.000
B 111.000
C 150.000
and discard
D 70.000
Thanks for your help.

Other suggested methods discard a lot of data that takes a fraction of a second to load, and that could be useful later on in your code. Hence this answer.
Instead of querying on the Sales object, you can query on the Store object. The query is roughly the same, except for the relations:
from django.db.models import Sum
stores = Store.objects.filter(sales__date__month=month, sales__date__year=year) \
.annotate(montly_sales=Sum('sales__amount')) \
.filter(montly_sales__gte=100000) \
# optionally prefetch all `sales` objects if you know you need them
.prefetch_related('sales')
>>> [s for s in stores]
[
<Store object 1>,
<Store object 2>,
etc.
]
All Store objects have an extra attribute montly_sales that has the total amount of sales for that particular month. By filtering on month and year before annotating, the annotation only uses the filtered related objects. Note that the sales attribute on the store still contains all sales for that store.
With this method, all store attributes are easily accessible, unlike when you use .values to group your results.

Without a good look at your models the best I can do is pseudocode. But I would expect you need something along the lines of
from django.db.models import Sum
results = Sales.objects.filter(date__month=month, date__year=year)
results = results.values('store')
results = results.annotate(total_sales=Sum(sale))
return results.filter(total_sales__gt=100)
Basically, what we're doing is using django's aggregation capabilities to compute the Sum of sales for each store. Per django's documentation, we can use the values function to group our results by distinct values in a given field.
In line 2 we filter our sales to only sales from this month.
In line 3, we limit our results to the values for field store.
In line 4, we annotate each result with the Sum of all sales from the
original query.
In line 5, we filter on that annotation, limiting the returned results to stores with total_sales greater than 100.

You can use annotate to handle this. Since I do not know your model structure, That is an average guess
from djnago.db.models import Sum
Sales.objects.filter(date__month=3, date__year=2014).values('store').annotate(monthly_sale=Sum('sale'))
That will return you a Queryset of Stores and their monthly sales like:
>> [
{"store": 1, "monthly_sale": 120.000},
{"store": 2, "monthly_sale": 100.000},
...
]
In above query assume you have:
Sales model have a Date or Datetime field named date
Your Sale model have a ForeignKey relation to Store
Your Sales model have a numeric field (Integer, Decimal etc.) named sale
In your resulting QuerySet, store is the id of your store record. But since it is a ForeigKey, you can use relation to get its name etc...
Sales.objects.filter(date__month=3, date__year=2014).values('store__name').annotate(monthly_sale=Sum('sale'))
>> [
{"store__name": "Store A", "monthly_sale": 120.000},
{"store__name": "Store B", "monthly_sale": 100.000},
...
]

Related

Assign 0 if the date doesn't exist in the records from the database

I'm trying to generate a list of values to plot on a dashboard.
The list shows how many orders have been made per day for the last 7 days, and if there was no order recorded for that day, then return 0.
For example, if we have this order tracker model:
Order: {date_ordered (date), count_orders (int), restaurant (string)}
If we want to list all the orders for the last seven days, I'd do something like:
from datetime import datetime, timedelta
from .models import Order
last_seven_days = datetime.now()-timedelta(days=6)
orders = Order.objects.filter(date_ordered__gte=last_seven_days).order_by('date_ordered')
orders return something like this:
[<19-05-2022, Restaurant1, 35>, <19-05-2022, Restaurant2, 30>, <22-05-2022, Restaurant1, 19>, <22-05-2022, Restaurant2, 10>, <23-05-2022, Restaurant1, 32>]
The final result I would like to get is stretched over the last 7 days.
If the last seven days are ['18-06-2022', '19-06-2022', '20-06-2022', '21-06-2022', '22-06-2022', '23-06-2022', '24-06-2022']
Then my output needs to be:
[[0,35,0,0,19,32,0], [0,30,0,0,10,0,0]]
The above array basically says 2 things:
1- there should be as many tuples as restaurants
2- if the date doesn't exist it means 0 orders were recorded.
I feel this can be solved using a query rather than looping 3 times and many conditions.
Can you please share some thoughts on this?

How To Determine Bulk Customers (Customers Buying More Than 'N' Items Or So In A Single Transaction) In Python (Sales Data Analytics)?

so I've the following sample dataset:
Column A: Name
Column B: Email
Column C: Products
Column D: Transaction Date
I've two objectives:
To determine bulk customers (customers who purchase, let's say,
5 products or more in a single transaction), where each row
represents a unique transaction with a unique timestamp.
To determine from the recurring customers (customers frequently
making different transactions), who all are also bulk customers.
Now, I've already determined the list of recurring customers as follows:
n = 15
custmost1 = Order_Details['Name'].value_counts().index.tolist()[:n]
custmost2 = Order_Details['Name'].value_counts().values.tolist()[:n]
custmost = np.column_stack((custmost1,custmost2))
Where custmost denotes the series tuple clubbed together as an array for customers making frequent purchases with their counts. Order_Details is the dataframe I created for the dataset.
Now, I'm at my wit ends to figure out to maintain a count of different products being purchased in a single transaction (with a unique timestamp), and possibly, add it as a separate column in a dataframe.
I don't know if it's a feasible approach or not, but two ways were coming to my mind:
One to count the number of commas, so that number of commas+1 will be number of products.
To segregate each products into a separate line (which I already did, by the way, for maintaining a total count for a different insight), and checking with timestamp the number of products sold at a given timestamp.
I'd segregated the Products as follows:
reshaped = \
(Order_Details.set_index(Order_Details.columns.drop('Product',1).tolist())
.Product.str.split(',', expand=True)
.stack()
.reset_index()
.rename(columns={0:'Product'})
.loc[:, Order_Details.columns]
)
So, in this lieu, I would want someone to guide me as I feel like the aforementioned approaches are actually rather messy.
Assuming you already have a proper DataFrame:
>>> df.applymap(lambda e: type(e).__name__).value_counts()
name email product date
str str list Timestamp 29
dtype: int64
(i.e., with columns: ['name', 'email', 'product', 'date'], where the 'product' column contains list objects, and date contains Timestamp),
Then you could do this:
bulk_customers = set(df.loc[df['product'].apply(len) >= 5, 'name'])
s = df.groupby('name').size() > 1
recur_customers = set(s[s].index)
>>> bulk_customers
{'PERSON_108', 'PERSON_123'}
>>> recur_customers
{'PERSON_105'}
Notes
I changed the row of PERSON_125 to be PERSON_105, so that there would be one repeat customer. Likewise, I used a threshold of n_visits > 1 as the criterion for "recurring", but you may want something different.
You'd be well inspired to assign a unique ID to each of your customers. This could be based on email or perhaps you already have a customer ID. In any case, using name is prone to collisions, plus sometimes customers change name (e.g. through marriage) while keeping the same email or customer ID.
You didn't mention over what period of time a customer needs to visit again in order to be considered "frequent". If that is to be considered, you have to be specific whether it is e.g. "within a calendar month", or "over the past 30 days", etc., as each leads to slightly different expressions.
Ok, so after a bit of extensive brainstorming, I've concocted the following way to do this:
In the original dataset's dataframe (Order_Details), I figured out to get the count of commas in each row of the Product column, which gave me a frequency of the number of products purchased in a single transaction. The code for that goes:
Order_Details['Number Of Products'] = Order_Details['Product'].str.count(",")+1
To make sure that I get the names of customers in a sorted order according to the frequency of purchases, I applied the following sort_values() function:
Dup_Order_Details = Order_Details
Dup_Order_Details.sort_values(["Number Of Products","Name"],axis=0, ascending=False,inplace=True,na_position='first')
Finally, a filter for those buying more than 'N' products (here, I took N=10, as I wanted this insight, y'all can take 'N' as input if you want):
Dup_Order_Details = Dup_Order_Details[Dup_Order_Details["Number Of Products"] >= 10]
Then a simple direct display can be done as per your need or you can convert it into a list or something, in case any visualization is needed (which I did).

Calculate Max of Sum of an annotated field over a grouped by query in Django ORM?

To keep it simple I have four tables(A, B, Category and Relation), Relation table stores the Intensity of A in B and Category stores the type of B.
A <--- Relation ---> B ---> Category
(So the relation between A and B is n to n, when the relation between B and Category is n to 1)
I need an ORM to group Relation records by Category and A, then calculate Sum of Intensity in each (Category, A) (seems simple till here), then I want to annotate Max of calculated Sum in each Category.
My code is something like:
A.objects.values('B_id').annotate(AcSum=Sum(Intensity)).annotate(Max(AcSum))
Which throws the error:
django.core.exceptions.FieldError: Cannot compute Max('AcSum'): 'AcSum' is an aggregate
Django-group-by package with the same error.
For further information please also see this stackoverflow question.
I am using Django 2 and PostgreSQL.
Is there a way to achieve this using ORM, if there is not, what would be the solution using raw SQL expression?
Update
After lots of struggling I found out that what I wrote was indeed an aggregation, however what I want is to find out the maximum of AcSum of each A in each category. So I suppose I have to group-by the result once more after AcSum Calculation. Based on this insight I found a stack-overflow question which asks the same concept(The question was asked 1 year, 2 months ago without any accepted answer).
Chaining another values('id') to the set does not function neither as a group_by nor as a filter for output attributes, It removes AcSum from the set. Adding AcSum to values() is also not an option due to changes in the grouped by result set.
I think what I am trying to do is re grouping the grouped by query based on the fields inside a column (i.e id).
any thoughts?
You can't do an aggregate of an aggregate Max(Sum()), it's not valid in SQL, whether you're using the ORM or not. Instead, you have to join the table to itself to find the maximum. You can do this using a subquery. The below code looks right to me, but keep in mind I don't have something to run this on, so it might not be perfect.
from django.db.models import Subquery, OuterRef
annotation = {
'AcSum': Sum('intensity')
}
# The basic query is on Relation grouped by A and Category, annotated
# with the Sum of intensity
query = Relation.objects.values('a', 'b__category').annotate(**annotation)
# The subquery is joined to the outerquery on the Category
sub_filter = Q(b__category=OuterRef('b__category'))
# The subquery is grouped by A and Category and annotated with the Sum
# of intensity, which is then ordered descending so that when a LIMIT 1
# is applied, you get the Max.
subquery = Relation.objects.filter(sub_filter).values(
'a', 'b__category').annotate(**annotation).order_by(
'-AcSum').values('AcSum')[:1]
query = query.annotate(max_intensity=Subquery(subquery))
This should generate SQL like:
SELECT a_id, category_id,
(SELECT SUM(U0.intensity) AS AcSum
FROM RELATION U0
JOIN B U1 on U0.b_id = U1.id
WHERE U1.category_id = B.category_id
GROUP BY U0.a_id, U1.category_id
ORDER BY SUM(U0.intensity) DESC
LIMIT 1
) AS max_intensity
FROM Relation
JOIN B on Relation.b_id = B.id
GROUP BY Relation.a_id, B.category_id
It may be more performant to eliminate the join in Subquery by using a backend specific feature like array_agg (Postgres) or GroupConcat (MySQL) to collect the Relation.ids that are grouped together in the outer query. But I don't know what backend you're using.
Something like this should work for you. I couldn't test it myself, so please let me know the result:
Relation.objects.annotate(
b_category=F('B__Category')
).values(
'A', 'b_category'
).annotate(
SumInensityPerCategory=Sum('Intensity')
).values(
'A', MaxIntensitySumPerCategory=Max('SumInensityPerCategory')
)

How to get average of 6 first elements in a queryset and annotate the value in Django?

I have 2 models that are something like this:
class Foo(models.Model):
name = models.CharField(max_length=30)
class FooStatement(models.Model):
foo = models.ForeignKey(Foo, on_delete.models.CASCADE)
revenue = models.DecimalField(decimal_places=2)
date = models.DateField()
What I want to know is what the average is of the FooStatement revenue for the first 6 dates for each Foo object. Is there any way to achieve this?
I was thinking of slicing the first 6 entries (after ordering them), but I cannot seem to get this to work. The statement months all start at different dates so I can't just say that I want all dates that are lesser than 'x'.
I'm almost positive the answer lies somewhere in clever annotation, but I just cannot find it.
Edit: Django version 1.11.6 with a postgres database. There are upwards of 4000 Foo objects and they will keep growing.
What I ended up using is F expressions. Though not an ideal solution, it covers roughly 90% of my statements. I annotate the first month and then select all foostatements that are before a certain date and aggregate the average:
qs = Foo.objects.all().annotate(f_month=Min(F('foostatement__date')))
value = qs.filter(foostatement__date__lt=F('f_month')+datetime.timedelta(6*365/12)).aggregate(Avg('foostatement__revenue'))
A problem still persists with foostatements that skip a month, but for my purposes I can live with that.

Django: Filtering a queryset then count

I'm trying to limit the number of queries I perform on a page. The queryset returns the objects created within the last 24 hours. I then want to filter that queryset to count the objects based upon a field.
Example:
cars = self.get_queryset()
volvos_count = cars.filter(brand="Volvo").count()
mercs_count = cars.filter(brand="Merc").count()
With an increasing number of brands (in this example), the number of queries grows linearly with the number of brands that must be queried.
How can you make a single query for the cars that returns a dict of all of the unique values for brand and the number of instances within the queryset?
Result:
{'volvos': 4, 'mercs': 50, ...}
Thanks!
EDIT:
Of the comments so far, they have been close but not quite on the mark. Using a values_list('brand', flat=True) will return the brands. From there you can use
from collections import Counter
To return the totals. It would be great if there is a way to do this from a single query, but maybe it isn't possible.
To generate a count for each distinct brand, you use values in conjunction with annotate.
totals = cars.values('brand').annotate(Count('brand'))
This gives you a queryset, each of whose elements is a dictionary with brand and brand__count. You can convert that directly into a dict with the format you want:
{item['brand']: item['brand__count'] for item in totals}
SELECT brand, COUNT(*) as total
FROM cars
GROUP BY brand
ORDER BY total DESC
Equivalent:
cars.objects.all().values('brand').annotate(total=Count('brand')).order_by('total')

Categories