Django equivalent of SELECT * GROUP BY in MySQL - python

I'm having troubles using .annotate() and .aggregate() in Django ORM.
My table structure:
-----------------------------------------------------
| id group_id date_time |
| ================================================= |
| 1 1 2020-01-25 19:51:46.603859 |
| 2 2 2020-01-24 18:40:24.301419 |
| 3 1 2020-01-25 20:14:11.123860 |
| 4 2 2020-01-25 05:20:21.507901 |
-----------------------------------------------------
//Edited
I have the following MySQL Query:
SELECT m.*
FROM my_table m
JOIN (
SELECT group_id, max(date_time) as max_date
FROM my_table
GROUP BY group_id
) as s on m.group_id=s.group_id and m.date_time=s.max_date
Which returns:
-----------------------------------------------------
| id group_id date_time |
| ================================================= |
| 3 1 2020-01-25 19:51:46.603859 |
| 4 2 2020-01-24 18:40:24.301419 |
-----------------------------------------------------
And I'm trying to convert it to Django ORM so I can have a full QuerySet of objects. Until now I have been using this code:
unique_qs = MyModel.objects.filter(id__lte=50).values_list('group_id', flat=True).distinct()
unique_obj = []
for qs in unique_qs:
unique_obj.append(MyModel.objects.filter(group_id = qs).latest('date_time'))
But it's really inefficient and time consuming. Could you give me some lead on how to achieve it?

first import Max function as you can see below:
from django.db.models import Max
and then, this is what you need :
MyModel.objects.filter(id__lte=50).values('group_id').order_by('group_id').annotate(date_time_max=Max('date_time'))

Related

flask-sqlalchemy query for returning newest and using distinct columns

I am really struggling to write the correct postgres query that I want in flask-sqlalchemy.
A sample table of what my data looks like is below. I am trying to get back xxx and yyy for the newest (latest time stamp) for each unique name.
So ideally for the data below I want my query to return the bottom 4 entries.
+------------+-------+-----+--------------------------+
| name | xxx | yyy | time | ... (other columns)
+------------+-------+-----+--------------------------+
| aaa | 2 | 12 | 2021-03-11 20:27:13+00 |
| bbbb | 9 | 13 | 2021-03-11 20:27:13+00 |
| cccc | 2 | 16 | 2021-03-11 20:27:13+00 |
| dddd | 10 | 26 | 2021-03-11 20:27:13+00 |
| aaa | 4 | 13 | 2021-03-11 20:27:23+00 |
| bbbb | 8 | 12 | 2021-03-11 20:27:23+00 |
| cccc | 1 | 15 | 2021-03-11 20:27:23+00 |
| dddd | 12 | 26 | 2021-03-11 20:27:23+00 |
| aaa | 3 | 12 | 2021-03-11 20:27:33+00 |
| bbbb | 6 | 11 | 2021-03-11 20:27:33+00 |
| cccc | 1 | 17 | 2021-03-11 20:27:33+00 |
| dddd | 13 | 23 | 2021-03-11 20:27:33+00 |
+------------+-------+-----|--------------------------+
My most basic to return the latest of a single entry looks like:
single_query = MyModel \
.query \
.filter_by(name = 'aaaa) \
.order_by(desc(MyModel.time)) \
.first()
Using my model, I have tried to get my expected result with queries like below (based on some other SO answers):
full_query = MyModel \
.query \
.with_entities(MyModel.name, MyModel.time, MyModel.xxx) \
.distinct(MyMmodel.name) \
.all()
This gets me most of the way there, but it is returning random entries (seemingly at least). I thought I would just be easily able to add an order_by(desc(MyModel.time)), but I can't make it work with the above query.
Any suggestions on how I can get this to work or some pointers to get me in the correct direction? I've been scratching my head for a while. :)
I've done a lot of searching but don't know how to extend answers like this (SQLalchemy distinct, order_by different column) to my Model querying.
UPDATE
If I wanted to query two tables at once, for example sample_table_1 and sample_table_2 via MyModel and MyModel2, how can I translate postgres to flask-sqlalchemy?
A raw query that achieves what I want is below:
I was able to base this off of #Oluwafemi Sule's helpful answer. Can I extend this or the solution query to suit my needs :)
SELECT
m.name,
m.xxx,
m.yyy
n.xxx,
n.yyy
FROM (
SELECT
name,
xxx,
yyy,
ROW_NUMBER() OVER(PARTITION BY container_name ORDER BY time DESC) AS rn
FROM
sample_table_1
) m, (
SELECT
name,
xxx,
yyy,
ROW_NUMBER() OVER(PARTITION BY container_name ORDER BY time DESC) AS rn
FROM
sample_table_2
) n
WHERE m.container_name = n.container_name and m.rn = 1 and n.rn = 1;
The SQL query to get your expected results is as follows:
SELECT
name
,xxx
,yyy
,time
FROM
(
SELECT
name
,xxx
,yyy
,time
-- Number rows after partitioning by name and reverse chronological ordering
,ROW_NUMBER () OVER (PARTITION BY name ORDER BY time DESC) AS rn
FROM sample_table
) subquery
WHERE rn = 1;
Now, composing the SQLAlchemy query shall be as follows:
from sqlalchemy import func
subquery = (
MyModel
.query
.with_entities(
MyModel.name,
MyModel.time,
MyModel.xxx,
MyModel.yyy,
func.row_number().over(
partition_by=MyModel.name,
order_by=MyModel.time.desc()
).label("rn")
)
.subquery()
)
full_query = (
MyModel.query.with_entities(
subquery.c.name,
subquery.c.time,
subquery.c.xxx,
subquery.c.yyy
)
.select_from(subquery)
.filter(subquery.c.rn == 1)
.all()
)
You can get the desired result by selecting the distinct first_value for each column, partitioned by name and ordered by time
using the sqlalchemy.core language
import sqlalchemy as sa
from sqlalchemy import func
stmt = sa.select([
func.first_value(c)
.over(partition_by=MyModel.name,
order_by=MyModel.time.desc())
.label(c.name)
for c in MyModel.__table__.c
]).distinct()
this generates sql resembling:
select distinct
first_value(my_model.name) OVER (PARTITION BY name ORDER BY time DESC) AS name, ...
from my_model

SQLAlchemy Filter only minimum field value from distinct pairs

I have a ProductPurchase model that describes purchase made by the client.
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column
from sqlalchemy import DateTime
from sqlalchemy import String
Base = declarative_base()
class ProductPurchase(Base):
__tablename__ = "product_client"
client_id = Column(String(255))
product_id = Column(String(255))
purchased_at = Column(DateTime(timezone=True))
What I want to do is get the list of ProductPurchase where I would only have the first purchase of a client for given product_id.
For example:
+-----------+------------+--------------+
| client_id | product_id | purchased_at |
+-----------+------------+--------------+
| c1 | prod1 | 2020-01-01 |
+-----------+------------+--------------+
| c1 | prod1 | 2020-01-02 |
+-----------+------------+--------------+
| c2 | prod1 | 2020-01-01 |
+-----------+------------+--------------+
| c2 | prod2 | 2020-01-01 |
+-----------+------------+--------------+
I want to get following rows:
+-----------+------------+--------------+
| client_id | product_id | purchased_at |
+-----------+------------+--------------+
| c1 | prod1 | 2020-01-01 |
+-----------+------------+--------------+
| c2 | prod1 | 2020-01-01 |
+-----------+------------+--------------+
| c2 | prod2 | 2020-01-01 |
+-----------+------------+--------------+
Note that client_id=1 and product_id=1 pair is missing for date 2020-01-02 because it should be filtered out - the goal is to obtain only the first purchase of the product by the client.
How can I attempt this using sqlalchemy?
in SQLAlchemy you can use groupy by client_id and product_id and take minimum of purchased_at.
Something like this
from sqlalchemy import func
session.query(ProductPurchase.client_id ,ProductPurchase.purchased_id , func.min(ProductPurchase.purchased_at)).group_by(ProductPurchase.client_id ,ProductPurchase.purchased_id).all()
here is the sql code , sqlachemy support all over() and PArition by statement as well as using of cte , just follow the sqlachemy docs:
;WITH cte AS (
SELECT
*
, rank() OVER (PARTITION BY p.client_id , p.product_id ORDER BY p.purchased_at ASC) rnk
FROM
product AS p
)
SELECT cte.client_id
, cte.product_id
, cte.purchased_at
FROM cte
WHERE cte.rnk = 1

SQLAlchemy: Insert or Update when column value is a duplicate

I have a table A with the following columns:
id UUID
str_identifier TEXT
num FLOAT
and a table B with similar columns:
str_identifier TEXT
num FLOAT
entry_date TIMESTAMP
I want to construct a sqlalchemy query that does the following:
finds entries in table B that either do not exist yet in table A, and inserts them
finds entries in table B that do exist in table A but have a different value for the num column
The catch is that table B has the entry_date column, and as a result can have multiple entries with the same str_identifier but different entry dates. So I always want to perform this insert/update query using the latest entry for a given str_identifier (if it has multiple entries in table B).
For example, if before the query runs tables A and B are:
[A]
| id | str_identifier | num |
|-----|-----------------|-------|
| 1 | str_id_1 | 25 |
[B]
| str_identifier | num | entry_date |
|----------------|-----|------------|
| str_id_1 | 89 | 2020-07-20 |
| str_id_1 | 25 | 2020-06-20 |
| str_id_1 | 50 | 2020-05-20 |
| str_id_2 | 45 | 2020-05-20 |
After the update query, table A should look like:
[A]
| id | str_identifier | num |
|-----|-----------------|-----|
| 1 | str_id_1 | 89 |
| 2 | str_id_2 | 45 |
The query I've constructed so far should detect difference, but will adding order_by(B.entry_date.desc()) ensure I only do the exist comparisons with the latest str_identifier values?
My Current Query
query = (
select([B.str_identifier, B.value])
.select_from(
join(B, A, onclause=B.str_identifier == A.str_identifier, isouter=True)
)
.where(
and_(
~exists().where(
and_(
B.str_identifier == A.str_identifier,
B.value == A.value,
~B.value.in_([None]),
)
)
)
)
)

sqlalchemy orm - change column in a table depending on another table

I have a 3 tables
table 1
| id | name |
|:---:|:----:|
| 1 | name |
table 2
| id | name | status |
|:---:|:----:|:------:|
| 1 | name | True |
table 3
| id_table1 | id_table2 | datetime | status_table2 |
|:----------:|----------:|:--------:|:-------------:|
| 1 | 1 |01/11/2011| True |
How I can change a status in table 2 when I create a link in table 3, with sqlalchemy ORM in python, status must be changed when link in table 3 created and also must be changed when link deleted, who have any cool and simple ideas?
solved problem by use ORM Events

-find top x by count from MySQL in Python?

I have a csv file like this:
nohaelprince#uwaterloo.ca, 01-05-2014
nohaelprince#uwaterloo.ca, 01-05-2014
nohaelprince#uwaterloo.ca, 01-05-2014
nohaelprince#gmail.com, 01-05-2014
I am reading the above csv file and extracting domain name and also the count of emails address by domain name and date as well. All these things I need to insert into MySQL table called domains which I am able to do it successfully.
Problem Statement:- Now I need to use the same table to report the top 50 domains by count sorted by percentage growth of the last 30 days compared to the total. And this is what I am not able to understand how can I do it?
Below is the code in which I am successfully able to insert into MySQL database but not able to do above reporting task as I am not able to understand how to achieve this task?
#!/usr/bin/python
import fileinput
import csv
import os
import sys
import time
import MySQLdb
from collections import defaultdict, Counter
domain_counts = defaultdict(Counter)
# ======================== Defined Functions ======================
def get_file_path(filename):
currentdirpath = os.getcwd()
# get current working directory path
filepath = os.path.join(currentdirpath, filename)
return filepath
# ===========================================================
def read_CSV(filepath):
with open('emails.csv') as f:
reader = csv.reader(f)
for row in reader:
domain_counts[row[0].split('#')[1].strip()][row[1]] += 1
db = MySQLdb.connect(host="localhost", # your host, usually localhost
user="root", # your username
passwd="abcdef1234", # your password
db="test") # name of the data base
cur = db.cursor()
q = """INSERT INTO domains(domain_name, cnt, date_of_entry) VALUES(%s, %s, STR_TO_DATE(%s, '%%d-%%m-%%Y'))"""
for domain, data in domain_counts.iteritems():
for email_date, email_count in data.iteritems():
cur.execute(q, (domain, email_count, email_date))
db.commit()
# ======================= main program =======================================
path = get_file_path('emails.csv')
read_CSV(path) # read the input file
What is the right way to do the reporting task while using domains table.
Update:
Here is my domains table:
mysql> describe domains;
+----------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| domain_name | varchar(20) | NO | | NULL | |
| cnt | int(11) | YES | | NULL | |
| date_of_entry | date | NO | | NULL | |
+-------------+-------------+------+-----+---------+----------------+
And here is data I have in them:
mysql> select * from domains;
+----+---------------+-------+------------+
| id | domain_name | count | date_entry |
+----+---------------+-------+------------+
| 1 | wawa.com | 2 | 2014-04-30 |
| 2 | wawa.com | 2 | 2014-05-01 |
| 3 | wawa.com | 3 | 2014-05-31 |
| 4 | uwaterloo.ca | 4 | 2014-04-30 |
| 5 | uwaterloo.ca | 3 | 2014-05-01 |
| 6 | uwaterloo.ca | 1 | 2014-05-31 |
| 7 | anonymous.com | 2 | 2014-04-30 |
| 8 | anonymous.com | 4 | 2014-05-01 |
| 9 | anonymous.com | 8 | 2014-05-31 |
| 10 | hotmail.com | 4 | 2014-04-30 |
| 11 | hotmail.com | 1 | 2014-05-01 |
| 12 | hotmail.com | 3 | 2014-05-31 |
| 13 | gmail.com | 6 | 2014-04-30 |
| 14 | gmail.com | 4 | 2014-05-01 |
| 15 | gmail.com | 8 | 2014-05-31 |
+----+---------------+-------+------------+
Your needed report can be done in SQL on the MySQL side and Python can be used to call the query, import the resultset, and print out the results.
Consider the following aggregate query with subquery and derived table which follow the percentage growth formula:
((this month domain total cnt) - (last month domain total cnt))
/ (last month all domains total cnt)
SQL
SELECT domain_name, pct_growth
FROM (
SELECT t1.domain_name,
# SUM OF SPECIFIC DOMAIN'S CNT BETWEEN TODAY AND 30 DAYS AGO
(Sum(CASE WHEN t1.date_of_entry >= (CURRENT_DATE - INTERVAL 30 DAY)
THEN t1.cnt ELSE 0 END)
-
# SUM OF SPECIFIC DOMAIN'S CNT AS OF 30 DAYS AGO
Sum(CASE WHEN t1.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY)
THEN t1.cnt ELSE 0 END)
) /
# SUM OF ALL DOMAINS' CNT AS OF 30 DAYS AGO
(SELECT SUM(t2.cnt) FROM domains t2
WHERE t2.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY))
As pct_growth
FROM domains t1
GROUP BY t1.domain_name
) As derivedTable
ORDER BY pct_growth DESC
LIMIT 50;
Python
cur = db.cursor()
sql = "SELECT * FROM ..." # SEE ABOVE
cur.execute(sql)
for row in cur.fetchall():
print(row)
If I understand correctly, you just need the ratio of the past thirty days to the total count. You can get this using conditional aggregation. So, assuming that cnt is always greater than 0:
select d.domain_name,
sum(cnt) as CntTotal,
sum(case when date_of_entry >= date_sub(now(), interval 1 month) then cnt else 0 end) as Cnt30Days,
(sum(case when date_of_entry >= date_sub(now(), interval 1 month) then cnt else 0 end) / sum(cnt)) as Ratio30Days
from domains d
group by d.domain_name
order by Ratio30Days desc;

Categories