SQLAlchemy: Insert or Update when column value is a duplicate - python

I have a table A with the following columns:
id UUID
str_identifier TEXT
num FLOAT
and a table B with similar columns:
str_identifier TEXT
num FLOAT
entry_date TIMESTAMP
I want to construct a sqlalchemy query that does the following:
finds entries in table B that either do not exist yet in table A, and inserts them
finds entries in table B that do exist in table A but have a different value for the num column
The catch is that table B has the entry_date column, and as a result can have multiple entries with the same str_identifier but different entry dates. So I always want to perform this insert/update query using the latest entry for a given str_identifier (if it has multiple entries in table B).
For example, if before the query runs tables A and B are:
[A]
| id | str_identifier | num |
|-----|-----------------|-------|
| 1 | str_id_1 | 25 |
[B]
| str_identifier | num | entry_date |
|----------------|-----|------------|
| str_id_1 | 89 | 2020-07-20 |
| str_id_1 | 25 | 2020-06-20 |
| str_id_1 | 50 | 2020-05-20 |
| str_id_2 | 45 | 2020-05-20 |
After the update query, table A should look like:
[A]
| id | str_identifier | num |
|-----|-----------------|-----|
| 1 | str_id_1 | 89 |
| 2 | str_id_2 | 45 |
The query I've constructed so far should detect difference, but will adding order_by(B.entry_date.desc()) ensure I only do the exist comparisons with the latest str_identifier values?
My Current Query
query = (
select([B.str_identifier, B.value])
.select_from(
join(B, A, onclause=B.str_identifier == A.str_identifier, isouter=True)
)
.where(
and_(
~exists().where(
and_(
B.str_identifier == A.str_identifier,
B.value == A.value,
~B.value.in_([None]),
)
)
)
)
)

Related

SQL - Conditionally join and replace values between two tables

I have two tables where one is holding "raw" data and another is holding "updated" data. The updated data just contains corrections of rows from the first table, but is essentially the same. It is a functional requirement for this data to be stored separately.
I want a query with the following conditions:
Select all rows from the first table
If there is a matching row in the second table (ie. when raw_d.primary_key_col_1 = edit_d.primary_key_col_1 and raw_d.primary_key_col_2 = edit_d.primary_key_col_2), we use the most recent (where most recent is based on column primary_key_col_3 values from the second table, rather than the first
Otherwise we use the values from the first table.
Note: I have many more "value" columns in the actual data. Considering the following toy example where I have two tables, raw_d and edit_d, that are quite similar as follows:
primary_key_col_1 | primary_key_col_2 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------+-------------------
src_1 | dest_1 | 0 | 1
src_2 | dest_2 | 5 | 4
src_3 | dest_3 | 2 | 2
src_4 | dest_4 | 6 | 3
src_5 | dest_5 | 9 | 9
primary_key_col_1 | primary_key_col_2 | primary_key_col_3 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------------+---------------------------------------
src_1 | dest_1 | 2020-05-09 | 7 | 0
src_2 | dest_2 | 2020-05-08 | 6 | 1
src_3 | dest_3 | 2020-05-07 | 5 | 2
src_1 | dest_1 | 2020-05-08 | 3 | 4
src_2 | dest_2 | 2020-05-09 | 2 | 5
The expected result is as given:
primary_key_col_1 | primary_key_col_2 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------+-------------------
src_1 | dest_1 | 7 | 0
src_2 | dest_2 | 2 | 5
src_3 | dest_3 | 5 | 2
src_4 | dest_4 | 6 | 3
src_5 | dest_5 | 9 | 9
My proposed solution is to query the "greatest n per group" with the second table and then "overwrite" rows in a query of the first table, using Pandas.
The first query would just grab data from the first table:
SELECT * FROM raw_d
The second query to select "the greatest n per group" would be as follows:
SELECT DISTINCT ON (primary_key_col_1, primary_key_col_2) * FROM edit_d
ORDER BY primary_key_col_1, primary_key_col_2, primary_key_col_3 DESC;
I planned on merging the data like in Replace column values based on another dataframe python pandas - better way?.
Does anyone know a better solution, preferably using SQL only? For reference, I am using PostgreSQL and Pandas as part of my data stack.
I would suggest phrasing the requirements as:
select the most recent row from the second table
bring in additional rows from the first table that don't match
This is a union all with distinct on:
(select distinct on (primary_key_col_1, primary_key_col_2) u.primary_key_col_1, u.primary_key_col_2, u.value_col_1, u.value_col_2
from updated u
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc
) union all
select r.primary_key_col_1, r.primary_key_col_2, r.value_col_1, r.value_col_2
from raw r
where not exists (select 1
from updated u
where u.primary_key_col_1 = r.primary_key_col_2 and
u.primary_key_col_2 = r.primary_key_col_2
);
As I understood from your question, there are 2 ways to solve this
1. Using FULL OUTER JOIN
with cte as (
select distinct on (primary_key_col_1,primary_key_col_2) * from edit_d
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc
)
select
coalesce(t1.primary_key_col_1,t2.primary_key_col_1),
coalesce(t1.primary_key_col_2,t2.primary_key_col_2),
coalesce(t1.value_col_1,t2.value_col_1),
coalesce(t1.value_col_2,t2.value_col_2)
from cte t1
full outer join raw_d t2
on t1.primary_key_col_1 = t2.primary_key_col_1
and t1.primary_key_col_2 = t2.primary_key_col_2
DEMO
2. Using Union
select
distinct on (primary_key_col_1, primary_key_col_2)
primary_key_col_1, primary_key_col_2, value_col_1, value_col_2
from (
select * from edit_d
union all
select primary_key_col_1,primary_key_col_2, null as "primary_key_col_3",
value_col_1,value_col_2 from raw_d
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc nulls last
)tab
DEMO

Sqlalchemy many to one array response

Im working with SQLAlchemy and Flask. I have a content table like:
--------------------------------------------
| id | title | description |
--------------------------------------------
| 1 | example | my content |
| 2 | another piece| my other content|
--------------------------------------------
And a status table like this:
--------------------------------------------------------
| id | content_id | status type | date |
--------------------------------------------------------
| 1 | 1 | written | 1/5/2020 |
| 2 | 1 | edited | 1/7/2020 |
--------------------------------------------------------
I want to be able to query the db and get a content with all of the status's in one row instead of have multiple rows of the content repeated. For example I want:
----------------------------------------------------------
| id | title | description | status's |
----------------------------------------------------------
| 1 | example | my content | [1,2] |
----------------------------------------------------------
Is there a way to do this with sqlalchemy?
You can use this query for fetching your answer:
SELECT b.*,
(SELECT GROUP_CONCAT (id) FROM status_table
WHERE content_id = b.id) AS `status's`
FROM status_table a JOIN content_table b
ON a.content_id = b.id
GROUP BY a.content_id;

how to use a column value as a key in json field with django orm and postgres?

I need to find how to build a query with django ORM
I have two tables with a relation one to many.
table A
id = integer autonumeric
data = postgres json
...
table B
id = integer autonumeric
id_A = table A id
name = string
t_name = string like slug
the A.data have a json with the structure {key (t_name of table B): value (is a string value)}
so that the definition of entities (name and t_name) are in the table B and the values of that entities ere in the json structure in the table A.
eje:
Table A
------------
id | data
1 |{"a_01":"value a","a_02":"value a","b_01":"value b"}
2 |{"a_01":"value a","b_01":"value b"}
Table B
-----------
id | id_A | name | t_name
1 | 1 | A | a_01
2 | 1 | AA | a_02
3 | 1 | B | b_01
4 | 2 | A | a_01
5 | 2 | B | b_01
the id_A and t_name are uniques together
I need to get the items from table A with name (B.name) and value (A.data."t_name") from django orm
this query solve my problem but I don't how do in django ORM
SELECT at.id, at.data, bt.name
FROM "A" AS at
JOIN "B" AS bt ON at.id=bt.id_a
WHERE data ->> bt.t_name = 'value a' AND bt.name='AA'
LIMIT 50;
the result is:
1 | {"a_01":"value a","a_02":"value a","b_01":"value b"} | AA

-find top x by count from MySQL in Python?

I have a csv file like this:
nohaelprince#uwaterloo.ca, 01-05-2014
nohaelprince#uwaterloo.ca, 01-05-2014
nohaelprince#uwaterloo.ca, 01-05-2014
nohaelprince#gmail.com, 01-05-2014
I am reading the above csv file and extracting domain name and also the count of emails address by domain name and date as well. All these things I need to insert into MySQL table called domains which I am able to do it successfully.
Problem Statement:- Now I need to use the same table to report the top 50 domains by count sorted by percentage growth of the last 30 days compared to the total. And this is what I am not able to understand how can I do it?
Below is the code in which I am successfully able to insert into MySQL database but not able to do above reporting task as I am not able to understand how to achieve this task?
#!/usr/bin/python
import fileinput
import csv
import os
import sys
import time
import MySQLdb
from collections import defaultdict, Counter
domain_counts = defaultdict(Counter)
# ======================== Defined Functions ======================
def get_file_path(filename):
currentdirpath = os.getcwd()
# get current working directory path
filepath = os.path.join(currentdirpath, filename)
return filepath
# ===========================================================
def read_CSV(filepath):
with open('emails.csv') as f:
reader = csv.reader(f)
for row in reader:
domain_counts[row[0].split('#')[1].strip()][row[1]] += 1
db = MySQLdb.connect(host="localhost", # your host, usually localhost
user="root", # your username
passwd="abcdef1234", # your password
db="test") # name of the data base
cur = db.cursor()
q = """INSERT INTO domains(domain_name, cnt, date_of_entry) VALUES(%s, %s, STR_TO_DATE(%s, '%%d-%%m-%%Y'))"""
for domain, data in domain_counts.iteritems():
for email_date, email_count in data.iteritems():
cur.execute(q, (domain, email_count, email_date))
db.commit()
# ======================= main program =======================================
path = get_file_path('emails.csv')
read_CSV(path) # read the input file
What is the right way to do the reporting task while using domains table.
Update:
Here is my domains table:
mysql> describe domains;
+----------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| domain_name | varchar(20) | NO | | NULL | |
| cnt | int(11) | YES | | NULL | |
| date_of_entry | date | NO | | NULL | |
+-------------+-------------+------+-----+---------+----------------+
And here is data I have in them:
mysql> select * from domains;
+----+---------------+-------+------------+
| id | domain_name | count | date_entry |
+----+---------------+-------+------------+
| 1 | wawa.com | 2 | 2014-04-30 |
| 2 | wawa.com | 2 | 2014-05-01 |
| 3 | wawa.com | 3 | 2014-05-31 |
| 4 | uwaterloo.ca | 4 | 2014-04-30 |
| 5 | uwaterloo.ca | 3 | 2014-05-01 |
| 6 | uwaterloo.ca | 1 | 2014-05-31 |
| 7 | anonymous.com | 2 | 2014-04-30 |
| 8 | anonymous.com | 4 | 2014-05-01 |
| 9 | anonymous.com | 8 | 2014-05-31 |
| 10 | hotmail.com | 4 | 2014-04-30 |
| 11 | hotmail.com | 1 | 2014-05-01 |
| 12 | hotmail.com | 3 | 2014-05-31 |
| 13 | gmail.com | 6 | 2014-04-30 |
| 14 | gmail.com | 4 | 2014-05-01 |
| 15 | gmail.com | 8 | 2014-05-31 |
+----+---------------+-------+------------+
Your needed report can be done in SQL on the MySQL side and Python can be used to call the query, import the resultset, and print out the results.
Consider the following aggregate query with subquery and derived table which follow the percentage growth formula:
((this month domain total cnt) - (last month domain total cnt))
/ (last month all domains total cnt)
SQL
SELECT domain_name, pct_growth
FROM (
SELECT t1.domain_name,
# SUM OF SPECIFIC DOMAIN'S CNT BETWEEN TODAY AND 30 DAYS AGO
(Sum(CASE WHEN t1.date_of_entry >= (CURRENT_DATE - INTERVAL 30 DAY)
THEN t1.cnt ELSE 0 END)
-
# SUM OF SPECIFIC DOMAIN'S CNT AS OF 30 DAYS AGO
Sum(CASE WHEN t1.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY)
THEN t1.cnt ELSE 0 END)
) /
# SUM OF ALL DOMAINS' CNT AS OF 30 DAYS AGO
(SELECT SUM(t2.cnt) FROM domains t2
WHERE t2.date_of_entry < (CURRENT_DATE - INTERVAL 30 DAY))
As pct_growth
FROM domains t1
GROUP BY t1.domain_name
) As derivedTable
ORDER BY pct_growth DESC
LIMIT 50;
Python
cur = db.cursor()
sql = "SELECT * FROM ..." # SEE ABOVE
cur.execute(sql)
for row in cur.fetchall():
print(row)
If I understand correctly, you just need the ratio of the past thirty days to the total count. You can get this using conditional aggregation. So, assuming that cnt is always greater than 0:
select d.domain_name,
sum(cnt) as CntTotal,
sum(case when date_of_entry >= date_sub(now(), interval 1 month) then cnt else 0 end) as Cnt30Days,
(sum(case when date_of_entry >= date_sub(now(), interval 1 month) then cnt else 0 end) / sum(cnt)) as Ratio30Days
from domains d
group by d.domain_name
order by Ratio30Days desc;

Row number from one column, but then re-order using another column

I'm aggregating (summing) some data from a purchases table, aggregated by total amount per region.
Data looks something like the following:
| id | region | purchase_amount |
| 1 | A | 30 |
| 2 | A | 35 |
| 3 | B | 41 |
The aggregated data then looks like this, ordered by total_purchases:
| region | total_purchases |
| B | 1238 |
| A | 910 |
| D | 647 |
| C | 512 |
I'd like to get a ranking for each region, ordered by total_purchases. I can do this using row_number (using SQLAlchemy at the moment) and this results in a table looking like:
| rank | region | total_purchases |
| 1 | B | 1238 |
| 2 | A | 910 |
| 3 | D | 647 |
| 4 | C | 512 |
However, there's one more column that I'd like to group by and that's:
I want region 'C' to always be the first row, but keep it's ranking.
This would ideally result in a table looking like:
| rank | region | total_purchases |
| 4 | C | 512 |
| 1 | B | 1238 |
| 2 | A | 910 |
| 3 | D | 647 |
I can do one or the other, but I can't seem to combine these 2 features together. If I use a row_number() function, I get the proper ordering.
I can bring the region 'C' row always to the top using an ordering across 2 columns:
ORDER BY
CASE WHEN region = 'C' THEN 1 ELSE 0 DESC,
total_purchases DESC
However, I can't seem to combine these 2 requirements into the same query.
USE CTE to achieve that, put your ROW_NUMBER in your main query
;WITH C AS(
SELECT ROW_NUMBER() OVER (ORDER BY total_purchases DESC) AS Rn
,region
,total_purchases
FROM your_table
)
SELECT *
FROM C
ORDER BY (CASE WHEN region = 'C' THEN 1 ELSE 0 END) DESC
,total_purchases DESC
Does this work?
select row_number() over (order by total_purchases desc) as rank,
region, total_purchases
from table t
order by (case when region = 'C' then 1 else 0 end) desc, total_purchases desc;
This is about Postgres, we have a proper boolean type and can sort by any boolean expression directly:
SELECT rank() OVER (ORDER BY sum(purchase_amount) DESC NULLS LAST) AS rank
, region
, sum(purchase_amount) AS total_purchases
FROM purchases
GROUP BY region
ORDER BY (region <> 'C'), 1, region; -- region as tiebreaker
Explain
Window functions are executed after aggregate functions, so we don't need a subquery or CTE here.
Best way to get result count before LIMIT was applied
NULLS LAST?
PostgreSQL sort by datetime asc, null first?
The final 1 is referencing the ordinal position 1 in the SELECT list, so we don't have to repeat the expression.
ORDER BY (region <> 'C') ?
Sorting null values after all others, except special
The window function rank() seems adequate. As opposed to row_number(), equal total_purchases rank the same. To break possible ties and get a stable result in such cases, add region (or whatever) as last item to ORDER BY.
If you use row_number() and only ORDER BY sum(purchase_amount), equal totals can switch places in two separate calls. You could add another item to the ORDER BY clause of row_number() for a similar result, but an equal rank is more appropriate for equal total_purchases I'd say.

Categories