SQL - Conditionally join and replace values between two tables - python

I have two tables where one is holding "raw" data and another is holding "updated" data. The updated data just contains corrections of rows from the first table, but is essentially the same. It is a functional requirement for this data to be stored separately.
I want a query with the following conditions:
Select all rows from the first table
If there is a matching row in the second table (ie. when raw_d.primary_key_col_1 = edit_d.primary_key_col_1 and raw_d.primary_key_col_2 = edit_d.primary_key_col_2), we use the most recent (where most recent is based on column primary_key_col_3 values from the second table, rather than the first
Otherwise we use the values from the first table.
Note: I have many more "value" columns in the actual data. Considering the following toy example where I have two tables, raw_d and edit_d, that are quite similar as follows:
primary_key_col_1 | primary_key_col_2 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------+-------------------
src_1 | dest_1 | 0 | 1
src_2 | dest_2 | 5 | 4
src_3 | dest_3 | 2 | 2
src_4 | dest_4 | 6 | 3
src_5 | dest_5 | 9 | 9
primary_key_col_1 | primary_key_col_2 | primary_key_col_3 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------------+---------------------------------------
src_1 | dest_1 | 2020-05-09 | 7 | 0
src_2 | dest_2 | 2020-05-08 | 6 | 1
src_3 | dest_3 | 2020-05-07 | 5 | 2
src_1 | dest_1 | 2020-05-08 | 3 | 4
src_2 | dest_2 | 2020-05-09 | 2 | 5
The expected result is as given:
primary_key_col_1 | primary_key_col_2 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------+-------------------
src_1 | dest_1 | 7 | 0
src_2 | dest_2 | 2 | 5
src_3 | dest_3 | 5 | 2
src_4 | dest_4 | 6 | 3
src_5 | dest_5 | 9 | 9
My proposed solution is to query the "greatest n per group" with the second table and then "overwrite" rows in a query of the first table, using Pandas.
The first query would just grab data from the first table:
SELECT * FROM raw_d
The second query to select "the greatest n per group" would be as follows:
SELECT DISTINCT ON (primary_key_col_1, primary_key_col_2) * FROM edit_d
ORDER BY primary_key_col_1, primary_key_col_2, primary_key_col_3 DESC;
I planned on merging the data like in Replace column values based on another dataframe python pandas - better way?.
Does anyone know a better solution, preferably using SQL only? For reference, I am using PostgreSQL and Pandas as part of my data stack.

I would suggest phrasing the requirements as:
select the most recent row from the second table
bring in additional rows from the first table that don't match
This is a union all with distinct on:
(select distinct on (primary_key_col_1, primary_key_col_2) u.primary_key_col_1, u.primary_key_col_2, u.value_col_1, u.value_col_2
from updated u
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc
) union all
select r.primary_key_col_1, r.primary_key_col_2, r.value_col_1, r.value_col_2
from raw r
where not exists (select 1
from updated u
where u.primary_key_col_1 = r.primary_key_col_2 and
u.primary_key_col_2 = r.primary_key_col_2
);

As I understood from your question, there are 2 ways to solve this
1. Using FULL OUTER JOIN
with cte as (
select distinct on (primary_key_col_1,primary_key_col_2) * from edit_d
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc
)
select
coalesce(t1.primary_key_col_1,t2.primary_key_col_1),
coalesce(t1.primary_key_col_2,t2.primary_key_col_2),
coalesce(t1.value_col_1,t2.value_col_1),
coalesce(t1.value_col_2,t2.value_col_2)
from cte t1
full outer join raw_d t2
on t1.primary_key_col_1 = t2.primary_key_col_1
and t1.primary_key_col_2 = t2.primary_key_col_2
DEMO
2. Using Union
select
distinct on (primary_key_col_1, primary_key_col_2)
primary_key_col_1, primary_key_col_2, value_col_1, value_col_2
from (
select * from edit_d
union all
select primary_key_col_1,primary_key_col_2, null as "primary_key_col_3",
value_col_1,value_col_2 from raw_d
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc nulls last
)tab
DEMO

Related

Is there a way to improve a MERGE query?

I using this query to insert new entries to my table
MERGE INTO CLEAN clean USING DUAL ON (clean.id = :id)
WHEN NOT MATCHED THEN INSERT (ID, COUNT) VALUES (:id, :xcount)
WHEN MATCHED THEN UPDATE SET clean.COUNT = clean.count + :xcount
It seems that I do more inserts than updates, is there a way to improve my current performance?
I am using cx_Oracle with Python 3 and OracleDB 19c.
If you would have a massive problems with you approach, you are very probably missing an index on the column clean.id, that is required for your approach when the MERGE uses dual as a source for each row.
This is less probable while you are saying the id is a primary key.
So basically you are doing the right think and you will see execution plan similar as the one below:
---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | | | 2 (100)| |
| 1 | MERGE | CLEAN | | | | |
| 2 | VIEW | | | | | |
| 3 | NESTED LOOPS OUTER | | 1 | 40 | 2 (0)| 00:00:01 |
| 4 | TABLE ACCESS FULL | DUAL | 1 | 2 | 2 (0)| 00:00:01 |
| 5 | VIEW | VW_LAT_A18161FF | 1 | 38 | 0 (0)| |
| 6 | TABLE ACCESS BY INDEX ROWID| CLEAN | 1 | 38 | 0 (0)| |
|* 7 | INDEX UNIQUE SCAN | CLEAN_UX1 | 1 | | 0 (0)| |
---------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
7 - access("CLEAN"."ID"=:ID)
So the execution plan is fine and works effectively, but it has one problem.
Remember always you use an index, you will be happy while processing few rows, but it will not scale.
If you are processing a millions of records, you may fall back to a two step processing,
insert all rows in a temporary table
perform a single MERGE statement using the temporary table
The big advantage is that Oracle can open a hash join and get rid of the index access for each of the million rows.
Here an example of a test of the clean table initiated with 1M id (not shown) and performing 1M insert and 1M updates:
n = 1000000
data2 = [{"id" : i, "xcount" :1} for i in range(2*n)]
sql3 = """
insert into tmp (id,count)
values (:id,:xcount)"""
sql4 = """MERGE into clean USING tmp on (clean.id = tmp.id)
when not matched then insert (id, count) values (tmp.id, tmp.count)
when matched then update set clean.count= clean.count + tmp.count"""
cursor.executemany(sql3, data2)
cursor.execute(sql4)
The test runs in aprox. 10 second, which is less than a half of you approach with MERGEusing dual.
If this is still not enough, you'll have to use parallel option.
MERGE is quite fast. Inserts being faster then updates, I'd say (usually).
So, if you're asking how to make inserts faster, then it depends.
If you're inserting one row at a time, there shouldn't be any bottleneck.
If you're inserting millions of rows, see whether there are triggers enabled on the table which fire for each row and do something (slowing the process down).
As of updates, is there index on clean.id column? If not, it would probably help.
Otherwise, see what explain plan says; collect statistics regularly.

Retrieving the a column value where the primary key is a tuple (a,b) and where all rows (a,b') must exist for a given list for b' values

I have a table of form A|B|C where a tuple (a,b) in (A,B) is the primary key. I have list of values (BVALs) for B and I require the elements in the column A where a row entry for each value of the type (a,b') for b' in BVALs exists.
Currently I have implemented a script that retrieves the first all (a,b'') for the first element of the BVALs, which then iterates and refines the list until the last element of the BVALs. I believe it will be slow in big databases and believe that an faster solution exists. Would appreciate any help.
Let's say we have the following table:
+---+---+
| A | B |
+---+---+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 1 |
| 2 | 2 |
| 3 | 1 |
+---+---+
If the BVALs list consists of (1,2) then the query should return 1 and 2
I understand that you want as that have all b values. If so, you can use group by and having:
select a
from mytable
where b in (1, 2) -- either value
group by a
having count(*) = 2 -- both match
Is this what you want?
select distinct a
from t
where b in ( . . . ); -- list of b values here

SQLAlchemy: Insert or Update when column value is a duplicate

I have a table A with the following columns:
id UUID
str_identifier TEXT
num FLOAT
and a table B with similar columns:
str_identifier TEXT
num FLOAT
entry_date TIMESTAMP
I want to construct a sqlalchemy query that does the following:
finds entries in table B that either do not exist yet in table A, and inserts them
finds entries in table B that do exist in table A but have a different value for the num column
The catch is that table B has the entry_date column, and as a result can have multiple entries with the same str_identifier but different entry dates. So I always want to perform this insert/update query using the latest entry for a given str_identifier (if it has multiple entries in table B).
For example, if before the query runs tables A and B are:
[A]
| id | str_identifier | num |
|-----|-----------------|-------|
| 1 | str_id_1 | 25 |
[B]
| str_identifier | num | entry_date |
|----------------|-----|------------|
| str_id_1 | 89 | 2020-07-20 |
| str_id_1 | 25 | 2020-06-20 |
| str_id_1 | 50 | 2020-05-20 |
| str_id_2 | 45 | 2020-05-20 |
After the update query, table A should look like:
[A]
| id | str_identifier | num |
|-----|-----------------|-----|
| 1 | str_id_1 | 89 |
| 2 | str_id_2 | 45 |
The query I've constructed so far should detect difference, but will adding order_by(B.entry_date.desc()) ensure I only do the exist comparisons with the latest str_identifier values?
My Current Query
query = (
select([B.str_identifier, B.value])
.select_from(
join(B, A, onclause=B.str_identifier == A.str_identifier, isouter=True)
)
.where(
and_(
~exists().where(
and_(
B.str_identifier == A.str_identifier,
B.value == A.value,
~B.value.in_([None]),
)
)
)
)
)

Split table into 2 new tables by index with sqlite3 and Python

I have a database.db file that has a table1 like this one. Note that the index is just codes and not numeric:
id | item | price
-------------
45f5 | book | 20
25h8 | copy | 30
as34 | pen | 10
t674 | key | 15
5h6f | ring | 25
67yu | mug | 40
and I would like to create two additional tables in my dababase.db names table2 and table3 in which one contain the first 4 rows and the other one the last 2 rows:
table2
id | item | price
-------------
45f5 | book | 20
25h8 | copy | 30
as34 | pen | 10
t674 | key | 15
table3
id | item | price
-------------
5h6f | ring | 25
67yu | mug | 40
I have been trying with CREATE TABLE but I have too many columns in table1 to write it one by one. What would be your approach to this problem? Thanks!
CREATE TABLE table2 AS SELECT * FROM table1 WHERE condition
To create table you have to specify name of column and type,
SELECT * FROM already exisitig will not work. See https://www.w3schools.com/sql/sql_create_table.asp
You can try to create such table based on already existing table like:
INSERT INTO table2 SELECT * FROM table1 WHERE condition;
Try to use above statement without WHERE to check whether it works as I don't have access now to SQLite database to check it.

Row number from one column, but then re-order using another column

I'm aggregating (summing) some data from a purchases table, aggregated by total amount per region.
Data looks something like the following:
| id | region | purchase_amount |
| 1 | A | 30 |
| 2 | A | 35 |
| 3 | B | 41 |
The aggregated data then looks like this, ordered by total_purchases:
| region | total_purchases |
| B | 1238 |
| A | 910 |
| D | 647 |
| C | 512 |
I'd like to get a ranking for each region, ordered by total_purchases. I can do this using row_number (using SQLAlchemy at the moment) and this results in a table looking like:
| rank | region | total_purchases |
| 1 | B | 1238 |
| 2 | A | 910 |
| 3 | D | 647 |
| 4 | C | 512 |
However, there's one more column that I'd like to group by and that's:
I want region 'C' to always be the first row, but keep it's ranking.
This would ideally result in a table looking like:
| rank | region | total_purchases |
| 4 | C | 512 |
| 1 | B | 1238 |
| 2 | A | 910 |
| 3 | D | 647 |
I can do one or the other, but I can't seem to combine these 2 features together. If I use a row_number() function, I get the proper ordering.
I can bring the region 'C' row always to the top using an ordering across 2 columns:
ORDER BY
CASE WHEN region = 'C' THEN 1 ELSE 0 DESC,
total_purchases DESC
However, I can't seem to combine these 2 requirements into the same query.
USE CTE to achieve that, put your ROW_NUMBER in your main query
;WITH C AS(
SELECT ROW_NUMBER() OVER (ORDER BY total_purchases DESC) AS Rn
,region
,total_purchases
FROM your_table
)
SELECT *
FROM C
ORDER BY (CASE WHEN region = 'C' THEN 1 ELSE 0 END) DESC
,total_purchases DESC
Does this work?
select row_number() over (order by total_purchases desc) as rank,
region, total_purchases
from table t
order by (case when region = 'C' then 1 else 0 end) desc, total_purchases desc;
This is about Postgres, we have a proper boolean type and can sort by any boolean expression directly:
SELECT rank() OVER (ORDER BY sum(purchase_amount) DESC NULLS LAST) AS rank
, region
, sum(purchase_amount) AS total_purchases
FROM purchases
GROUP BY region
ORDER BY (region <> 'C'), 1, region; -- region as tiebreaker
Explain
Window functions are executed after aggregate functions, so we don't need a subquery or CTE here.
Best way to get result count before LIMIT was applied
NULLS LAST?
PostgreSQL sort by datetime asc, null first?
The final 1 is referencing the ordinal position 1 in the SELECT list, so we don't have to repeat the expression.
ORDER BY (region <> 'C') ?
Sorting null values after all others, except special
The window function rank() seems adequate. As opposed to row_number(), equal total_purchases rank the same. To break possible ties and get a stable result in such cases, add region (or whatever) as last item to ORDER BY.
If you use row_number() and only ORDER BY sum(purchase_amount), equal totals can switch places in two separate calls. You could add another item to the ORDER BY clause of row_number() for a similar result, but an equal rank is more appropriate for equal total_purchases I'd say.

Categories