How to update a Postgres table column using a pandas data frame? - python

I am adding a single column to a Postgres table with 100+ columns via Django ( a new migration). How can I update a column in a PostgreSQL table with the data from a pandas data_frame? The pseudo-code for Postgres SQL UPDATE would be:
UPDATE wide_table wt
SET wt.z = df.z
WHERE date = 'todays_date'
The reason for doing it this way is that I am computing a column in the data_frame using a CSV that is in S3 (this is df.z). The docs for Postgres update are straightforward to use, but I am unsure how to do this via Django, sqlalchemy, pyodbc, or the like.
I apologize if this is a bit convoluted. A small and incomplete example would be:
Wide Table (pre-update column z)
identifier | x | y | z | date
foo | 2 | 1 | 0.0 | ...
bar | 2 | 8 | 0.0 | ...
baz | 3 | 7 | 0.0 | ...
foo | 2 | 8 | 0.0 | ...
foo | 1 | 5 | 0.0 | ...
baz | 2 | 8 | 0.0 | ...
bar | 9 | 3 | 0.0 | ...
baz | 2 | 3 | 0.0 | ...
Example Python snippet
def apply_function(identifier):
# Maps baz-> 15.0, bar-> 19.6, foo -> 10.0 for single date
df = pd.read_csv("s3_file_path/date_file_name.csv")
# Compute 'z' based on identifier and S3 csv
return z
postgres_query = "Select identifier from wide_table"
df = pd.read_sql(sql=postgres_query, con=engine)
df['z'] = df.identifier.apply(apply_function)
# Python / SQL Update Logic here to update Postgres Column
???
Wide Table (post-update column z)
identifier | x | y | z | date
foo | 2 | 1 | 10.0 | ...
bar | 2 | 8 | 19.6 | ...
baz | 3 | 7 | 15.0 | ...
foo | 2 | 8 | 10.0 | ...
foo | 1 | 5 | 10.0 | ...
baz | 2 | 8 | 15.0 | ...
bar | 9 | 3 | 19.6 | ...
baz | 2 | 3 | 15.0 | ...
NOTE: The values in z will change daily so simply creating another table to hold these z values is not a great solution. Also, I'd really prefer to avoid deleting all of the data and adding it back.

Ran into a similar problem and the current accepted solution was too slow for me. My table had 500k+ rows and i needed to update 100k+ rows. After lengthy research and trial and error i arrived at an efficient and correct solution.
The idea is to use psycopg as your writer and to use a temp table. df is your pandas dataframe that contains values you want to set.
import psycopg2
conn = psycopg2.connect("dbname='db' user='user' host='localhost' password='test'")
cur = conn.cursor()
rows = zip(df.id, df.z)
cur.execute("""CREATE TEMP TABLE codelist(id INTEGER, z INTEGER) ON COMMIT DROP""")
cur.executemany("""INSERT INTO codelist (id, z) VALUES(%s, %s)""", rows)
cur.execute("""
UPDATE table_name
SET z = codelist.z
FROM codelist
WHERE codelist.id = vehicle.id;
""")
cur.rowcount
conn.commit()
cur.close()
conn.close()

I managed to cobble together a solution myself where I zip the id and z values and then execute a generic SQL UPDATE statement and utilizing SQL UPDATE FROM VALUES.
Data Prep
sql_query= "SELECT id, a FROM wide_table"
df = pd.read_sql(sql=sql_query, con=engine)
df['z'] = df.a.apply(apply_function)
zipped_vals = zip(df.id, df.z)
tuple_to_str= str(tuple(zipped_vals))
entries_to_update = tuple_to_str[1:len(tuple_to_str)-1] # remove first and last paren in tuple
SQL Query Solution:
# Update column z by matching ID from SQL Table & Pandas DataFrame
update_sql_query = f"""UPDATE wide_table t SET z = v.z
FROM (VALUES {entries_to_update}) AS v (id, z)
WHERE t.id = v.id;"""
with engine.begin() as conn:
conn.execute(update_sql_query)
conn.exec(sql_query)
Answer on updating PostgreSQL table column from values
PostgreSQL update docs

Related

SQL - Conditionally join and replace values between two tables

I have two tables where one is holding "raw" data and another is holding "updated" data. The updated data just contains corrections of rows from the first table, but is essentially the same. It is a functional requirement for this data to be stored separately.
I want a query with the following conditions:
Select all rows from the first table
If there is a matching row in the second table (ie. when raw_d.primary_key_col_1 = edit_d.primary_key_col_1 and raw_d.primary_key_col_2 = edit_d.primary_key_col_2), we use the most recent (where most recent is based on column primary_key_col_3 values from the second table, rather than the first
Otherwise we use the values from the first table.
Note: I have many more "value" columns in the actual data. Considering the following toy example where I have two tables, raw_d and edit_d, that are quite similar as follows:
primary_key_col_1 | primary_key_col_2 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------+-------------------
src_1 | dest_1 | 0 | 1
src_2 | dest_2 | 5 | 4
src_3 | dest_3 | 2 | 2
src_4 | dest_4 | 6 | 3
src_5 | dest_5 | 9 | 9
primary_key_col_1 | primary_key_col_2 | primary_key_col_3 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------------+---------------------------------------
src_1 | dest_1 | 2020-05-09 | 7 | 0
src_2 | dest_2 | 2020-05-08 | 6 | 1
src_3 | dest_3 | 2020-05-07 | 5 | 2
src_1 | dest_1 | 2020-05-08 | 3 | 4
src_2 | dest_2 | 2020-05-09 | 2 | 5
The expected result is as given:
primary_key_col_1 | primary_key_col_2 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------+-------------------
src_1 | dest_1 | 7 | 0
src_2 | dest_2 | 2 | 5
src_3 | dest_3 | 5 | 2
src_4 | dest_4 | 6 | 3
src_5 | dest_5 | 9 | 9
My proposed solution is to query the "greatest n per group" with the second table and then "overwrite" rows in a query of the first table, using Pandas.
The first query would just grab data from the first table:
SELECT * FROM raw_d
The second query to select "the greatest n per group" would be as follows:
SELECT DISTINCT ON (primary_key_col_1, primary_key_col_2) * FROM edit_d
ORDER BY primary_key_col_1, primary_key_col_2, primary_key_col_3 DESC;
I planned on merging the data like in Replace column values based on another dataframe python pandas - better way?.
Does anyone know a better solution, preferably using SQL only? For reference, I am using PostgreSQL and Pandas as part of my data stack.
I would suggest phrasing the requirements as:
select the most recent row from the second table
bring in additional rows from the first table that don't match
This is a union all with distinct on:
(select distinct on (primary_key_col_1, primary_key_col_2) u.primary_key_col_1, u.primary_key_col_2, u.value_col_1, u.value_col_2
from updated u
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc
) union all
select r.primary_key_col_1, r.primary_key_col_2, r.value_col_1, r.value_col_2
from raw r
where not exists (select 1
from updated u
where u.primary_key_col_1 = r.primary_key_col_2 and
u.primary_key_col_2 = r.primary_key_col_2
);
As I understood from your question, there are 2 ways to solve this
1. Using FULL OUTER JOIN
with cte as (
select distinct on (primary_key_col_1,primary_key_col_2) * from edit_d
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc
)
select
coalesce(t1.primary_key_col_1,t2.primary_key_col_1),
coalesce(t1.primary_key_col_2,t2.primary_key_col_2),
coalesce(t1.value_col_1,t2.value_col_1),
coalesce(t1.value_col_2,t2.value_col_2)
from cte t1
full outer join raw_d t2
on t1.primary_key_col_1 = t2.primary_key_col_1
and t1.primary_key_col_2 = t2.primary_key_col_2
DEMO
2. Using Union
select
distinct on (primary_key_col_1, primary_key_col_2)
primary_key_col_1, primary_key_col_2, value_col_1, value_col_2
from (
select * from edit_d
union all
select primary_key_col_1,primary_key_col_2, null as "primary_key_col_3",
value_col_1,value_col_2 from raw_d
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc nulls last
)tab
DEMO

SQLAlchemy: Insert or Update when column value is a duplicate

I have a table A with the following columns:
id UUID
str_identifier TEXT
num FLOAT
and a table B with similar columns:
str_identifier TEXT
num FLOAT
entry_date TIMESTAMP
I want to construct a sqlalchemy query that does the following:
finds entries in table B that either do not exist yet in table A, and inserts them
finds entries in table B that do exist in table A but have a different value for the num column
The catch is that table B has the entry_date column, and as a result can have multiple entries with the same str_identifier but different entry dates. So I always want to perform this insert/update query using the latest entry for a given str_identifier (if it has multiple entries in table B).
For example, if before the query runs tables A and B are:
[A]
| id | str_identifier | num |
|-----|-----------------|-------|
| 1 | str_id_1 | 25 |
[B]
| str_identifier | num | entry_date |
|----------------|-----|------------|
| str_id_1 | 89 | 2020-07-20 |
| str_id_1 | 25 | 2020-06-20 |
| str_id_1 | 50 | 2020-05-20 |
| str_id_2 | 45 | 2020-05-20 |
After the update query, table A should look like:
[A]
| id | str_identifier | num |
|-----|-----------------|-----|
| 1 | str_id_1 | 89 |
| 2 | str_id_2 | 45 |
The query I've constructed so far should detect difference, but will adding order_by(B.entry_date.desc()) ensure I only do the exist comparisons with the latest str_identifier values?
My Current Query
query = (
select([B.str_identifier, B.value])
.select_from(
join(B, A, onclause=B.str_identifier == A.str_identifier, isouter=True)
)
.where(
and_(
~exists().where(
and_(
B.str_identifier == A.str_identifier,
B.value == A.value,
~B.value.in_([None]),
)
)
)
)
)

Django equivalent of SELECT * GROUP BY in MySQL

I'm having troubles using .annotate() and .aggregate() in Django ORM.
My table structure:
-----------------------------------------------------
| id group_id date_time |
| ================================================= |
| 1 1 2020-01-25 19:51:46.603859 |
| 2 2 2020-01-24 18:40:24.301419 |
| 3 1 2020-01-25 20:14:11.123860 |
| 4 2 2020-01-25 05:20:21.507901 |
-----------------------------------------------------
//Edited
I have the following MySQL Query:
SELECT m.*
FROM my_table m
JOIN (
SELECT group_id, max(date_time) as max_date
FROM my_table
GROUP BY group_id
) as s on m.group_id=s.group_id and m.date_time=s.max_date
Which returns:
-----------------------------------------------------
| id group_id date_time |
| ================================================= |
| 3 1 2020-01-25 19:51:46.603859 |
| 4 2 2020-01-24 18:40:24.301419 |
-----------------------------------------------------
And I'm trying to convert it to Django ORM so I can have a full QuerySet of objects. Until now I have been using this code:
unique_qs = MyModel.objects.filter(id__lte=50).values_list('group_id', flat=True).distinct()
unique_obj = []
for qs in unique_qs:
unique_obj.append(MyModel.objects.filter(group_id = qs).latest('date_time'))
But it's really inefficient and time consuming. Could you give me some lead on how to achieve it?
first import Max function as you can see below:
from django.db.models import Max
and then, this is what you need :
MyModel.objects.filter(id__lte=50).values('group_id').order_by('group_id').annotate(date_time_max=Max('date_time'))

Getting SQL to filter rows from a join, on a dynamic property

I'm not quite sure how to ask this question, so I'm going to demonstrate what I'm trying to do with an example. I'm using Python SQLAlchemy, but an answer in plain old SQL would be fine, just so I could understand the query.
I have two tables (this is a contrived example), that look something like this:
Table: Users
id | username
1 | john
2 | bob
3 | mary
4 | sally
Table: Updates
id | date | message | user_id
1 | 11-14 | m3 | 1
2 | 11-13 | m2 | 1
3 | 11-12 | m1 | 1
4 | 11-13 | n2 | 2
5 | 11-12 | n1 | 2
6 | 11-12 | o1 | 3
The "updates" table is populated daily, via a script, but each user may not have an update for that day. I'm trying to figure out how I can query both tables to pull the "latest" update for each user. For example, I'd want my output to look something like the following:
username | date | message
john | 11-14 | m3
bob | 11-13 | n2
mary | 11-12 | o1
sally | |
I get tripped up, because the "date" value won't be the same for each user, so I can't match that column against a static value.
From a high level, I picture the SQLAlchemy looking something like this:
Updates.query.join(Users).filter((Users.id == Updates.user_id) & (Updates.date == LATEST_DATE_IN_UPDATES_BY_USER)).all()
Where the "LATEST_DATE_IN_UPDATES_BY_USER" represents the most recent row in the update table, for that user. I'm not sure how to get that behavior, though.
In SQL the query would look something like this:
SELECT Us.username, Up.Date, Up.Message
FROM USERS Us
LEFT JOIN UPDATES Up on Us.id = Up.user_id
AND Up.date = (SELECT MAX(date) FROM UPDATES u WHERE u.user_id = Up.user_id)
Using sqlalchemy, the simple solution (similar to the answer from #radu-gheorghiu) would be:
sq = (
session
.query(Update.user_id, func.max(Update.date).label("max_date"))
.group_by(Update.user_id)
).subquery("subq")
q = (
session
.query(User.username, Update.date, Update.message)
.outerjoin(sq, User.id == sq.c.user_id)
.outerjoin(Update, and_(User.id == Update.user_id, Update.date == sq.c.max_date))
)
And if your database supports RANK functions, you can use:
expr = func.rank().over(partition_by=Update.user_id, order_by=Update.date.desc()).label("rank")
# need subquery because cannot use RANK in the WHERE part directly
sq = session.query(Update, expr).subquery("subq")
q = (
session
.query(User.username, Update.date, Update.message)
.select_entity_from(sq)
.join(User)
.filter(sq.c.rank == 1)
)

How to substitute a list in python/mysql

This is using the python mysql.connector from MySQL.
I'm wanting to write an update query where id is in a list, e.g.
UPDATE tbl SET thing=1 WHERE id IN (1,2,3,4,5);
If I was placeholding single elements, I would write:
qry = ("UPDATE tbl SET thing=1 WHERE id=%s")
cur.execute (qry,(var,))
I don't know how long my list is each time so I can't go with %s, %s, %s ...n etc. I could ",".join(list) and just write a query with a raw string each time but feels like a hack.
Is there a preferred way to do something like this? This might be a wider question about using placeholders in queries in general but I'm not sure.
If you query requires IN then use IN, there is no need to replace it with = for example:
mysql> select * from foo;
+------+-------+
| id | thing |
+------+-------+
| 1 | 5 |
| 2 | 5 |
| 3 | 5 |
| 4 | 5 |
| 5 | 5 |
+------+-------+
5 rows in set (0.00 sec)
>>> cur = conn.cursor()
>>> my_ids
[1, 3, 5]
>>> sql
'UPDATE foo SET thing=1 WHERE id IN %s'
>>> cur.execute(sql, (my_ids,))
3L
>>> conn.commit()
Then all rows will be updated:
mysql> select * from foo;
+------+-------+
| id | thing |
+------+-------+
| 1 | 1 |
| 2 | 5 |
| 3 | 1 |
| 4 | 5 |
| 5 | 1 |
+------+-------+
5 rows in set (0.00 sec)

Categories