How to return specific dictionary keys from within a nested list from a jsonb column in sqlalchemy - python

I am attempting to return some named columns from a jsonb data set that is stored with PostgreSQL.
I am able to run a raw query that meets my needs directly, however I am trying to run the query utilising SQLAlchemy, in order to ensure that my code is 'pythonic' and easy to read.
The query that returns the correct result (two columns) is:
SELECT
tmp.item->>'id',
tmp.item->>'name'
FROM (SELECT jsonb_array_elements(t.data -> 'users') AS item FROM tpeople t) as tmp
Example json (each user has 20+ columns)
{ "results":247, "users": [
{"id":"202","regdate":"2015-12-01","name":"Bob Testing"},
{"id":"87","regdate":"2014-12-12","name":"Sally Testing"},
{"id":"811", etc etc}
...
]}
The table is simple enough, with a PK, datetime of json extraction, and the jsonb column for the extract
CREATE TABLE tpeople
(
record_id bigint NOT NULL DEFAULT nextval('"tpeople_record_id_seq"'::regclass) ( INCREMENT 1 START 1 MINVALUE 1 MAXVALUE 9223372036854775807 CACHE 1 ),
scrape_time timestamp without time zone NOT NULL,
data jsonb NOT NULL,
CONSTRAINT "tpeople_pkey" PRIMARY KEY (record_id)
);
Additionally I have a People Class that looks as follows:
class people(Base):
__tablename__ = 'tpeople'
record_id = Column(BigInteger, primary_key=True, server_default=text("nextval('\"tpeople_record_id_seq\"'::regclass)"))
scrape_time = Column(DateTime, nullable=False)
data = Column(JSONB(astext_type=Text()), nullable=False)
Presently my code to return the two columns looks like this:
from db.db_conn import get_session // Generic connector for my db
from model.models import people
from sqlalchemy import func,
sess = get_session()
sub = sess.query(func.jsonb_array_elements(people.data["users"]).label("item")).subquery()
test = sess.query(sub.c.item).select_entity_from(sub).all()
SQLAlchemy generates the following SQL:
SELECT anon_1.item AS anon_1_item
FROM (SELECT jsonb_array_elements(tpeople.data -> %(data_1)s) AS item
FROM tpeople) AS anon_1
{'data_1': 'users'}
But nothing I seem to do can allow me to only get certain columns within the item itself like the raw SQL I can write. Some of the approaches I have tried as follows (they all error out):
test = sess.query("sub.item.id").select_entity_from(sub).all()
test = sess.query(sub.item.["id"]).select_entity_from(sub).all()
aas = func.jsonb_to_recordset(people.data["users"])
res = sess.query("id").select_from(aas).all()
sub = select(func.jsonb_array_elements(people.data["users"]).label("item"))
Presently I can extract the columns I need in a simple for loop, but this seems like a hacky way to do it, and I'm sure there is something dead obvious I'm missing.
for row in test:
print(row.item['id'])

Searched for a few hours eventually found some who accidentally did this while trying to get another result.
sub = sess.query(func.jsonb_array_elements(people.data["users"]).label("item")).subquery()
tmp = sub.c.item.op('->>')('id')
tmp2 = sub.c.item.op('->>')('name')
test = sess.query(tmp, tmp2).all()

Related

Python sqlAlchemy bulk update but non-serializable JSON attributes

I am trying to better understand how I can bulk update rows in sqlAlchemy using a Python function for each row that requires dumping results to json without having to iterate over them individually:
def do_something(x):
return x.id + x.offset
table.update({Table.updated_field: do_something(Table)})
This is a simplification of what I am trying to accomplish except I get the error TypeError: Object of type InstrumentedAttribute is not JSON serializable.
Any thoughts on how to fix the issue here?
Why are you casting your Table Id to a json String? remove it and try.
Edit:
You can't call the same object in bulk, you can for example:
table.update({Table.updated_field: json.dumps(object_of_my_table variable._asdict())})
If you want update your column attribute with the whole object you will must loop and dump it in the update as:
for table in dbsession.query(Table):
table.update_field = json.dumps(table._asdict())
dbsession.add(table).
If you need to update millions of rows the best way from my experience is with Pandas and bulk_update_mappings.
You can load data from the DB in bulk as a DataFrame with read_sql by passing a query statement and your engine object.
import pandas as pd
query = session.query(Table)
table_data = pd.read_sql(query.statement, engine)
Note that read_sql has a chunksize parameter which causes it to return an iterator, so if the table is too large to fit in memory, you can throw this in a loop with however many rows your pc can handle at once:
for table_chunk in pd.read_sql(query.statement, engine, chunksize=1e6):
...
From there you can use apply to alter each column with any custom function you want:
table_data["column_1"] = table_data["column_1"].apply(do_something)
Then, converting the DataFrame to a dict with the records orientation puts it in the appropriate format for bulk_update_mappings:
table_data = table_data.to_dict("records")
session.bulk_update_mappings(Table, table_data)
session.commit()
Additionally, if you need to perform a lot of json operations for your updates, I've used orjson for updates like this in the past which also provides a notable speed improvement over the standard library's json.
Without the requirement to serialise to JSON,
session.query(Table).update({'updated_field': Table.id + Table.offset})
would work fine, performing all computations and updates in the database. However
session.query(Table).update({'updated_field': json.dumps(Table.id + Table.offset)})
does not work, because it mixes Python-level operations (json.dumps) with database-level operations (add id and offset for all rows).
Fortunately, many RDBMS provide JSON functions (SQLite, PostgreSQL, MariaDB, MySQL) so that we can do the work solely in the database layer. This is considerably more efficient that fetching data into the Python layer, mutating it, and writing it back to the database. Unfortunately, the available functions and their behaviours are not consistent across RDBMS.
The following script should work for SQLite, PostgreSQL and MaraiDB (and probably MySQL too). These assumptions are made:
id and offset are both columns in the database table being updated
both are integers, as is their sum
the desired result is that their sum is written to a JSON column as an scalar
import sqlalchemy as sa
from sqlalchemy import orm
from sqlalchemy.dialects.postgresql import JSONB
urls = [
'sqlite:///so73956014.db',
'postgresql+psycopg2:///test',
'mysql+pymysql://root:root#localhost/test',
]
for url in urls:
engine = sa.create_engine(url, echo=False, future=True)
print(f'Checking {engine.dialect.name}')
Base = orm.declarative_base()
JSON = JSONB if engine.dialect.name == 'postgresql' else sa.JSON
class Table(Base):
__tablename__ = 't73956014'
id = sa.Column(sa.Integer, primary_key=True)
offset = sa.Column(sa.Integer)
updated_field = sa.Column(JSON)
Base.metadata.drop_all(engine, checkfirst=True)
Base.metadata.create_all(engine)
Session = orm.sessionmaker(engine, future=True)
with Session.begin() as s:
ts = [Table(offset=o * 10) for o in range(1, 4)]
s.add_all(ts)
# Use DB-specific function to serialise to JSON.
if engine.dialect.name == 'postgresql':
func = sa.func.to_jsonb
else:
func = sa.func.json_quote
# MariaDB requires that argument to json_quote is character type.
if engine.dialect.name in ['mysql', 'mariadb']:
expr = sa.cast(Table.id + Table.offset, sa.Text)
else:
expr = Table.id + Table.offset
with Session.begin() as s:
s.query(Table).update(
{Table.updated_field: func(expr)}
)
with Session() as s:
ts = s.scalars(sa.select(Table))
for t in ts:
print(t.id, t.offset, t.updated_field)
engine.dispose()
Output:
Checking sqlite
1 10 11
2 20 22
3 30 33
Checking postgresql
1 10 11
2 20 22
3 30 33
Checking mysql
1 10 11
2 20 22
3 30 33
Other functions can be used if the desired result is an object or array. If updating an existing JSON column value the column my need to use the Mutable extension.

How to query with like() when using many-to-many relationships in SQLAlchemy?

I have the folloing many-to-many relationship defined in SQLAlchemy:
training_ids_association_table = db.Table(
"training_ids_association",
db.Model.metadata,
Column("training_id", Integer, ForeignKey("training_sessions.id")),
Column("ids_id", Integer, ForeignKey("image_data_sets.id")),
)
class ImageDataSet(db.Model):
__tablename__ = "image_data_sets"
id = Column(Integer, primary_key=True)
tags = Column(String)
trainings = relationship("TrainingSession", secondary=training_ids_association_table, back_populates="image_data_sets")
class TrainingSession(db.Model):
__tablename__ = "training_sessions"
id = Column(Integer, primary_key=True)
image_data_sets = relationship("ImageDataSet", secondary=training_ids_association_table, back_populates="trainings")
Note the field ImageDataSet.tags, which can contain a list of string items (i.e. tags), separated by a slash character. If possible I would rather stick to that format instead of creating a new table just for these tags.
What I want now is to query table TrainingSession for all entries that have a certain tag set ub their related ImageDataSet's. Now, if an ImageDataSet has only one tag saved in the tags field, then the following works:
TrainingSession.query.filter(TrainingSession.image_data_sets.any(tags=find_tag))
However, as soon as there are multiple tags in the tags field (e.g. something like "tag1/tag2/tag3"), then of course this filter above does not work any more. So I tried it with a like:
.filter(TrainingSession.image_data_sets.like(f'%{find_tag}%'))
But this leads to an NotImplementedError in SQLAlchemy. So is there a way to achieve what I am trying to do here, or do I necessarily need another table for the tags per ImageDataSet?
You can apply any filters on related model columns if you join this model first:
query = session.query(TrainingSession). \
join(TrainingSession.image_data_sets). \
filter(ImageDataSet.tags.like(f"%{find_tag}%"))
This query is translated to the following SQL statement:
SELECT training_sessions.id FROM training_sessions
JOIN training_ids_association ON training_sessions.id = training_ids_association.training_id
JOIN image_data_sets ON image_data_sets.id = training_ids_association.ids_id
WHERE image_data_sets.tags LIKE %(find_tag)s
Note that you may stumble to a problem with storing tags as strings with separators. If some records have tags tag1, tag12, tag123 they will all pass the filter LIKE '%tag1%'.
It would be better to switch to ARRAY column if your database supports this column type (PostgreSQL for example). Your column may be defined like this:
tags = Column(ARRAY(String))
And the query may look like this:
query = session.query(TrainingSession). \
join(TrainingSession.image_data_sets). \
filter(ImageDataSet.tags.any(find_tag))

SQLAlchemy query against a view not returning full results

I am using Flask-SQLAlchemy (flask_sqlalchemy==2.3.2) for my Flask webapp.
For normal table queries it has performed flawlessly, but now I am transitioning to making some of the logic into SQL Views and SQLAlchemy is not capturing the full results.
This is my specific example:
SQL View view_ticket_counts:
CREATE VIEW view_ticket_counts AS
SELECT event_id, price_id, COUNT(1) AS ticket_count FROM public.tickets
GROUP BY event_id, price_id
When I run this as a normal SQL query with pgAdmin:
SELECT * FROM view_ticket_counts WHERE event_id=1
I get the results:
|event_id|price_id|ticket_count|
| 1 | 1 | 3 |
| 1 | 2 | 1 |
However, if I run a python SQLAlchemy query like so:
ticket_counts = ViewTicketCounts.query.filter_by(event_id=1).all()
for tc in ticket_counts:
print(tc.event_id, tc.price_id, tc.ticket_count)
It only prints one result: 1 1 3
So for some reason the SQLAlchemy query or implementation is only fetching the first element, even with .all().
For completion this is my View Model class:
class ViewTicketCounts(db.Model):
event_id = db.Column(BigInteger, primary_key=True)
price_id = db.Column(BigInteger)
ticket_count = db.Column(BigInteger)
Your view's actual key is event_id, price_id, not just event_id,. The reason why you are only seeing the first row is that when querying for model objects / entities the ORM consults the identity map for each found row based on its primary key, and if the object has already been included in the results, it is skipped. So in your case when the second row is processed, SQLAlchemy finds that the object with the primary key 1, already exists in the results, and simply ignores the row (since there is no joined eager loading involved).
The fix is simple:
class ViewTicketCounts(db.Model):
event_id = db.Column(BigInteger, primary_key=True)
price_id = db.Column(BigInteger, primary_key=True)
ticket_count = db.Column(BigInteger)
This sort of implicit "distinct on" is mentioned and reasoned about in the ORM tutorial under "Adding and Updating Objects" and "Joined Load".

Bulk Upsert with SQLAlchemy Postgres

I'm following the SQLAlchemy documentation here to write a bulk upsert statement with Postgres. For demonstration purposes, I have a simple table MyTable:
class MyTable(base):
__tablename__ = 'mytable'
id = Column(types.Integer, primary_key=True)
test_value = Column(types.Text)
Creating a generic insert statement is simple enough:
from sqlalchemy.dialects import postgresql
values = [{'id': 0, 'test_value': 'a'}, {'id': 1, 'test_value': 'b'}]
insert_stmt = postgresql.insert(MyTable.__table__).values(values)
The problem I run into is when I try to add the "on conflict" part of the upsert.
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_=dict(data=values)
)
Trying to execute this statement yields a ProgrammingError:
from sqlalchemy import create_engine
engine = create_engine('postgres://localhost/db_name')
engine.execute(update_stmt)
>>> ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'dict'
I think my misunderstanding is in constructing the statement with the on_conflict_do_update method. Does anyone know how to construct this statement? I have looked at other questions on StackOverflow (eg. here) but I can't seem to a way to address the above error.
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_=dict(data=values)
)
index_elements should either be a list of strings or a list of column objects. So either [MyTable.id] or ['id'] (This is correct)
set_ should be a dictionary with column names as keys and valid sql update objects as values. You can reference values from the insert block using the excluded attribute. So to get the result you are hoping for here you would want set_={'test_value': insert_stmt.excluded.test_value} (The error you made is that data= in the example isn't a magic argument... it was the name of the column on their example table)
So, the whole thing would be
update_stmt = insert_stmt.on_conflict_do_update(
index_elements=[MyTable.id],
set_={'test_value': insert_stmt.excluded.test_value}
)
Of course, in a real world example I usually want to change more then one column. In that case I would do something like...
update_columns = {col.name: col for col in insert_stmt.excluded if col.name not in ('id', 'datetime_created')}
update_statement = insert_stmt.on_conflict_do_update(index_elements=['id'], set_=update_columns)
(This example would overwrite every column except for the id and datetime_created columns)

Postgresql: Insert from huge csv file, collect the ids and respect unique constraints

In a postgresql database:
class Persons(models.Model):
person_name = models.CharField(max_length=10, unique=True)
The persons.csv file, contains 1 million names.
$cat persons.csv
Name-1
Name-2
...
Name-1000000
I want to:
Create the names that do not already exist
Query the database and fetch the id for each name contained in the csv file.
My approach:
Use the COPY command or the django-postgres-copy application that implements it.
Also take advantage of the new Postgresql-9.5+ upsert feature.
Now, all the names in the csv file, are also in the database.
I need to get their ids -from the database- either in memory or in another csv file with an efficient way:
Use Q objects
list_of_million_q = <iterate csv and append Qs>
million_names = Names.objects.filter(list_of_million_q)
or
Use __in to filter based on a list of names:
list_of_million_names = <iterate csv and append strings>
million_names = Names.objects.filter(
person_name__in=[list_of_million_names]
)
or
?
I do not feel that any of the above approaches for fetching the ids is efficient.
Update
There is a third option, along the lines of this post that should be a great solution which combines all the above.
Something like:
SELECT * FROM persons;
make a name: id dictionary out of the names recieved from the database:
db_dict = {'Harry': 1, 'Bob': 2, ...}
Query the dictionary:
ids = []
for name in list_of_million_names:
if name in db_dict:
ids.append(db_dict[name])
This way you're using the quick dictionary indexing as opposed to the slower if x in list approach.
But the only way to really know for sure is to benchmark these 3 approaches.
This post describes how to use RETURNING with ON CONFLICT so while inserting into the database the contents of the csv file, the ids will be saved in another table either when an insertion was successful, or when -due to unique constraints- the insertion was omitted.
I have tested it in sqlfiddle where I used a set up that resembles the one used for the COPY command which inserts to the database straight from a csv file, respecting the unique constraints.
The schema:
CREATE TABLE IF NOT EXISTS label (
id serial PRIMARY KEY,
label_name varchar(200) NOT NULL UNIQUE
);
INSERT INTO label (label_name) VALUES
('Name-1'),
('Name-2');
CREATE TABLE IF NOT EXISTS ids (
id serial PRIMARY KEY,
label_ids varchar(12) NOT NULL
);
The script:
CREATE TEMP TABLE tmp_table
(LIKE label INCLUDING DEFAULTS)
ON COMMIT DROP;
INSERT INTO tmp_table (label_name) VALUES
('Name-2'),
('Name-3');
WITH ins AS(
INSERT INTO label
SELECT *
FROM tmp_table
ON CONFLICT (label_name) DO NOTHING
RETURNING id
)
INSERT INTO ids (label_ids)
SELECT
id FROM ins
UNION ALL
SELECT
l.id FROM tmp_table
JOIN label l USING(label_name);
The output:
SELECT * FROM ids;
SELECT * FROM label;

Categories