SQLAlchemy query against a view not returning full results - python

I am using Flask-SQLAlchemy (flask_sqlalchemy==2.3.2) for my Flask webapp.
For normal table queries it has performed flawlessly, but now I am transitioning to making some of the logic into SQL Views and SQLAlchemy is not capturing the full results.
This is my specific example:
SQL View view_ticket_counts:
CREATE VIEW view_ticket_counts AS
SELECT event_id, price_id, COUNT(1) AS ticket_count FROM public.tickets
GROUP BY event_id, price_id
When I run this as a normal SQL query with pgAdmin:
SELECT * FROM view_ticket_counts WHERE event_id=1
I get the results:
|event_id|price_id|ticket_count|
| 1 | 1 | 3 |
| 1 | 2 | 1 |
However, if I run a python SQLAlchemy query like so:
ticket_counts = ViewTicketCounts.query.filter_by(event_id=1).all()
for tc in ticket_counts:
print(tc.event_id, tc.price_id, tc.ticket_count)
It only prints one result: 1 1 3
So for some reason the SQLAlchemy query or implementation is only fetching the first element, even with .all().
For completion this is my View Model class:
class ViewTicketCounts(db.Model):
event_id = db.Column(BigInteger, primary_key=True)
price_id = db.Column(BigInteger)
ticket_count = db.Column(BigInteger)

Your view's actual key is event_id, price_id, not just event_id,. The reason why you are only seeing the first row is that when querying for model objects / entities the ORM consults the identity map for each found row based on its primary key, and if the object has already been included in the results, it is skipped. So in your case when the second row is processed, SQLAlchemy finds that the object with the primary key 1, already exists in the results, and simply ignores the row (since there is no joined eager loading involved).
The fix is simple:
class ViewTicketCounts(db.Model):
event_id = db.Column(BigInteger, primary_key=True)
price_id = db.Column(BigInteger, primary_key=True)
ticket_count = db.Column(BigInteger)
This sort of implicit "distinct on" is mentioned and reasoned about in the ORM tutorial under "Adding and Updating Objects" and "Joined Load".

Related

Python sqlAlchemy bulk update but non-serializable JSON attributes

I am trying to better understand how I can bulk update rows in sqlAlchemy using a Python function for each row that requires dumping results to json without having to iterate over them individually:
def do_something(x):
return x.id + x.offset
table.update({Table.updated_field: do_something(Table)})
This is a simplification of what I am trying to accomplish except I get the error TypeError: Object of type InstrumentedAttribute is not JSON serializable.
Any thoughts on how to fix the issue here?
Why are you casting your Table Id to a json String? remove it and try.
Edit:
You can't call the same object in bulk, you can for example:
table.update({Table.updated_field: json.dumps(object_of_my_table variable._asdict())})
If you want update your column attribute with the whole object you will must loop and dump it in the update as:
for table in dbsession.query(Table):
table.update_field = json.dumps(table._asdict())
dbsession.add(table).
If you need to update millions of rows the best way from my experience is with Pandas and bulk_update_mappings.
You can load data from the DB in bulk as a DataFrame with read_sql by passing a query statement and your engine object.
import pandas as pd
query = session.query(Table)
table_data = pd.read_sql(query.statement, engine)
Note that read_sql has a chunksize parameter which causes it to return an iterator, so if the table is too large to fit in memory, you can throw this in a loop with however many rows your pc can handle at once:
for table_chunk in pd.read_sql(query.statement, engine, chunksize=1e6):
...
From there you can use apply to alter each column with any custom function you want:
table_data["column_1"] = table_data["column_1"].apply(do_something)
Then, converting the DataFrame to a dict with the records orientation puts it in the appropriate format for bulk_update_mappings:
table_data = table_data.to_dict("records")
session.bulk_update_mappings(Table, table_data)
session.commit()
Additionally, if you need to perform a lot of json operations for your updates, I've used orjson for updates like this in the past which also provides a notable speed improvement over the standard library's json.
Without the requirement to serialise to JSON,
session.query(Table).update({'updated_field': Table.id + Table.offset})
would work fine, performing all computations and updates in the database. However
session.query(Table).update({'updated_field': json.dumps(Table.id + Table.offset)})
does not work, because it mixes Python-level operations (json.dumps) with database-level operations (add id and offset for all rows).
Fortunately, many RDBMS provide JSON functions (SQLite, PostgreSQL, MariaDB, MySQL) so that we can do the work solely in the database layer. This is considerably more efficient that fetching data into the Python layer, mutating it, and writing it back to the database. Unfortunately, the available functions and their behaviours are not consistent across RDBMS.
The following script should work for SQLite, PostgreSQL and MaraiDB (and probably MySQL too). These assumptions are made:
id and offset are both columns in the database table being updated
both are integers, as is their sum
the desired result is that their sum is written to a JSON column as an scalar
import sqlalchemy as sa
from sqlalchemy import orm
from sqlalchemy.dialects.postgresql import JSONB
urls = [
'sqlite:///so73956014.db',
'postgresql+psycopg2:///test',
'mysql+pymysql://root:root#localhost/test',
]
for url in urls:
engine = sa.create_engine(url, echo=False, future=True)
print(f'Checking {engine.dialect.name}')
Base = orm.declarative_base()
JSON = JSONB if engine.dialect.name == 'postgresql' else sa.JSON
class Table(Base):
__tablename__ = 't73956014'
id = sa.Column(sa.Integer, primary_key=True)
offset = sa.Column(sa.Integer)
updated_field = sa.Column(JSON)
Base.metadata.drop_all(engine, checkfirst=True)
Base.metadata.create_all(engine)
Session = orm.sessionmaker(engine, future=True)
with Session.begin() as s:
ts = [Table(offset=o * 10) for o in range(1, 4)]
s.add_all(ts)
# Use DB-specific function to serialise to JSON.
if engine.dialect.name == 'postgresql':
func = sa.func.to_jsonb
else:
func = sa.func.json_quote
# MariaDB requires that argument to json_quote is character type.
if engine.dialect.name in ['mysql', 'mariadb']:
expr = sa.cast(Table.id + Table.offset, sa.Text)
else:
expr = Table.id + Table.offset
with Session.begin() as s:
s.query(Table).update(
{Table.updated_field: func(expr)}
)
with Session() as s:
ts = s.scalars(sa.select(Table))
for t in ts:
print(t.id, t.offset, t.updated_field)
engine.dispose()
Output:
Checking sqlite
1 10 11
2 20 22
3 30 33
Checking postgresql
1 10 11
2 20 22
3 30 33
Checking mysql
1 10 11
2 20 22
3 30 33
Other functions can be used if the desired result is an object or array. If updating an existing JSON column value the column my need to use the Mutable extension.

Get fields used within SQL query

I would like to be able to return a list of all fields (ideally with the table details) used by an given SQL query. E.g. the input of the query:
SELECT t1.field1, field3
FROM dbo.table1 AS t1
INNER JOIN dbo.table2 as t2
ON t2.field2 = t1.field2
WHERE t2.field1 = 'someValue'
would return
+--------+-----------+--------+
| schema | tablename | field |
+--------+-----------+--------+
| dbo | table1 | field1 |
| dbo | table1 | field2 |
| dbo | table1 | field3 |
| dbo | table2 | field1 |
| dbo | table2 | field2 |
+--------+-----------+--------+
Really it needs to make use of the SQL kernal (is that the right word? engine?) as there is no way that the reader can know that field3 is in table1 not table2. For this reason I would assume that the solution would be an SQL. Bonus points if it can handle SELECT * too!
I have attempted a python solution using sqlparse (https://sqlparse.readthedocs.io/en/latest/), but was having trouble with the more complex SQL queries involving temporary tables, subqueries and CTEs. Also handling of aliases was very difficult (particularly if the query used the same alias in multiple places). Obviously it could not handle cases like field3 above which had no table identifier. Nor can it handle SELECT *.
I was hoping there might be a more elgant solution within SQL Server Management Studio or even some function within SQL Server itself. We have SQL Prompt from Redgate, which must have some understand within its intellisense, of the architecture and SQL query it is formatting.
UPDATE:
As requested: the reason I'm trying to do this is to work out which Users can execute which SSRS Reports within our organisation. This is entirely dependent on them having GRANT SELECT permissions assigned to their Roles on all fields used by all datasets (in our case SQL queries) in a given report. I have already managed to report on which Users have GRANT SELECT on which fields according to their Roles. I now want to extend that to which reports those permissions allow them to run.
The column table names may be tricky because column names can be ambiguous or even derived. However, you can get the column names, sequence and type from virtually any query or stored procedure.
Example
Select column_ordinal
,name
,system_type_name
From sys.dm_exec_describe_first_result_set('Select * from YourTable',null,null )
I think I have now found an answer. Please note: I currently do not have permissions to execute these functions so I have not yet tested it - I will update the answer when I've had a chance to test it. Thanks for the answer goes to #milivojeviCH. The answer is copied from here: https://stackoverflow.com/a/19852614/6709902
The ultimate goal of selecting all the columns used in an SQL Server's execution plan solved:
USE AdventureWorksDW2012
DBCC FREEPROCCACHE
SELECT dC.Gender, dc.HouseOwnerFlag,
SUM(fIS.SalesAmount) AS SalesAmount
FROM
dbo.DimCustomer dC INNER JOIN
dbo.FactInternetSales fIS ON fIS.CustomerKey = dC.CustomerKey
GROUP BY dC.Gender, dc.HouseOwnerFlag
ORDER BY dC.Gender, dc.HouseOwnerFlag
/*
query_hash query_plan_hash
0x752B3F80E2DB426A 0xA15453A5C2D43765
*/
DECLARE #MyQ AS XML;
-- SELECT qstats.query_hash, query_plan_hash, qplan.query_plan AS [Query Plan],qtext.text
SELECT #MyQ = qplan.query_plan
FROM sys.dm_exec_query_stats AS qstats
CROSS APPLY sys.dm_exec_query_plan(qstats.plan_handle) AS qplan
cross apply sys.dm_exec_sql_text(qstats.plan_handle) as qtext
where text like '% fIS %'
and query_plan_hash = 0xA15453A5C2D43765
SeLeCt #MyQ
;WITH xmlnamespaces (default 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT DISTINCT
[Database] = x.value('(#Database)[1]', 'varchar(128)'),
[Schema] = x.value('(#Schema)[1]', 'varchar(128)'),
[Table] = x.value('(#Table)[1]', 'varchar(128)'),
[Alias] = x.value('(#Alias)[1]', 'varchar(128)'),
[Column] = x.value('(#Column)[1]', 'varchar(128)')
FROM #MyQ.nodes('//ColumnReference') x1(x)
Leads to the following output:
Database Schema Table Alias Column
------------------------- ------ ---------------- ----- ----------------
NULL NULL NULL NULL Expr1004
[AdventureWorksDW2012] [dbo] [DimCustomer] [dC] CustomerKey
[AdventureWorksDW2012] [dbo] [DimCustomer] [dC] Gender
[AdventureWorksDW2012] [dbo] [DimCustomer] [dC] HouseOwnerFlag
[AdventureWorksDW2012] [dbo] [FactInternetSal [fIS] CustomerKey
[AdventureWorksDW2012] [dbo] [FactInternetSal [fIS] SalesAmount

How to return specific dictionary keys from within a nested list from a jsonb column in sqlalchemy

I am attempting to return some named columns from a jsonb data set that is stored with PostgreSQL.
I am able to run a raw query that meets my needs directly, however I am trying to run the query utilising SQLAlchemy, in order to ensure that my code is 'pythonic' and easy to read.
The query that returns the correct result (two columns) is:
SELECT
tmp.item->>'id',
tmp.item->>'name'
FROM (SELECT jsonb_array_elements(t.data -> 'users') AS item FROM tpeople t) as tmp
Example json (each user has 20+ columns)
{ "results":247, "users": [
{"id":"202","regdate":"2015-12-01","name":"Bob Testing"},
{"id":"87","regdate":"2014-12-12","name":"Sally Testing"},
{"id":"811", etc etc}
...
]}
The table is simple enough, with a PK, datetime of json extraction, and the jsonb column for the extract
CREATE TABLE tpeople
(
record_id bigint NOT NULL DEFAULT nextval('"tpeople_record_id_seq"'::regclass) ( INCREMENT 1 START 1 MINVALUE 1 MAXVALUE 9223372036854775807 CACHE 1 ),
scrape_time timestamp without time zone NOT NULL,
data jsonb NOT NULL,
CONSTRAINT "tpeople_pkey" PRIMARY KEY (record_id)
);
Additionally I have a People Class that looks as follows:
class people(Base):
__tablename__ = 'tpeople'
record_id = Column(BigInteger, primary_key=True, server_default=text("nextval('\"tpeople_record_id_seq\"'::regclass)"))
scrape_time = Column(DateTime, nullable=False)
data = Column(JSONB(astext_type=Text()), nullable=False)
Presently my code to return the two columns looks like this:
from db.db_conn import get_session // Generic connector for my db
from model.models import people
from sqlalchemy import func,
sess = get_session()
sub = sess.query(func.jsonb_array_elements(people.data["users"]).label("item")).subquery()
test = sess.query(sub.c.item).select_entity_from(sub).all()
SQLAlchemy generates the following SQL:
SELECT anon_1.item AS anon_1_item
FROM (SELECT jsonb_array_elements(tpeople.data -> %(data_1)s) AS item
FROM tpeople) AS anon_1
{'data_1': 'users'}
But nothing I seem to do can allow me to only get certain columns within the item itself like the raw SQL I can write. Some of the approaches I have tried as follows (they all error out):
test = sess.query("sub.item.id").select_entity_from(sub).all()
test = sess.query(sub.item.["id"]).select_entity_from(sub).all()
aas = func.jsonb_to_recordset(people.data["users"])
res = sess.query("id").select_from(aas).all()
sub = select(func.jsonb_array_elements(people.data["users"]).label("item"))
Presently I can extract the columns I need in a simple for loop, but this seems like a hacky way to do it, and I'm sure there is something dead obvious I'm missing.
for row in test:
print(row.item['id'])
Searched for a few hours eventually found some who accidentally did this while trying to get another result.
sub = sess.query(func.jsonb_array_elements(people.data["users"]).label("item")).subquery()
tmp = sub.c.item.op('->>')('id')
tmp2 = sub.c.item.op('->>')('name')
test = sess.query(tmp, tmp2).all()

Better Alternate instead of using chained union queries in Django ORM

I needed to achieve something like this in Django ORM :
(SELECT * FROM `stats` WHERE MODE = 1 ORDER BY DATE DESC LIMIT 2)
UNION
(SELECT * FROM `stats` WHERE MODE = 2 ORDER BY DATE DESC LIMIT 2)
UNION
(SELECT * FROM `stats` WHERE MODE = 3 ORDER BY DATE DESC LIMIT 2)
UNION
(SELECT * FROM `stats` WHERE MODE = 6 ORDER BY DATE DESC LIMIT 2)
UNION
(SELECT * FROM `stats` WHERE MODE = 5 AND is_completed != 3 ORDER BY DATE DESC)
# mode 5 can return more than 100 records so NO LIMIT here
for which i wrote this :
query_run_now_job_ids = Stats.objects.filter(mode=5).exclude(is_completed=3).order_by('-date')
list_of_active_job_ids = Stats.objects.filter(mode=1).order_by('-date')[:2].union(
Stats.objects.filter(mode=2).order_by('-date')[:2],
Stats.objects.filter(mode=3).order_by('-date')[:2],
Stats.objects.filter(mode=6).order_by('-date')[:2],
query_run_now_job_ids)
but somehow list_of_active_job_ids returned is unordered i.e list_of_active_job_ids.ordered returns False due to which when this query is passed to Paginator class it gives :
UnorderedObjectListWarning:
Pagination may yield inconsistent results with an unordered object_list
I have already set ordering in class Meta in models.py
class Meta:
ordering = ['-date']
Without paginator query works fine and page loads but using paginator , view never loads it keeps on loading .
Is there any better alternate for achieving this without using chain of union .
So I tried another alternate for above mysql query but i'm stuck in another problem to write up condition for mode = 5 in this query :
SELECT
MODE ,
SUBSTRING_INDEX(GROUP_CONCAT( `job_id` SEPARATOR ',' ),',',2) AS job_id_list,
SUBSTRING_INDEX(GROUP_CONCAT( `total_calculations` SEPARATOR ',' ),',',2) AS total_calculations
FROM `stats`
ORDER BY DATE DESC
Even if I was able to write this Query it would lead me to another challenging situation i.e to convert this query for Django ORM .
So why My Query is not ordered even when i have set it in Class Meta .
Also if not this query , Is there any better alternate for achieving this ?
Help would be appreciated ! .
I'm using Python 2.7 and Django 1.11 .
While subqueries may be ordered, the resulting union data is not. You need to explicitly define the ordering.
from django.db import models
def make_query(mode, index):
return (
Stats.objects.filter(mode=mode).
annotate(_sort=models.Value(index, models.IntegerField())).
order_by('-date')
)
list_of_active_job_ids = make_query(1, 1)[:2].union(
make_query(2, 2)[:2],
make_query(3, 3)[:2],
make_query(6, 4)[:2],
make_query(5, 5).exclude(is_completed=3)
).order_by('_sort', '-date')
All I did was add a new, literal value field _sort that has a different value for each subquery and then ordered by it in the final query.The rest of the code is just to reduce duplication. It would have been even cleaner if it wasn't for that mode=6 subquery.

how to use two Lists to insert multiple records in Django Model query

I have three tables:
table1, table2, table3
from table1 i get the data using:
table1queryset = table1.objects.filter(token = 123)
it gives me 50 records.
from table2 i get the data using:
table2queryset = table2.objects.filter(name='andy')
it gives me 10 records.
table3 structure is like:
mytoken = models.foreignKey(table1)
myname = models.foreignKey(table2)
now for every table1 record i want to insert table2 record into table3. like:
for eachT1 in table1queryset:
for eachT2 in table2queryset:
table3(mytoken=eachT1,myname=eachT2).save()
in my case it will insert 50*10 = 500 records.
what is the most efficiant way of using this?
can i assign both queryset to query, something like:
table3(mytoken=table1queryset,myname=table2queryset).save()
Look at RelatedManager.add It is promised that "Using add() with a many-to-many relationship, however, will not call any save() methods, but rather create the relationships using QuerySet.bulk_create()."
If you define ManyToMany field in first model and tell django that this relationship must work through table 3 (like ManyToManyField.through) you can do:
t2s = [x.id for x in tq2]
for t1 in tq1:
tq.add(*t2s)
And probably it will be more efficient than creation of separate instances of T3 model.

Categories