Python Peewee EXISTS Subquery not working as expected - python

I am using the peewee ORM for a python application and I am trying to write code to fetch batches of records from a SQLite database. I have a subquery that seems to work by itself but when added to an update query the fn.EXISTS(sub_query) seems to have no effect as every record in the database is updated.
Note: I am using the APSW extension for peewee.
def batch_logic(self, id_1, path_1, batch_size=1000, **kwargs):
sub_query = (self.select(ModelClass.granule_id).distinct().where(
(ModelClass.status == 'old_status') &
(ModelClass.collection_id == collection_id) &
(ModelClass.name.contains(provider_path))
).order_by(ModelClass.discovered_date.asc()).limit(batch_size)).limit(batch_size))
print(f'len(sub_query): {len(sub_query)}')
fb_st_2 = time.time()
updated_records= list(
(self.update(status='new_status').where(fn.EXISTS(sub_query)).returning(ModelClass))
)
print(f'update {len(updated_records)}: {time.time() - fb_st_2}')
db.close()
return updated_records
Below is output from testing locally:
id_1: id_1_1676475997_PQXYEQGJWR
len(sub_query): 2
update 20000: 1.0583274364471436
fetch_batch 20000: 1.1167597770690918
count_things 0: 0.02147078514099121
processed_things: 20000
The subquery is correctly returning 2 but the update query where(fn.EXISTS(sub_query)) seems to be ignored. Have I made a mistake in my understanding of how this works?
Edit 1: I believe GROUP BY is needed as rows can have the same granule_id and I need to fetch rows up to batch_size granule_ids

I think your use of UPDATE...WHERE EXISTS is incorrect or inappropriate here. This may work better for you:
# Unsure why you have a GROUP BY with no aggregation, that seems
# incorrect possibly, so I've removed it.
sub_query = (self.select(ModelClass.id)
.where(
(ModelClass.status == 'old_status') &
(ModelClass.collection_id == id_1) &
(ModelClass.name.contains(path_1)))
.order_by(ModelClass.discovered_date.asc())
.limit(batch_size))
update = (self.update(status='new_status')
.where(self.id.in_(sub_query))
.returning(ModelClass))
cursor = update.execute() # It's good to explicitly execute().
updated_records = list(cursor)
The key idea, at any rate, is I'm correlating the update with the subquery.

Related

How to print SQLAlchemy update query?

According to docs printing queries is as simple as print(query).
But according to update function description, it returns an integer
:return: the count of rows matched as returned by the database's
"row count" feature.
My code:
state = 'router_in_use'
q = self.db_session.query(Router).filter(
Router.id == self.router_id,
).update(
{Router.state: state}, synchronize_session=False
)
#print(q) outputs just 1
self.db_session.commit()
Is there a way to print q query in SQL language?
Query itself works fine.
python 3.8

Performance SQLAlchemy and or

I use the following sqlalchemy code to retrieve some data from a database
q = session.query(hd_tbl).\
join(dt_tbl, hd_tbl.c['data_type'] == dt_tbl.c['ID']).\
filter(or_(and_(hd_tbl.c['object_id'] == get_id(row['object']),
hd_tbl.c['data_type'] == get_id(row['type']),
hd_tbl.c['data_provider'] == get_id(row['provider']),
hd_tbl.c['data_account'] == get_id(row['account']))
for index, row in data.iterrows())).\
with_entities(hd_tbl.c['ID'], hd_tbl.c['object_id'],
hd_tbl.c['data_type'], hd_tbl.c['data_provider'],
hd_tbl.c['data_account'], dt_tbl.c['value_type'])
where hd_tbland dt_tbl are two tables in sql db, and datais pandas dataframe containing typically around 1k-9k entries. hd_tbl contains at the moment around 90k rows.
The execution time seems to exponentially grow with the length of data. The corresponding sql statement (by sqlalchemy) looks as follows:
SELECT data_header.`ID`, data_header.object_id, data_header.data_type, data_header.data_provider, data_header.data_account, basedata_data_type.value_type
FROM data_header INNER JOIN basedata_data_type ON data_header.data_type = basedata_data_type.`ID`
WHERE data_header.object_id = %s AND data_header.data_type = %s AND data_header.data_provider = %s AND data_header.data_account = %s OR
data_header.object_id = %s AND data_header.data_type = %s AND data_header.data_provider = %s AND data_header.data_account = %s OR
...
data_header.object_id = %s AND data_header.data_type = %s AND data_header.data_provider = %s AND data_header.data_account = %s OR
The tables and columns are fully indexed, and performance is not satisfying. Currently it is way faster to read all the data of hd_tbl and dt_tbl into memory and merge with pandas merge function. However, this is seems to be suboptimal. Anyone having an idea on how to improve the sqlalchemy call?
EDIT:
I was able to improve performance signifcantly by using sqlalchemy tuple_ in the following way:
header_tuples = [tuple([int(y) for y in tuple(x)]) for x in
data_as_int.values]
q = session.query(hd_tbl). \
join(dt_tbl, hd_tbl.c['data_type'] == dt_tbl.c['ID']). \
filter(tuple_(hd_tbl.c['object_id'], hd_tbl.c['data_type'],
hd_tbl.c['data_provider'],
hd_tbl.c['data_account']).in_(header_tuples)). \
with_entities(hd_tbl.c['ID'], hd_tbl.c['object_id'],
hd_tbl.c['data_type'], hd_tbl.c['data_provider'],
hd_tbl.c['data_account'], dt_tbl.c['value_type'])
with corresponding query...
SELECT data_header.`ID`, data_header.object_id, data_header.data_type, data_header.data_provider, data_header.data_account, basedata_data_type.value_type
FROM data_header INNER JOIN basedata_data_type ON data_header.data_type = basedata_data_type.`ID`
WHERE (data_header.object_id, data_header.data_type, data_header.data_provider, data_header.data_account) IN ((%(param_1)s, %(param_2)s, %(param_3)s, %(param_4)s), (%(param_5)s, ...))
I'd recommend you create a composite index on fields object_id, data_type, data_provider, ... with the same order, which they are placed in table, and make sure they're following in the same order in your WHERE condition. It may speed-up a bit your requests by cost of the disk space.
Also you may use several consequent small SQL requests instead a large query with complex OR condition. Accumulate extracted data on the application side or, if amount is large enough, in a fast temporary storage (a temporary table, noSQL, etc.)
In addition you may check MySQL configuration and increase values, related to memory volume per a thread, request, etc. A good idea is to check is your composite index fits into available memory, or it is useless.
I guess DB tuning may help a lot to increase productivity. Otherwise you may analyze your application's architecture to get more significant results.

SQLAlchemy Left join WHERE clause being converted to zeros and ones

Howdie do,
I have the following SQL, that I'm converting to SQLAlchemy:
select t1.`order_id`, t1.`status_type`
from `tracking_update` AS t1 LEFT JOIN `tracking_update` AS t2
ON (t1.`order_id` = t2.`order_id` AND t1.`last_updated` < t2.`last_updated`)
where t1.`order_id` = '21757'and t2.`last_updated` IS NULL
The SQL is just returning the latest tracking update for order id 21757. I'm accomplishing this by doing a left join back to the same table. In order to do this, I'm aliasing the table first:
tUAlias1 = aliased(TrackingUpdate)
tUalias2 = aliased(TrackingUpdate)
So far, this is what I have for my conversion to SQLAlchemy:
tracking_updates = db.session.query(tUAlias1.order_id, tUAlias1.status_type).\
outerjoin(tUalias2, (tUAlias1.order_id == tUalias2.order_id) & (tUAlias1.last_updated < tUalias2.last_updated)).\
filter(and_(tUAlias1.order_id == '21757', tUalias2.last_updated is None))
And this is the result of the SQLAlchemy code that is executed on the server via log:
SELECT tracking_update_1.order_id AS tracking_update_1_order_id, tracking_update_1.status_type AS tracking_update_1_status_type
FROM tracking_update AS tracking_update_1 LEFT OUTER JOIN tracking_update AS tracking_update_2 ON tracking_update_1.order_id = tracking_update_2.order_id AND tracking_update_1.last_updated < tracking_update_2.last_updated
WHERE 0 = 1
As you can see, the filter(WHERE clause) is now 0 = 1.
Now, if I remove the and_ statement and try two filters like so:
tracking_updates = db.session.query(tUAlias1.order_id, tUAlias1.status_type).\
outerjoin(tUalias2, (tUAlias1.order_id == tUalias2.order_id) & (tUAlias1.last_updated < tUalias2.last_updated)).\
filter(tUAlias1.order_id == '21757').filter(tUalias2.last_updated is None)
I receive the same result. I know the SQL itself is fine as I can run it with no issue via MySQL workbench.
When SQL run directly, I will receive the following
order ID | Status
21757 D
Also, if I remove the tUalias2.last_updated is None, I actually receive some results, but they are not correct. This is the SQL Log for that:
Python code
tracking_updates = db.session.query(tUAlias1.order_id, tUAlias1.status_type).\
outerjoin(tUalias2, (tUAlias1.order_id == tUalias2.order_id) & (tUAlias1.last_updated < tUalias2.last_updated)).\
filter(tUAlias1.order_id == '21757')
SQLAlchemy run:
SELECT tracking_update_1.order_id AS tracking_update_1_order_id, tracking_update_1.status_type AS tracking_update_1_status_type
FROM tracking_update AS tracking_update_1 LEFT OUTER JOIN tracking_update AS tracking_update_2 ON tracking_update_1.order_id = tracking_update_2.order_id AND tracking_update_1.last_updated < tracking_update_2.last_updated
WHERE tracking_update_1.order_id = '21757'
Any ideas?
Howdie do,
I figured it out
The Python 'is' operator doesn't play nice with SQLAlchemy
I found this out thanks to the following S/O question:
Selecting Null values SQLAlchemy
I've since updated my query to the following:
tracking_updates = db.session.query(tUAlias1.order_id, tUAlias1.status_type).\
outerjoin(tUalias2, (tUAlias1.order_id == tUalias2.order_id) & (tUAlias1.last_updated < tUalias2.last_updated)).\
filter(tUAlias1.order_id == '21757').filter(tUalias2.last_updated == None)
The problem is not in how SqlAlchemy processes null values, the problem is that you use an operator which is not supported for instrumented' columns and thus the expressiontUalias2.last_updated is Noneevaluates to a value (False), which is then translated to eitherand 0=1. You should writetUalias2.last_updated.is_(None)instead oftUalias2.last_updated is None` to make your code work.

sqlalchemy query using joinedload exponentially slower with each new filter clause

I have this sqlalchemy query:
query = session.query(Store).options(joinedload('salesmen').
joinedload('comissions').
joinedload('orders')).\
filter(Store.store_code.in_(selected_stores))
stores = query.all()
for store in stores:
for salesman in store.salesmen:
for comission in salesman.comissions:
#generate html for comissions for each salesman in each store
#print html document using PySide
This was working perfectly, however I added two new filter queries:
filter(Comissions.payment_status == 0).\
filter(Order.order_date <= self.dateEdit.date().toPython())
If I add just the first filter the application hangs for a couple of seconds, if I add both the application hangs indefinitely
What am I doing wrong here? How do I make this query fast?
Thank you for your help
EDIT: This is the sql generated, unfortunately the class and variable names are in Portuguese, I just translated them to English so it would be easier to undertand,
so Loja = Store, Vendedores = Salesmen, Pedido = Order, Comission = Comissao
Query generated:
SELECT "Loja"."CodLoja", "Vendedores_1"."CodVendedor", "Vendedores_1"."NomeVendedor", "Vendedores_1"."CodLoja", "Vendedores_1"."PercentualComissao",
"Vendedores_1"."Ativo", "Comissao_1"."CodComissao", "Comissao_1"."CodVendedor", "Comissao_1"."CodPedido",
"Pedidos_1"."CodPedido", "Pedidos_1"."CodLoja", "Pedidos_1"."CodCliente", "Pedidos_1"."NomeCliente", "Pedidos_1"."EnderecoCliente", "Pedidos_1"."BairroCliente",
"Pedidos_1"."CidadeCliente", "Pedidos_1"."UFCliente", "Pedidos_1"."CEPCliente", "Pedidos_1"."FoneCliente", "Pedidos_1"."Fone2Cliente", "Pedidos_1"."PontoReferenciaCliente",
"Pedidos_1"."DataPedido", "Pedidos_1"."ValorProdutos", "Pedidos_1"."ValorCreditoTroca",
"Pedidos_1"."ValorTotalDoPedido", "Pedidos_1"."Situacao", "Pedidos_1"."Vendeu_Teflon", "Pedidos_1"."ValorTotalTeflon",
"Pedidos_1"."DataVenda", "Pedidos_1"."CodVendedor", "Pedidos_1"."TipoVenda", "Comissao_1"."Valor", "Comissao_1"."DataPagamento", "Comissao_1"."StatusPagamento"
FROM "Comissao", "Pedidos", "Loja" LEFT OUTER JOIN "Vendedores" AS "Vendedores_1" ON "Loja"."CodLoja" = "Vendedores_1"."CodLoja"
LEFT OUTER JOIN "Comissao" AS "Comissao_1" ON "Vendedores_1"."CodVendedor" = "Comissao_1"."CodVendedor" LEFT OUTER JOIN "Pedidos" AS "Pedidos_1" ON "Pedidos_1"."CodPedido" = "Comissao_1"."CodPedido"
WHERE "Loja"."CodLoja" IN (:CodLoja_1) AND "Comissao"."StatusPagamento" = :StatusPagamento_1 AND "Pedidos"."DataPedido" <= :DataPedido_1
Your FROM clause is producing a Cartesian product and includes each table twice, once for filtering the result and once for eagerly loading the relationship.
To stop this use contains_eager instead of joinedload in your options. This will look for the related attributes in the query's columns instead of constructing an extra join. You will also need to explicitly join to the other tables in your query, e.g.:
query = session.query(Store)\
.join(Store.salesmen)\
.join(Store.commissions)\
.join(Store.orders)\
.options(contains_eager('salesmen'),
contains_eager('comissions'),
contains_eager('orders'))\
.filter(Store.store_code.in_(selected_stores))\
.filter(Comissions.payment_status == 0)\
.filter(Order.order_date <= self.dateEdit.date().toPython())

Bulk create Django with unique sequences or values per record?

I have what is essentially a table which is a pool of available codes/sequences for unique keys when I create records elsewhere in the DB.
Right now I run a transaction where I might grab 5000 codes out of an available pool of 1 billion codes using the slice operator [:code_count] where code_count == 5000.
This works fine, but then for every insert, I have to run through each code and insert it into the record manually when I use the code.
Is there a better way?
Example code (omitting other attributes for each new_item that are similar to all new_items):
code_count=5000
pool_cds = CodePool.objects.filter(free_indicator=True)[:code_count]
for pool_cd in pool_cds:
new_item = Item.objects.create(
pool_cd=pool_cd.unique_code,
)
new_item.save()
cursor = connection.cursor()
update_sql = 'update CodePool set free_ind=%s where pool_cd.id in %s'
instance_param = ()
#Create ridiculously long list of params (5000 items)
for pool_cd in pool_cds:
instance_param = instance_param + (pool_cd.id,)
params = [False, instance_param]
rows = cursor.execute(update_sql, params)
As I understand how it works:
code_count=5000
pool_cds = CodePool.objects.filter(free_indicator=True)[:code_count]
ids = []
for pool_cd in pool_cds:
Item.objects.create(pool_cd=pool_cd.unique_code)
ids += [pool_cd.id]
CodePool.objects.filter(id__in=ids).update(free_ind=False)
By the way if you created object using queryset method create, you don't need call save method. See docs.

Categories