Using top level item in nested subquery with SQLAlchemy - python

I have a query when I'm attempting to find a link between two tables, but I require few checks with association tables within the same query. I think my problem stems from having to check across multiple levels of relationships, where I want to filter a subquery based on the top level item, but I've hit an issue and have no idea how to proceed.
More specifically I want to query Script using the name of an Application, but narrow the results down to when the Application's Language matches the Script's Language.
Tables: Script (id, language_id), Application (id, name), Language (id)
Association Tables: ApplicationLanguage (app_id, language_id), ScriptApplication (script_id, app_id)
Current attempt: (it's important this stays as a single query)
value = 'appname'
# Search applications for a value
app_search = select([Application.id]).where(Application.name==value).as_scalar()
# Search for applications matching the language of the script
lang_search = select([ApplicationLanguage.app_id]).where(
ApplicationLanguage.language_id==Script.language_id
).as_scalar()
# Find the script based on which applications appear in both subqueries.
script_search = select([ScriptApplication.script_id]).where(and_(
ScriptApplication.app_id.in_(app_search),
ScriptApplication.app_id.in_(lang_search),
)).as_scalar()
# Turn it into an SQL expression
query = Script.id.in_(script_search)
Resulting SQL code:
SELECT script.id AS script_id
FROM script
WHERE script.id IN (SELECT script_application.script_id
FROM script_application
WHERE script_application.application_id IN (SELECT application.id
FROM application
WHERE application.name = ?) AND script_application.application_id IN (SELECT application_language.application_id
FROM application_language, script
WHERE script.language_id = application_language.language_id))
My theory
I believe the issue is on the line ApplicationLanguage.language_id==Script.language_id, because if I change it to (ApplicationLanguage.language_id==3, 3 being the value I'm expecting), then it works perfectly. In the SQL code, I assume it's the FROM application_language, script which is overwriting the top level script
How would I go about either rearranging or fixing this query? My current method seems to work fine if it's across a single relationship, just doesn't work if I try and do anything more complex.

I'd still love to know how I'd go about fixing the original query as I believe it'll come in useful in the future, but I managed to rearrange it.
I reversed the lang_search to grab languages for each application from app_search, and used that as part of the final query, instead of attempting to combine it in a subquery.
value = 'appname'
app_search = select([Application.id]).where(Application.name==value).as_scalar()
lang_search = select([ApplicationLanguage.language_id]).where(
ApplicationLanguage.app_id.in(app_search)
).as_scalar()
script_search = select([ScriptApplication.script_id]).where(and_(
ScriptApplication.app_id.in(app_search),
)).as_scalar()
query = and_(
Script.id.in_(script_search),
Script.language_id.in_(lang_search),
)
Final SQL query:
SELECT script.id AS script_id
FROM script
WHERE script.id IN (SELECT script_application.script_id
FROM script_application
WHERE script_application.application_id IN (SELECT application.id
FROM application
WHERE lower(application.name) = ?)) AND script.language_id IN (SELECT application_language.language_id
FROM application_language
WHERE application_language.application_id IN (SELECT application.id
FROM application
WHERE lower(application.name) = ?))

Related

Using arbitrary sqlalchemy select results (e.g., from a CTE) to create ORM instances

When I create an instance of an object (in my example below, a Company) I want to automagically create default, related objects. One way is to use a per-row, after-insert trigger, but I'm trying to avoid that route and use CTEs which are easier to read and maintain. I have this SQL working (underlying db is PostgreSQL and the only thing you need to know about table company is its primary key is: id SERIAL PRIMARY KEY and it has one other required column, name VARCHAR NOT NULL):
with new_company as (
-- insert my company row, returning the whole row
insert into company (name)
values ('Acme, Inc.')
returning *
),
other_related as (
-- herein I join to `new_company` and create default related rows
-- in other tables. Here we use, effectively, a no-op - what it
-- actually does is not germane to the issue.
select id from new_company
)
-- Having created the related rows, we return the row we inserted into
-- table `company`.
select * from new_company;
The above works like a charm and with the recently added Select.add_cte() (in sqlalchemy 1.4.21) I can write the above with the following python:
import sqlalchemy as sa
from myapp.models import Company
new_company = (
sa.insert(Company)
.values(name='Acme, Inc.')
.returning(Company)
.cte(name='new_company')
)
other_related = (
sa.select(sa.text('new_company.id'))
.select_from(new_company)
.cte('other_related')
)
fetch_company = (
sa.select(sa.text('* from new_company'))
.add_cte(other_related)
)
print(fetch_company)
And the output is:
WITH new_company AS
(INSERT INTO company (name) VALUES (:param_1) RETURNING company.id, company.name),
other_related AS
(SELECT new_company.id FROM new_company)
SELECT * from new_company
Perfect! But when I execute the above query I get back a Row:
>>> result = session.execute(fetch_company).fetchone()
>>> print(result)
(26, 'Acme, Inc.')
I can create an instance with:
>>> result = session.execute(fetch_company).fetchone()
>>> company = Company(**result)
But this instance, if added to the session, is in the wrong state, pending, and if I flush and/or commit, I get a duplicate key error because the company is already in the database.
If I try using Company in the select list, I get a bad query because sqlalchemy automagically sets the from-clause and I cannot figure out how to clear or explicitly set the from-clause to use my CTE.
I'm looking for one of several possible solutions:
annotate an arbitrary query in some way to say, "build an instance of MyModel, but use this table/alias", e.g., query = sa.select(Company).select_from(new_company.alias('company'), reset=True).
tell a session that an instance is persistent regardless of what the session thinks about the instance, e.g., company = Company(**result); session.add(company, force_state='persistent')
Obviously I could do another round-trip to the db with a call to session.merge() (as discussed in early comments of this question) so the instance ends up in the correct state, but that seems terribly inefficient especially if/when used to return lists of instances.

Too many server roundtrips w/ psycopg2

I am making a script, that should create a schema for each customer. I’m fetching all metadata from a database that defines how each customer’s schema should look like, and then create it. Everything is well defined, the types, names of tables, etc. A customer has many tables (fx, address, customers, contact, item, etc), and each table has the same metadata.
My procedure now:
get everything I need from the metadataDatabase.
In a for loop, create a table, and then Alter Table and add each metadata (This is done for each table).
Right now my script runs in about a minute for each customer, which I think is too slow. It has something to do with me having a loop, and in that loop, I’m altering each table.
I think that instead of me altering (which might be not so clever approach), I should do something like the following:
Note that this is just a stupid but valid example:
for table in tables:
con.execute("CREATE TABLE IF NOT EXISTS tester.%s (%s, %s);", (table, "last_seen date", "valid_from timestamp"))
But it gives me this error (it seems like it reads the table name as a string in a string..):
psycopg2.errors.SyntaxError: syntax error at or near "'billing'"
LINE 1: CREATE TABLE IF NOT EXISTS tester.'billing' ('last_seen da...
Consider creating tables with a serial type (i.e., autonumber) ID field and then use alter table for all other fields by using a combination of sql.Identifier for identifiers (schema names, table names, column names, function names, etc.) and regular format for data types which are not literals in SQL statement.
from psycopg2 import sql
# CREATE TABLE
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (ID serial)"""
cur.execute(sql.SQL(query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table")))
# ALTER TABLE
items = [("last_seen", "date"), ("valid_from", "timestamp")]
query = """ALTER TABLE {shm}.{tbl} ADD COLUMN {col} {typ}"""
for item in items:
# KEEP IDENTIFIER PLACEHOLDERS
final_query = query.format(shm="{shm}", tbl="{tbl}", col="{col}", typ=i[1])
cur.execute(sql.SQL(final_query).format(shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"),
col = sql.Identifier(item[0]))
Alternatively, use str.join with list comprehension for one CREATE TABLE:
query = """CREATE TABLE IF NOT EXISTS {shm}.{tbl} (
"id" serial,
{vals}
)"""
items = [("last_seen", "date"), ("valid_from", "timestamp")]
val = ",\n ".join(["{{}} {typ}".format(typ=i[1]) for i in items])
# KEEP IDENTIFIER PLACEHOLDERS
pre_query = query.format(shm="{shm}", tbl="{tbl}", vals=val)
final_query = sql.SQL(pre_query).format(*[sql.Identifier(i[0]) for i in items],
shm = sql.Identifier("tester"),
tbl = sql.Identifier("table"))
cur.execute(final_query)
SQL (sent to database)
CREATE TABLE IF NOT EXISTS "tester"."table" (
"id" serial,
"last_seen" date,
"valid_from" timestamp
)
However, this becomes heavy as there are too many server roundtrips.
How many tables with how many columns are you creating that this is slow? Could you ssh to a machine closer to your server and run the python there?
I don't get that error. Rather, I get an SQL syntax error. A values list is for conveying data. But ALTER TABLE is not about data, it is about metadata. You can't use a values list there. You need the names of the columns and types in double quotes (or no quotes) rather than single quotes. And you can't have a comma between name and type. And you can't have parentheses around each pair. And each pair needs to be introduced with "ADD", you can't have it just once. You are using the wrong tool for the job. execute_batch is almost the right tool, except it will use single quotes rather than double quotes around the identifiers. Perhaps you could add a flag to it tell it to use quote_ident.
Not only is execute_values the wrong tool for the job, but I think python in general might be as well. Why not just load from a .sql file?

SQLAlchemy match with or

I'm getting myself tied up in knots with some sqlalchemy I'm trying to work out. I've got an old web app I'm trying to tart up, and have decided to rewrite it from scratch. As part of that, I'm playing with SQL Alchemy and trying to improve my pythonic skills - so I've got a search object I'm trying to run, where I'm checking to see if the customer query exists in either the account name and customer name fields and match against either of them. However SQL Alchemy registers it as an AND
If I add extra or_ blocks, it fails to recognise them and process appropriately.
I've moved it so it's the first query, but the query planner in sqlalchemy leaves it exactly the same.
Any ideas?
def CustomerCountryMatch(query, page):
customer=models.Customer
country=models.CustomerCodes
query=customer.query.order_by(customer.account_name).\
group_by(customer.account_name).having(func.max(customer.renewal_date)).\
join(country, customer.country_code==country.CODE).\
add_columns(customer.account_name,
customer.customer_name,
customer.account_id,
customer.CustomerNote,
country.COUNTRY,
country.SupportRegion,
customer.renewal_date,
customer.contract_type,
customer.CCGroup).\
filter(customer.account_name.match(query)).filter(or_(customer.customer_name.match(query))).\
paginate(page, 50, False)
The query as executed is below:
sqlalchemy.engine.base.Engine SELECT customer.customer_id AS customer_customer_id,
customer.customer_code AS customer_customer_code,
customer.address_code AS customer_address_code,
customer.customer_name AS customer_customer_name,
customer.account_id AS customer_account_id,
customer.account_name AS customer_account_name,
customer.`CustomerNote` AS `customer_CustomerNote`,
customer.renewal_date AS customer_renewal_date,
customer.contract_type AS customer_contract_type,
customer.country_code AS customer_country_code,
customer.`CCGroup` AS `customer_CCGroup`,
customer.`AgentStatus` AS `customer_AgentStatus`,
customer.comments AS customer_comments,
customer.`SCR` AS `customer_SCR`,
customer.`isDummy` AS `customer_isDummy`,
customer_codes.`COUNTRY` AS `customer_codes_COUNTRY`,
customer_codes.`SupportRegion` AS `customer_codes_SupportRegion`
FROM customer INNER JOIN
customer_codes ON customer.country_code=customer_codes.`CODE` WHERE
MATCH (customer.account_name) AGAINST (%s IN BOOLEAN MODE) AND
MATCH (customer.customer_name) AGAINST (%s IN BOOLEAN MODE) GROUP BY
customer.account_name HAVING max(customer.renewal_date) ORDER BY
customer.account_name LIMIT %s,
%s 2015-11-06 03:32:52,035 INFO sqlalchemy.engine.base.Engine ('bob', 'bob', 0, 50)
The filter clause should be:
filter(
or_(
customer.account_name.match(query),
customer.customer_name.match(query)
)
)
Calling filter twice, as in filter(clause1).filter(clause2) joins the criteria using AND (see the docs).
The construct: filter(clause1).filter(or_(clause2)) does not do what you intend, and is translated into SQL: clause1 AND clause2.
The following example makes sense: filter(clause1).filter(or_(clause2, clause3)), and is translated into SQL as: clause1 AND (clause2 OR clause 3).
A simpler approach is to use an OR clause using the '|' operator within your match if you want to find all matches that contain one or more of the words your are searching for eg
query = query.filter(Table.text_searchable_column.match('findme | orme'))

Changing where clause without generating subquery in SQLAlchemy

I'm trying to build a relatively complex query and would like to manipulate the where clause of the result directly, without cloning/subquerying the returned query. An example would look like:
session = sessionmaker(bind=engine)()
def generate_complex_query():
return select(
columns=[location.c.id.label('id')],
from_obj=location,
whereclause=location.c.id>50
).alias('a')
query = generate_complex_query()
# based on this query, I'd like to add additional where conditions, ideally like:
# `query.where(query.c.id<100)`
# but without subquerying the original query
# this is what I found so far, which is quite verbose and it doesn't solve the subquery problem
query = select(
columns=[query.c.id],
from_obj=query,
whereclause=query.c.id<100
)
# Another option I was considering was to map the query to a class:
# class Location(object):pass
# mapper(Location, query)
# session.query(Location).filter(Location.id<100)
# which looks more elegant, but also creates a subquery
result = session.execute(query)
for r in result:
print r
This is the generated query:
SELECT a.id
FROM (SELECT location.id AS id
FROM location
WHERE location.id > %(id_1)s) AS a
WHERE a.id < %(id_2)s
I would like to obtain:
SELECT location.id AS id
FROM location
WHERE id > %(id_1)s and
id < %(id_2)s
Is there any way to achieve this? The reason for this is that I think query (2) is slightly faster (not much), and the mapper example (2nd example above) which I have in place messes up the labels (id becomes anon_1_id or a.id if I name the alias).
Why don't you do it like this:
query = generate_complex_query()
query = query.where(location.c.id < 100)
Essentially you can refine any query like this. Additionally, I suggest reading the SQL Expression Language Tutorial which is pretty awesome and introduces all the techniques you need. The way you build a select is only one way. Usually, I build my queries more like this: select(column).where(expression).where(next_expression) and so on. The FROM is usually automatically inferred by SQLAlchemy from the context, i.e. you rarely need to specify it.
Since you don't have access to the internals of generate_complex_query try this:
query = query.where(query.c.id < 100)
This should work in your case I presume.
Another idea:
query = query.where(text("id < 100"))
This uses SQLAlchemy's text expression. This could work for you, however, and this is important: If you want to introduce variables, read the description of the API linked above, because just using format strings intead of bound parameters will open you up to SQL injection, something that normally is a no-brainer with SQLAlchemy but must be taken care of if working with such literal expressions.
Also note that this works because you label the column as id. If you don't do that and don't know the column name, then this won't work either.

Delete rows without a related record using SQLAlchemy

I have 2 tables; we'll call them table1 and table2. table2 has a foreign key to table1. I need to delete the rows in table1 that have zero child records in table2. The SQL to do this is pretty straightforward:
DELETE FROM table1
WHERE 0 = (SELECT COUNT(*) FROM table2 WHERE table2.table1_id = table1.table1_id);
However, I haven't been able to find a way to translate this query to SQLAlchemy. Trying the straightforward approach:
subquery = session.query(sqlfunc.count(Table2).label('t2_count')).select_from(Table2).filter(Table2.table1_id == Table1.table1_id).subquery()
session.query(Table1).filter(0 == subquery.columns.t2_count).delete()
Just yielded an error:
sqlalchemy.exc.ArgumentError: Only deletion via a single table query is currently supported
How can I perform this DELETE with SQLAlchemy?
Python 2.7
PostgreSQL 9.2.4
SQLAlchemy 0.7.10 (Cannot upgrade due to using GeoAlchemy, but am interested if newer versions would make this easier)
I'm pretty sure this is what you want. You should try it out though. It uses EXISTS.
from sqlalchemy.sql import not_
# This fetches rows in python to determine which ones were removed.
Session.query(Table1).filter(not_(Table1.table2s.any())).delete(
synchronize_session='fetch')
# If you will not be referencing more Table1 objects in this session then you
# can just ignore syncing the session.
Session.query(Table1).filter(not_(Table1.table2s.any())).delete(
synchronize_session=False)
Explanation of the argument for delete():
http://docs.sqlalchemy.org/en/rel_0_8/orm/query.html#sqlalchemy.orm.query.Query.delete
Example with exists(using any() above uses EXISTS):
http://docs.sqlalchemy.org/en/rel_0_8/orm/tutorial.html#using-exists
Here is the SQL that should be generated:
DELETE FROM table1 WHERE NOT (EXISTS (SELECT 1
FROM table2
WHERE table1.id = table2.table1_id))
If you are using declarative I think there is a way to access Table2.table and then you could just use the sql layer of sqlalchemy to do exactly what you want. Although you run into the same issue of making your Session out of sync.
Well, I found one very ugly way to do it. You can do a select with a join to get the rows loaded into memory, then you can delete them individually:
subquery = session.query(Table2.table1_id
,sqlalchemy.func.count(Table2.table2_id).label('t1count')
) \
.select_from(Table2) \
.group_by(Table2.table1_id) \
.subquery()
rows = session.query(Table1) \
.select_from(Table1) \
.outerjoin(subquery, Table1.table1_id == subquery.c.table1_id) \
.filter(subquery.c.t1count == None) \
.all()
for r in rows:
session.delete(r)
This is not only nasty to write, it's also pretty nasty performance-wise. For starters, you have to bring the table1 rows into memory. Second, if you were like me and had a line like this on Table2's class definition:
table1 = orm.relationship(Table1, backref=orm.backref('table2s'))
then SQLAlchemy will actually perform a query to pull the related table2 rows into memory, too (even though there aren't any). Even worse, because you have to loop over the list (I tried just passing in the list; didn't work), it does so one table1 row at a time. So if you're deleting 10 rows, it's 21 individual queries (1 for the initial select, 1 for each relationship pull, and 1 for each delete). Maybe there are ways to mitigate that; I would have to go through the documentation to see. All this for things I don't even want in my database, much less in memory.
I won't mark this as the answer. I want a cleaner, more efficient way of doing this, but this is all I have for now.

Categories