I have the following query that is attempting to return authors and their article counts:
SELECT (
SELECT COUNT(*)
FROM aldryn_newsblog_article
WHERE
aldryn_newsblog_article.author_id IN (1,2) AND
aldryn_newsblog_article.app_config_id = 1 AND
aldryn_newsblog_article.is_published IS TRUE AND
aldryn_newsblog_article.publishing_date <= now()
) as article_count, aldryn_people_person.*
FROM aldryn_people_person
However, it is currently returning the same number for each author because it counts all articles for authors with ID's of 1 and 2.
How should the query be modified, so it returns proper article counts for each author?
On a separate note, how can one turn the (1,2) into a list that can be spliced into the query dynamically? That is, suppose I have a Python list of author IDs, for which I would like to look up article counts. How could I pass that information to the SQL?
As commented, for a subquery to work you need to correlate it to the outer query usually by a unique identifier (assumed to be author_id) which appears to also be used for a filtered condition to be run in WHERE of outer query. Also, use table aliases for clarity between subquery and outer query.
SELECT main.*
, (SELECT COUNT(*)
FROM aldryn_newsblog_article AS sub
WHERE
sub.author_id = main.author_id AND
sub.app_config_id = 1 AND
sub.is_published IS TRUE AND
sub.publishing_date <= now()
) AS article_count
FROM aldryn_people_person AS main
WHERE main.author_id IN (1, 2)
Alternatively, for a more efficient query, have main query JOIN to an aggregate subquery to calculate counts once and avoid re-running subquery for every outer query's number of rows.
SELECT main.*,
, sub.article_count
FROM aldryn_people_person AS main
INNER JOIN
(SELECT author_id
, COUNT(*) AS article_count
FROM aldryn_newsblog_article AS sub
WHERE
sub.app_config_id = 1 AND
sub.is_published IS TRUE AND
sub.publishing_date <= now()
GROUP BY author_id
) AS sub
ON sub.author_id = main.author_id
AND main.author_id IN (1, 2)
Re your separate note, there are many SO questions like this one that asks for a dynamic list in IN operator which involves creating a prepared statement with dynamic number of parameter placeholders, either ? or %s depending on Python DB-API (e.g., psycopg2, pymysql, pyodbc). Then, pass parameters in second argument of cursor.execute() clause. Do note the limit of such values for your database.
# BUILD PARAM PLACEHOLDERS
qmarks = ", ".join(['?' for _ in range(len([list_of_author_ids]))])
# INTERPOLATE WITH F-STRING (PYTHON 3.6+)
sql = f'''SELECT ...
FROM ....
INNER JOIN ....
AND main.author_id IN ({qmarks})'''
# BIND PARAMS
cursor.execute(sql, [list_of_author_ids])
The way I normally handle these sorts of aggregates is first design a query that gets a list of author names and articles, then create a column to serve as the article count. At the lowest level this looks silly, because every article is 1. Then I wrap that in a subquery and sum from it.
SELECT sub.author, articleCount = sum(sub.rowCount)
FROM (
select distinct
author = x.author_id
, article = x.articleTitle
, rowCount = 1
from aldryn_newsblog_article x
where x.author_id in (1,2) and x.is_pubished = true --whatever other conditions you need here
) sub
GROUP BY sub.author
As far as the (1,2) being replaced with something more dynamic, the way I've seen it done before is to use CHARINDEX to parse a comma separated string in the where clause so you would have something like
DECLARE #passedFilter VARCHAR(50) = ',1,2,'
SELECT * FROM aldryn_newsblog_article WHERE CHARINDEX(',' + CAST(author_id AS VARCHAR) + ',', #passedFilter, 0) > 0
What this does is takes your list of ids (note the leading and trailing commas) and lets the query do a pattern match on it off of the key value. I've read that this doesn't give the absolute best performance, but sometimes that isn't the biggest concern. We used this a lot in passing filters from a web app to SQL Server reports. Another method would be to declare a table variable / temp table, populate it somehow with the authors you want to filter for then join that subquery from the first bit of my answer to that table.
Related
I am working on a SQL analysis tool that, given a RAW SQL SELECT query, can give some sort of analysis. The first version of the tool is finished and can analyze simple RAW queries. However, when the query contains a subquery it breaks.
So I am looking for a simple but reliable way to parse queries and subqueries. My tool must analyze every subquery individually so for example:
Suppose this is the query that the tool is given as input:
SELECT name, email
FROM (SELECT * FROM user WHERE email IS NOT NULL)
WHERE id IN (SELECT cID FROM customer WHERE points > 5)
Then I would like to get a list of queries like so:
queries = [
"SELECT name, EMAIL FROM <subquery> WHERE id in <subquery>"
"SELECT * FROM user WHERE email IS NOT NULL"
"SELECT cID FROM customer WHERE points > 5)"
]
In my first attempt, I am using the fact that subqueries are always written between brackets. So I scan the initial query for brackets. This works when subqueries aren't nested i.e. the are no subqueries inside subqueries. I also experimented a bit with AST, but felt that it was probably a bit too complicated and that there are probably more simple ways.
Anyone who's able to guide me in the right direction? I am using Python, but examples in other languages are also much appreciated.
You can use sqlparse:
import sqlparse
def queries(d):
if type(d) != sqlparse.sql.Token:
paren = isinstance(d, sqlparse.sql.Parenthesis)
v = [queries(i) for i in (d if not paren else d[1:-1])]
subseq, qrs = ''.join(str(i[0]) for i in v), [x for _, y in v for x in y]
if [*d][paren].value == 'SELECT':
return '<subquery>', [subseq]+qrs
return subseq, qrs
return d, []
s="""SELECT name, email
FROM (SELECT * FROM user WHERE email IS NOT NULL)
WHERE id IN (SELECT cID FROM customer WHERE points > 5)
"""
_, subqueries = queries(sqlparse.parse(s)[0])
Output:
['SELECT name, email\n FROM <subquery>\n WHERE id IN <subquery>\n', 'SELECT * FROM user WHERE email IS NOT NULL', 'SELECT cID FROM customer WHERE points > 5']
Using the sqlparse library, you can parse a SQL input string into a tokenized stream of keywords, statements, and values. The function queries above takes in a sqlparse.sql.Statement object and searches for any occurrence of a SELECT statement in the query, reformatting the original input along the way to remove subqueries, per the desired output sample.
I am using Python and PyMySQL. I want to fetch a number of items from a MySQL database according to their ids:
items_ids = tuple([3, 2])
sql = f"SELECT * FROM items WHERE item_id IN {items_ids};"
I am using the formatted string literals (f" ", https://docs.python.org/3/whatsnew/3.6.html#whatsnew36-pep498) to evaluate the tuple inside the SQL statement.
However,I want to get back the items in the order specified by the tuple so firstly the item with item_id = 3 and then the item with item_id = 2. To accomplish this I have to use the ORDER BY FIELD clause (see also here: Ordering by the order of values in a SQL IN() clause).
But if I write something like this:
items_ids = tuple([3, 2])
sql = f"SELECT * FROM items WHERE item_id IN {items_ids} ORDER BY FIELD{(item_id,) + items_ids};"
then item_id in the ORDER BY FIELD clause is considered as an undeclared python variable
and if I write something like this:
items_ids = tuple([3, 2])
sql = f"SELECT * FROM items WHERE item_id IN {items_ids} ORDER BY FIELD{('item_id',) + items_ids};"
then item_id in the ORDER BY FIELD clause is considered as a string and not as a SQL variable and in this case ORDER BY FIELD does not do anything.
How can I evaluate the tuple (item_id,) + items_ids in the SQL statement by maintaining item_id as a SQL variable in the ORDER BY FIELD clause?
Obviously I can sort the items after they have returned from the database according to items_ids and without bothering so much with MySQL but I was just wondering how to do this.
Please don't use f-strings, or any string formatting, for passing values to SQL queries. That's the road to SQL injection. Now you may be thinking: "it's a tuple of integers, what bad could happen?" First of all a single element Python tuple's string representation is not valid SQL. Secondly, someone may follow the example with user controllable data other than tuples of ints (so having these bad examples online perpetuates the habit). Also the reason why you have to resort to your "cunning" solution is using the wrong tools for the job.
The correct way to pass values to SQL queries is to use placeholders. In case of pymysql the placeholder is – a bit confusingly – %s. Don't mix it with manual %-formatting. In case of having to pass a variable amount of values to a query you do have to resort to some string building, but you build the placeholders, not the values:
item_ids = (3, 2)
item_placeholders = ', '.join(['%s'] * len(item_ids))
sql = f"""SELECT * FROM items
WHERE item_id IN ({item_placeholders})
ORDER BY FIELD(item_id, {item_placeholders})"""
# Produces:
#
# SELECT * FROM items
# WHERE item_id IN (%s, %s)
# ORDER BY FIELD(item_id, %s, %s)
with conn.cursor() as cur:
# Build the argument tuple
cur.execute(sql, (*item_ids, *item_ids))
res = cur.fetchall()
Another simpler way to resolve this single element tuple problem is by checking the length of the element by keeping it into list and keeping it as a list rather than passing it as a tuple to cursor param:
eg:
if (len(get_version_list[1])==1):
port_id=str(port_id[0])
port_id = '(' + "'" + port_id + "'" + ')'
else:
port_id=tuple(port_id)
pd.read_sql(sql=get_version_str.format(port_id,src_cd), con=conn)
By using above code simply you won't get (item_id,) this error in sql further:)
A solution with .format() is the following:
items_ids = tuple([3, 2])
items_placeholders = ', '.join(['{}'] * len(items_ids))
sql = "SELECT * FROM items WHERE item_id IN {} ORDER BY FIELD(item_id, {});".format(items_ids, items_placeholders).format(*items_ids)
# with `.format(items_ids, items_placeholders)` you get this: SELECT * FROM items WHERE item_id IN (3, 2) ORDER BY FIELD(item_id, {}, {});
# and then with `.format(*items_ids)` you get this: SELECT * FROM items WHERE item_id IN (3, 2) ORDER BY FIELD(item_id, 3, 2);
A rather tricky solution with f-strings is the following:
sql1 = f"SELECT * FROM items WHERE item_id IN {item_ids} ORDER BY FIELD(item_id, "
sql2 = f"{items_ids};"
sql = sql1 + sql2[1:]
# SELECT * FROM items WHERE item_id IN (3, 2) ORDER BY FIELD(item_id, 3, 2);
But as #IIija mentions, I may get a SQL injection with it because IN {item_ids} cannot accommodate one-element tuples as such.
Additionally, using f-strings to unpack tuples in strings is perhaps more difficult than using .format() as others have mentioned before (Formatted string literals in Python 3.6 with tuples) since you cannot use * to unpack a tuple within a f-string. However, perhaps you may come up with a solution for this (which is using a iterator?) to produce this
sql = f"SELECT * FROM items WHERE item_id IN ({t[0]}, {t[1]}) ORDER BY FIELD(item_id, {t[0]}, {t[1]});"
even though I do not have the solution for this in my mind right now. You are welcome to post a solution of this kind if you have it in your mind.
I'm having trouble converting this SQL query into a SQL Alchemy query:
query = """
SELECT i.case_num,
to_char(i.date_time, 'FMMonth FMDD, YYYY'),
to_char(i.date_time, 'HH24:MI'),
i.incident_type,
i.incident_cat,
i.injury,
i.property_damage,
i.description,
i.root_cause,
a.corrective_action,
a.due_date,
i.user_id
FROM incident as i, action_items as a
WHERE i.case_num = a.case_id AND i.case_num = %s;
"""
I have tried the following but have received nothing but errors:
sqlalchemy.orm.exc.NoResultFound: No row was found for one()
results = dbsession.query(Incidents.case_num,
func.to_char(Incidents.date_time, 'FMMonth FMDD, YYYY'),
func.to_char(Incidents.date_time, 'HH24:MI'),
Incidents.incident_type,
Incidents.incident_cat,
Incidents.injury,
Incidents.property_damage,
Incidents.description,
Incidents.root_cause,
Actions.corrective_action,
Actions.due_date,
Incidents.user_id).join(Actions).filter_by(case_id = id).one()
AttributeError: mapper
results = dbsession.query(Incidents.case_num,
func.to_char(Incidents.date_time, 'FMMonth FMDD, YYYY'),
func.to_char(Incidents.date_time, 'HH24:MI'),
Incidents.incident_type,
Incidents.incident_cat,
Incidents.injury,
Incidents.property_damage,
Incidents.description,
Incidents.root_cause,
Incidents.user_id).join(Actions.corrective_action, Actions.due_date).filter_by(case_id = id).one()
I figure I can do two separate queries but would rather figure out how to perform one join query instead.
you shouldn't need to specify a join explicitly to get sqlalchemy to generate the statment you want.
Also, (my opinion). Avoid using filter_by.
In this case filter_by is not smart enough to realize that id is a column in Incidents, because id is a built in function. filter_by (see source)
accepts where conditions as keyword arguments, unpacks them, treating the keys as columns to be looked up, but not the values, then it calls the filter method with all the conditions conjoined.
relevant bit of code:
def filter_by(self, **kwargs):
clauses = [_entity_descriptor(self._joinpoint_zero(), key) == value
for key, value in kwargs.items()]
return self.filter(sql.and_(*clauses))
if id were provided as a left-hand value, i.e.
stmt = dbsession.query(...).join(...).filter_by(id = 123)
The statement would compile. However, the following would not compile
stmt = dbsession.query(...).join(...).filter_by(id = case_id)
because, case_id is not a variable in scope
And, the OP's version
stmt = dbsession.query(...).join(...).filter_by(case_id = id)
can resolve case_id properly, and sees that there is something in the current scope named id (the built-in), and tries to use it
This should do what you want:
results = dbsession.query(
Incidents.case_num,
func.to_char(Incidents.date_time, 'FMMonth FMDD, YYYY'),
func.to_char(Incidents.date_time, 'HH24:MI'),
Incidents.incident_type,
Incidents.incident_cat,
Incidents.injury,
Incidents.property_damage,
Incidents.description,
Incidents.root_cause,
Actions.corrective_action,
Actions.due_date,
Incidents.user_id).filter(
Actions.case_id == Incidents.id
).filter(
Incidents.case_num == 123
).one()
# ^ here's how one would add multiple filters to a query
FYI, you can save query objects and inspect them, like this:
stmt = dbsession.query(...).filter(...)
print(stmt)
And then fetch the results with
stmt.one()
# or stmt.first() or stmt.all() or ...
I want to add another condition to this WHERE clause:
stmt = 'SELECT account_id FROM asmithe.data_hash WHERE percent < {};'.format(threshold)
I have the variable juris which is a list. The value of account_id and juris are related in that when an account_id is created, it contains the substring of a juris.
I want to add to the query the condition that it needs to match anyone of the juris elements. Normally I would just add ...AND account_id LIKE '{}%'".format(juris) but this doesn't work because juris is a list.
How do I add all elements of a list to the WHERE clause?
Use Regex with operator ~:
juris = ['2','7','8','3']
'select * from tbl where id ~ \'^({})\''.format('|'.join(juris))
which leads to this query:
select * from tbl where id ~ '^(2|7|8|3)'
This brings the rows which their id start with any of 2,7,8 or 3. Here is a fiddle for it.
If you want the id start with 2783 use:
select * from tbl where id ~ '^2783'
and if id contains any of 2,7,8 or 3
select * from t where id ~ '.*(2|7|8|3).*'
Stop using string formatting with SQL. Right now. Understand?
OK now. There's a construct, ANY in SQL, that lets you take an operator and apply it to an array. psycopg2 supports passing a Python list as an SQL ARRAY[]. So in this case you can just
curs.execute('SELECT account_id FROM asmithe.data_hash WHERE percent LIKE ANY (%s)', (thelist,))
Note here that %s is the psycopg2 query-parameter placeholder. It's not actually a format specifier. The second argument is a tuple, the query parameters. The first (and only) parameter is the list.
There's also ALL, which works like ANY but is true only if all the matches are true, not just if one or more is true.
I am hoping juris is a list of strings? If so, this might help:
myquery = ("SELECT accountid FROM asmithe.data_hash "
"WHERE percent in (%s)" % ",".join(map(str,juris)))
See these links:
python list in sql query as parameter
How to select item matching Only IN List in sql server
String formatting operations
I have a sql query as follows
select cloumn1,column2,count(column1) as c
from Table1 where user_id='xxxxx' and timestamp > xxxxxx
group by cloumn1,column2
order by c desc limit 1;
And I successed in write the sqlalchemy equvalent
result = session.query(Table1.field1,Table1.field2,func.count(Table1.field1)).filter(
Table1.user_id == self.user_id).filter(Table1.timestamp > self.from_ts).group_by(
Table1.field1,Travelog.field2).order_by(desc(func.count(Table1.field1))).first()
But I want to avoid using func.count(Table1.field1) in the order_by clause.
How can I use alias in sqlalchemy? Can any one show any example?
Aliases are for tables; columns in a query are given a label instead. This trips me up from time to time too.
You can go about this two ways. It is sufficient to store the func.count() result is a local variable first and reuse that:
field1_count = func.count(Table1.field1)
result = session.query(Table1.field1, Table1.field2, field1_count).filter(
Table1.user_id == self.user_id).filter(Table1.timestamp > self.from_ts).group_by(
Table1.field1, Travelog.field2).order_by(desc(field1_count)).first()
The SQL produced would still be the same as your own code would generate, but at least you don't have to type out the func.count() call twice.
To give this column an explicit label, call the .label() method on it:
field1_count = func.count(Table1.field1).label('c')
and you can then use that same label string in the order_by clause:
result = session.query(Table1.field1, Table1.field2, field1_count).filter(
Table1.user_id == self.user_id).filter(Table1.timestamp > self.from_ts).group_by(
Table1.field1, Travelog.field2).order_by(desc('c')).first()
or you could use the field1_count.name attribute:
result = session.query(Table1.field1, Table1.field2, field1_count).filter(
Table1.user_id == self.user_id).filter(Table1.timestamp > self.from_ts).group_by(
Table1.field1, Travelog.field2).order_by(desc(field1_count.name)).first()
Can also using the c which is an alias of the column attribute but in this case a label will work fine as stated.
Will also point out that the filter doesn't need to be used multiple times can pass comma separated criterion.
result = (session.query(Table1.field1, Table1.field2,
func.count(Table1.field1).label('total'))
.filter(Table1.c.user_id == self.user_id, Table1.timestamp > self.from_ts)
.group_by(Table1.field1,Table1.field2)
.order_by(desc('total')).first())