I am working on a SQL analysis tool that, given a RAW SQL SELECT query, can give some sort of analysis. The first version of the tool is finished and can analyze simple RAW queries. However, when the query contains a subquery it breaks.
So I am looking for a simple but reliable way to parse queries and subqueries. My tool must analyze every subquery individually so for example:
Suppose this is the query that the tool is given as input:
SELECT name, email
FROM (SELECT * FROM user WHERE email IS NOT NULL)
WHERE id IN (SELECT cID FROM customer WHERE points > 5)
Then I would like to get a list of queries like so:
queries = [
"SELECT name, EMAIL FROM <subquery> WHERE id in <subquery>"
"SELECT * FROM user WHERE email IS NOT NULL"
"SELECT cID FROM customer WHERE points > 5)"
]
In my first attempt, I am using the fact that subqueries are always written between brackets. So I scan the initial query for brackets. This works when subqueries aren't nested i.e. the are no subqueries inside subqueries. I also experimented a bit with AST, but felt that it was probably a bit too complicated and that there are probably more simple ways.
Anyone who's able to guide me in the right direction? I am using Python, but examples in other languages are also much appreciated.
You can use sqlparse:
import sqlparse
def queries(d):
if type(d) != sqlparse.sql.Token:
paren = isinstance(d, sqlparse.sql.Parenthesis)
v = [queries(i) for i in (d if not paren else d[1:-1])]
subseq, qrs = ''.join(str(i[0]) for i in v), [x for _, y in v for x in y]
if [*d][paren].value == 'SELECT':
return '<subquery>', [subseq]+qrs
return subseq, qrs
return d, []
s="""SELECT name, email
FROM (SELECT * FROM user WHERE email IS NOT NULL)
WHERE id IN (SELECT cID FROM customer WHERE points > 5)
"""
_, subqueries = queries(sqlparse.parse(s)[0])
Output:
['SELECT name, email\n FROM <subquery>\n WHERE id IN <subquery>\n', 'SELECT * FROM user WHERE email IS NOT NULL', 'SELECT cID FROM customer WHERE points > 5']
Using the sqlparse library, you can parse a SQL input string into a tokenized stream of keywords, statements, and values. The function queries above takes in a sqlparse.sql.Statement object and searches for any occurrence of a SELECT statement in the query, reformatting the original input along the way to remove subqueries, per the desired output sample.
Related
I have the following query that is attempting to return authors and their article counts:
SELECT (
SELECT COUNT(*)
FROM aldryn_newsblog_article
WHERE
aldryn_newsblog_article.author_id IN (1,2) AND
aldryn_newsblog_article.app_config_id = 1 AND
aldryn_newsblog_article.is_published IS TRUE AND
aldryn_newsblog_article.publishing_date <= now()
) as article_count, aldryn_people_person.*
FROM aldryn_people_person
However, it is currently returning the same number for each author because it counts all articles for authors with ID's of 1 and 2.
How should the query be modified, so it returns proper article counts for each author?
On a separate note, how can one turn the (1,2) into a list that can be spliced into the query dynamically? That is, suppose I have a Python list of author IDs, for which I would like to look up article counts. How could I pass that information to the SQL?
As commented, for a subquery to work you need to correlate it to the outer query usually by a unique identifier (assumed to be author_id) which appears to also be used for a filtered condition to be run in WHERE of outer query. Also, use table aliases for clarity between subquery and outer query.
SELECT main.*
, (SELECT COUNT(*)
FROM aldryn_newsblog_article AS sub
WHERE
sub.author_id = main.author_id AND
sub.app_config_id = 1 AND
sub.is_published IS TRUE AND
sub.publishing_date <= now()
) AS article_count
FROM aldryn_people_person AS main
WHERE main.author_id IN (1, 2)
Alternatively, for a more efficient query, have main query JOIN to an aggregate subquery to calculate counts once and avoid re-running subquery for every outer query's number of rows.
SELECT main.*,
, sub.article_count
FROM aldryn_people_person AS main
INNER JOIN
(SELECT author_id
, COUNT(*) AS article_count
FROM aldryn_newsblog_article AS sub
WHERE
sub.app_config_id = 1 AND
sub.is_published IS TRUE AND
sub.publishing_date <= now()
GROUP BY author_id
) AS sub
ON sub.author_id = main.author_id
AND main.author_id IN (1, 2)
Re your separate note, there are many SO questions like this one that asks for a dynamic list in IN operator which involves creating a prepared statement with dynamic number of parameter placeholders, either ? or %s depending on Python DB-API (e.g., psycopg2, pymysql, pyodbc). Then, pass parameters in second argument of cursor.execute() clause. Do note the limit of such values for your database.
# BUILD PARAM PLACEHOLDERS
qmarks = ", ".join(['?' for _ in range(len([list_of_author_ids]))])
# INTERPOLATE WITH F-STRING (PYTHON 3.6+)
sql = f'''SELECT ...
FROM ....
INNER JOIN ....
AND main.author_id IN ({qmarks})'''
# BIND PARAMS
cursor.execute(sql, [list_of_author_ids])
The way I normally handle these sorts of aggregates is first design a query that gets a list of author names and articles, then create a column to serve as the article count. At the lowest level this looks silly, because every article is 1. Then I wrap that in a subquery and sum from it.
SELECT sub.author, articleCount = sum(sub.rowCount)
FROM (
select distinct
author = x.author_id
, article = x.articleTitle
, rowCount = 1
from aldryn_newsblog_article x
where x.author_id in (1,2) and x.is_pubished = true --whatever other conditions you need here
) sub
GROUP BY sub.author
As far as the (1,2) being replaced with something more dynamic, the way I've seen it done before is to use CHARINDEX to parse a comma separated string in the where clause so you would have something like
DECLARE #passedFilter VARCHAR(50) = ',1,2,'
SELECT * FROM aldryn_newsblog_article WHERE CHARINDEX(',' + CAST(author_id AS VARCHAR) + ',', #passedFilter, 0) > 0
What this does is takes your list of ids (note the leading and trailing commas) and lets the query do a pattern match on it off of the key value. I've read that this doesn't give the absolute best performance, but sometimes that isn't the biggest concern. We used this a lot in passing filters from a web app to SQL Server reports. Another method would be to declare a table variable / temp table, populate it somehow with the authors you want to filter for then join that subquery from the first bit of my answer to that table.
I'm trying to execute a raw sql query and safely pass an order by/asc/desc based on user input. This is the back end for a paginated datagrid. I cannot for the life of me figure out how to do this safely. Parameters get converted to strings so Oracle can't execute the query. I can't find any examples of this anywhere on the internet. What is the best way to safely accomplish this? (I am not using the ORM, must be raw sql).
My workaround is just setting ASC/DESC to a variable that I set. This works fine and is safe. However, how do I bind a column name to the ORDER BY? Is that even possible? I can just whitelist a bunch of columns and do something similar as I do with the ASC/DESC. I was just curious if there's a way to bind it. Thanks.
#default.route('/api/barcodes/<sort_by>/<sort_dir>', methods=['GET'])
#json_enc
def fetch_barcodes(sort_by, sort_dir):
#time.sleep(5)
# Can't use sort_dir as a parameter, so assign to variable to sanitize it
ord_dir = "DESC" if sort_dir.lower() == 'desc' else 'ASC'
records = []
stmt = text("SELECT bb_request_id,bb_barcode,bs_status, "
"TO_CHAR(bb_rec_cre_date, 'MM/DD/YYYY') AS bb_rec_cre_date "
"FROM bars_barcodes,bars_status "
"WHERE bs_status_id = bb_status_id "
"ORDER BY :ord_by :ord_dir ")
stmt = stmt.bindparams(ord_by=sort_by,ord_dir=ord_dir)
rs = db.session.execute(stmt)
records = [dict(zip(rs.keys(), row)) for row in rs]
DatabaseError: (cx_Oracle.DatabaseError) ORA-01036: illegal variable name/number
[SQL: "SELECT bb_request_id,bb_barcode,bs_status, TO_CHAR(bb_rec_cre_date, 'MM/DD/YYYY') AS bb_rec_cre_date FROM bars_barcodes,bars_status WHERE bs_status_id = bb_status_id ORDER BY :ord_by :ord_dir "] [parameters: {'ord_by': u'bb_rec_cre_date', 'ord_dir': 'ASC'}]
UPDATE Solution based on accepted answer:
def fetch_barcodes(sort_by, sort_dir, page, rows_per_page):
ord_dir_func = desc if sort_dir.lower() == 'desc' else asc
query_limit = int(rows_per_page)
query_offset = (int(page) - 1) * query_limit
stmt = select([column('bb_request_id'),
column('bb_barcode'),
column('bs_status'),
func.to_char(column('bb_rec_cre_date'), 'MM/DD/YYYY').label('bb_rec_cre_date')]).\
select_from(table('bars_barcode')).\
select_from(table('bars_status')).\
where(column('bs_status_id') == column('bb_status_id')).\
order_by(ord_dir_func(column(sort_by))).\
limit(query_limit).offset(query_offset)
result = db.session.execute(stmt)
records = [dict(row) for row in result]
response = json_return()
response.addRecords(records)
#response.setTotal(len(records))
response.setTotal(1001)
response.setSuccess(True)
response.addMessage("Records retrieved successfully. Limit: " + str(query_limit) + ", Offset: " + str(query_offset) + " SQL: " + str(stmt))
return response
You could use Core constructs such as table() and column() for this instead of raw SQL strings. That'd make your life easier in this regard:
from sqlalchemy import select, table, column, asc, desc
ord_dir = desc if sort_dir.lower() == 'desc' else asc
stmt = select([column('bb_request_id'),
column('bb_barcode'),
column('bs_status'),
func.to_char(column('bb_rec_cre_date'),
'MM/DD/YYYY').label('bb_rec_cre_date')]).\
select_from(table('bars_barcodes')).\
select_from(table('bars_status')).\
where(column('bs_status_id') == column('bb_status_id')).\
order_by(ord_dir(column(sort_by)))
table() and column() represent the syntactic part of a full blown Table object with Columns and can be used in this fashion for escaping purposes:
The text handled by column() is assumed to be handled like the name of a database column; if the string contains mixed case, special characters, or matches a known reserved word on the target backend, the column expression will render using the quoting behavior determined by the backend.
Still, whitelisting might not be a bad idea.
Note that you don't need to manually zip() the row proxies in order to produce dictionaries. They act as mappings as is, and if you need dict() for serialization reasons or such, just do dict(row).
I'm currently building SQL queries depending on input from the user. An example how this is done can be seen here:
def generate_conditions(table_name,nameValues):
sql = u""
for field in nameValues:
sql += u" AND {0}.{1}='{2}'".format(table_name,field,nameValues[field])
return sql
search_query = u"SELECT * FROM Enheter e LEFT OUTER JOIN Handelser h ON e.Id == h.Enhet WHERE 1=1"
if "Enhet" in args:
search_query += generate_conditions("e",args["Enhet"])
c.execute(search_query)
Since the SQL changes every time I cannot insert the values in the execute call which means that I should escape the strings manually. However, when I search everyone points to execute...
I'm also not that satisfied with how I generate the query, so if someone has any idea for another way that would be great also!
You have two options:
Switch to using SQLAlchemy; it'll make generating dynamic SQL a lot more pythonic and ensures proper quoting.
Since you cannot use parameters for table and column names, you'll still have to use string formatting to include these in the query. Your values on the other hand, should always be using SQL parameters, if only so the database can prepare the statement.
It's not advisable to just interpolate table and column names taken straight from user input, it's far too easy to inject arbitrary SQL statements that way. Verify the table and column names against a list of such names you accept instead.
So, to build on your example, I'd go in this direction:
tables = {
'e': ('unit1', 'unit2', ...), # tablename: tuple of column names
}
def generate_conditions(table_name, nameValues):
if table_name not in tables:
raise ValueError('No such table %r' % table_name)
sql = u""
params = []
for field in nameValues:
if field not in tables[table_name]:
raise ValueError('No such column %r' % field)
sql += u" AND {0}.{1}=?".format(table_name, field)
params.append(nameValues[field])
return sql, params
search_query = u"SELECT * FROM Enheter e LEFT OUTER JOIN Handelser h ON e.Id == h.Enhet WHERE 1=1"
search_params = []
if "Enhet" in args:
sql, params = generate_conditions("e",args["Enhet"])
search_query += sql
search_params.extend(params)
c.execute(search_query, search_params)
I hope this has not been asked previously, I was not sure what keywords to use.
Suppose I want to write a function that can take a less than or equal to statement for a query...
import MySQLdb
def query1(date,le):
'''
query1('2013-01',<= )
>>> 10
'''
query = '''
select *
from table
where number {x} 1
and date = {dt}
'''.format(dt=date,x=le)
cursor.execute(query)
rslt = cursor.fetchall()
return rslt
Then what is the best way to do this?
You can just pass the comparison operator as a string to your function:
query1('2013-01', '<=')
This will insert the string for the operator in to the query, resulting in
select *
from table
where number <= 1
and date = 2013-01
Please note that directly building SQL queries by inserting strings is a potential vector for SQL injections. If you allow users to supply their own date strings, the user could inject some SQL code in and run malicious code. Look in to query parameterisation for more information.
If you wanted to guard against SQL injection, you should do something like the following. The allowed operators list is carefully whitelisted, so only valid and safe operators can be used. This is used to build the query. The date is then injected in to the query by the cursor.execute() command. MySQLdb then handles constructing a safe query from your data, and will not allow a malicious user to inject their own SQL in place of the date string.
import MySQLdb
def query1(date, comp):
query = '''
select *
from table
where number {comp} 1
and date = %s
'''.format(comp=sql_comp_operator(comp))
cursor.execute(query, (date, ))
return cursor.fetchall()
def sql_comp_operator(comp):
operators = {
'lt': '<',
'lte': '<',
'gt': '>',
'gte': '>=',
}
if comp in operators:
return operators[comp]
else:
raise ValueError("Unknown comparison operator '{}'".format(comp))
query1('2013-01', 'lte')
Ideally, you'd want to use an ORM to prevent SQL injection attacks (I prefer SQLAlchemy/Elixir) which would let you do stuff like:
q = session.query(User).\
filter(User.id <= 1).\
filter(User.date_of_birth == date)
Sounds like you want "le" to be a function/lambda that you can pass in, but I don't know of any way to convert that lambda to a string for putting in your query. For example, you could call it like:
query1('2013-01-01', lambda x,y: x <= y)
But there's no real way I know of to convert that in your query to "<=". If you pass it in as a string, however, you can used format with named blocks, by passing in a dictionary with keys with the same names as those blocks, like this:
sql = """
select *
from table
where number {operation} 1
and date = '{date}'
"""
data = {
"operation": "<=",
"date": "2013-01-01"
}
query = sql.format(**data)
I'm currently building SQL queries depending on input from the user. An example how this is done can be seen here:
def generate_conditions(table_name,nameValues):
sql = u""
for field in nameValues:
sql += u" AND {0}.{1}='{2}'".format(table_name,field,nameValues[field])
return sql
search_query = u"SELECT * FROM Enheter e LEFT OUTER JOIN Handelser h ON e.Id == h.Enhet WHERE 1=1"
if "Enhet" in args:
search_query += generate_conditions("e",args["Enhet"])
c.execute(search_query)
Since the SQL changes every time I cannot insert the values in the execute call which means that I should escape the strings manually. However, when I search everyone points to execute...
I'm also not that satisfied with how I generate the query, so if someone has any idea for another way that would be great also!
You have two options:
Switch to using SQLAlchemy; it'll make generating dynamic SQL a lot more pythonic and ensures proper quoting.
Since you cannot use parameters for table and column names, you'll still have to use string formatting to include these in the query. Your values on the other hand, should always be using SQL parameters, if only so the database can prepare the statement.
It's not advisable to just interpolate table and column names taken straight from user input, it's far too easy to inject arbitrary SQL statements that way. Verify the table and column names against a list of such names you accept instead.
So, to build on your example, I'd go in this direction:
tables = {
'e': ('unit1', 'unit2', ...), # tablename: tuple of column names
}
def generate_conditions(table_name, nameValues):
if table_name not in tables:
raise ValueError('No such table %r' % table_name)
sql = u""
params = []
for field in nameValues:
if field not in tables[table_name]:
raise ValueError('No such column %r' % field)
sql += u" AND {0}.{1}=?".format(table_name, field)
params.append(nameValues[field])
return sql, params
search_query = u"SELECT * FROM Enheter e LEFT OUTER JOIN Handelser h ON e.Id == h.Enhet WHERE 1=1"
search_params = []
if "Enhet" in args:
sql, params = generate_conditions("e",args["Enhet"])
search_query += sql
search_params.extend(params)
c.execute(search_query, search_params)