How to use sqlparse to parse sql statements

How to use sqlparse to parse sql statements - python

I am trying to parse all the queries executed by users (within a period of time) in PostgreSQL DB (by querying the pg_stat_statements table) and trying to create a report of which tables are used by users to run either a Select or an Insert or a Delete query. Basically running something like Select query, queryid, userid from pg_stat_state and then parsing each query to check if it was a Select or an Insert or a Delete query and also extract the table_Name from the query.
I am using sqlparse python module but very new to it so need help.
I am able to get the table name by using something like:
import sqlparse
from sqlparse.sql import Where, Comparison, Parenthesis, Identifier
for tokens in sqlparse.parse(sql_statement)[0]:
if isinstance(tokens, Identifier):
print(str(tokens))
but not sure how to get the type of statement (Select/Insert/Delete) together with the name of the table. Also, need to incorporate COPY statements as Selects too.
I tried using psqlparse but I did not see much info/help online regarding this module.
Please suggest.
Thanks.

This is not trivial, and I don't think sqlparse really helps very much. INSERT and DELETE are pretty easy, because they usually start out "INSERT INTO table" and "DELETE FROM table", but "SELECT" is the wild wild west. Clearly the tables will be mentioned in a FROM clause, but it could be "FROM table1 t1, table t2, table t3 WHERE" or "FROM table t1 LEFT INNER JOIN table t2 LEFT INNER JOIN table t3 WHERE".
You might have nested queries, and a SELECT doesn't even have to have a table. Plus, there could be UNIONs that mention further tables. And, of course, "SELECT INTO" is just another way of doing "INSERT". I believe you should start out just doing text processing, looking for the key words. You might get far enough.

Related

Entering data into specific field using "WHERE" in "INSERT" command in SQL? (Python)

I'm currently writing a program for a parents evening system. I have two tables, a bookings table and a teacher table - set up with the following column headings: TeacherSubject | 15:30 | 15:35 | 15:40 etc... When people make a booking, they select a teacher from a drop-down menu and also a time. Therefore, I need the bookingID added into the booking table where the teacher selected = to the same teacher in the table and where time selected = time in the database.
At the moment, my code only attempts to match the teacher, but this doesn't work as I'm getting the error of: (line 5)
TypeError: 'str' object is not callable
Am I doing the whole thing wrong and is this actually possible with the way I have set the table up?
def insert(parent_name, parent_email, student_name,student_form,teacher,app_time,comments):
conn=sqlite3.connect("parentsevening.db")
cur=conn.cursor()
cur.execute("INSERT INTO bookings VALUES (NULL,?,?,?,?,?,?,?)",(parent_name,parent_email,student_name,student_form,teacher,app_time,comments))
cur.execute("INSERT INTO teachers VALUES (?) WHERE teachers = (?)" (id,teacherName,))
conn.commit()
conn.close()

This SQL Query is invalid.
INSERT INTO teachers VALUES (?) WHERE teachers = (?)
It should be
INSERT INTO teachers (id, name) VALUES(?, ?)
Note that I'm guessing the teachers columns (id, name) WHERE on the insert isn't valid because it's used to find data (SELECT, UPDATE, DELETE)

OK, let's take out the comments and make this into an answer.
Python error
I think your error comes from WHERE teachers = (?) have you tried WHERE teachers = ? instead.
But...
bad sql syntax
Also that command as a whole doesnt make much sense, SQL syntax wise - you seem to be trying to insert where a teacher that doesn't exist (if you are inserting them) and values on an insert does not go with where and where needs a from. i.e. once you've solved your python error, sqlite is going to have a fit as well.
That's already covered by another answer.
But...
probably not what you should be doing
If you have an existing teacher, you only need to insert their teacherid into table bookings. You don't have to, and in fact, you can't insert into table teachers at this point, you'd get a duplicate data error.
So, rather than fixing your second query, just get rid of it entirely.
If you can get a command line or GUI SQL tool up, try running these queries by hardcoding them by hand before coding them in Python. the sqlite command should be able to do that for you.
(recommendation) don't use insert table values
Try being explicit with insert into table (<column list>) values .... The reason is that, as soon as the table changes in some way that affects column order (possibly an alter column) the values won't line up with the implied insert list. hard to debug, hard to know what was intended at time of writing. Been there, done that. And had to debug buncha folks' code who took this shortcut, it's never fun

Delete rows without a related record using SQLAlchemy

I have 2 tables; we'll call them table1 and table2. table2 has a foreign key to table1. I need to delete the rows in table1 that have zero child records in table2. The SQL to do this is pretty straightforward:
DELETE FROM table1
WHERE 0 = (SELECT COUNT(*) FROM table2 WHERE table2.table1_id = table1.table1_id);
However, I haven't been able to find a way to translate this query to SQLAlchemy. Trying the straightforward approach:
subquery = session.query(sqlfunc.count(Table2).label('t2_count')).select_from(Table2).filter(Table2.table1_id == Table1.table1_id).subquery()
session.query(Table1).filter(0 == subquery.columns.t2_count).delete()
Just yielded an error:
sqlalchemy.exc.ArgumentError: Only deletion via a single table query is currently supported
How can I perform this DELETE with SQLAlchemy?
Python 2.7
PostgreSQL 9.2.4
SQLAlchemy 0.7.10 (Cannot upgrade due to using GeoAlchemy, but am interested if newer versions would make this easier)

I'm pretty sure this is what you want. You should try it out though. It uses EXISTS.
from sqlalchemy.sql import not_
# This fetches rows in python to determine which ones were removed.
Session.query(Table1).filter(not_(Table1.table2s.any())).delete(
synchronize_session='fetch')
# If you will not be referencing more Table1 objects in this session then you
# can just ignore syncing the session.
Session.query(Table1).filter(not_(Table1.table2s.any())).delete(
synchronize_session=False)
Explanation of the argument for delete():
http://docs.sqlalchemy.org/en/rel_0_8/orm/query.html#sqlalchemy.orm.query.Query.delete
Example with exists(using any() above uses EXISTS):
http://docs.sqlalchemy.org/en/rel_0_8/orm/tutorial.html#using-exists
Here is the SQL that should be generated:
DELETE FROM table1 WHERE NOT (EXISTS (SELECT 1
FROM table2
WHERE table1.id = table2.table1_id))
If you are using declarative I think there is a way to access Table2.table and then you could just use the sql layer of sqlalchemy to do exactly what you want. Although you run into the same issue of making your Session out of sync.

Well, I found one very ugly way to do it. You can do a select with a join to get the rows loaded into memory, then you can delete them individually:
subquery = session.query(Table2.table1_id
,sqlalchemy.func.count(Table2.table2_id).label('t1count')
) \
.select_from(Table2) \
.group_by(Table2.table1_id) \
.subquery()
rows = session.query(Table1) \
.select_from(Table1) \
.outerjoin(subquery, Table1.table1_id == subquery.c.table1_id) \
.filter(subquery.c.t1count == None) \
.all()
for r in rows:
session.delete(r)
This is not only nasty to write, it's also pretty nasty performance-wise. For starters, you have to bring the table1 rows into memory. Second, if you were like me and had a line like this on Table2's class definition:
table1 = orm.relationship(Table1, backref=orm.backref('table2s'))
then SQLAlchemy will actually perform a query to pull the related table2 rows into memory, too (even though there aren't any). Even worse, because you have to loop over the list (I tried just passing in the list; didn't work), it does so one table1 row at a time. So if you're deleting 10 rows, it's 21 individual queries (1 for the initial select, 1 for each relationship pull, and 1 for each delete). Maybe there are ways to mitigate that; I would have to go through the documentation to see. All this for things I don't even want in my database, much less in memory.
I won't mark this as the answer. I want a cleaner, more efficient way of doing this, but this is all I have for now.

Python SQlite ORDER BY command doesn't work

i just started out with programmming and wrote a few lines of code in pyscripter using sqlite3.
The table "gather" is created beforehand. i then select certain rows from "gather" to put them into another table. i try to sort this table by a specific column 'date'. But it doesn't seem to work. it doesn't give me an error message or something like that. It's just not sorted. If i try the same command (SELECT * FROM matches ORDER BY date) in sqlitemanager, it works fine on the exact same table! what is the problem here? i googled quite some time, but i don't find a solution. it's proobably something stupid i'm missing..
as i said i'm a total newbie. i guess you all break out in tears looking at the code. so if you have any tips how i can shorten the code or make it faster or whatever, you're very welcome :) (but everything works fine except the above mentioned part.)
import sqlite3
connection = sqlite3.connect("gather.sqlite")
cursor1 = connection.cursor()
cursor1.execute('Drop table IF EXISTS matches')
cursor1.execute('CREATE TABLE matches(date TEXT, team1 TEXT, team2 TEXT)')
cursor1.execute('INSERT INTO matches (date, team1, team2) SELECT * FROM gather WHERE team1=? or team2=?, (a,a,))
cursor1.execute("SELECT * FROM matches ORDER BY date")
connection.commit()

OK, I think I understand your problem. First of all: I'm not sure if that commit call is necessary at all. However, if it is, you'll definitely want that to be before your select statement. 'connection.commit()' is essentially saying, commit the changes I just made to the database.
Your second issue is that you are executing the select query but never actually doing anything with the results of the query.
try this:
import sqlite3
connection = sqlite3.connect("gather.sqlite")
cursor1 = connection.cursor()
cursor1.execute('Drop table IF EXISTS matches')
cursor1.execute('CREATE TABLE matches(date TEXT, team1 TEXT, team2 TEXT)')
cursor1.execute('INSERT INTO matches (date, team1, team2) SELECT * FROM gather WHERE team1=? or team2=?, (a,a,))
connection.commit()
# directly iterate over the results of the query:
for row in cursor1.execute("SELECT * FROM matches ORDER BY date"):
print row
you are executing the query, but never actually retrieving the results. There are two ways to do this with sqlite3: One way is the way I showed you above, where you can just use the execute statement directly as an iteratable object.
The other way is as follows:
import sqlite3
connection = sqlite3.connect("gather.sqlite")
cursor1 = connection.cursor()
cursor1.execute('Drop table IF EXISTS matches')
cursor1.execute('CREATE TABLE matches(date TEXT, team1 TEXT, team2 TEXT)')
cursor1.execute('INSERT INTO matches (date, team1, team2) SELECT * FROM gather WHERE team1=? or team2=?, (a,a,))
connection.commit()
cursor1.execute("SELECT * FROM matches ORDER BY date")
# fetch all means fetch all rows from the last query. here you put the rows
# into their own result object instead of directly iterating over them.
db_result = cursor1.fetchall()
for row in db_result:
print row

Try moving the commit before the SELECT * (I'm not sure 100% that this is an issue) You then just need to fetch the results of the query :-) Add a line like res = cursor1.fetchall() after you've executed the SELECT. If you want to display them like in sqlitemanager, add
for hit in res:
print '|'.join(hit)
at the bottom.
Edit: To address your issue of storing the sort order to the table:
I think what you're looking for is something like a clustered index. (Which doesn't actually sort the values in th table, but comes close; see here).
SQLIte doesn't have such such indexes, but you can simulate them by actually ordering the table. You can only do this once, as you're inserting the data. You would need an SQL command like the following:
INSERT INTO matches (date, team1, team2)
SELECT * FROM gather
WHERE team1=? or team2=?
ORDER BY date;
instead of the one you currently use.
See point 4 here, which is where I got the idea.

Getting Table and Column names in PyOdbc

I'd like to retrieve the fully referenced column name from a PyOdbc Cursor. For example, say I have 2 simple tables:
Table_1(Id, < some other fields >)
Table_2(Id, < some other fields >)
and I want to retrieve the joined data
select * from Table_1 t1, Table2 t2 where t1.Id = t2.Id
using pyodbc, like this:
query = 'select * from Table_1 t1, Table2 t2 where t1.Id = t2.Id'
import pyodbc
conn_string = '<removed>'
connection = pyodbc.connect(conn_string)
cursor = connection.cursor()cursor.execute(query)
I then want to get the column names:
for row in cursor.description:
print row[0]
BUT if I do this I'll get Id twice which I don't want. Ideally I could get t1.Id and t2.Id in the output.
Some of the solutions I've thought of (and why I don't really want to implement them):
re-name the columns in the query - in my real-world use case there are dozens of tables, some with dozens of rows that are changed far too often
parse my query and automate my SQL query generation (basically checking the query for tables, using the cursor.tables function to get the columns and then replacing the select * with a set of named columns) - If I have too I'll do this, but it seems like overkill for a testing harness
Is there a better way? Any advice would be appreciated.

The PyOdbc docs offer
# columns in table x
for row in cursor.columns(table='x'):
print(row.column_name)
www.PyOdbc wiki The API docs are useful

Here's how I do it.
import pyodbc
connection = pyodbc.connect('DSN=vertica_standby', UID='my_user', PWD='my_password', ansi=True)
cursor = connection.cursor()
for row in cursor.columns(table='table_name_in_your_database'):
print(row.column_name)
You have to have your DSN (data source name) set up via two files. odbc.ini and odbcinst.ini

It doesn't seem to be possible to do what I want without writing a decent amount of code to wrap it up. None of the other answers actually answered the question of returning different column names by the table they originate from in some relatively automatic fashion.

sql select from a large number of IDs

I have a table, Foo. I run a query on Foo to get the ids from a subset of Foo. I then want to run a more complicated set of queries, but only on those IDs. Is there an efficient way to do this? The best I can think of is creating a query such as:
SELECT ... --complicated stuff
WHERE ... --more stuff
AND id IN (1, 2, 3, 9, 413, 4324, ..., 939393)
That is, I construct a huge "IN" clause. Is this efficient? Is there a more efficient way of doing this, or is the only way to JOIN with the inital query that gets the IDs? If it helps, I'm using SQLObject to connect to a PostgreSQL database, and I have access to the cursor that executed the query to get all the IDs.
UPDATE: I should mention that the more complicated queries all either rely on these IDs, or create more IDs to look up in the other queries. If I were to make one large query, I'd end up joining six tables at once or so, which might be too slow.

One technique I've used in the past is to put the IDs into a temp table, and then use that to drive a sequence of queries. Something like:
BEGIN;
CREATE TEMP TABLE search_result ON COMMIT DROP AS
SELECT entity_id
FROM entity /* long complicated search joins and conditions ... */;
-- Fetch primary entities
SELECT entity_id, entity.x /*, ... */
FROM entity JOIN search_result USING (entity_id);
-- Fetch some related entities
SELECT entity_id, related_entity_id, related_entity.x /*, ... */
FROM related_entity JOIN search_result USING (entity_id);
-- And more, as required
END;
This is particularly useful where the search result entities have multiple one-to-many relationships which you want to fetch without either a) doing N*M+1 selects or b) doing a cartesian join of related entities.

I would think it might be useful to use a VIEW. Simple create a view with your query for ID's, then join to that view via ID. That will limit your results to the required subset of ID's without an expensive IN statement.
I do know that the IN statement is more expensive then an EXISTS statement would be.

I think the join with the criteria to select the id's will be more efficient because the query optimizer has more options to do the right thing. Use the explain plan to see how postgresql will approach it.

You are almost certainly better off with a join, however, another option is to use a sub select, i.e.
SELECT ... --complicated stuff
WHERE ... --more stuff
AND id IN (select distinct id from Foo where ...)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.