I am trying to log every SQL statement executed from my scripts. However I contemplate one problem I can not overcome.
Is there a way to compute actual SQL statement after bind variables were specified. In SQLite I had to compute the statement to be executed manually, using code below:
def __sql_to_str__(self, value,args):
for p in args:
if type(p) is IntType or p is None:
value = value.replace("?", str(p) ,1)
else:
value = value.replace("?",'\'' + p + '\'',1)
return value
It seems CX_Oracle has cursor.parse() facilities. But I can't figure out how to trick CX_Oracle to compute my query before its execution.
The query is never computed as a single string. The actual text of the query and the params are never interpolated and don't produce a real full string with both.
That's the whole point of using parameterized queries - you separate the query from the data - preventing sql injections and limitations all in one go, and allowing easy query optimization. The database gets both separately, and does what it needs to do, without ever joining them together.
That said, you could generate the query yourself, but note that the query you generate, although probably equivalent, is not what gets actually executed on the database.
Your best bet is to do it at the database server, since a properly implemented Oracle connector will not put the bind variables into a string before sending the query to the server. See if you can find an Oracle server setting that makes it log the queries it executes.
You might want to consider using Oracle's extended SQL trace feature for this. I would recommend starting here: http://carymillsap.blogspot.com/2011/01/new-paper-mastering-performance-with.html.
Related
I have a sqlite query from an external source that will have an unknown number of WHERE clauses. There will be a limited number of types of clauses (and I know in advance what types they can be), but how many of each type is unknown until I actually receive the query.
I thought this would be an easy problem to solve until I actually got to it.
I can think of a couple of possible solutions. I could specify a long SELECT query with lots of different WHERE clauses for each type of selection, and fill those in with 1=1 when there's not enough selections given to fill them all up. But that's ugly code, and doesn't react well when more space is needed than is given.
I could instead not do this in pure SQL, but instead use a recursive Python function that iterates over the queries and successively filters the results. This is psuedocode that doesn't come close to running successfully:
queries = (list of queries from external source)
return filter_results(conn.cursor(), (database), queries)
def filter_results(cursor, results, queries):
if len(queries) == 0:
return results_so_far
cursor.execute("SELECT * FROM {} WHERE {}".format(results, queries.pop(0)))
results = cursor.fetchall()
return filter_results(cursor, results, queries)
As you can see I've fluffed over passing the database into the function, and I'm well aware that I won't be able to pass an SQL query to the result of cursor.fetchall(). At some point I'd either be trying to emulate SQL in Python, or exposing myself to SQL injection.
I'm either grossly overthinking this or trying to solve the unsolvable. I highly suspect it's the former. What's the correct approach to this?
The answer is to use a query building tool. In this case PyPika was the right tool.
I am writing a class to be used as part of a much larger modeling algorithm. My part does spatial analysis to calculate distances from certain points to other points. There are a variety of conditions involving number of returned distances, cutoff distances, and etc.
Currently, the project specification only indicates hardcoded situations. i.e. "Function #1 needs to list all the distances from point set A to point set B within 500m. Function #2 needs to list all the distances from point set C to point set D..." and so on.
I don't want to hardcode these parameters, and neither does the person who is developing the next stage of the model, because obviously they would like to tweak the parameters or possibly re-use the algorithm in other projects where they will have different conditions.
Now the problem is that I am using psycopg2 to do this. This is the standard where I work so I do not have a choice of deviating from it. I have read that it is a very bad idea to expose parameters that will be put into the executed queries as parameters due to the obvious reason of SQL injection. However, I thought that psycopg2 automatically sanitized SQL input. I think that the issue is using the AsIs function.
The easy solution is just to hardcode it as specified in the project but this feels lazy and sloppy to me. I don't like doing lazy and sloppy work.
Is it at all safe to allow the user to input parameters that will be input into a psycopg2-executed query? Or is it just using AsIs that makes it unsafe? If I wanted to allow the user to be able to input these parameters, do I have to take the responsibility upon myself to santitize the inputs, and if so, is there a quick and easy way to do it, like with another python library or something?
AsIs is unsafe, unless you really know what you are doing. You can use it for unit testing for example.
Passing parameters is not that unsafe, as long as you do not pre-format your sql query. Never do:
sql_query = 'SELECT * FROM {}'.format(user_input)
cur.execute(sql_query)
Since user_input could be ';DROP DATABASE;' for instance.
Instead, do:
sql_query = 'SELECT * FROM %s'
cur.execute(sql_query, (user_input,))
pyscopg2 will sanitize your query. Also, you can pre-sanitize the parameters in your code with your own logic, if you really do not trust your user's input.
Per psycopg2's documentation:
Warning Never, never, NEVER use Python string concatenation (+) or string parameters interpolation (%) to pass variables to a SQL query string. Not even at gunpoint.
Also, I would never, ever, let my users tell me which table I should query. Your app's logic (or routes) should tell you that.
Regarding AsIs(), per psycopg2's documentation :
Asis()... for objects whose string representation is already valid as SQL representation.
So, don't use it with user's input.
You can use psycopg2.sql to compose dynamic queries. Unlike AsIs it will protect you from SQL injection.
If you need to store your query in a variable you can use the SQL method (documentation) :
from psycopg2 import sql
query = sql.SQL("SELECT * FROM Client where id={clientId}").format(clientId=sql.Literal(clientId)
As far as I understand, prepared statements are (mainly) a database feature that allows you to separate parameters from the code that uses such parameters. Example:
PREPARE fooplan (int, text, bool, numeric) AS
INSERT INTO foo VALUES($1, $2, $3, $4);
EXECUTE fooplan(1, 'Hunter Valley', 't', 200.00);
A parameterized query substitutes the manual string interpolation, so instead of doing
cursor.execute("SELECT FROM tablename WHERE fieldname = %s" % value)
we can do
cursor.execute("SELECT FROM tablename WHERE fieldname = %s", [value])
Now, it seems that prepared statements are, for the most part, used in the database language and parameterized queries are mainly used in the programming language connecting to the database, although I have seen exceptions to this rule.
The problem is that asking about the difference between prepared statement and parameterized query brings a lot of confusion. Their purpose is admittedly the same, but their methodology seems distinct. Yet, there are sources indicating that both are the same. MySQLdb and Psycopg2 seem to support parameterized queries but don’t support prepared statements (e.g. here for MySQLdb and in the TODO list for postgres drivers or this answer in the sqlalchemy group). Actually, there is a gist implementing a psycopg2 cursor supporting prepared statements and a minimal explanation about it. There is also a suggestion of subclassing the cursor object in psycopg2 to provide the prepared statement manually.
I would like to get an authoritative answer to the following questions:
Is there a meaningful difference between prepared statement and parameterized query? Does this matter in practice? If you use parameterized queries, do you need to worry about prepared statements?
If there is a difference, what is the current status of prepared statements in the Python ecosystem? Which database adapters support prepared statements?
Prepared statement: A reference to a pre-interpreted query routine on the database, ready to accept parameters
Parametrized query: A query made by your code in such a way that you are passing values in alongside some SQL that has placeholder values, usually ? or %s or something of that flavor.
The confusion here seems to stem from the (apparent) lack of distinction between the ability to directly get a prepared statement object and the ability to pass values into a 'parametrized query' method that acts very much like one... because it is one, or at least makes one for you.
For example: the C interface of the SQLite3 library has a lot of tools for working with prepared statement objects, but the Python api makes almost no mention of them. You can't prepare a statement and use it multiple times whenever you want. Instead, you can use sqlite3.executemany(sql, params) which takes the SQL code, creates a prepared statement internally, then uses that statement in a loop to process each of your parameter tuples in the iterable you gave.
Many other SQL libraries in Python behave the same way. Working with prepared statement objects can be a real pain, and can lead to ambiguity, and in a language like Python which has such a lean towards clarity and ease over raw execution speed they aren't really the greatest option. Essentially, if you find yourself having to make hundreds of thousands or millions of calls to a complex SQL query that gets re-interpreted every time, you should probably be doing things differently. Regardless, sometimes people wish they could have direct access to these objects because if you keep the same prepared statement around the database server won't have to keep interpreting the same SQL code over and over; most of the time this will be approaching the problem from the wrong direction and you will get much greater savings elsewhere or by restructuring your code.*
Perhaps more importantly in general is the way that prepared statements and parametrized queries keep your data sanitary and separate from your SQL code. This is vastly preferable to string formatting! You should think of parametrized queries and prepared statements, in one form or another, as the only way to pass variable data from your application into the database. If you try to build the SQL statement otherwise, it will not only run significantly slower but you will be vulnerable to other problems.
*e.g., by producing the data that is to be fed into the DB in a generator function then using executemany() to insert it all at once from the generator, rather than calling execute() each time you loop.
tl;dr
A parametrized query is a single operation which generates a prepared statement internally, then passes in your parameters and executes.
edit: A lot of people see this answer! I want to also clarify that many database engines also have concepts of a prepared statement that can be constructed explicitly with plaintext query syntax, then reused over the lifetime of a client's session (in postgres for example). Sometimes you have control over whether the query plan is cached to save even more time. Some frameworks use these automatically (I've seen rails' ORM do it aggressively), sometimes usefully and sometimes to their detriment when there are permutations of form for the queries being prepared.
Also if you want to nit pick, parametrized queries do not always use a prepared statement under the hood; they should do so if possible, but sometimes it's just formatting in the parameter values. The real difference between 'prepared statement' and 'parametrized query' here is really just the shape of the API you use.
First, your questions shows very good preparation - well done.
I am not sure, if I am the person to provide authoritative answer, but I will try to explain my
understanding of the situation.
Prepared statement is an object, created on side of database server as a result of PREPARE
statement, turning provided SQL statement into sort of temporary procedure with parameters. Prepared
statement has lifetime of current database session and are discarded after the session is over.
SQL statement DEALOCATE allows destroying the prepared statement explicitly.
Database clients can use SQL statement EXECUTE to execute the prepared statement by calling it's
name and parameters.
Parametrized statement is alias for prepared statement as usually, the prepared statement has
some parameters.
Parametrized query seems to be less often used alias for the same (24 mil Google hits for
parametrized statement, 14 mil hits for parametrized query). It is possible, that some people use
this term for another purpose.
Advantages of prepared statements are:
faster execution of actual prepared statement call (not counting the time for PREPARE)
resistency to SQL injection attack
Players in executing SQL query
Real application will probably have following participants:
application code
ORM package (e.g. sqlalchemy)
database driver
database server
From application point of view it is not easy to know, if the code will really use prepared
statement on database server or not as any of participants may lack support of prepared
statements.
Conclusions
In application code prevent direct shaping of SQL query as it is prone to SQL injection attack. For
this reason it is recommended using whatever the ORM provides to parametrized query even if it does
not result on using prepared statements on database server side as the ORM code can be optimized to
prevent this sort of attack.
Decide, if prepared statement is worth for performance reasons. If you have simple SQL query,
which is executed only few times, it will not help, sometime it will even slow down the execution a
bit.
For complex query being executed many times and having relatively short execution time will be
the effect the biggest. In such a case, you may follow these steps:
check, that the database you are going to use supports the PREPARE statement. In most cases it
will be present.
check, that the drive you use is supporting prepared statements and if not, try to find another
one supporting it.
Check support of this feature on ORM package level. Sometime it vary driver by driver (e.g.
sqlalchemy states some limitations on prepared statements with MySQL due to how MySQL manages
that).
If you are in search for real authoritative answer, I would head to authors of sqlalchemy.
An sql statement can't be execute immediately: the DBMS must interpret them before the execution.
Prepared statements are statement already interpreted, the DBMS change parameters and the query starts immediately. This is a feature of certain DBMS and you can achieve fast response (comparable with stored procedures).
Parametrized statement are just a way you compose the query string in your programming languages. Since it doesn't matter how sql string are formed, you have slower response by DBMS.
If you measure time executing 3-4 time the same query (select with different conditions) you will see the same time with parametrized queries, the time is smaller from the second execution of prepared statement (the first time the DBMS has to interpret the script anyway).
I think the comment about using executemany fails to deal with the case where an application stores data into the database OVER TIME and wants each insert statement to be as efficient as possible. Thus the desire to prepare the insert statement once and re-use the prepared statement.
Alternatively, one could put the desired statement(s) in a stored proc and re-use it.
I would like to construct a sqlite3 database query in python, then write it to a file.
I am a huge fan of python's interfaces for sql databases, which AFAICT wrap all calls you could mess up with a nice little '?' parameters that sanitizes/escapes your strings for you, but that's not what I want. I actually just want to prepare and escape a sql statement - to do this, I need to escape/quote arbitrary strings.
For example:
query = "INSERT INTO example_table VALUES ('%s')",sqlite_escape_string("'")
And so query should contain:
"INSERT INTO example_table VALUES ('''')"
Note that it inserted an additional ' character.
PHP's equivalent is sqlite_escape_string()
perl's equivalent is DBI's quote()
I feel Python has a better overall interface, but I happen to need the query, pre-exec.
When you use SQLite it doesn't turn parameterized queries back into text. It has an api ("bindings") and stores the values separately. Queries can be reused with different values just by changing the bindings. This is what underlies the statement cache. Consequently you'll get no help from python/sqlite in doing what you describe.
What you didn't describe is why you want to do this. The usual reason is as some form of tracing. My alternate Python/SQLite interface (APSW) provides easy tracing - you don't even have to touch your code to use it:
https://rogerbinns.github.io/apsw/execution.html#apsw-trace
SQLite also has an authorizer API which lets you veto operations performed by a statement. This also has a side effect of telling you what operations a statement would end up performing.
So if I do
import MySQLdb
conn = MySQLdb.connect(...)
cur = conn.cursor()
cur.execute("SELECT * FROM HUGE_TABLE")
print "hello?"
print cur.fetchone()
It looks to me that MySQLdb gets the entire huge table before it gets to the "print".
I previously assumed it did some sort of "cursor/state" lazy retrieval in the background,
but it doesn't look like it to me.
Is this right? If so is it because it has to be this way or is this due to a limitation
of the MySQL wire protocol? Does this mean that java/hibernate behave the same way?
I guess I need to use the "limit 1" MySQL clauses and relatives if I want to walk through
a large table without pulling in the whole thing at once? Or no? Thanks in advance.
In the _mysql module, use the following call:
conn.use_result()
That tells the connection you want to fetch rows one by one, leaving the remainder on the server (but leaving the cursor open).
The alternative (and the default) is:
conn.store_result()
This tells the connection to fetch the entire result set after executing the query, and subsequent fetches will just iterate through the result set, which is now in memory in your Python app. If your result set is very large, you should consider using LIMIT to restrict it to something you can handle.
Note that MySQL does not allow another query to be run until you have fetched all the rows from the one you have left open.
In the MySQLdb module, the equivalent is to use one of these two different cursor objects from MySQLdb.cusrors:
CursorUseResultMixIn
CursorStoreResultMixIn
This is correct in every other language I've used. The fetchone is just going to only retrieve the first row of the resultset which in this case is the entire database. It's more of a convenience method than anything, it's designed to be easier to use if you KNOW that there's only one result coming down or you only care about the first.
oursql is an alternative MySQL DB-API interface that exposes a few more of the lower-level details, and also provides other facilities for dealing with large datasets.