Oracle index not used with SQLAlchemy prepared statements - python

I use Python and SQLAlchemy to access an Oracle 11 database.
As far as I think, SQLAlchemy always use prepared statements.
On a huge table (4 millions records), correctly indexed, SQLAlchemy filters queries doesn't use the index, so doing a full table scan is very slow ; using the same SQL code in a "raw" SQL editor (SQLplus) make Oracle use the index with good performances.
We tried to add "+index" hints on requests, without effect on Oracle execution path which still doesn't want to use the index.
Any idea ? Is it possible to really force Oracle to use an index, or to make SQLAlchemy to not use prepared statements ??
Best regards,
Thierry

Related

Table metadata transfer from sqlserver to postgresql using SQLAlchemy

I am trying to migrate my database from ms sql server to PostgreSQL using python script .
Before migrating the data, script needs to create required tables.
I intend to use sqlalchemy to create required tables and then migrate the actual data. Below is the sample code. While creating table in pgsql , script is failing as there are no datatype like tinyint in pgsql. I though sqlalchemy abstracts these data types.
Any suggestions and best practices for this kind of usecase will be of great help
from sqlalchemy import create_engine, MetaData, select, func, Table
import pandas as pd
engine_pg = create_engine('postgresql://XXXX:YYYY$#10.10.1.4:5432/pgschema')
engine_ms = create_engine('mssql+pyodbc://XX:YY#10.10.1.5/msqlschema?driver=SQL+Server')
ms_metadata = MetaData(bind=engine_ms)
pg_metadata = MetaData(bind=engine_pg)
#extract Node table object from mssql using ms_metadat and engine_ms
Node = Table('Node', ms_metadata, autoload_with=engine_ms)
#create Node table in pgsql using the Node table object
Node.create(bind=engine_pg)
While I have not done the ms sql to postgreSQL path I have done some other (small to tiny) migrations and have some minor experience with both databases you are looking at. The solution to your specific problem is probably best done through a mapping functionality.
There is a library that I have looked at but never gotten around to using which contain such mappings:
https://pgloader.readthedocs.io/en/latest/ref/mssql.html?highlight=tinyint%20#default-ms-sql-casting-rules
Since a data migration is usually done just once, I would recommend making use of an existing tool. SQLAlchemy is not really such a tool from my understanding but could potentially be turned into one with some effort.
Regarding your question about SQLAlchemy abstracting the data I would not hold this situation against SQLAlchemy. Tinyint is a 1 byte data type. There is no such data type available in postgreSQL which makes a direct mapping impossible. Hence the mapping ound in pgloader (linked above).
https://learn.microsoft.com/en-us/sql/t-sql/data-types/int-bigint-smallint-and-tinyint-transact-sql?view=sql-server-ver15
https://www.postgresql.org/docs/9.1/datatype-numeric.html
Finally some thoughts on meta information available here. It seems like you are offering a bounty to this 6 months after you posted the original question which is interesting as it is either a huge project or one you don't allocate a lot of time to. Either way I urge you to use an existing tool rather than trying to make something work beyond its intended usage. Another thing is the inclusion of the pandas import. If you are thinking of using pandas for the data transfer I want to caution you on the fact that pandas is very forgiving with data formats. This might not be a problem for you, but a more controlled data pipeline would probably be less error prone.
Given the previous paragraph I'd like some more info on the overall situation before pointing you in the right direction. Database migration can have other unforeseen consequences as well so I don't want to give the impression that the solution to your overall problem is a quick fix as simple as a tinyint to smallint mapping.

Database data Integrity check using Python SqlAlchemy or Sql Query

I imported and will keep importing data from different sources into sql server database. Some logic check such as sum of one column should be in 5 dollars difference with one amount in another table. It is hard for me to check this logic when importing because some data will be manually inserted, some imported using Excel VBA or python. Data Accuracy is very important to me. I am thinking to check after the data inserted. I am thinking two choices
Python with SqlAlchemy and write the logic in python
Create a stored procedure or direct SQL to verify
What will be advantages and disadvantages of SQLAlchemy vs stored procedure for data check? or other solutions?
The Benefit in my mind for SQLAlchemy with automap:
Possible combination use of Jupyter for nice user interface
Logic is easier to write, such as loop one table each row should be sum of another table with some where conditions.
Benefit for SQL stored procedure
Can all be managed in SQL server management studio
The answer is neither. Most RDBMS have built in mechanisms to enforce restrictions on the data that is inserted into a row or a column. As you said
It is hard for me to check this logic when importing because some data
will be manually inserted, some imported using Excel VBA or python.
Data Accuracy is very important to me
You can't have your code in all these places. What works is constraints.
CHECK constraints enforce domain integrity by limiting the values that
are accepted by one or more columns. You can create a CHECK constraint
with any logical (Boolean) expression that returns TRUE or FALSE based
on the logical operators
You can of course use stored procedures for this but constraints are more efficient, transparent and easier to maintain.

Confusion between prepared statement and parameterized query in Python

As far as I understand, prepared statements are (mainly) a database feature that allows you to separate parameters from the code that uses such parameters. Example:
PREPARE fooplan (int, text, bool, numeric) AS
INSERT INTO foo VALUES($1, $2, $3, $4);
EXECUTE fooplan(1, 'Hunter Valley', 't', 200.00);
A parameterized query substitutes the manual string interpolation, so instead of doing
cursor.execute("SELECT FROM tablename WHERE fieldname = %s" % value)
we can do
cursor.execute("SELECT FROM tablename WHERE fieldname = %s", [value])
Now, it seems that prepared statements are, for the most part, used in the database language and parameterized queries are mainly used in the programming language connecting to the database, although I have seen exceptions to this rule.
The problem is that asking about the difference between prepared statement and parameterized query brings a lot of confusion. Their purpose is admittedly the same, but their methodology seems distinct. Yet, there are sources indicating that both are the same. MySQLdb and Psycopg2 seem to support parameterized queries but don’t support prepared statements (e.g. here for MySQLdb and in the TODO list for postgres drivers or this answer in the sqlalchemy group). Actually, there is a gist implementing a psycopg2 cursor supporting prepared statements and a minimal explanation about it. There is also a suggestion of subclassing the cursor object in psycopg2 to provide the prepared statement manually.
I would like to get an authoritative answer to the following questions:
Is there a meaningful difference between prepared statement and parameterized query? Does this matter in practice? If you use parameterized queries, do you need to worry about prepared statements?
If there is a difference, what is the current status of prepared statements in the Python ecosystem? Which database adapters support prepared statements?
Prepared statement: A reference to a pre-interpreted query routine on the database, ready to accept parameters
Parametrized query: A query made by your code in such a way that you are passing values in alongside some SQL that has placeholder values, usually ? or %s or something of that flavor.
The confusion here seems to stem from the (apparent) lack of distinction between the ability to directly get a prepared statement object and the ability to pass values into a 'parametrized query' method that acts very much like one... because it is one, or at least makes one for you.
For example: the C interface of the SQLite3 library has a lot of tools for working with prepared statement objects, but the Python api makes almost no mention of them. You can't prepare a statement and use it multiple times whenever you want. Instead, you can use sqlite3.executemany(sql, params) which takes the SQL code, creates a prepared statement internally, then uses that statement in a loop to process each of your parameter tuples in the iterable you gave.
Many other SQL libraries in Python behave the same way. Working with prepared statement objects can be a real pain, and can lead to ambiguity, and in a language like Python which has such a lean towards clarity and ease over raw execution speed they aren't really the greatest option. Essentially, if you find yourself having to make hundreds of thousands or millions of calls to a complex SQL query that gets re-interpreted every time, you should probably be doing things differently. Regardless, sometimes people wish they could have direct access to these objects because if you keep the same prepared statement around the database server won't have to keep interpreting the same SQL code over and over; most of the time this will be approaching the problem from the wrong direction and you will get much greater savings elsewhere or by restructuring your code.*
Perhaps more importantly in general is the way that prepared statements and parametrized queries keep your data sanitary and separate from your SQL code. This is vastly preferable to string formatting! You should think of parametrized queries and prepared statements, in one form or another, as the only way to pass variable data from your application into the database. If you try to build the SQL statement otherwise, it will not only run significantly slower but you will be vulnerable to other problems.
*e.g., by producing the data that is to be fed into the DB in a generator function then using executemany() to insert it all at once from the generator, rather than calling execute() each time you loop.
tl;dr
A parametrized query is a single operation which generates a prepared statement internally, then passes in your parameters and executes.
edit: A lot of people see this answer! I want to also clarify that many database engines also have concepts of a prepared statement that can be constructed explicitly with plaintext query syntax, then reused over the lifetime of a client's session (in postgres for example). Sometimes you have control over whether the query plan is cached to save even more time. Some frameworks use these automatically (I've seen rails' ORM do it aggressively), sometimes usefully and sometimes to their detriment when there are permutations of form for the queries being prepared.
Also if you want to nit pick, parametrized queries do not always use a prepared statement under the hood; they should do so if possible, but sometimes it's just formatting in the parameter values. The real difference between 'prepared statement' and 'parametrized query' here is really just the shape of the API you use.
First, your questions shows very good preparation - well done.
I am not sure, if I am the person to provide authoritative answer, but I will try to explain my
understanding of the situation.
Prepared statement is an object, created on side of database server as a result of PREPARE
statement, turning provided SQL statement into sort of temporary procedure with parameters. Prepared
statement has lifetime of current database session and are discarded after the session is over.
SQL statement DEALOCATE allows destroying the prepared statement explicitly.
Database clients can use SQL statement EXECUTE to execute the prepared statement by calling it's
name and parameters.
Parametrized statement is alias for prepared statement as usually, the prepared statement has
some parameters.
Parametrized query seems to be less often used alias for the same (24 mil Google hits for
parametrized statement, 14 mil hits for parametrized query). It is possible, that some people use
this term for another purpose.
Advantages of prepared statements are:
faster execution of actual prepared statement call (not counting the time for PREPARE)
resistency to SQL injection attack
Players in executing SQL query
Real application will probably have following participants:
application code
ORM package (e.g. sqlalchemy)
database driver
database server
From application point of view it is not easy to know, if the code will really use prepared
statement on database server or not as any of participants may lack support of prepared
statements.
Conclusions
In application code prevent direct shaping of SQL query as it is prone to SQL injection attack. For
this reason it is recommended using whatever the ORM provides to parametrized query even if it does
not result on using prepared statements on database server side as the ORM code can be optimized to
prevent this sort of attack.
Decide, if prepared statement is worth for performance reasons. If you have simple SQL query,
which is executed only few times, it will not help, sometime it will even slow down the execution a
bit.
For complex query being executed many times and having relatively short execution time will be
the effect the biggest. In such a case, you may follow these steps:
check, that the database you are going to use supports the PREPARE statement. In most cases it
will be present.
check, that the drive you use is supporting prepared statements and if not, try to find another
one supporting it.
Check support of this feature on ORM package level. Sometime it vary driver by driver (e.g.
sqlalchemy states some limitations on prepared statements with MySQL due to how MySQL manages
that).
If you are in search for real authoritative answer, I would head to authors of sqlalchemy.
An sql statement can't be execute immediately: the DBMS must interpret them before the execution.
Prepared statements are statement already interpreted, the DBMS change parameters and the query starts immediately. This is a feature of certain DBMS and you can achieve fast response (comparable with stored procedures).
Parametrized statement are just a way you compose the query string in your programming languages. Since it doesn't matter how sql string are formed, you have slower response by DBMS.
If you measure time executing 3-4 time the same query (select with different conditions) you will see the same time with parametrized queries, the time is smaller from the second execution of prepared statement (the first time the DBMS has to interpret the script anyway).
I think the comment about using executemany fails to deal with the case where an application stores data into the database OVER TIME and wants each insert statement to be as efficient as possible. Thus the desire to prepare the insert statement once and re-use the prepared statement.
Alternatively, one could put the desired statement(s) in a stored proc and re-use it.

pymssql and placeholders

What placeholders can I use with pymssql. I'm getting my values from the html query string so they are all of type string. Is this safe with regard to sql injection?
query = dictify_querystring(Response.QueryString)
employeedata = conn.execute_row("SELECT * FROM employees WHERE company_id=%s and name = %s", (query["id"], query["name"]))
What mechanism is being used in this case to avoid injections?
There isn't much in the way of documentation for pymssql...
Maybe there is a better python module I could use to interface with Sql Server 2005.
Thanks,
Barry
Regarding SQL injection, and not knowing exactly how that implementation works, I would say that's not safe.
Some simple steps to make it so:
Change that query into a prepared statement (or make sure the implementation internally does so, but doesn't seem like it).
Make sure you use ' around your query arguments.
Validate the expected type of your arguments (if request parameters that should be numeric are indeed numeric, etc).
Mostly... number one is the key. Using prepared statements is the most important and probably easiest line of defense against SQL injection.
Some ORM's take care of some of these issues for you (notice the ample use of the word some), but I would advise making sure you know these problems and how to work around them before using an abstraction like an ORM.
Sooner or later, you need to know what's going on under those wonderful layers of time-saving.
Maybe there is a better python module I could use to interface with Sql Server 2005.
Well, my advice is using an ORM like SqlAlchemy to handle this.
>>> from sqlalchemy.ext.sqlsoup import SqlSoup
>>> db = SqlSoup('mssql:///DATABASE?PWD=yourpassword&UID=some_user&dsn=your_dsn')
>>> employeedata = db.employees.filter(db.employees.company_id==query["id"])\
.filter(db.employees.name==query["name"]).one()
You can use one() if you want to raise an exception if there is more than one record, .first() if you want just the first record or .all() if you want all records.
As a side benefit, if you later change to other DBMS, the code will remain the same except for the connection URL.

Python and Postgresql

If you wanted to manipulate the data in a table in a postgresql database using some python (maybe running a little analysis on the result set using scipy) and then wanted to export that data back into another table in the same database, how would you go about the implementation?
Is the only/best way to do this to simply run the query, have python store it in an array, manipulate the array in python and then run another sql statement to output to the database?
I'm really just asking, is there a more efficient way to deal with the data?
Thanks,
Ian
You could use PL/Python to write a PostgreSQL function to manipulate the data.
http://www.postgresql.org/docs/current/static/plpython.html
Although tbh I think it's much of a muchness compared to processing it in an external client for most cases.
I'm not sure I understand what you mean, but I'd say it sounds very much like
INSERT INTO anothertable SELECT stuff FROM the_table RETURNING *
and then work on the returned rows. That is, of course, if you don't want to modify the data when you manipulate it.
Is the only/best way to do this to
simply run the query, have python
store it in an array, manipulate the
array in python and then run another
sql statement to output to the
database?
Not the only way (see the other answers) but IMHO the best and certainly the simplest. It just requires a PostgreSQL librray (I use psycopg). The standard interface is documented in PEP 249.
An example of a SELECT with psycopg:
cursor.execute("SELECT * FROM students WHERE name=%(name)s;",
globals())
and an INSERT:
cursor.execute("INSERT INTO Foobar (t, i) VALUES (%s, %s)",
["I like Python", 42])
pgnumpy seems to be what you're looking for.
I'd think about using http://www.sqlalchemy.org/.
SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.
It supports postgres, too - http://www.sqlalchemy.org/docs/05/reference/dialects/postgres.html
You could use an ORM such as SQLAlchemy to retrieve the data into an "object", and manipulate that object. Such an implementation would probably be more elegant than just using an array.
I agree with the SQL Alchemy suggestions or using Django's ORM. Your needs seem to simple for PL/Python to be used.

Categories