SQLAlchemy: Dynamically pass schema and table name avoiding SQL Injection - python

How can I execute an SQL query where the schema and table name are passed in a function? Something like below?
def get(engine, schema: str, table: str):
query = text("select * from :schema.:table")
result = engine.connect().execute(query, schema=schema, table=table)
return result

Two things going on here:
Avoiding SQL injection
Dynamically setting a schema with (presumably) PostgreSQL
The first question has a very broad scope, you might want to look at older questions about SQLAlchemy and SQL Injection like this one SQLAlchemy + SQL Injection
Your second question can be addressed in a number of ways, though I would recommend the following approach from SQLAlchemy's documentation: https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#remote-schema-table-introspection-and-postgresql-search-path
PostgreSQL supports a "search path" command which sets the schema for all operations in the transaction.
So your query code might look like:
qry_str = f"SET search_path TO {schema}";
Alternatively, if you use an SQLAlchemy declarative approach, you can use a MetaData object like in this question/answer SQLAlchemy support of Postgres Schemas

You could create a collection of existing table and schema names in your database and check inputs against those values before creating your query:
-- assumes we are connected to the correct *database*
SELECT table_schema, table_name
FROM information_schema.tables;

Related

Prevent SQL Injection in BigQuery with Python for table name

I have an Airflow DAG which takes an argument from the user for table.
I then use this value in an SQL statement and execute it in BigQuery. I'm worried about exposing myself to SQL Injection.
Here is the code:
sql = f"""
CREATE OR REPLACE TABLE {PROJECT}.{dataset}.{table} PARTITION BY DATE(start_time) as (
//OTHER CODE
)
"""
client = bigquery.Client()
query_job = client.query(sql)
Both dataset and table get passed through via airflow but I'm worried someone could pass through something like: random_table; truncate other_tbl; -- as the table argument.
My fear is that the above will create a table called random_table and then truncate an existing table.
Is there a safer way to process these passed through arguments?
I've looked into parameterized queries in BigQuery but these don't work for table names.
You will have to create a table name validator. I think you can safely validate by using just backticks --> ` at the start and at the end of your table name string. It's not a 100% solution but it worked for some of my test scenarios I try. It should work like this:
# validate should look for ` at the beginning and end of your tablename
table_name = validate(f"`{project}.{dataset}.{table}`")
sql = f"""
CREATE OR REPLACE TABLE {table_name} PARTITION BY DATE(start_time) as (
//OTHER CODE
)
"""
...
Note: I suggest you to check the following post on medium site to check about bigquery sql injection.
I checked the official documentation about Running parameterized queries, and sadly it only covers the parameterization of variables not tables or other string part of your query.
As a final note, I recommend to open a feature request for BigQuery for this particular scenario.
You should probably look into sanitization/validation of user input in general. This is done before passing the input to the BQ query.
With Python, you could look for malicious strings in the user input - like truncate in your example - or use a regex to filter input that for instance contains --. Those are just some quick examples. I recommend you do more research on that topic; you will also find quite a few questions on that topic on SE.

SQLAlchemy: do not schema qualify columns in a select statement

In SQLAlchemy, a query like
user = sess.query(User).first()
will emit SQL with each column in the select clause qualified with the schema name
select
myschema.user.id, -- (vs user.id)
....
from
myschema.user
For some dialects, like Presto views, that is a syntax error.
Is there any way to make sqlalchemy skip the schema name in the columns of the select statement? e.g. user.id vs myschema.user.id without using aliased on every table? or a setting such that sqlalchemy automatically uses an alias?

How to join tables from two different databases using sqlalchemy expression language / sqlalchemy core?

I am using MySql. I was however able to find ways to do this using sqlalchemy orm but not using expression language.So I am specifically looking for core / expression language based solutions. The databases lie on the same server
This is how my connection looks like:
connection = engine.connect().execution_options(
schema_translate_map = {current_database_schema: new_database_schema})
engine_1=create_engine("mysql+mysqldb://root:user#*******/DB_1")
engine_2=create_engine("mysql+mysqldb://root:user#*******/DB_2",pool_size=5)
metadata_1=MetaData(engine_1)
metadata_2=MetaData(engine_2)
metadata.reflect(bind=engine_1)
metadata.reflect(bind=engine_2)
table_1=metadata_1.tables['table_1']
table_2=metadata_2.tables['table_2']
query=select([table_1.c.name,table_2.c.name]).select_from(join(table_2,table_1.c.id==table_2.c.id,'left')
result=connection.execute(query).fetchall()
However, when I try to join tables from different databases it throws an error obviously because the connection belongs to one of the databases. And I haven't tried anything else because I could not find a way to solve this.
Another way to put the question (maybe) is 'how to connect to multiple databases using a single connection in sqlalchemy core'.
Applying the solution from here to Core only you could create a single Engine object that connects to your server, but without defaulting to one database or the other:
engine = create_engine("mysql+mysqldb://root:user#*******/")
and then using a single MetaData instance reflect the contents of each schema:
metadata = MetaData(engine)
metadata.reflect(schema='DB_1')
metadata.reflect(schema='DB_2')
# Note: use the fully qualified names as keys
table_1 = metadata.tables['DB_1.table_1']
table_2 = metadata.tables['DB_2.table_2']
You could also use one of the databases as the "default" and pass it in the URL. In that case you would reflect tables from that database as usual and pass the schema= keyword argument only when reflecting the other database.
Use the created engine to perform the query:
query = select([table_1.c.name, table_2.c.name]).\
select_from(outerjoin(table1, table_2, table_1.c.id == table_2.c.id))
with engine.begin() as connection:
result = connection.execute(query).fetchall()

Can I find datatypes of fields returned in a query in mysql

Is there anyway I can get the datatypes of fields returned in a query in mysql. Say I have a query:
SELECT a.*,b.* FROM tbl_name a LEFT JOIN other_tbl b ON a.id=b.first_id
Is there a command I can use in mysql that will return the names of the fields that this query will return and their datatypes. I know I can potentially create a view using this query and then DESCRIBE that view, but is there any other way I can do it on the fly?
I'm using SQLAlchemy to perform this raw query and my tables are dynamically generated. Is there a SQLAlchemy way if not a MySQL way.
You can get the datatypes from a table with this in MySQL
SELECT COLUMN_TYPE
FROM information_schema.COLUMNS
WHERE TABLE_NAME = 'a'

Parameterized queries with psycopg2 / Python DB-API and PostgreSQL

What's the best way to make psycopg2 pass parameterized queries to PostgreSQL? I don't want to write my own escpaing mechanisms or adapters and the psycopg2 source code and examples are difficult to read in a web browser.
If I need to switch to something like PyGreSQL or another python pg adapter, that's fine with me. I just want simple parameterization.
psycopg2 follows the rules for DB-API 2.0 (set down in PEP-249). That means you can call execute method from your cursor object and use the pyformat binding style, and it will do the escaping for you. For example, the following should be safe (and work):
cursor.execute("SELECT * FROM student WHERE last_name = %(lname)s",
{"lname": "Robert'); DROP TABLE students;--"})
From the psycopg documentation
(http://initd.org/psycopg/docs/usage.html)
Warning Never, never, NEVER use Python string concatenation (+) or string parameters interpolation (%) to pass variables to a SQL query string. Not even at gunpoint.
The correct way to pass variables in a SQL command is using the second argument of the execute() method:
SQL = "INSERT INTO authors (name) VALUES (%s);" # Note: no quotes
data = ("O'Reilly", )
cur.execute(SQL, data) # Note: no % operator
Here are a few examples you might find helpful
cursor.execute('SELECT * from table where id = %(some_id)d', {'some_id': 1234})
Or you can dynamically build your query based on a dict of field name, value:
query = 'INSERT INTO some_table (%s) VALUES (%s)'
cursor.execute(query, (my_dict.keys(), my_dict.values()))
Note: the fields must be defined in your code, not user input, otherwise you will be susceptible to SQL injection.
I love the official docs about this:
https://www.psycopg.org/psycopg3/docs/basic/params.html

Categories