Prevent SQL Injection in BigQuery with Python for table name - python

I have an Airflow DAG which takes an argument from the user for table.
I then use this value in an SQL statement and execute it in BigQuery. I'm worried about exposing myself to SQL Injection.
Here is the code:
sql = f"""
CREATE OR REPLACE TABLE {PROJECT}.{dataset}.{table} PARTITION BY DATE(start_time) as (
//OTHER CODE
)
"""
client = bigquery.Client()
query_job = client.query(sql)
Both dataset and table get passed through via airflow but I'm worried someone could pass through something like: random_table; truncate other_tbl; -- as the table argument.
My fear is that the above will create a table called random_table and then truncate an existing table.
Is there a safer way to process these passed through arguments?
I've looked into parameterized queries in BigQuery but these don't work for table names.

You will have to create a table name validator. I think you can safely validate by using just backticks --> ` at the start and at the end of your table name string. It's not a 100% solution but it worked for some of my test scenarios I try. It should work like this:
# validate should look for ` at the beginning and end of your tablename
table_name = validate(f"`{project}.{dataset}.{table}`")
sql = f"""
CREATE OR REPLACE TABLE {table_name} PARTITION BY DATE(start_time) as (
//OTHER CODE
)
"""
...
Note: I suggest you to check the following post on medium site to check about bigquery sql injection.
I checked the official documentation about Running parameterized queries, and sadly it only covers the parameterization of variables not tables or other string part of your query.
As a final note, I recommend to open a feature request for BigQuery for this particular scenario.

You should probably look into sanitization/validation of user input in general. This is done before passing the input to the BQ query.
With Python, you could look for malicious strings in the user input - like truncate in your example - or use a regex to filter input that for instance contains --. Those are just some quick examples. I recommend you do more research on that topic; you will also find quite a few questions on that topic on SE.

Related

How to avoid SQL Injection in Python for Upsert Query to SQL Server?

I have a sql query I'm executing that I'm passing variables into. In the current context I'm passing the parameter values in as f strings, but this query is vulnerable to sql injection. I know there is a method to use a stored procedure and restrict permissions on the user executing the query. But is there a way to avoid having to go the stored procedure route and perhaps modify this function to be secure against SQL Injection?
I have the below query created to execute within a python app.
def sql_gen(tv, kv, join_kv, col_inst, val_inst, val_upd):
sqlstmt = f"""
IF NOT EXISTS (
SELECT *
FROM {tv}
WHERE {kv} = {join_kv}
)
INSERT {tv} (
{col_inst}
)
VALUES (
{val_inst}
)
ELSE
UPDATE {tv}
SET {val_upd}
WHERE {kv} = {join_kv};
"""
engine = create_engine(f"mssql+pymssql://{username}:{password}#{server}/{database}")
connection = engine.raw_connection()
cursor = connection.cursor()
cursor.execute(sqlstmt)
connection.commit()
cursor.close()
Fortunately, most database connectors have query parameters in which you pass the variable instead of giving in the string inside the query yourself for the risks you mentioned.
You can read more on this here: https://realpython.com/prevent-python-sql-injection/#understanding-python-sql-injection
Example:
# Vulnerable
cursor.execute("SELECT admin FROM users WHERE username = '" + username + '");
# Safe
cursor.execute("SELECT admin FROM users WHERE username = %s'", (username, ));
As Amanzer mentions correctly in his reply Python has mechanisms to pass parameters safely.
However, there are other elements in your query (table names and column names) that are not supported as parameters (bind variables) because JDBC does not support those.
If these are from an untrusted source (or may be in the future) you should be sure you validate these elements. This is a good coding practice to do even if you are sure.
There are some options to do this safely:
You should limit your tables and columns based on positive validation - make sure that the only values allowed are the ones that are authorized
If that's not possible (because these are user created?):
You should make sure tables or column names limit the
names to use a "safe" set of characters (alphanumeric & dashes,
underscores...)
You should enquote the table names / column names -
adding double quotes around the objects. If you do this, you need to
be careful to validate there are no quotes in the name, and error out
or escape the quotes. You also need to be aware that adding quotes
will make the name case sensitive.

Add non-datetime variables in SQL python as function parameters

I have a function that executes many SQL queries with different dates.
What I want is to pass all dates and other query variables as function parameters and then just execute the function. I have figured out how to do this for datetime variables as below. But I also have a query that looks at specific campaign_names in a database and pulls those as strings. I want to be able to pass those strings as function parameters but I haven't figured out the correct syntax for this in the SQL query.
def Camp_eval(start_date,end_1M,camp1,camp2,camp3):
query1 = f"""SELECT CONTACT_NUMBER, OUTCOME_DATE
FROM DATABASE1
where OUTCOME_DATE >= (to_date('{start_date}', 'dd/mm/yyyy'))
and OUTCOME_DATE < (to_date('{end_1M}', 'dd/mm/yyyy'))"""
query2 = """SELECT CONTACT_NUMBER
FROM DATABASE2
WHERE (CAMP_NAME = {camp1} or
CAMP_NAME = {camp2} or
CAMP_NAME = {camp3})"""
Camp_eval('01/04/2022','01/05/2022','Camp_2022_04','Camp_2022_05','Camp_2022_06')
The parameters start_date and end_1M work fine with the {} brackets but the camp variables, which are strings don't return any results even though there are results in the database with those conditions if I were to write them directly in the query.
Any help would be appreciated!!
Please, do not use f-strings for creating SQL queries!
Most likely, any library you use for accessing a database already has a way of creating queries: SQLite docs (check code examples).
Another example: cur.execute("SELECT * FROM tasks WHERE priority = ?", (priority,)).
Not only this way is safer (fixes SQL Injection problem mentioned by #d-malan in comments), but it also eliminates the need to care about how data is represented in SQL - the library will automatically cast dates, strings, etc. in what they need to be casted into. Therefore, your problem can be fixed by using proper instruments.

How does a SQLAlchemy Session manage transactions, when executing multiple raw SQL statements at once?

None of the "similar questions" really get at this specific topic, but I am trying to find out how SQLAlchemy's Session handles transactions, when:
Passing raw SQL text to the execute() method, rather than utilizing any SQLAlchemy model objects, AND
The raw SQL text contains multiple distinct commands.
For instance:
bulk_operation = """
DELETE FROM the_table WHERE id = ...;
INSERT INTO the_table (id, name) VALUES (...);
"""
sql = text(bulk_operation)
session.execute(sql.bindparams(id=foo, name=bar))
The goal here is to restore the original state, if either the DELETE or the INSERT fails for any reason.
But does Session.execute() actually guarantee this, in this context? Is it necessary to include BEGIN and COMMIT commands within the raw SQL text itself, or manage from the Python level with session.commit() or something else? Thanks in advance!

Psycopg2 - Passing variable in the where clause

I am trying to run a SQL script in Python where I am passing a variable in the where clause as below:
cursor.execute(f"""select * from table where type = variable_value""")
In the above query, variable_value has the value that I am trying to use in the where clause. I am however getting an error psycopg2.errors.UndefinedColumn: column "variable_value" does not exist in table
As per psycopg2 documentation the execute function takes variables as an extra parameter.
cursor.execute("""select * from table where type = %(value)s """, {"value": variable_value})
More examples in psycopg2 user manual..
Also please read carefully the section about SQL injection - the gist is, you should not quote parameters in your query, the execute function will take care of that to prevent the injection of harmful SQL.
Also to explain the error you are getting - the query you're sending is comparing two identifiers (type and variable_value). The table does not contain variable_value column, hence the error.
I believe, you intended to use string interpolation to construct the query, but you forgot the {}. It would work like this:
cursor.execute(f"""select * from table where type = '{variable_value}'""")
⚠️ but because of previously mentioned SQL injection, it is not a recommended way!.

SQLAlchemy: Dynamically pass schema and table name avoiding SQL Injection

How can I execute an SQL query where the schema and table name are passed in a function? Something like below?
def get(engine, schema: str, table: str):
query = text("select * from :schema.:table")
result = engine.connect().execute(query, schema=schema, table=table)
return result
Two things going on here:
Avoiding SQL injection
Dynamically setting a schema with (presumably) PostgreSQL
The first question has a very broad scope, you might want to look at older questions about SQLAlchemy and SQL Injection like this one SQLAlchemy + SQL Injection
Your second question can be addressed in a number of ways, though I would recommend the following approach from SQLAlchemy's documentation: https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#remote-schema-table-introspection-and-postgresql-search-path
PostgreSQL supports a "search path" command which sets the schema for all operations in the transaction.
So your query code might look like:
qry_str = f"SET search_path TO {schema}";
Alternatively, if you use an SQLAlchemy declarative approach, you can use a MetaData object like in this question/answer SQLAlchemy support of Postgres Schemas
You could create a collection of existing table and schema names in your database and check inputs against those values before creating your query:
-- assumes we are connected to the correct *database*
SELECT table_schema, table_name
FROM information_schema.tables;

Categories