How to protect against SQL Injection with pandas read_gbq - python

How do I use pandas_gbq.read_gbq safely to protect against SQL Injections as I cannot in the docs find a way to parametrize it
I've looked at the docs at a way to parametrize as well as googles website and other sources.
df_valid = read_gbq(QUERY_INFO.format(variable), project_id='project-1622', location='EU') Where query looks like SELECT name, date FROM table WHERE id = '{0}'
I can input p' or '1'='1 and it works

Per Google BigQuery docs, you have to use a specified configuration with SQL parameterized statement:
import pandas as pd
sql = "SELECT name, date FROM table WHERE id = #id"
query_config = {
'query': {
'parameterMode': 'NAMED',
'queryParameters': [
{
'name': 'id',
'parameterType': {'type': 'STRING'},
'parameterValue': {'value': 1}
}
]
}
}
df = pd.read_gbq(sql, project_id='project-1622', location='EU', configuration=query_config)

Related

Upload table to bigquery using Colab, specifying schema in job_config

I am trying to write a table to bigquery using Colab. The best way to do it I find is using client and job_config. It is important that I maintain control over how data is written as I plan to use below code for different tasks. The last step that eludes me is setting up schema. I do not want someone's query to crash as say Year is suddenly integer instead of a string. The below code should work? Or perhaps I need to use "job_config.schema_update_options", but I am not sure how schema object should look ? I cannot use pandas gbq as it is too slow to write to dataframe first. Table would be overwritten each month, that is why write_truncate. Thanks
schema_1 = [
{ "name": "Region", "type": "STRING", "mode": "NULLABLE" },
{ "name": "Product", "type": "STRING", "mode": "NULLABLE" } ]
schemma2 = [('Region', 'STRING', 'NULLABLE', None, ()),
('Product', 'STRING', 'NULLABLE', None, ())]
"""Create a Google BigQuery input table.
In the code below, the following actions are taken:
* A new dataset is created "natality_regression."
* A query is run, the output of which is stored in a new "regression_input" table.
"""
from google.cloud import bigquery
# Create a new Google BigQuery client using Google Cloud Platform project defaults.
project_id = 'nproject'
client = bigquery.Client(project=project_id)
# Prepare a reference to a new dataset for storing the query results.
dataset_id = "natality_regression"
table_id = "regression_input"
table_id_full = f"{project_id}.{dataset_id}.{table_id}"
# Configure the query job.
job_config = bigquery.QueryJobConfig()
# Set the destination table to where you want to store query results.
job_config.destination = table_id_full
job_config.write_disposition = 'WRITE_TRUNCATE' # WRITE_APPEND
job_config.schema = schemma2
#job_config.schema_update_options = ???
#job_config.schema = schema_1
# Set up a query in Standard SQL
query = """
SELECT * FROM `nproject.SalesData1.Sales1` LIMIT 15
"""
# Run the query.
query_job = client.query(query, job_config=job_config)
query_job.result() # Waits for the query to finish
print('danski')
This needs to be doing effectively by the job_config but the syntaxis for python is like this:
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("mode", "STRING"),
]
)
You can find more details here: https://cloud.google.com/bigquery/docs/schemas?hl=es_419
You can create the table first too, as you mentioned in the query:
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"
schema = [
bigquery.SchemaField("full_name", "STRING", mode="REQUIRED"),
bigquery.SchemaField("age", "INTEGER", mode="REQUIRED"),
]
table = bigquery.Table(table_id, schema=schema)
table = client.create_table(table) # Make an API request.
print(
"Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)

Confused about python data types to insert into database

I am trying to insert this value into SQL Server table and I'm not sure is this supposed to be a list or a dictionary.
For some context I am pulling the data from a Sharepoint list using shareplum with code like this
import json
import pandas
import pyodbc
from shareplum import Site
from shareplum import Office365
authcookie = Office365('https://company.sharepoint.com', username='username', password='password').GetCookies()
site = Site('https://company.sharepoint.com/sites/sharepoint/', authcookie=authcookie)
sp_list = site.List('Test')
data = sp_list.GetListItems('All Items')
cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0};"
"Server=Server;"
"Database=db;"
"Trusted_Connection=yes;")
cursor = cnxn.cursor()
insert_query = "INSERT INTO SharepointTest(No,Name) VALUES (%(No)s,%(Name)s)"
cursor.executemany(insert_query,data)
cnxn.commit
Here's the result when I used print(data)
[
{ 'No': '1', 'Name': 'Qwe' },
{ 'No': '2', 'Name': 'Asd' },
{ 'No': '3', 'Name': 'Zxc' },
{ 'No': '10', 'Name': 'jkl' }
]
If I tried to execute that code will shows me this message
TypeError: ('Params must be in a list, tuple, or Row', 'HY000')
What should I fix in the code?
convert your list of dictionaries to a list or tuple of the dictionary values.
I've done it below using list comprehension to iterate through the list and the values() method to extract the values from a dictionary
insert_query = "INSERT INTO SharepointTest(No,Name) VALUES (?, ?)" #change your sql statement to include parameter markers
cursor.executemany(insert_query, [tuple(d.values()) for d in data])
cnxn.commit() #finally commit your changes

Create table from dictionary data in a safe way

I have a problem where i have a list of dictionaries with for example the following data:
columns = [{
'name': 'column1',
'type': 'varchar'
},
{
'name': 'column2',
'type': 'decimal'
},
.
.
.
]
From that list i need to dynamically create a CREATE TABLE statement based on each dictionary in the list which contains the name of the column and the type and execute it on a PostgreSQL database using the psycopg2 adapter.
I managed to do it with:
columns = "(" + ",\n".join(["{} {}".format(col['name'], col['type']) for col in columns]) + ")"
cursor.execute("CREATE TABLE some_table_name\n {}".format(columns))
But this solution is vulnerable to SQL injection. I tried to do the exact same thing with the sql module from psycopg2 but without luck. Always getting syntax error, because it wraps the type in quotes.
Is there some way this can be done safely?
You can make use of AsIs to get the column types added non-quoted:
import psycopg2
from psycopg2.extensions import AsIs
import psycopg2.sql as sql
conn = psycopg2.connect("dbname=mf port=5959 host=localhost user=mf_usr")
columns = [{
'name': "column1",
'type': "varchar"
},
{
'name': "column2",
'type': "decimal"
}]
# create a dict, so we can use dict placeholders in the CREATE TABLE query.
column_dict = {c['name']: AsIs(c['type']) for c in columns}
createSQL = sql.SQL("CREATE TABLE some_table_name\n ({columns})").format(
columns = sql.SQL(',').join(
sql.SQL(' ').join([sql.Identifier(col), sql.Placeholder(col)]) for col in column_dict)
)
print(createSQL.as_string(conn))
cur = conn.cursor()
cur.execute(createSQL, column_dict)
cur.execute("insert into some_table_name (column1) VALUES ('foo')")
cur.execute("select * FROM some_table_name")
print('Result: ', cur.fetchall())
Output:
CREATE TABLE some_table_name
("column1" %(column1)s,"column2" %(column2)s)
Result: [('foo', None)]
Note:
psycopg2.sql is safe to SQL injection, AsIs probably not.
Testing using 'type': "varchar; DROP TABLE foo" resulted in Postgres syntax error:
b'CREATE TABLE some_table_name\n ("column1" varchar; DROP TABLE foo,"column2" decimal)'
Traceback (most recent call last):
File "pct.py", line 28, in <module>
cur.execute(createSQL, column_dict)
psycopg2.errors.SyntaxError: syntax error at or near ";"
LINE 2: ("column1" varchar; DROP TABLE foo,"column2" decimal)
Expanding on my comment, a complete example:
import psycopg2
from psycopg2 import sql
columns = [{
'name': 'column1',
'type': 'varchar'
},
{
'name': 'column2',
'type': 'decimal'
}]
con = psycopg2.connect("dbname=test host=localhost user=aklaver")
cur = con.cursor()
col_list = sql.SQL(',').join( [sql.Identifier(col["name"]) + sql.SQL(' ') + sql.SQL(col["type"]) for col in columns])
create_sql = sql.SQL("CREATE TABLE tablename ({})").format(col_list)
print(create_sql.as_string(con))
CREATE TABLE tablename ("column1" varchar,"column2" decimal)
cur.execute(create_sql)
con.commit()
test(5432)=> \d tablename
Table "public.tablename"
Column | Type | Collation | Nullable | Default
---------+-------------------+-----------+----------+---------
column1 | character varying | | |
column2 | numeric |
Iterate over the column list of dicts and assign the column names as SQL identifiers and the column types as straight SQL into sql.SQL construct. Use this as parameter to CREATE TABLE SQL.
Caveat: sql.SQL() does not do escaping, so those values would have to be validated before they where used.

Google BigQuery: In Python, column addition makes all the other columns Nullable

I have a table that already exists with the following schema:
{
"schema": {
"fields": [
{
"mode": "required",
"name": "full_name",
"type": "string"
},
{
"mode": "required",
"name": "age",
"type": "integer"
}]
}
}
It already contains entries like:
{'full_name': 'John Doe',
'age': int(33)}
I want to insert a new record with a new field and have the load job automatically add the new column as it loads. The new format looks like this:
record = {'full_name': 'Karen Walker',
'age': int(48),
'zipcode': '63021'}
My code is as follows:
from google.cloud import bigquery
client = bigquery.Client(project=projectname)
table = client.get_table(table_id)
config = bigquery.LoadJobConfig()
config.autoedetect = True
config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
config.schema_update_options = [
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
]
job = client.load_table_from_json([record], table, job_config=config)
job.result()
This results in the following error:
400 Provided Schema does not match Table my_project:my_dataset:mytable. Field age has changed mode from REQUIRED to NULLABLE
I can fix this by changing config.schema_update_options as follows:
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
]
This allows me to insert the new record, with zipcode added to the schema, but it causes both full_name and age to become NULLABLE, which is not the behavior I want. Is there a way to prevent schema auto-detect from changing the existing columns?
If you need to add fields to your schema, you can do the following:
from google.cloud import bigquery
client = bigquery.Client()
table = client.get_table("your-project.your-dataset.your-table")
original_schema = table.schema # Get your current table's schema
new_schema = original_schema[:] # Creates a copy of the schema.
# Add new field to schema
new_schema.append(bigquery.SchemaField("new_field", "STRING"))
# Set new schema in your table object
table.schema = new_schema
# Call API to update your table with the new schema
table = client.update_table(table, ["schema"])
After updating your table's schema you can load your new records with this additional field ignoring any schema configurations.

SQLAlchemy - Update table using list of dictionaries

I have a table containing user data and I would like to update information for many of the users using a list of dictionaries. At the moment I am using a for loop to send an update statement one dictionary at a time, but it is slow and I am hoping that there is a bulk method to do this.
user_data = [{'user_id' : '12345', 'user_name' : 'John'}, {'user_id' : '11223', 'user_name' : 'Andy'}]
connection = engine.connect()
metadata = MetaData()
for row in user_data:
stmt = update(users_table).where(users_table.columns.user_id == row['user_id'])
results = connection.execute(stmt, row)
Thanks in advance!
from sqlalchemy.sql.expression import bindparam
connection = engine.connect()
stmt = users_table.update().\
where(users_table.c.id == bindparam('_id')).\
values({
'user_id': bindparam('user_id'),
'user_name': bindparam('user_name'),
})
connection.execute(stmt, [
{'user_id' : '12345', 'user_name' : 'John', '_id': '12345'},
{'user_id' : '11223', 'user_name' : 'Andy', '_id': '11223'}
])

Categories