SQL query f string formatting in Python script - python

I have been trying to apply an f string formatting for the colname parameter inside the SQL query for a script I am building, but I keep getting a parse exception error.
def expect_primary_key_have_relevant_foreign_key(spark_df1, spark_df2, colname):
'''
Check that all the primary keys have a relevant foreign key
'''
# Create Temporary View
spark_df1.createOrReplaceTempView("spark_df1")
spark_df2.createOrReplaceTempView("spark_df2")
# Wrap Query in spark.sql
result = spark.sql("""
select df1.*
from spark_df1 df1
left join
spark_df2 df2
f"on trim(upper(df1.{colname})) = trim(upper(df2.{colname}))"
f"where df2.{colname} is null"
""")
if result == 0:
print("Validation Passed!")
else:
print("Validation Failed!")
return result

I found the solution, the f goes before the triple quotes """ as:
# Wrap Query in spark.sql
result = spark.sql(f"""
select df1.*
from spark_df1 df1
left join
spark_df2 df2
on trim(upper(df1.{colname})) = trim(upper(df2.{colname}))
where df2.{colname} is null
""")

Related

Duplicated values as an output of python function

I would like create a Python function for multiple SQL inserts. This function I will use in my Airflow DAG for inserts into Snowflake db. I will need do create an SnowflakeOperator which will use this function. I'm just start to use Airflow so please correct me if I'm wrong.
My example:
I'm connecting to Snowflake db in order to get data from table with information schema name and table name. This output I'm using for inserts per schema. I'm selecting schema and creating variable my_schema = 'my_schema'.
First approach:
sql = "SELECT SCHEMA, TABLE FROM TABLE"
cur.execute(sql)
df = pd.DataFrame.from_records(iter(cur), columns=[x[0] for x in cur.description])
my_dict = dict()
for i in df['SCHEMA'].unique().tolist():
df_x = df[df['SCHEMA'] == i]
my_dict[i] = df_x['TABLE'].tolist()
for schema, tables in my_dict.items():
for table in tables:
query = f"INSERT INTO {schema}.{table} SELECT * FROM {schema}.{table} where col2 = 1;"
try:
cur.execute(query)
except snowflake.connector.errors.ProgrammingError as e:
# Something went wrong with the insert
logging.error(f"Inserting in {schema}.{table}: {e}")
conn.close()
For testing I created pandas datframe with two columns schema and table.
data = [['test', 'table01'], ['test', 'table02'], ['my_schema', 'table03'], ['schemaxxx', 'table04']]
# Create the pandas DataFrame
df_new = pd.DataFrame(data, columns=['schema', 'table'])
I created function for inserts.
my_schema = 'my_schema'
def my_insert_fnc(df):
my_dict = dict()
for i in df['schema'].unique().tolist():
df_x = df[df['schema'] == i]
my_dict[i] = df_x['table'].tolist()
sql_list = []
for schema, tables in my_dict.items():
for table in tables:
if schema == my_schema:
sql_list.append(f"INSERT INTO {schema}.{table} SELECT * FROM {schema}.{table} where col2 = 1;")
print(sql_list)
But I'm getting duplicates.
my_insert_fnc(df_new)
['INSERT INTO my_schema.table03 SELECT * FROM my_schema.table03 where col2 = 1;']
['INSERT INTO my_schema.table03 SELECT * FROM my_schema.table03 where col2 = 1;']
I would like to remove duplicates and logging errors.
try:
cur.execute(query)
except snowflake.connector.errors.ProgrammingError as e:
# Something went wrong with the insert
logging.error(f"Inserting in {schema}.{table}: {e}")
This function as I mentioned I need to use in my Airflow DAG so It needs to give me a string output in order to use it in SnowflakeOperator. Please correct me if I'm wrong.

Can query name variable be parameterized in Python API?

The following API function works. But I would like to parameterize the query name so that I don't have to use if...else. Instead, I would like to be able to take the parameter from the query url, concatenate it to the query name variable and execute the query.
I would like to be able to stick "qry_+ <reportId>" and use it as the query name variables like qry_R01, qry_R02, or qry_R03. Is it possible?
def get_report():
reportId = request.args.get('reportId', '')`
qry_R01 = """
SELECT
column1,
column2,
column3,
FROM
table1
"""
qry_R02 = """
SELECT
column1,
column2,
column3,
FROM
table2
"""
qry_R03 = """
SELECT
column1,
column2,
column3,
FROM
table3
"""
db = OracleDB('DB_RPT')
if (rptId == 'R01'):
db.cursor.execute(qry_R01,
)
elif (rptId == 'R02'):
db.cursor.execute(qry_R02,
)
elif (rptId == 'R03'):
db.cursor.execute(qry_R03,
)
json_data = db.render_json_data('json_arr')
db.connection.close()
return json_data
It seems what you need in this case is to map from reportId to the table, the rest of the query is identical. The below solutions uses a dictionary and str.format():
def get_report():
reportId = request.args.get('reportId', '')
# this maps ours possible report IDs to their relevant query suffixes
reportTableMap = {
'R01': 'table1',
'R02': 'table2',
'R03': 'table3',
}
# ensure this report ID is valid, else we'll end up with a KeyError later on
if reportId not in reportTableMap:
return 'Error: invalid report'
baseQuery = '''
SELECT
column1,
column2,
column3,
FROM {table}
'''
db = OracleDB('DB_RPT')
db.cursor.execute(baseQuery.format(table=reportTableMap[reportId]))
json_data = db.render_json_data('json_arr')
db.connection.close()
return json_data
This solution only works for fairly simple cases, though, and does risk leaving open a SQL injection attack. A better solution would be to use prepared statements but the exact code to do that depends on the database driver being used.

insert into mysql database with pymysql failing to insert

I'm trying to insert dummy data into a mysql database.
The database structure looks like:
database name: messaround
database table name: test
table structure:
id (Primary key, auto increment)
path (varchar(254))
UPDATED 2 method below, and error.
I have a method to try to insert via:
def insert_into_db(dbcursor, table, *cols, **vals):
try:
query = "INSERT INTO {} ({}) VALUES ('{}')".format(table, ",".join(cols), "'),('".join(vals))
print(query)
dbcursor.execute(query)
dbcursor.commit()
print("inserted!")
except pymysql.Error as exc:
print("error inserting...\n {}".format(exc))
connection=conn_db()
insertstmt=insert_into_db(connection, table='test', cols=['path'], vals=['test.com/test2'])
However, this is failing saying:
INSERT INTO test () VALUES ('vals'),('cols')
error inserting...
(1136, "Column count doesn't match value count at row 1")
Can you please assist?
Thank you.
If you use your code:
def insert_into_db(dbcursor, table, *cols, **vals):
query = "INSERT INTO {} ({}) VALUES ({})".format(table,",".join(cols), ",".join(vals))
print(query)
insert_into_db('cursor_here', 'table_here', 'name', 'city', name_person='diego', city_person='Sao Paulo')
Python returns:
INSERT INTO table_here (name,city) VALUES (name_person,city_person)
Now with this other:
def new_insert_into_db(dbcursor, table, *cols, **vals):
vals2 = ''
for first_part, second_part in vals.items():
vals2 += '\'' + second_part + '\','
vals2 = vals2[:-1]
query = "INSERT INTO {} ({}) VALUES ({})".format(table,",".join(cols), vals2)
print(query)
new_insert_into_db('cursor_here', 'table_here', 'name', 'city', name_person='diego', city_person='Sao Paulo')
Python will return the correct SQL:
INSERT INTO table_here (name,city) VALUES ('diego','Sao Paulo')
Generally in Python you pass a parameterized query to the DB driver. See this example in PyMySQL's documentation; it constructs the INSERT query with placeholder characters, then calls cursor.execute() passing the query, and a tuple of the actual values.
Using parameterized queries is also recommended for security purposes, as it defeats many common SQL injection attacks.
you should print the sql statement which you've generated, that makes it a lot easier to see what's wrong.
But I guess you need quotes ' around string values for your ",".join(vals) (in case there are string values.
So your code is producing
insert into test (path,) values (test.com/test2,);
but it should produce
insert into test (`path`) values ('test.com/test2');
Otherwise try https://github.com/markuman/MariaSQL/ which makes it super easy to insert data to MariaDB/MySQL using pymysql.
Change your query as below
query = "INSERT INTO {} ({}) VALUES ('{}')".format(table, ",".join(cols), "'),('".join(vals))
As you are using join, the variable is expected to be a list but not a string
table = 'test'
cols = ['path']
vals = ['test.com/test2', 'another.com/anothertest']
print(query)
"INSERT INTO test (path) VALUES ('test.com/test2'),('another.com/anothertest')"
Update:
def insert_into_db(dbconnection=None, table='', cols=None, vals=None):
mycursor = dbconnection.cursor()
if not (dbconnection and table and cols and vals):
print('Must need all values')
quit()
try:
query = "INSERT INTO {} ({}) VALUES ('{}')".format(table, ",".join(cols), "'),('".join(vals))
mycursor.execute(query)
dbconnection.commit()
print("inserted!")
except pymysql.Error as exc:
print("error inserting...\n {}".format(exc))
connection=conn_db()
insertstmt=insert_into_db(dbconnection=connection, table='test', cols=['path'], vals=['test.com/test2'])

Defining Python function with input

I have Postgres table called Records with several appIds. I am writing a function to pull out the data for one AppId (ABC0123) and want to store this result in dataframe df.
I have created a variable AppId1 to store my AppId(ABC0123) and passed it to the query.
The script got executed, but it did not create the dataframe.
def fun(AppId):
AppId1=pd.read_sql_query(""" select "AppId" from "Records" where "AppId"='%s'""" %AppId,con=engine )
query = """" SELECT AppId from "Records" where AppId='%s' """ % AppId1
df=pd.read_sql_query(query,con=engine)
return fun
fun('ABC0123')
Change % AppId1 to % AppId only.
return df
def fun(AppId):
AppId1=pd.read_sql_query(""" select 'AppId' from 'Records' where 'AppId'='%s'""" %AppId,con=engine )
query = """ SELECT 'AppId' from 'Records' where AppId='%s' """ %AppId
df=pd.read_sql_query(query,con=engine)
return df

How can I get the youngest objects from SQLAlchemy?

Each row in my table has a date. The date is not unique. The same date is present more than one time.
I want to get all objects with the youngest date.
My solution work but I am not sure if this is a elegent SQLAlchemy way.
query = _session.query(Table._date) \
.order_by(Table._date.desc()) \
.group_by(Table._date)
# this is the younges date (type is date.datetime)
young = query.first()
query = _session.query(Table).filter(Table._date==young)
result = query.all()
Isn't there a way to put all this in one query object or something like that?
You need a having clause, and you need to import the max function
then your query will be:
from sqlalchemy import func
stmt = _session.query(Table) \
.group_by(Table._date) \
.having(Table._date == func.max(Table._date)
This produces a sql statement like the following.
SELECT my_table.*
FROM my_table
GROUP BY my_table._date
HAVING my_table._date = MAX(my_table._date)
If you construct your sql statement with a select, you can examine the sql produced in your case using. *I'm not sure if this would work with statements query
str(stmt)
Two ways of doing this using a sub-query:
# #note: do not need to alias, but do in order to specify `name`
T1 = aliased(MyTable, name="T1")
# version-1:
subquery = (session.query(func.max(T1._date).label("max_date"))
.as_scalar()
)
# version-2:
subquery = (session.query(T1._date.label("max_date"))
.order_by(T1._date.desc())
.limit(1)
.as_scalar()
)
qry = session.query(MyTable).filter(MyTable._date == subquery)
results = qry.all()
The output should be similar to:
# version-1
SELECT my_table.id AS my_table_id, my_table.name AS my_table_name, my_table._date AS my_table__date
FROM my_table
WHERE my_table._date = (
SELECT max("T1"._date) AS max_date
FROM my_table AS "T1")
# version-2
SELECT my_table.id AS my_table_id, my_table.name AS my_table_name, my_table._date AS my_table__date
FROM my_table
WHERE my_table._date = (
SELECT "T1"._date AS max_date
FROM my_table AS "T1"
ORDER BY "T1"._date DESC LIMIT ? OFFSET ?
)

Categories