Airflow and Templates reference and PostgresHook - python

I have a question
I want to use Templates reference - {{ds}}
When substituting in PostgresOperator, everything works out well (I guess so)
And PostgresHook does not want to work
def prc_mymys_update(procedure: str, type_agg: str):
with PostgresHook(postgres_conn_id=CONNECTION_ID_GP).get_conn() as conn:
with conn.cursor() as cur:
with open(URL_YML_2,"r", encoding="utf-8") as f:
ya_2 = yaml.safe_load(f)
yml_mymts_2 = ya_2['type_agg']
query_pg = ""
if yml_mymts_2[0]['type_agg_name'] == "day" and type_agg == "day":
sql_1 = yml_mymts_2[0]['sql']
query_pg = f"""{sql_1}"""
elif yml_mymts_2[1]['type_agg_name'] == "retention" and type_agg == "retention":
sql_2 = yml_mymts_2[1]['sql']
query_pg = f"""{sql_2}"""
elif yml_mymts_2[2]['type_agg_name'] == "mau" and type_agg == "mau":
sql_3 = yml_mymts_2[2]['sql']
query_pg = f"""{sql_3}"""
cur.execute(query_pg)
dates_list = cur.fetchall()
for date_res in dates_list:
cur.execute(
"select from {}(%(date)s::date);".format(procedure),
{"date": date_res[0].strftime("%Y-%m-%d")},
)
conn.close()
I use yml
type_agg:
- type_agg_name: day
sql: select calendar_date from entertainment_dds.v_calendar where calendar_date between '{{ds}}'::date - interval '7 days' and '{{ds}}'::date - 1 order by 1 desc
- type_agg_name: retention
sql: SELECT t.date::date AS date FROM generate_series((date_trunc('month','{{execution_date.strftime('%Y-%m-%d')}}'::date) - interval '11 month'), date_trunc('month','{{execution_date.strftime('%Y-%m-%d')}}'::date) , '1 month'::interval) t(date) order by 1 asc
- type_agg_name: mau
sql: select dt::date date_ from generate_series('{{execution_date.strftime('%Y-%m-%d')}}'::date - interval '7 days', '{{execution_date.strftime('%Y-%m-%d')}}'::date - interval '1 days', interval '1 days') dt order by 1 asc
And when I run a dag, it comes to a moment with a certain task that uses
- type_agg_name: retention
sql: SELECT t.date::date AS date FROM generate_series((date_trunc('month','{{execution_date.strftime('%Y-%m-%d')}}'::date) - interval '11 month'), date_trunc('month','{{execution_date.strftime('%Y-%m-%d')}}'::date) , '1 month'::interval) t(date) order by 1 asc
I have wrong
psycopg2.errors.UndefinedColumn: column "y" does not exist
LINE 1: ...((date_trunc('month','{{execution_date.strftime('%Y-%m-%d')}...
enter image description here
I tried to find information on the interaction of Templates reference and PostgresHook, but found nothing
https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html#templates-reference

This is expected. templated_fields is an attribute of the BaseOperator in Airflow from which all operators inherit. This is which passing in a Jinja expression when using the PostgresOperator works just fine.
If you need to write a custom task, you need to render the template values explicitly. Like this, untested, but I'm sure this can be extrapolated in your function:
def prc_mymys_update(procedure: str, type_agg: str, ti):
ti.render_templates()
with PostgresHook(postgres_conn_id=CONNECTION_ID_GP).get_conn() as conn:
with conn.cursor() as cur:
...
The ti kwargs represents the Airflow Task Instance and is directly accessible as part of the execution context pushed to every task in Airflow. That object has a render_templates() method which will translate the Jinja expression to a value.
If the PostgresOperator doesn't fit your needs you can always subclass the operator and tailor it accordingly.
Also, the sql string itself has single quotes which cause string parsing issues as you're seeing:
'{{execution_date.strftime('%Y-%m-%d')}}'
Should be something like:
'{{execution_date.strftime("%Y-%m-%d")}}'

Note the single quotes in the following query:
sql: SELECT t.date::date AS date FROM generate_series((date_trunc('month','{{execution_date.strftime('%Y-%m-%d')}}'::date) - interval '11 month'), date_trunc('month','{{execution_date.strftime('%Y-%m-%d')}}'::date) , '1 month'::interval) t(date) order by 1 asc
Specifically, this part:
'{{execution_date.strftime('%Y-%m-%d')}}'
You have two separate strings here, separated by the date format. Here's the first string:
'{{execution_date.strftime('
This causes the date format to be rendered separately. If you wrap the date format in double quotes instead of single quotes, it should resolve this error. For example:
sql: SELECT t.date::date AS date FROM generate_series((date_trunc('month','{{execution_date.strftime("%Y-%m-%d")}}'::date) - interval '11 month'), date_trunc('month','{{execution_date.strftime('%Y-%m-%d')}}'::date) , '1 month'::interval) t(date) order by 1 asc
Note that you might need to swap the double and single quotes if double quotes in the RDBMS are used for other purposes, for example:
"{{execution_date.strftime('%Y-%m-%d')}}"

Related

There are some problems when using Sqlalchemy to query the data between time1 and time2

My database is SQL Server 2008.
The type of time character I want to query in the database (such as finishdate) is datetime2
I just want data between "10-11" and "10-17".
When using Sqlalchemy, I use
cast(FinishDate, DATE).between(cast(time1, DATE),cast(time2, DATE))
to query dates, but it does not return any data (I confirm that there must be some data statements meet the query time range)
==============================================
from sqlalchemy import DATE
bb = "2021-10-11 12:21:23"
cc = "2021-10-17 16:12:34"
record = session.query(sa.Name cast(sa.FinishDate, DATE)).filter(
cast(sa.SamplingTime, DATE).between(cast(bb, DATE), cast(cc, DATE)),
sa.SamplingType != 0
).all()
or
record = session.query(sa.Name cast(sa.FinishDate, DATE)).filter(
cast(sa.SamplingTime, DATE)>= cast(bb, DATE),
sa.SamplingType != 0
).all()
Both return []
Something is wrong with my code and I don't know what the trouble is.
It is working for me, I only changed the DATE that you are using to Date
from sqlalchemy import Date
record = session.query(
sa.Name cast(sa.FinishDate, Date)
).filter(
cast(sa.SamplingTime, Date).between(
cast(bb, Date), cast(cc, Date)
),
sa.SamplingType != 0
).all()
As a matter of fact first parameter of cast can be a string also, so in this case its fine to pass date as string in cast.
:param expression: A SQL expression, such as a
:class:`_expression.ColumnElement`
expression or a Python string which will be coerced into a bound
literal value.

Error with SQL string: "Error while connecting to PostgreSQL operator does not exist: date = integer"

I have a Python(3) script that is supposed to run each morning. In it, I call some SQL. However I'm getting an error message:
Error while connecting to PostgreSQL operator does not exist: date = integer
The SQL is based on the concatenation of a string:
ecom_dashboard_query = """
with
days_data as (
select
s.date,
s.user_type,
s.channel_grouping,
s.device_category,
sum(s.sessions) as sessions,
count(distinct s.dimension2) as daily_users,
sum(s.transactions) as transactions,
sum(s.transaction_revenue) as revenue
from ga_flagship_ecom.sessions s
where date = """ + run.start_date + """
group by 1,2,3,4
)
insert into flagship_reporting.ecom_dashboard
select *
from days_data;
"""
Here is the full error:
09:31:25 Error while connecting to PostgreSQL operator does not exist: date = integer
09:31:25 LINE 14: where date = 2020-01-19
09:31:25 ^
09:31:25 HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
I tried wrapping run.start_date within str like so: str(run.start_date) but I received the same error message.
I suspect it may be to do with the way I concatenate the SQL query string, but I am not sure.
The query runs fine in SQL directly with a hard coded date and no concatenation:
where date = '2020-01-19'
How can I get the query string to work correctly?
It's more better to pass query params to cursor.execute method. From docs
Warning Never, never, NEVER use Python string concatenation (+) or string parameters interpolation (%) to pass variables to a SQL query string. Not even at gunpoint.
So instead of string concatenation pass run.start_date as second argument of cursor.execute.
In your query instead of concatenation use %s:
where date = %s
group by 1,2,3,4
In your python code add second argument to execute method:
cur.execute(ecom_dashboard_query , (run.start_date,))
Your sentece is wrong:
where date = """ + run.start_date + """
try to compare a date and a string and this is not posible, you need to convert "run.start_date" to datetime and compare simply:
date_format = datetime.strptime(your_date_string, '%y-%m-%d')
and with this date converted to datetime do:
where date = date_format
Final code:
date_format = datetime.strptime(your_date_string, '%y-%m-%d')
ecom_dashboard_query = """
with
days_data as (
select
s.date,
s.user_type,
s.channel_grouping,
s.device_category,
sum(s.sessions) as sessions,
count(distinct s.dimension2) as daily_users,
sum(s.transactions) as transactions,
sum(s.transaction_revenue) as revenue
from ga_flagship_ecom.sessions s
where date = {}
group by 1,2,3,4
)
insert into flagship_reporting.ecom_dashboard
select *
from days_data;
""".format(date_format)

Sqlalchemy query is very slow after using the in_() method

filters.append(Flow.time_point >= datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S'))
filters.append(Flow.time_point <= datetime.strptime(end_time, '%Y-%m-%d %H:%M:%S'))
if domain_name != 'all':
filters.append(Bandwidth.domain_name.in_(domain_name.split('|')))
flow_list = db.session.query(Flow.time_point, db.func.sum(Flow.value).label('value')).filter(*filters).group_by(Flow.time_point).order_by(Flow.time_point.asc()).all()
The query time is 3 to 4 seconds when domain_name is 'all', otherwise the query time is 5 minutes. I have tried to add an index to a column but to no avail. What could be the reason for this?
When domain_name is not 'all' you end up performing an implicit CROSS JOIN between Flow and Bandwidth. When you add the IN predicate to your list of filters SQLAlchemy also picks up Bandwidth as a FROM object. As there is no explicit join between the two, the query will end up as something like:
SELECT flow.time_point, SUM(flow.value) AS value FROM flow, bandwidth WHERE ...
-- ^
-- `- This is the problem
In the worst case the planner produces a query that first joins every row from Flow with every row from Bandwidth. If your tables are even moderately big, the resulting set of rows can be huge.
Without seeing your models it is impossible to produce an exact solution, but in general you should include the proper join in your query, if you include Bandwidth:
query = db.session.query(Flow.time_point, db.func.sum(Flow.value).label('value'))
filters.append(Flow.time_point >= datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S'))
filters.append(Flow.time_point <= datetime.strptime(end_time, '%Y-%m-%d %H:%M:%S'))
if domain_name != 'all':
query = query.join(Bandwidth)
filters.append(Bandwidth.domain_name.in_(domain_name.split('|')))
flow_list = query.\
filter(*filters).\
group_by(Flow.time_point).\
order_by(Flow.time_point.asc()).\
all()
If there are no foreign keys connecting your models, you must provide the ON clause as the second argument to Query.join() explicitly.

Python MySQLdb prevent SQL injection - not working as expected

I am trying to query a MySQL database in a secure way, avoiding SQL injection. I am getting an error when trying to execute the SQL in the DB cursor.
My code looks like this:
reseller_list = ('138',)
for reseller in reseller_list:
cur1 = db.cursor()
dbQuery = """
SELECT
TRIM(CONCAT(TRIM(c1.first_name), ' ', TRIM(c1.name))) AS 'User name',
FORMAT(sum(cost1),2) AS 'cost1',
FORMAT(sum(cost2),2) AS 'cost2',
FROM
client as c1,
client as c2
WHERE
c2.id = %s
AND start BETWEEN DATE_FORMAT(CURRENT_DATE - INTERVAL 1 MONTH, '%Y-%m-01 00:00:00')
AND DATE_FORMAT(LAST_DAY(CURRENT_DATE - INTERVAL 1 MONTH), '%Y-%m-%d 23:59:59')
GROUP BY
c1.name
ORDER BY
CONCAT(c1.first_name, ' ', c1.name);
"""
cur1.execute(dbQuery, (reseller_id,))
And what happens is this:
cur1.execute(dbQuery, (reseller_id,))
File "/usr/lib64/python2.7/site-packages/MySQLdb/cursors.py", line 159, in execute
query = query % db.literal(args)
TypeError: not enough arguments for format string
I have read a number of pages both on this site and others but can't see what I am doing wrong. I can easily do this using string substitution into the query but want to do it the right way!
You have % signs in your date_format calls, so you'll need to escape them from the param substitution by doubling them.
WHERE
c2.id = %s
AND start BETWEEN DATE_FORMAT(CURRENT_DATE - INTERVAL 1 MONTH, '%%Y-%%m-01 00:00:00')
AND DATE_FORMAT(LAST_DAY(CURRENT_DATE - INTERVAL 1 MONTH), '%%Y-%%m-%%d 23:59:59')

Parameterize a quoted string in Python's SQL DBI

I am using pg8000 to connect to a PostgreSQL database via Python. I would like to be able to send in dates as parameters via the cursor.execute method:
def info_by_month(cursor, year, month):
query = """
SELECT *
FROM info
WHERE date_trunc('month', info.created_at) =
date_trunc('month', '%s-%s-01')
"""
cursor.execute(query, (year, month))
return cursor
This will raise the error: InterfaceError: '%s' not supported in a quoted string within the query string. It's possible to use Python's string formatting to insert the date in there. The use of the string formatting mini language provides a measure of data validation to prevent SQL injection attacks, but it's still pretty ugly.
def info_by_month(cursor, year, month):
query = """
SELECT *
FROM info
WHERE date_trunc('month', info.created_at) =
date_trunc('month', '{:04}-{:02}-01')
""".format(year, month)
cursor.execute(query)
return cursor
How do I sent a quoted string into the cursor.execute method?
Do the format ahead of time, and then pass the resulting string into execute. That way you avoid the SQL injection potential, but still get the formatting you want.
e.g. the query becomes:
query = """
SELECT *
FROM info
WHERE date_trunc('month', info.created_at) =
date_trunc('month', %s)"""
And then the format and execute becomes:
dateStr = "{:04}-{:02}-01".format(year, month)
cursor.execute(query, dateStr)
I use psycopg2, but it appears pg8000 adheres to the same DBI standard, so I would expect this to work in pg8000, too.
It's possible to do this via concatenation, to the detriment of readability.
query = """
SELECT *
FROM info
WHERE date_trunc('month', info.created_at) =
date_trunc('month', %s || '-' || %s || '-01')
"""
cursor.execute(query, (year, month))

Categories