Airflow: Running PostgresOperator within a PythonOperator with Taskflow API

Airflow: Running PostgresOperator within a PythonOperator with Taskflow API - python

The goal I am having is to loop over a list that I get from another PythonOperator and within this loop save the json to the Postgres DB. I am using the Taskflow API from Airflow 2.0.
The code works well if I write the SQL statement directly to the sql parameter in the PostgresOperator. But when I write the SQL to a file and put the file path into the sql parameter, then the error is thrown:
psycopg2.errors.SyntaxError: syntax error at or near "sql"
LINE 1: sql/insert_deal_into_deals_table.sql
This is the code of the task:
#task()
def write_all_deals_to_db(all_deals):
for deal in all_deals:
deal_json = json.dumps(deal)
pg = PostgresOperator(
task_id='insert_deal',
postgres_conn_id='my_db',
sql='sql/insert_deal_into_deals_table.sql',
params={'deal_json': deal_json}
)
pg.execute(dict())
The weird thing is, that the code works if I use it as a standalone Operator (outside of a PythonOperator). Like this:
create_deals_table = PostgresOperator(
task_id='create_deals_table',
postgres_conn_id='my_db',
sql='sql/create_deals_table.sql'
)
I tried around a lot and I guess that it has to do with the Jinja templating. Somehow within a PythonOperator the PostgresOperator cannot make use of neither the param nor the .sql file parsing.
Any tip or reference is greatly appreciated!
EDIT:
This code works, but is rather a quick fix. The actual problem I am still having, is that the Jinja template is not working for the PostgresOperator when I am using it inside a PythonOperator.
#task()
def write_all_deals_to_db(all_deals):
sql_path = 'sql/insert_deal_into_deals_table.sql'
for deal in all_deals:
deal_json = _transform_json(deal)
sql_query = open(path.join(ROOT_DIRECTORY, sql_path)).read()
sql_query = sql_query.format(deal_json)
pg = PostgresOperator(
task_id='insert_deal',
postgres_conn_id='my_db',
sql=sql_query
)
pg.execute(dict())

Related

Python MySQL Query Not working (Something went wrong format requires a mapping)

I am trying to pull a query from my database and I am receiving this error when trying to run it: Something went wrong format requires a mapping.
I'm using flask in Python and pymysql.
This is my class method that is throwing the error:
#classmethod
def get_dojo(cls, data):
query = 'SELECT * FROM dojos WHERE id = %(id)s;'
result = connectToMySQL('dojos_and_ninjas').query_db(query, data)
return cls(result[0])
I thought it might be the data I am passing through but it looks good to me, and the query runs fine in workbench. I tried restarting MySQL, VS Code, and restarting the pipenv.
The data I am passing is:
#app.route('/dojo/<int:id>')
def dojo_page(id):
dojo_current = Dojo.get_dojo(id)
return render_template('dojo_page.html', dojo = dojo_current)
My page will render and I receive no error when I enter an id in manually instead of calling the data into it.

I figured it out, I needed to add a data dictionary in the route.
#app.route('/dojo/<int:id>')
def dojo_page(id):
data = {
'id': id
}
dojo_current = Dojo.get_dojo(data)
return render_template('dojo_page.html', dojo = dojo_current)

why can't connect to databases via airflow connection?

After pushing my DAG I get this error
I am new to data engineering. I tried to solve this error in different ways at the expense of my knowledge, but nothing worked. I want to write a DAG that consists of two tasks, the first is to export data from one database table on one server as CSV files and import these CSV files into database tables on another server. The variable contains DAG configuration and SQL scripts for exporting and importing data.
Please tell me how can I solve this error?
I have this exporting code:
def export_csv():
import json
from airflow.models import Variable
import pandas as pd
instruction_data = json.loads(Variable.get('MAIN_SOURCE_DAMDI_INSTRUCTIONS'))
requirement_data = instruction_data['requirements']
lst = requirement_data['scripts']
ms_hook = MsSqlHook(mssql_conn_id='OKTELL')
connection = ms_hook.get_conn()
cursor = connection.cursor()
for i in lst:
result = cursor.execute(i['export_script'])
df = pd.DataFrame(result)
df.to_csv(i['filename'], index=False, header=None, sep=',', encoding='utf-8')
cursor.close()
And this is my task for exporting:
export_csv_func = PythonOperator(
task_id='export_csv_func',
python_callable=export_csv,
mssql_conn_id='OKTELL'
P.S. I imported the libraries and airflow variables inside the function because before that there was a lot of load on the server and this method helped to reduce the load.

When using the PythonOperator you pass args to a callable via op_args and/or op_kwargs. In this case, if you wanted to pass the mssql_conn_id arg you can try:
export_csv_func = PythonOperator(
task_id='export_csv_func',
python_callable=export_csv,
op_kwargs={'mssql_conn_id': 'OKTELL'},
)
Then you need to update the export_csv() function signature to accept this kwarg too.

Importing module required by module

I'm writing a web app for a college assignment using Python/Flask and, to keep my app.py file neat, I have a function to query a DB stored in another file. This function uses pymysql and json modules and I can't manage to load these in a way that makes it work - I keep getting an attribute error saying pymysql is not defined.
I've tried putting import statements in my module file (DBjson.py), within the function contained in my module, and within app.py. This is my module code:
def fetchFromDB(host,port,dbname,user,password,query,jsonString=False):
import pymysql # these import statements are in the function in this example - one of several places I've tried them!
import json
conn = pymysql.connect(host, user=user,port=port, passwd=password, db=dbname)
cursorObject = conn.cursor(pymysql.cursors.DictCursor)
with cursorObject as cursor:
cursor.execute(query)
result = cursor.fetchall()
conn.close()
if jsonString == True:
try:
for i in range(len(result)):
result[i]['dateTime'] = result[i]['dateTime'].strftime('%Y-%m-%d %H:%M')
except:
pass
result = json.dumps(result)
return result
And the route from my app.py:
import pymysql
import json
#app.route('/')
def index():
wds = DBjson.fetchFromDB(host,port,dbname,user,password,weatherQuery)
bds = DBjson.fetchFromDB(host,port,dbname,user,password,bikesQuery)
return render_template('weatherDateTime.html', wds=wds, bds=bds)
Any help on how to make this work?
Thanks!
edit - I wrote a test script from which I can load my module and run my function no problem - I have my import statements at the start of the DBjson.py module file and outside of the function. Is this some quirk of Flask/scoping that I don't know about?
PS - Thanks for all the replies so far
import DBjson
query = "SELECT * FROM dublinBikesInfo WHERE dateTime LIKE (SELECT MAX(datetime) FROM dublinBikesInfo);"
#login details for AWS RDS DB
host="xyza"
port=3306
dbname="xyza"
user="xyza"
password="xyza"
a = DBjson.fetchFromDB(host,port,dbname,user,password,query)
print(a)

Hi in your code there is a indent error all the statements has to be inside of the function/method that you have created ex.
`def method():
#code here`
And also try to import the library’s before defining a function/method in the beginning of the page is also good!!
In your scenario please put all the function/method related statements inside of the function/method!!

Python is very unforgiving about indentation. Your module code isn't indented correctly.
The proper way to do this, would be:
def Function():
#your code indented in here

Sqlite3.DatabaseError only when I deploy

Introduction
I'm developing a python webapp running on Flask. One of the module I developed use sqlite3 to access a database file in one of my project directory. Locally it works like a charm, but I have issues to make it run properly on pythonanywhere.
Code
Here's an insight of my module_database.py (both sql query are only SELECT):
import sqlite3
import os
PATH_DB = os.path.join(os.path.dirname(__file__), 'res/database.db')
db = sqlite3.connect(PATH_DB)
cursor = db.cursor()
def init():
cursor.execute(my_sql_query)
val = cursor.fetchone()
def process():
cursor.execute(another_sql_query)
another_val = cursor.fetchone()
I don't know if that's important but my module is imported like this:
from importlib import import_module
module = import_module(absolute_path_to_module)
module.init() # module init
And afterwards my webapp will regularly call:
module.process()
So, I have one access to the db in my init() and one access to the db in my process(). Both works when I run it locally.
Problem
I pulled my code via github on pythonanywhere, restarted the app and I can see in the log file that the access to the DB in the init() worked (I print a value, it's working fine)
But then, when my app calls the process() method I got a:
2017-11-06 16:27:55,551: File "/home/account-name/project-name/project_modules/module_database.py", line 71, in my_method
2017-11-06 16:27:55,551: cursor.execute(sql)
2017-11-06 16:27:55,552: sqlite3.DatabaseError: database disk image is malformed
I tried via the console to run an integrity check:
PRAGMA integrity_check;
and it prints OK
I'd be glad to hear if you have any idea where this could come from.

a small thing, and it may not fix your specific problem, but you should always call path.abspath on __file__ before calling path.dirname, otherwise you can get unpredictable results depending on how your code is imported/loaded/run
PATH_DB = os.path.join(
os.path.dirname(os.path.abspath(__file__)),
'res/database.db'
)

mysql command out of sync when executing insert from celery

I am running in to the dreaded MySQL Commands out of Sync when using a custom DB library and celery.
The library is as follows:
import pymysql
import pymysql.cursors
from furl import furl
from flask import current_app
class LegacyDB:
"""Db
Legacy Database connectivity library
"""
def __init__(self,app):
with app.app_context():
self.rc = current_app.config['RAVEN']
self.logger = current_app.logger
self.data = {}
# setup Mysql
try:
uri = furl(current_app.config['DBCX'])
self.dbcx = pymysql.connect(
host=uri.host,
user=uri.username,
passwd=uri.password,
db=str(uri.path.segments[0]),
port=int(uri.port),
cursorclass=pymysql.cursors.DictCursor
)
except:
self.rc.captureException()
def query(self, sql, params = None, TTL=36):
# INPUT 1 : SQL query
# INPUT 2 : Parameters
# INPUT 3 : Time To Live
# OUTPUT : Array of result
# check that we're still connected to the
# database before we fire off the query
try:
db_cursor = self.dbcx.cursor()
if params:
self.logger.debug("%s : %s" % (sql, params))
db_cursor.execute(sql,params)
self.dbcx.commit()
else:
self.logger.debug("%s" % sql)
db_cursor.execute(sql)
self.data = db_cursor.fetchall()
if self.data == None:
self.data = {}
db_cursor.close()
except Exception as ex:
if ex[0] == "2006":
db_cursor.close()
self.connect()
db_cursor = self.dbcx.cursor()
if params:
db_cursor.execute(sql,params)
self.dbcx.commit()
else:
db_cursor.execute(sql)
self.data = db_cursor.fetchall()
db_cursor.close()
else:
self.rc.captureException()
return self.data
The purpose of the library is to work alongside SQLAlchemy whilst I migrate a legacy database schema from a C++-based system to a Python based system.
All configuration is done via a Flask application and the app.config['DBCX'] value reads the same as a SQLAlchemy String ("mysql://user:pass#host:port/dbname") allowing me to easily switch over in future.
I have a number of tasks that run "INSERT" statements via celery, all of which utilise this library. As you can imagine, the main reason for running Celery is so that I can increase throughput on this application, however I seem to be hitting an issue with the threading in my library or the application as after a while (around 500 processed messages) I see the following in the logs:
Stacktrace (most recent call last):
File "legacy/legacydb.py", line 49, in query
self.dbcx.commit()
File "pymysql/connections.py", line 662, in commit
self._read_ok_packet()
File "pymysql/connections.py", line 643, in _read_ok_packet
raise OperationalError(2014, "Command Out of Sync")
I'm obviously doing something wrong to hit this error, however it doesn't seem to matter whether MySQL has autocommit enabled/disabled or where I place my connection.commit() call.
If I leave out the connection.commit() then I don't get anything inserted into the database.
I've recently moved from mysqldb to pymysql and the occurrences appear to be lower, however given that these are simple "insert" commands and not a complicated select (there aren't even any foreign key constraints on this database!) I'm struggling to work out where the issue is.
As things stand at present, I am unable to use executemany as I cannot prepare the statements in advance (I am pulling data from a "firehose" message queue and storing it locally for later processing).

First of all, make sure that the celery thingamajig uses its own connection(s) since
>>> pymysql.threadsafety
1
Which means: "threads may share the module but not connections".

Is the init called once, or per-worker? If only once, you need to move the initialisation.
How about lazily initialising the connection in a thread-local variable the first time query is called?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Airflow: Running PostgresOperator within a PythonOperator with Taskflow API - python

Related

Python MySQL Query Not working (Something went wrong format requires a mapping)

why can't connect to databases via airflow connection?

Importing module required by module

Sqlite3.DatabaseError only when I deploy

mysql command out of sync when executing insert from celery

Categories

Resources