programatically set connections / variables in airflow - python

Is there a way to set connections / variables programtically in airflow? I am aware this is defeating the very purpose of not exposing these details in the code, but to debug it would really help me big time if I could do something like the following pseudo code:
# pseudo code
from airflow import connections
connections.add({name:'...',
user:'...'})

Connection is DB entity and you can create it. See below
from airflow import settings
from airflow.models import Connection
conn = Connection(
conn_id=conn_id,
conn_type=conn_type,
host=host,
login=login,
password=password,
port=port
)
session = settings.Session()
session.add(conn)
session.commit()
As for variables - just use the API. See example below
from airflow.models import Variable
Variable.set("my_key", "my_value")
A good blog post on this topic can be found here.

Related

Does SQLAlchemy close sessions after commit()?

So this question is a little like Does SQLAlchemy reset the database session between SQLAlchemy Sessions from the same connection?
I have a Flask/SQLAlchemy/Postgres app, which intermittently seems to drop connections after a commit() that occurs as part of a POST request.
This causes me headaches as I rely upon a customized option (https://www.postgresql.org/docs/9.6/runtime-config-custom.html) to control row level security - in effect executing the following before each Flask request while utilising scoped sessions:
#app.before_request
def load_user():
...
# Set-up RLS.
statement = f"SET app.permitted_workspace_id = '{workspace_id}'"
db.db_session.execute(statement)
...
This pattern generally works fine, but occasionally seems to fail when, so far as I can tell, after a commit(), SQLAlchemy drops the existing session and checks out a new one, in which app.permitted_workspace_id is no longer set.
My workaround for this is to listen for session checkout events, and then re-set the parameter:
#event.listens_for(db_engine, 'checkout')
def receive_checkout(dbapi_connection, connection_record, connection_proxy):
...
cursor = dbapi_connection.cursor()
statement = f"SET app.permitted_workspace_id = '{g.user.workspace_id}'"
cursor.execute(statement)
return
So my question is really: is it unavoidable that SQLAlchemy may close sessions after commit(), meaning I lose my session parameters - even with more DB work still to do?
If so, do we think this pattern is secure or even acceptable practice? Ideally, I'd keep the session open until removed (via #app.teardown_appcontext), but since I'm struggling to achieve that, and still have the relevant info available within the Flask request, I think this is the next best way to go.
Thanks
Edit 1:
In terms of session scoping, the layout is this:
In a database module, I lay out the following:
def get_database_connection()
...
db_engine = sa.create_engine(
f'postgresql://{user}:{password}#{host}/postgres',
echo=False,
poolclass=sa.pool.NullPool
)
# Connect - RLS is controlled by db_get_user_details.
db_connection = db_engine.connect()
db_session = scoped_session(
sessionmaker(
autocommit=False,
autoflush=False,
expire_on_commit=False,
bind=db_engine
)
)
return(db_engine, db_session, db_connection)
This is then called up top from inside the main Flask application:
db_engine, db_session, db_connection = db.get_database_connection()
And session removal is controlled by a function as follows:
#app.teardown_appcontext
def remove_session(exception=None):
db_session.remove()
So the answer in here seems to be that commit() does perform a checkin with this pattern:
https://github.com/sqlalchemy/sqlalchemy/issues/4925
if Session is what you're working with then yes, the Session will release connections when any of commit(), rollback(), or close() is called.

give an example using GrpcHook and GrpcOperator in Airflow

i am new in airflow and gRPC
i use airflow running in docker with default setting
https://airflow.apache.org/docs/apache-airflow/stable/start/docker.html
when i try to do in this link
https://airflow.apache.org/docs/apache-airflow-providers-grpc/stable/_api/airflow/providers/grpc/index.html
channel = grpc.insecure_channel('localhost:50051')
number = calculator_pb2.Number(value=25)
con = GrpcHook(grpc_conn_id='grpc_con',
interceptors=[UnaryUnaryClientInterceptor]
)
run = GrpcOperator(task_id='square_root',
stub_class=calculator_pb2_grpc.CalculatorStub(channel),
call_func='SquareRoot',
grpc_conn_id='grpc_con',
data=number,
log_response=True,
interceptors=[UnaryUnaryClientInterceptor]
)
no response in DAG log even server is shut down or server port is wrong, but it works if i call with simple client
What you're looking for I guess is the GrpcOperator example.
In your example, the wrong parameter is data.
The data parameter should be data={'request':calculator_pb2.Number(value=25)}, if you don't modify generated protof files.
This is an example.
from airflow.providers.grpc.operators.grpc import GrpcOperator
from some_pb2_grpc import SomeStub
from some_pb2 import SomeRequest
GrpcOperator(task_id="task_id", stub_class=SomeStub, call_func='Function', data={'request': SomeRequest(var='data')})

Is there a way to create/modify connections through Airflow API

Going through Admin -> Connections, we have the ability to create/modify a connection's params, but I'm wondering if I can do the same through API so I can programmatically set the connections
airflow.models.Connection seems like it only deals with actually connecting to the instance instead of saving it to the list. It seems like a function that should have been implemented, but I'm not sure where I can find the docs for this specific function.
Connection is actually a model which you can use to query and insert a new connection
from airflow import settings
from airflow.models import Connection
conn = Connection(
conn_id=conn_id,
conn_type=conn_type,
host=host,
login=login,
password=password,
port=port
) #create a connection object
session = settings.Session() # get the session
session.add(conn)
session.commit() # it will insert the connection object programmatically.
You can also add, delete, and list connections from the Airflow CLI if you need to do it outside of Python/Airflow code, via bash, in a Dockerfile, etc.
airflow connections --add ...
Usage:
airflow connections [-h] [-l] [-a] [-d] [--conn_id CONN_ID]
[--conn_uri CONN_URI] [--conn_extra CONN_EXTRA]
[--conn_type CONN_TYPE] [--conn_host CONN_HOST]
[--conn_login CONN_LOGIN] [--conn_password CONN_PASSWORD]
[--conn_schema CONN_SCHEMA] [--conn_port CONN_PORT]
https://airflow.apache.org/cli.html#connections
It doesn't look like the CLI currently supports modifying an existing connection, but there is a Jira issue for it with an active open PR on GitHub.
AIRFLOW-2840 - cli option to update existing connection
https://github.com/apache/incubator-airflow/pull/3684
First check if connection exists, after create new Connection using from airflow.models import Connection :
import logging
from airflow import settings
from airflow.models import Connection
def create_conn(conn_id, conn_type, host, login, pwd, port, desc):
conn = Connection(conn_id=conn_id,
conn_type=conn_type,
host=host,
login=login,
password=pwd,
port=port,
description=desc)
session = settings.Session()
conn_name = session.query(Connection).filter(Connection.conn_id == conn.conn_id).first()
if str(conn_name) == str(conn.conn_id):
logging.warning(f"Connection {conn.conn_id} already exists")
return None
session.add(conn)
session.commit()
logging.info(Connection.log_info(conn))
logging.info(f'Connection {conn_id} is created')
return conn
You can populate connections using environment variables using the connection URI format.
The environment variable naming convention is AIRFLOW_CONN_<conn_id>, all uppercase.
So if your connection id is my_prod_db then the variable name should be AIRFLOW_CONN_MY_PROD_DB.
In general, Airflow’s URI format is like so:
my-conn-type://my-login:my-password#my-host:5432/my-schema?param1=val1&param2=val2
Note that connections registered in this way do not show up in the Airflow UI.
To use session = settings.Session(), it assumes the airflow database backend has been initiated. For those who haven't set it up for your development environment, a hybrid method using both Connection class and environment variables will be a workaround.
Below is the example for setting up a S3Hook
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from airflow.models.connection import Connection
import os
import json
aws_default = Connection(
conn_id="aws_default",
conn_type="aws",
login='YOUR-AWS-KEY-ID',
password='YOUR-AWS-KEY-SECRET',
extra=json.dumps({'region_name': 'us-east-1'})
)
os.environ["AIRFLOW_CONN_AWS_DEFAULT"] = aws_default.get_uri()
s3_hook = S3Hook(aws_conn_id='aws_default')
s3_hook.list_keys(bucket_name='YOUR-BUCKET', prefix='YOUR-FILENAME')

how to create pymongo connection per request in Flask

In my Flask application, I hope to use pymongo directly. But I am not sure what's the best way to create pymongo connection for each request and how to reclaim the connection resource.
I know Connection in pymongo is thread-safe and has built-in pooling. I guess I need to create a global Connection instance, and use before_request to put it in flask g.
In the app.py:
from pymongo import Connection
from admin.views import admin
connection = Connection()
db = connection['test']
#app.before_request
def before_request():
g.db = db
#app.teardown_request
def teardown_request(exception):
if hasattr(g, 'db'):
# FIX
pass
In admin/views.py:
from flask import g
#admin.route('/')
def index():
# do something with g.db
It actually works. So questions are:
Is this the best way to use Connection in flask?
Do I need to explicitly reclaim resources in teardown_request and how to do it?
I still think this is an interesting question, but why no response... So here is my update.
For the first question, I think using current_app is more clearer in Flask.
In app.py
app = Flask(__name__)
connection = Connection()
db = connection['test']
app.db = db
In the view.py
from Flask import current_app
db = current_app.db
# do anything with db
And by using current_app, you can use application factory to create more than one app as http://flask.pocoo.org/docs/patterns/appfactories/
And for the second question, I'm still figuring it out.
Here's example of using flask-pymnongo extension:
Example:
your mongodb uri (till db name) in app.config like below
app.config['MONGO_URI'] = 'mongodb://192.168.1.1:27017/your_db_name'
mongo = PyMongo(app, config_prefix='MONGO')
and then under your api method where u need db do the following:
db = mongo.db
Now you can work on this db connection and get your data:
users_count = db.users.count()
I think what you present is ok. Flask is almost too flexible in how you can organize things, not always presenting one obvious and right way. You might make use of the flask-pymongo extension which adds a couple of small conveniences. To my knowledge, you don't have to do anything with the connection on request teardown.

How do I make one instance in Python that I can access from different modules?

I'm writing a web application that connects to a database. I'm currently using a variable in a module that I import from other modules, but this feels nasty.
# server.py
from hexapoda.application import application
if __name__ == '__main__':
from paste import httpserver
httpserver.serve(application, host='127.0.0.1', port='1337')
# hexapoda/application.py
from mongoalchemy.session import Session
db = Session.connect('hexapoda')
import hexapoda.tickets.controllers
# hexapoda/tickets/controllers.py
from hexapoda.application import db
def index(request, params):
tickets = db.query(Ticket)
The problem is that I get multiple connections to the database (I guess that because I import application.py in two different modules, the Session.connect() function gets executed twice).
How can I access db from multiple modules without creating multiple connections (i.e. only call Session.connect() once in the entire application)?
Try the Twisted framework with something like:
from twisted.enterprise import adbapi
class db(object):
def __init__(self):
self.dbpool = adbapi.ConnectionPool('MySQLdb',
db='database',
user='username',
passwd='password')
def query(self, sql)
self.dbpool.runInteraction(self._query, sql)
def _query(self, tx, sql):
tx.execute(sql)
print tx.fetchone()
That's probably not what you want to do - a single connection per app means that your app can't scale.
The usual solution is to connect to the database when a request comes in and store that connection in a variable with "request" scope (i.e. it lives as long as the request).
A simple way to achieve that is to put it in the request:
request.db = ...connect...
Your web framework probably offers a way to annotate methods or something like a filter which sees all requests. Put the code to open/close the connection there.
If opening connections is expensive, use connection pooling.

Categories