I would like to connect to a redshift server using psycopg2 and execute some SQL statements. I am scheduling this on an airflow server and running via PythonOperator. I would like the redshift server console output (STOUT) to be passed back to the LoggingConnection, however it only seems to pass the query executed.
The purpose of this is to be able to debug upstream issues and log ETL metadata (in context of the job). Is this possible??
Instantiate a logger object and initialize conn with the logger. This works, but provides no information besides the query executed.
import pandas as pd
import logging
import psycopg2 as pg
import sys
from psycopg2.extras import LoggingConnection
conn = pg.connect(<.....>)
conn.autocommit = True
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stdout))
conn.initialize(logger)
cur.execute("INSERT INTO tableA <select...>")
# Example of what I want to Log:
# [2019-07-15 15:59:51] 2000 rows inserted... in 2 s 318 ms (execution: 2 s 279 ms, fetching: 39 ms)
I want the logger to be passed all output from the server console, emulating what one would see if they were to run it locally.
Related
I am trying to setup AI similar to how it is being done here
I am using Python 3.9.13 and following packages: opencensus==0.11.0, opencensus-ext-azure==1.1.7, opencensus-context==0.1.3
My code looks something like this:
import logging
import time
from opencensus.ext.azure.log_exporter import AzureLogHandler
# create the logger
app_insights_logger = logging.getLogger(__name__)
# set the handler
app_insights_logger.addHandler(AzureLogHandler(
connection_string='InstrumentationKey=00000000-0000-0000-0000-000000000000')
)
# set the logging level
app_insights_logger.setLevel(logging.INFO)
# this prints 'logging level = 20'
print('logging level = ',app_insights_logger.getEffectiveLevel())
# try to log an exception
try:
result = 1 / 0
except Exception:
app_insights_logger.exception('Captured a math exception.')
app_insights_logger.handlers[0].flush()
time.sleep(5)
However the exception does not get logged, I tried adding the explicit flush as mentioned in this post
Additionally, I tried adding the instrumentation key as mentioned in the docs, when that didn't work I tried with the entire connection string(the one with the ingestion key)
So,
How can I debug if my app is indeed sending requests to Azure ?
How can I check on the Azure portal if it is a permission issue ?
You can set the severity level before logging any kind of telemetry information to Application Insights.
Note: By default, root logger can be configured with warning severity. if you want to add other severity information you have to set like (logger.setLevel(logging.INFO)).
I am using below code to log the telemetry information in Application Insights
import logging
from logging import Logger
from opencensus.ext.azure.log_exporter import AzureLogHandler
AI_conn_string= '<Your AI Connection string>'
handler = AzureLogHandler(connection_string=AI_conn_string)
logger = logging.getLogger()
logger.addHandler(handler)
#by default root logger can be configured with warning severity.
logger.warning('python console app warning log in AI ')
# setting severity for information level logging.
logger.setLevel(logging.INFO)
logger.info('Test Information log')
logger.info('python console app information log in AI')
try:
logger.warning('python console app Try block warning log in AI')
result = 1 / 0
except Exception:
logger.setLevel(logging.ERROR)
logger.exception('python console app error log in AI')
Results
Warning and Information log
Error log in AI
How can I debug if my app is indeed sending requests to Azure ?
We cannot debug the information whether the telemetry data send to Application Insights or not. But we can see the process like below
How can I check on the Azure portal if it is a permission issue ?
Instrumentation key and connection string has the permission to access the Application Insights resource.
I am writing a new cloud function and am using the new Google Cloud Logging library as announced at https://cloud.google.com/blog/products/devops-sre/google-cloud-logging-python-client-library-v3-0-0-release.
I am also using functions-framework to debug my code locally before pushing it to GCP. Setup and Invoke Cloud Functions using Python has been particularly useful here.
The problem I have is that when using these two things together I cannot see logging output in my IDE, I can only see print statements. Here's a sample of my code:
from flask import Request
from google.cloud import bigquery
from datetime import datetime
import google.cloud.logging
import logging
log_client = google.cloud.logging.Client()
log_client.setup_logging()
def main(request) -> str:
#
# do stuff to setup a bigquery job
#
bq_client = bigquery.Client()
job_config = bigquery.QueryJobConfig(labels={"key": "value"})
nowstr = datetime.now().strftime("%Y%m%d%H%M%S%f")
job_id = f"qwerty-{nowstr}"
query_job = bq_client.query(
query=export_script, job_config=job_config, job_id=job_id
)
print("Started job: {}".format(query_job.job_id))
query_job.result() # Waits for job to complete.
logging.info(f"job_id={query_job.job_id}")
logging.info(f"total_bytes_billed={query_job.total_bytes_billed}")
return f"{query_job.job_id} {query_job.state} {query_job.error_result}"
However when I run the function using cloud functions the only output I see is in my terminal is
Started job: qwerty-20220306181905424093
As you can see the call to print(...) has outputted to my terminal but the call to logging.info(...) has not. Is there a way to redirect logging output to my terminal when running locally using functions-framework but not affect logging when the function is running as an actual cloud function in GCP?
Thanks to the advice from #cryptofool I figured out that I needed to change the default logging level to get output to appear in the terminal.
from flask import Request
from google.cloud import bigquery
from datetime import datetime
import google.cloud.logging
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def main(request) -> str:
#
# do stuff to setup a bigquery job
#
bq_client = bigquery.Client()
job_config = bigquery.QueryJobConfig(labels={"key": "value"})
nowstr = datetime.now().strftime("%Y%m%d%H%M%S%f")
job_id = f"qwerty-{nowstr}"
query_job = bq_client.query(
query=export_script, job_config=job_config, job_id=job_id
)
print("Started job: {}".format(query_job.job_id))
query_job.result() # Waits for job to complete.
logging.info(f"job_id={query_job.job_id}")
logging.info(f"total_bytes_billed={query_job.total_bytes_billed}")
return f"{query_job.job_id} {query_job.state} {query_job.error_result}"
Started job: qwerty-20220306211233889260
INFO:root:job_id=qwerty-20220306211233889260
INFO:root:total_bytes_billed=31457280
However, I still can't any output in the terminal when using google.cloud.logging
from flask import Request
from google.cloud import bigquery
from datetime import datetime
import google.cloud.logging
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
log_client = google.cloud.logging.Client()
log_client.setup_logging()
def main(request) -> str:
#
# do stuff to setup a bigquery job
#
bq_client = bigquery.Client()
job_config = bigquery.QueryJobConfig(labels={"key": "value"})
nowstr = datetime.now().strftime("%Y%m%d%H%M%S%f")
job_id = f"qwerty-{nowstr}"
query_job = bq_client.query(
query=export_script, job_config=job_config, job_id=job_id
)
print("Started job: {}".format(query_job.job_id))
query_job.result() # Waits for job to complete.
logging.info(f"job_id={query_job.job_id}")
logging.info(f"total_bytes_billed={query_job.total_bytes_billed}")
return f"{query_job.job_id} {query_job.state} {query_job.error_result}"
Started job: qwerty-20220306211718088936
I think I'll start another thread about this.
Is there a way to set connections / variables programtically in airflow? I am aware this is defeating the very purpose of not exposing these details in the code, but to debug it would really help me big time if I could do something like the following pseudo code:
# pseudo code
from airflow import connections
connections.add({name:'...',
user:'...'})
Connection is DB entity and you can create it. See below
from airflow import settings
from airflow.models import Connection
conn = Connection(
conn_id=conn_id,
conn_type=conn_type,
host=host,
login=login,
password=password,
port=port
)
session = settings.Session()
session.add(conn)
session.commit()
As for variables - just use the API. See example below
from airflow.models import Variable
Variable.set("my_key", "my_value")
A good blog post on this topic can be found here.
I'm having trouble connecting to the Airflow's metadata database from Python.
The connection is set, and I can query the metadata database, using the UI's Ad hoc query window.
If I try to use the same connection but from Python, it wouldn't work.
So, in details, I have two connection setup, both are working from the UI:
Connection 1:
Conn type: MYSQL
Host: airflow-sqlproxy-service
Schema: composer-1-6-1-airflow-1-10-0-57315b5a
Login: root
Connection 2:
Conn type: MYSQL
Host: 127.0.0.1
Schema: composer-1-6-1-airflow-1-10-0-57315b5a
Login: root
As I said, both of them working from the UI (Data Profiling -> Ad Hoc Query)
But whenever I create a DAG and try to trigger it from a PythonOperator using various hooks, I'm always getting the same error message:
Sample code 1:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.operators.mysql_operator import MySqlOperator
from airflow.hooks.mysql_hook import MySqlHook
class ReturningMySqlOperator(MySqlOperator):
def execute(self, context):
self.log.info('Executing: %s', self.sql)
hook = MySqlHook(mysql_conn_id=self.mysql_conn_id,
schema=self.database)
return hook.get_records(
self.sql,
parameters=self.parameters)
with DAG(
"a_JSON_db",
start_date=datetime(2020, 11, 19),
max_active_runs=1,
schedule_interval=None,
# catchup=False # enable if you don't want historical dag runs to run
) as dag:
t1 = ReturningMySqlOperator(
task_id='basic_mysql',
mysql_conn_id='airflow_db_local',
#sql="select * from xcom",
sql="select * from xcom")
def get_records(**kwargs):
ti = kwargs['ti']
xcom = ti.xcom_pull(task_ids='basic_mysql')
string_to_print = 'Value in xcom is: {}'.format(xcom)
# Get data in your logs
logging.info(string_to_print)
t2 = PythonOperator(
task_id='records',
provide_context=True,
python_callable=get_records)
Sample code 2:
def get_dag_ids(**kwargs):
mysql_hook = MySqlOperator(task_id='query_table_mysql',mysql_conn_id="airflow_db",sql="SELECT MAX(execution_date) FROM task_instance WHERE dag_id = 'Test_Dag'")
MySql_Hook = MySqlHook(mysql_conn_id="airflow_db_local")
records = MySql_Hook.get_records(sql="SELECT MAX(execution_date) FROM task_instance")
print(records)
t1 = PythonOperator(
task_id="get_dag_nums",
python_callable=get_dag_ids,
provide_context=True)
The error message is this:
ERROR - (2003, "Can't connect to MySQL server on '127.0.0.1' (111)")
I looked up the config, and I found this env_variable:
core sql_alchemy_conn mysql+mysqldb://root:#127.0.0.1/composer-1-6-1-airflow-1-10-0-57315b5a env var
I tried to use a postgress connection with this uri as well, same error message (as above).
I'm statred thinking the GCP Airflow's IAP blocking me to have access from a Python DAG.
My Airflow composer version is the following:
composer-1.6.1-airflow-1.10.0
Can anyone help me?
Only the service account associated with Composer will have read access to the tenant project for the metadata database. This will incorporate connecting from the underlying Kubernetes (Airflow) system to the tenant project hosting Airflow's Cloud SQL instance.
The accepted connection methods are SQLAlchemy and using the Kubernetes cluster as a proxy. Connections from GKE will use the airflow-sqlproxy-service.default service discovery name for connecting.
We also you have a CLI option through GKE. Use the following command to run a temporary deployment+pod using the mysql:latest image, which comes preinstalled with the mysql CLI tool:
$ kubectl run mysql-cli-tmp-deployment --generator=run-pod/v1 --rm --stdin --tty --image mysql:latest -- bash
Once in the shell, we can use the mysql tool to open an interactive session with the Airflow database:
$ mysql --user root --host airflow-sqlproxy-service --database airflow-db
Once the session is open, standard queries can be executed against the database.
Going through Admin -> Connections, we have the ability to create/modify a connection's params, but I'm wondering if I can do the same through API so I can programmatically set the connections
airflow.models.Connection seems like it only deals with actually connecting to the instance instead of saving it to the list. It seems like a function that should have been implemented, but I'm not sure where I can find the docs for this specific function.
Connection is actually a model which you can use to query and insert a new connection
from airflow import settings
from airflow.models import Connection
conn = Connection(
conn_id=conn_id,
conn_type=conn_type,
host=host,
login=login,
password=password,
port=port
) #create a connection object
session = settings.Session() # get the session
session.add(conn)
session.commit() # it will insert the connection object programmatically.
You can also add, delete, and list connections from the Airflow CLI if you need to do it outside of Python/Airflow code, via bash, in a Dockerfile, etc.
airflow connections --add ...
Usage:
airflow connections [-h] [-l] [-a] [-d] [--conn_id CONN_ID]
[--conn_uri CONN_URI] [--conn_extra CONN_EXTRA]
[--conn_type CONN_TYPE] [--conn_host CONN_HOST]
[--conn_login CONN_LOGIN] [--conn_password CONN_PASSWORD]
[--conn_schema CONN_SCHEMA] [--conn_port CONN_PORT]
https://airflow.apache.org/cli.html#connections
It doesn't look like the CLI currently supports modifying an existing connection, but there is a Jira issue for it with an active open PR on GitHub.
AIRFLOW-2840 - cli option to update existing connection
https://github.com/apache/incubator-airflow/pull/3684
First check if connection exists, after create new Connection using from airflow.models import Connection :
import logging
from airflow import settings
from airflow.models import Connection
def create_conn(conn_id, conn_type, host, login, pwd, port, desc):
conn = Connection(conn_id=conn_id,
conn_type=conn_type,
host=host,
login=login,
password=pwd,
port=port,
description=desc)
session = settings.Session()
conn_name = session.query(Connection).filter(Connection.conn_id == conn.conn_id).first()
if str(conn_name) == str(conn.conn_id):
logging.warning(f"Connection {conn.conn_id} already exists")
return None
session.add(conn)
session.commit()
logging.info(Connection.log_info(conn))
logging.info(f'Connection {conn_id} is created')
return conn
You can populate connections using environment variables using the connection URI format.
The environment variable naming convention is AIRFLOW_CONN_<conn_id>, all uppercase.
So if your connection id is my_prod_db then the variable name should be AIRFLOW_CONN_MY_PROD_DB.
In general, Airflow’s URI format is like so:
my-conn-type://my-login:my-password#my-host:5432/my-schema?param1=val1¶m2=val2
Note that connections registered in this way do not show up in the Airflow UI.
To use session = settings.Session(), it assumes the airflow database backend has been initiated. For those who haven't set it up for your development environment, a hybrid method using both Connection class and environment variables will be a workaround.
Below is the example for setting up a S3Hook
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
from airflow.models.connection import Connection
import os
import json
aws_default = Connection(
conn_id="aws_default",
conn_type="aws",
login='YOUR-AWS-KEY-ID',
password='YOUR-AWS-KEY-SECRET',
extra=json.dumps({'region_name': 'us-east-1'})
)
os.environ["AIRFLOW_CONN_AWS_DEFAULT"] = aws_default.get_uri()
s3_hook = S3Hook(aws_conn_id='aws_default')
s3_hook.list_keys(bucket_name='YOUR-BUCKET', prefix='YOUR-FILENAME')