After pushing my DAG I get this error
I am new to data engineering. I tried to solve this error in different ways at the expense of my knowledge, but nothing worked. I want to write a DAG that consists of two tasks, the first is to export data from one database table on one server as CSV files and import these CSV files into database tables on another server. The variable contains DAG configuration and SQL scripts for exporting and importing data.
Please tell me how can I solve this error?
I have this exporting code:
def export_csv():
import json
from airflow.models import Variable
import pandas as pd
instruction_data = json.loads(Variable.get('MAIN_SOURCE_DAMDI_INSTRUCTIONS'))
requirement_data = instruction_data['requirements']
lst = requirement_data['scripts']
ms_hook = MsSqlHook(mssql_conn_id='OKTELL')
connection = ms_hook.get_conn()
cursor = connection.cursor()
for i in lst:
result = cursor.execute(i['export_script'])
df = pd.DataFrame(result)
df.to_csv(i['filename'], index=False, header=None, sep=',', encoding='utf-8')
cursor.close()
And this is my task for exporting:
export_csv_func = PythonOperator(
task_id='export_csv_func',
python_callable=export_csv,
mssql_conn_id='OKTELL'
P.S. I imported the libraries and airflow variables inside the function because before that there was a lot of load on the server and this method helped to reduce the load.
When using the PythonOperator you pass args to a callable via op_args and/or op_kwargs. In this case, if you wanted to pass the mssql_conn_id arg you can try:
export_csv_func = PythonOperator(
task_id='export_csv_func',
python_callable=export_csv,
op_kwargs={'mssql_conn_id': 'OKTELL'},
)
Then you need to update the export_csv() function signature to accept this kwarg too.
Related
I'm trying to create an Airflow pipeline that downloads data from an API, processes it, saves it as a CSV and then loads the data to a Postgres database (all within a docker container).
The code looks something like this
from datetime import datetime, timedelta
import pandas as pd
from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.operators.python import PythonOperator
default_args = {
"owner": "airflow",
"retries": 5,
"retry_delay": timedelta(minutes=1),
"email": ['airflow#domain.com'],
"email_on_failure": True,
"email_on_retry": False
}
def get_data():
request = request.get("some_url")
request_data = request.json()
all_data = pd.DataFrame.from_dict(request_data["data"])
all_data.to_csv("/opt/airflow/data/all_data.csv",index=False)
with DAG(
dag_id="my_dag",
default_args=default_args,
start_date=datetime(2022,1,24),
catchup=False,
schedule_interval=timedelta(minutes=5)
) as dag:
create_table = PostgresOperator(
task_id="create_table",
postgres_conn_id="postgres_localhost",
sql="""
create table if not exists my_table(
created_at timestamp,
col1 double precision,
col2 smallint,
primary key (created_at, col1)
)
"""
)
get_data = PythonOperator(
task_id="get_data",
python_callable=get_data
)
load_data = PostgresOperator(
task_id = "load_data",
postgres_conn_id="postgres_localhost",
sql="""
copy my_table
from '/opt/airflow/data/all_data.csv'
delimiter ',' csv;
"""
)
create_table >> get_data >> load_data
The problem is that when I try to run the DAG I get an error in the load_data task saying psycopg2.errors.UndefinedFile: could not open file "/opt/***/data/all_data.csv" for reading: No such file or directory HINT: COPY FROM instructs the PostgreSQL server process to read a file. You may want a client-side facility such as psql's \copy.
I don't know why the word airflow is getting replaced in the path or how to save it properly so that the CSV file can be copied into postgres.
This error is because the postgres server is a separate instance within Docker. You could try one of the following ways around this:
Copy the file between servers by using using scp to place the data file onto the postgres server
Copy the file between servers by using SFTPOperator (which requires SSH Hook instatiation), then do the COPY statement.
Connect manually to the Postgres db via the BashOperator and run the copy CLI command for Postgres.
If anyone has a more elegant solution, please answer. I have this same problem right now and am working on it. Once I have it, I'll post it back here.
I am using the sample program from the Snowflake document on using Python to ingest the data to the destination table.
So basically, I have to execute put command to load data to the internal stage and then run the Python program to notify the snowpipe to ingest the data to the table.
This is how I create the internal stage and pipe:
create or replace stage exampledb.dbschema.example_stage;
create or replace pipe exampledb.dbschema.example_pipe
as copy into exampledb.dbschema.example_table
from
(
select
t.*
from
#exampledb.dbschema.example_stage t
)
file_format = (TYPE = CSV) ON_ERROR = SKIP_FILE;
put command:
put file://E:\\example\\data\\a.csv #exampledb.dbschema.example_stage OVERWRITE = TRUE;
This is the sample program I use:
from logging import getLogger
from snowflake.ingest import SimpleIngestManager
from snowflake.ingest import StagedFile
from snowflake.ingest.utils.uris import DEFAULT_SCHEME
from datetime import timedelta
from requests import HTTPError
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.serialization import load_pem_private_key
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives.serialization import Encoding
from cryptography.hazmat.primitives.serialization import PrivateFormat
from cryptography.hazmat.primitives.serialization import NoEncryption
import time
import datetime
import os
import logging
logging.basicConfig(
filename='/tmp/ingest.log',
level=logging.DEBUG)
logger = getLogger(__name__)
# If you generated an encrypted private key, implement this method to return
# the passphrase for decrypting your private key.
def get_private_key_passphrase():
return '<private_key_passphrase>'
with open("E:\\ssh\\rsa_key.p8", 'rb') as pem_in:
pemlines = pem_in.read()
private_key_obj = load_pem_private_key(pemlines,
get_private_key_passphrase().encode(),
default_backend())
private_key_text = private_key_obj.private_bytes(
Encoding.PEM, PrivateFormat.PKCS8, NoEncryption()).decode('utf-8')
# Assume the public key has been registered in Snowflake:
# private key in PEM format
# List of files in the stage specified in the pipe definition
file_list=['a.csv.gz']
ingest_manager = SimpleIngestManager(account='<account_identifier>',
host='<account_identifier>.snowflakecomputing.com',
user='<user_login_name>',
pipe='exampledb.dbschema.example_pipe',
private_key=private_key_text)
# List of files, but wrapped into a class
staged_file_list = []
for file_name in file_list:
staged_file_list.append(StagedFile(file_name, None))
try:
resp = ingest_manager.ingest_files(staged_file_list)
except HTTPError as e:
# HTTP error, may need to retry
logger.error(e)
exit(1)
# This means Snowflake has received file and will start loading
assert(resp['responseCode'] == 'SUCCESS')
# Needs to wait for a while to get result in history
while True:
history_resp = ingest_manager.get_history()
if len(history_resp['files']) > 0:
print('Ingest Report:\n')
print(history_resp)
break
else:
# wait for 20 seconds
time.sleep(20)
hour = timedelta(hours=1)
date = datetime.datetime.utcnow() - hour
history_range_resp = ingest_manager.get_history_range(date.isoformat() + 'Z')
print('\nHistory scan report: \n')
print(history_range_resp)
After running the program, I just need to remove the file in the internal stage:
REMOVE #exampledb.dbschema.example_stage;
The code works as expected for the first time but when I truncate the data on that table and run the code again, the table on snowflake doesn't have any data in it.
Do I miss something here? How can I make this code can run multiple times?
Update:
I found that if I use a file with a different name each time I run, the data can load to the snowflake table.
So how can I run this code without changing the data filename?
Snowflake uses file loading metadata to prevent reloading the same files (and duplicating data) in a table. Snowpipe prevents loading files with the same name even if they were later modified (i.e. have a different eTag).
The file loading metadata is associated with the pipe object rather than the table. As a result:
Staged files with the same name as files that were already loaded are ignored, even if they have been modified, e.g. if new rows were added or errors in the file were corrected.
Truncating the table using the TRUNCATE TABLE command does not delete the Snowpipe file loading metadata.
However, note that pipes only maintain the load history metadata for 14 days. Therefore:
Files modified and staged again within 14 days:
Snowpipe ignores modified files that are staged again. To reload modified data files, it is currently necessary to recreate the pipe object using the CREATE OR REPLACE PIPE syntax.
Files modified and staged again after 14 days:
Snowpipe loads the data again, potentially resulting in duplicate records in the target table.
For more information have a look here
The goal I am having is to loop over a list that I get from another PythonOperator and within this loop save the json to the Postgres DB. I am using the Taskflow API from Airflow 2.0.
The code works well if I write the SQL statement directly to the sql parameter in the PostgresOperator. But when I write the SQL to a file and put the file path into the sql parameter, then the error is thrown:
psycopg2.errors.SyntaxError: syntax error at or near "sql"
LINE 1: sql/insert_deal_into_deals_table.sql
This is the code of the task:
#task()
def write_all_deals_to_db(all_deals):
for deal in all_deals:
deal_json = json.dumps(deal)
pg = PostgresOperator(
task_id='insert_deal',
postgres_conn_id='my_db',
sql='sql/insert_deal_into_deals_table.sql',
params={'deal_json': deal_json}
)
pg.execute(dict())
The weird thing is, that the code works if I use it as a standalone Operator (outside of a PythonOperator). Like this:
create_deals_table = PostgresOperator(
task_id='create_deals_table',
postgres_conn_id='my_db',
sql='sql/create_deals_table.sql'
)
I tried around a lot and I guess that it has to do with the Jinja templating. Somehow within a PythonOperator the PostgresOperator cannot make use of neither the param nor the .sql file parsing.
Any tip or reference is greatly appreciated!
EDIT:
This code works, but is rather a quick fix. The actual problem I am still having, is that the Jinja template is not working for the PostgresOperator when I am using it inside a PythonOperator.
#task()
def write_all_deals_to_db(all_deals):
sql_path = 'sql/insert_deal_into_deals_table.sql'
for deal in all_deals:
deal_json = _transform_json(deal)
sql_query = open(path.join(ROOT_DIRECTORY, sql_path)).read()
sql_query = sql_query.format(deal_json)
pg = PostgresOperator(
task_id='insert_deal',
postgres_conn_id='my_db',
sql=sql_query
)
pg.execute(dict())
This is probably an easy fix, but I cannot get this code to run. I have been using AWS Secrets Manager with no issues on Pycharm 2020.2.3. The problems with AWS Wrangler however are listed below:
Read in Dataframe
test_df = pd.read_csv(source, encoding='latin-1')
Check df data types
data_types_df = test_df.dtypes
print('Data type of each column of Dataframe:')
print(data_types_df)
Convert columns to correct data types
test_df['C'] = pd.to_datetime(test_df['C'])
test_df['E'] = pd.to_datetime(test_df['E'])
Check df data types
df_new = test_df.dtypes
print('Data type of each column of Dataframe:')
print(df_new)
I have tried both snippets below and I get the same error:
engine = wr.catalog.get_engine("aws-data-wrangler-redshift", region_name=region_name)
engine = wr.catalog.get_engine('redshift+psycopg2://' + Username + ":" + Password + ClusterURL)
Error:
botocore.exceptions.NoRegionError: You must specify a region.
Then I was going to try to convert a Pandas Dataframe to a custom table in redshift using one of the two methods below:
path = f"s3://{bucket}/stage/"
iam_role = 'ARN'
Copy df to redshift custom table
wr.db.copy_to_redshift(
df=df_new,
path=path,
con=engine,
schema="custom",
table="test_df",
mode="overwrite",
iam_role=iam_role,
primary_keys=["c"]
)
Pandas df to redshift
wr.pandas.to_redshift(
dataframe=df_new,
path=path,
schema="custom",
table="test_df",
connection=con,
iam_role="YOUR_ROLE_ARN",
mode="overwrite",
preserve_index=False
)
Any help would be much appreciated :)
Data Wrangler uses Boto3 under the hood. And Boto3 will look for the AWS_DEFAULT_REGION env variable. So you have two options:
Set this in your ~/.aws/config file:
[default]
region=us-east-1
Or set this as env variable in your PC:
export AWS_DEFAULT_REGION=us-east-1
More specific you can set environment variables in PyCharm
I have saved a connection of type "google_cloud_platform" in Airflow as described here https://cloud.google.com/composer/docs/how-to/managing/connections
Now in my DAG, I need to extract from the saved connection the Keyfile JSON
What is the correct hook to be used?
Use airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook to get the stored connection. For example
from airflow.contrib.hooks.gcp_api_base_hook import GoogleCloudBaseHook
gcp_hook = GoogleCloudBaseHook(gcp_conn_id="<your-conn-id>")
keyfile_dict = gcp_hook._get_field('keyfile_dict')
You can just use BaseHook as follows:
from airflow.hooks.base_hook import BaseHook
GCP_CONNECTION_ID="my-gcp-connection"
BaseHook.get_connection(GCP_CONNECTION_ID).extras["extra__google_cloud_platform__keyfile_dict"]
The other solutions no longer work. Here's a way that's working in 2023:
from airflow.models import Connection
conn = Connection.get_connection_from_secrets(
conn_id='my-gcp-connection'
)
json_key = conn.extra_dejson['keyfile_dict']
with open('gcp_svc_acc.json', 'w') as f:
f.write(json_key)
Mostly because the imports moved around I think.
The question specifically refers to keyfile JSON, but this is a quick addendum for those who configured keyfile path instead: take care to check if it's keyfile_dict OR keyfile_path that the Airflow admin configured, as they're two different ways to set up the connection.