How to add content to the email send by airflow on success - python

I wish to send a email with information of the output from my airflow DAG on success. My two approach where first to execute a function from the main DAG (which I could't do as is seems the DAG is unnable to access outputs of what it runs, which I find logical) and the second more promising, to configure it to send the log of the DAG run. I created it following this idea in airflow, but while it sends emails, they're blank. I tried finding the log and creating a list to pass as a message as follows:
def task_success_callback(context):
outer_task_success_callback(context, email='a1u1k6u3f0v1t0r8#justeat.slack.com')
def outer_task_success_callback(context, email):
lines = []
for file in glob.glob("AIRFLOW_HOME/*.log"):
with open(file) as f:
lines = [line for line in f.readlines()]
print(lines)
mensaje = lines
subject = "[Airflow] DAG {0} - Task {1}: Success".format(
context['task_instance_key_str'].split('__')[0],
context['task_instance_key_str'].split('__')[1]
)
html_content = """
DAG: {0}<br>
Task: {1}<br>
Log: {2}<br>
""".format(
context['task_instance_key_str'].split('__')[0],
context['task_instance_key_str'].split('__')[1],
mensaje
)
It didnt seem to produce anything. I did not even return an error. Even weirder, now in the airflow log it doesnt even refer to the event "email send" which it used to do

First: pass messages
To send mail about DAG running info, you need two core airflow components to support you.
They are XCOM and Python operator =>provide_context
XCOM: it is used for exchange information among tasks. The message could be push and pull from Xcom.
This is a subtle but very important point: in general, if two
operators need to share information, like a filename or small amount
of data, you should consider combining them into a single operator. If
it absolutely can’t be avoided, Airflow does have a feature for
operator cross-communication called XCom that is described in the
section XComs
https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom
provide_context (bool)
– if set to true, Airflow will pass a set of keyword arguments that
can be used in your function. This set of kwargs correspond exactly to
what you can use in your jinja templates. For this to work, you need
to define **kwargs in your function header.
https://airflow.apache.org/docs/stable/_api/airflow/operators/python_operator/index.html?highlight=provide_context
Second: run email task after task successed.
There are many way. One is depended on the DAG level. Another is depended on the task level. You can double check about your logic

Related

generate dynamic task using hooks without running them in backend

I have a simple dag - that takes argument from mysql db - (like sql, subject)
Then I have a function creating report out and send to particular email.
Here is code snippet.
def s_report(k,**kwargs):
body_sql = list2[k][4]
request1 = "({})".format(body_sql)
dwh_hook = SnowflakeHook(snowflake_conn_id="snowflake_conn")
df1 = dwh_hook.get_pandas_df(request1)
df2 = df1.to_html()
body_Text = list2[k][3]
html_content = f"""HI Team, Please find report<br><br>
{df2} <br> </br>
<b>Thank you!</b><br>
"""
return EmailOperator(task_id="send_email_snowflake{}".format(k), to=list2[k][1],
subject=f"{list2[k][2]}", html_content=html_content, dag=dag)
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
The s_report is getting generated dynamically, But the real problem is hook is continously submitting query in backend, While dag is stopped still its submitting query in backend.
I can use pythonoperator, but its not generating dynamic task.
A couple of things:
By looking at your code, in particular the lines:
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
we can determine that if your first task succeeds, namely, mysql_list, then the tasks downstream to it, namely, the s_report calls should begin executing. You have precisely len(list) of them. Within each s_report call there is exactly one dwh_hook.get_pandas_df(request) call, so I believe your DAG should be making len(list) calls of this type provided mysql_list task succeeds.
As for the mismatch you see in your Snowflake logs, I can't advise you here. I'd need more details. Keep in mind that the call get_pandas_df might have a retry mechanism (i.e. if cannot reach snowflake, retry) which might explain why your Snowflake logs show a bunch of requests.
If your DAG finishes successfully, (i.e. end_operator tasks finishes successfully), you are correct. There should be no requests in your Snowflake logs that came post-DAG end.
If you want more insight as to how your DAG interacts with your Snowflake resource, I'd suggest having a single s_report task like so:
mysql_list >> [ s_report(0)] >> end_operator
and see the behaviour in the logs.

boto3 how to get the logstream form a sagemaker transform job?

i am able to crete the job and it fail, using boto3
import boto3
session = boto3.session.Session()
client = session.client('sagemaker')
descibe = client.describe_transform_job(TransformJobName="my_transform_job_name")
in the ui i can see the button to go to the logs, i can use boto3 to retrive the logs if hardcode the group name and the log-stream.
but how can i get the Log stream from the batch transfrom job? shouldnt be a field with logstream or something like that in the ".describe_transform_job"?
sagemaker doesnt provide a direct way to do it, the way to do it, is to also use the log client.
get the log streams corresponding to your batchtransform_job
client_logs = boto3.client('logs')
log_groups =
client_logs.describe_log_streams(logGroupName="the_log_group_name", logStreamNamePrefix=transform_job_name)
log_streams_names= []
for i in log_groups["logStreams"]:
log_streams_names.append(i["logStreamName"])
and this will give a list of "project_name/virtualMachine_id" that is the machines that your code was run depending on how many instances you set.
After you can run for each of the log_streams
for i_stream_name in log_streams_names:
client_logs.get_log_events("the_stream_log_name", "the_log_group_name")
now you can loop and print the lines of the log stream event =)

Python3.8 Asyncio - Return Results from List of Dictionaries

I am really really struggling to figure out how to use asyncio to return a bunch of results from a bunch of AWS Lambda calls, here is my example.
My team owns a bunch of AWS accounts. For the sake of time, I want to run an async of AWS lambda functions to process the information of each account, and return the results. I'm trying to understand how I can create an async of sending a whole bunch of accounts quickly rather than doing it one at a time. Here is my example code.
def call_lambda(acct):
aws_lambda = boto3.client('lambda', region_name='us-east-2')
aws_payload = json.dumps(acct)
response = aws_lambda.invoke(
FunctionName='MyLambdaName',
Payload=aws_payload,
)
return json.loads(response['Payload'].read())
def main():
scan_time = datetime.datetime.utcnow()
accounts = []
scan_data = []
account_data = account_parser()
for account_info in account_data:
account_info['scan_time'] = scan_time
for account in account_data:
scan_data.append(call_lambda(account))
I am struggling to figure out how to do this in an asyncio style. I originally managed to pull it off using concurrent futures threadpoolexecutor, but I ran into some issues with performance, but here is what I had.
executor = concurrent.futures.ThreadPoolExecutor(max_workers=50)
sg_data = executor.map(call_lambda, account_data)
So this worked, but not well, and I was told to do asyncio instead. I read these following articles but I am still just lost as to how to make this work. I know AWS Lambda itself is asynchronous, and should work fine without a coroutine.
The tl;dr is I want to kick off call_lambda(acct) for every single Dict in my List (account_data is a list of dictionaries) and then return all the results into one big list of Dict again. (this eventually gets written into an CSV, company policy issues for why not going into a database).
I have read the following, still confused...
https://stackabuse.com/python-async-await-tutorial/
Lambda invocations are by default synchronous (RequestResponse), so you have to specify the InvocationType to Event. By doing this though you don't get your response to collate the account info you desire. Hence the desire to use async or something similar.
response = client.invoke(
FunctionName='string',
InvocationType='Event'|'RequestResponse'|'DryRun',
LogType='None'|'Tail',
ClientContext='string',
Payload=b'bytes'|file,
Qualifier='string'
)
I've not implemented async in lambda, but as alternate solution, if you can create an s3 file, then have each invocation of FunctionName='MyLambdaName' just update the file, you might get what you need that way.

How to initialize repeating tasks using Django Background Tasks?

I'm working on a django application which reads csv file from dropbox, parse data and store it in database. For this purpose I need background task which checks if the file is modified or changed(updated) and then updates database.
I've tried 'Celery' but failed to configure it with django. Then I find django-background-tasks which is quite simpler than celery to configure.
My question here is how to initialize repeating tasks?
It is described in documentation
but I'm unable to find any example which explains how to use repeat, repeat_until or other constants mentioned in documentation.
can anyone explain the following with examples please?
notify_user(user.id, repeat=<number of seconds>, repeat_until=<datetime or None>)
repeat is given in seconds. The following constants are provided:
Task.NEVER (default), Task.HOURLY, Task.DAILY, Task.WEEKLY,
Task.EVERY_2_WEEKS, Task.EVERY_4_WEEKS.
You have to call the particular function (notify_user()) when you really need to execute it.
Suppose you need to execute the task while a request comes to the server, then it would be like this,
#background(schedule=60)
def get_csv(creds):
#read csv from drop box with credentials, "creds"
#then update the DB
def myview(request):
# do something with my view
get_csv(creds, repeat=100)
return SomeHttpResponse
Excecution Procedure
1. Request comes to the url hence it would dispatch to the corresponding view, here myview()
2. Excetes the line get_csv(creds, repeat=100) and then creates a async task in DB (it wont excetute the function now)
3. Returning the HTTP response to the user.
After 60 seconds from the time which the task creation, get_csv(creds) will excecutes repeatedly in every 100 seconds
For example, suppose you have the function from the documentation
#background(schedule=60)
def notify_user(user_id):
# lookup user by id and send them a message
user = User.objects.get(pk=user_id)
user.email_user('Here is a notification', 'You have been notified')
Suppose you want to repeat this task daily until New Years day of 2019 you would do the following
import datetime
new_years_2019 = datetime.datetime(2019, 01, 01)
notify_user(some_id, repeat=task.DAILY, repeat_until=new_years_2019)

How to isolate user input for env.hosts in fabric per task

I have a few tasks in a fabric script. I'm trying to figure out how one would allow for the setting of the env.hosts or the #hosts decorator to be isolated to a given task. I want to have some tasks in my fab file to have preset hosts while I could feed a file that could be parsed as a tuple of hosts to others. I would also like to have that file determined at run time.
I have this:
def host_list():
host_file = raw_input("enter the file containing the list of hosts: ")
host_list = open(host_file, 'r')
host_list = host_list.read().strip('\n')
host_list = host_list.split(',')
return host_list
I have a task:
#task
def hostname():
run('hostname')
I can get env.hosts set properly when I have the host_list function separated into commands, but I have other tasks I don't want to have fabric prompt to set env.hosts. I tried adding the steps inside the task functions, but I get prompted with every iteration. I tried to feed the #hosts decorator with the host_list function, but it gave me an error about the function object not being iterable. Is there a way to isolate the host_list function to only certain tasks?
There are plenty of questions in stack overflow with this answer. But to give you an idea, you could send the file as an argument to host_list(), and then read it, and pass it's results to use with the execute()

Categories