generate dynamic task using hooks without running them in backend

generate dynamic task using hooks without running them in backend - python

I have a simple dag - that takes argument from mysql db - (like sql, subject)
Then I have a function creating report out and send to particular email.
Here is code snippet.
def s_report(k,**kwargs):
body_sql = list2[k][4]
request1 = "({})".format(body_sql)
dwh_hook = SnowflakeHook(snowflake_conn_id="snowflake_conn")
df1 = dwh_hook.get_pandas_df(request1)
df2 = df1.to_html()
body_Text = list2[k][3]
html_content = f"""HI Team, Please find report<br><br>
{df2} <br> </br>
<b>Thank you!</b><br>
"""
return EmailOperator(task_id="send_email_snowflake{}".format(k), to=list2[k][1],
subject=f"{list2[k][2]}", html_content=html_content, dag=dag)
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
The s_report is getting generated dynamically, But the real problem is hook is continously submitting query in backend, While dag is stopped still its submitting query in backend.
I can use pythonoperator, but its not generating dynamic task.

A couple of things:
By looking at your code, in particular the lines:
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
we can determine that if your first task succeeds, namely, mysql_list, then the tasks downstream to it, namely, the s_report calls should begin executing. You have precisely len(list) of them. Within each s_report call there is exactly one dwh_hook.get_pandas_df(request) call, so I believe your DAG should be making len(list) calls of this type provided mysql_list task succeeds.
As for the mismatch you see in your Snowflake logs, I can't advise you here. I'd need more details. Keep in mind that the call get_pandas_df might have a retry mechanism (i.e. if cannot reach snowflake, retry) which might explain why your Snowflake logs show a bunch of requests.
If your DAG finishes successfully, (i.e. end_operator tasks finishes successfully), you are correct. There should be no requests in your Snowflake logs that came post-DAG end.
If you want more insight as to how your DAG interacts with your Snowflake resource, I'd suggest having a single s_report task like so:
mysql_list >> [ s_report(0)] >> end_operator
and see the behaviour in the logs.

Related

run 2 instructions sequentially and execute the second instruction based on a condition in Python Lambda

I have a requirement to execute 2 tasks sequentially in a Lambda Function in Python.
But as the execution is asynchronous I have to check the status of the first task regularly and run the second task only after the first task has the status AVAILABLE. I did not manage to do that with the code below. The second task starts at the latest line (create_access_key). I found this doc: https://docs.aws.amazon.com/lambda/latest/operatorguide/synchronous-waiting.html but I don't want to use a second Lambda function to execute the second simple task.
if e.response["Error"]["Code"] == "NoSuchEntity":
client_sc = boto3.client('servicecatalog')
create_user = client_sc.provision_product(
ProductId='prod-9g59999d2djkg',
ProvisionedProductName='Create_User',
ProvisioningArtifactName='v1.1.002',
ProvisioningParameters=[
{
'Key':'iamUserName',
'Value':useriam
},
]
)
get_status = client_sc.describe_provisioned_product(Id=pp_id)
status = status['ProvisionedProductDetail']['Status']
print (status)
while describe_status != 'AVAILABLE':
time.sleep(20)
if describe_status == 'AVAILABLE':
create_access_key = client_iam.create_access_key(UserName=useriam) #second task

Before the AWS CLI added wait commands to various services, what you are doing is almost exactly the same pattern that people used in bash scripts trying to create a sequence of CLI invocations. Primitive polling.
There are waiters in Boto3 as documented at
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/clients.html
and
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/servicecatalog.html#ServiceCatalog.Client.get_waiter
but the ServiceCatalog client doesn't have any ready-made ones:
import boto3
# boto3.__version__ = '1.24.43'
sc = boto3.client('servicecatalog')
sc.waiter_names
[]
So rolling your own (as you do in your example code) may be the best option.
I know you don't want to add a second lambda function, but perhaps Step Functions are a useful alternative?
Does this lambda get invoked by someone clicking on a button in a web ui, hence the need to return a finished result? If not, maybe treating this as a series of events and messages could be a possibility?

How to add content to the email send by airflow on success

I wish to send a email with information of the output from my airflow DAG on success. My two approach where first to execute a function from the main DAG (which I could't do as is seems the DAG is unnable to access outputs of what it runs, which I find logical) and the second more promising, to configure it to send the log of the DAG run. I created it following this idea in airflow, but while it sends emails, they're blank. I tried finding the log and creating a list to pass as a message as follows:
def task_success_callback(context):
outer_task_success_callback(context, email='a1u1k6u3f0v1t0r8#justeat.slack.com')
def outer_task_success_callback(context, email):
lines = []
for file in glob.glob("AIRFLOW_HOME/*.log"):
with open(file) as f:
lines = [line for line in f.readlines()]
print(lines)
mensaje = lines
subject = "[Airflow] DAG {0} - Task {1}: Success".format(
context['task_instance_key_str'].split('__')[0],
context['task_instance_key_str'].split('__')[1]
)
html_content = """
DAG: {0}<br>
Task: {1}<br>
Log: {2}<br>
""".format(
context['task_instance_key_str'].split('__')[0],
context['task_instance_key_str'].split('__')[1],
mensaje
)
It didnt seem to produce anything. I did not even return an error. Even weirder, now in the airflow log it doesnt even refer to the event "email send" which it used to do

First: pass messages
To send mail about DAG running info, you need two core airflow components to support you.
They are XCOM and Python operator =>provide_context
XCOM: it is used for exchange information among tasks. The message could be push and pull from Xcom.
This is a subtle but very important point: in general, if two
operators need to share information, like a filename or small amount
of data, you should consider combining them into a single operator. If
it absolutely can’t be avoided, Airflow does have a feature for
operator cross-communication called XCom that is described in the
section XComs
https://airflow.apache.org/docs/stable/concepts.html?highlight=xcom
provide_context (bool)
– if set to true, Airflow will pass a set of keyword arguments that
can be used in your function. This set of kwargs correspond exactly to
what you can use in your jinja templates. For this to work, you need
to define **kwargs in your function header.
https://airflow.apache.org/docs/stable/_api/airflow/operators/python_operator/index.html?highlight=provide_context
Second: run email task after task successed.
There are many way. One is depended on the DAG level. Another is depended on the task level. You can double check about your logic

How to initialize repeating tasks using Django Background Tasks?

I'm working on a django application which reads csv file from dropbox, parse data and store it in database. For this purpose I need background task which checks if the file is modified or changed(updated) and then updates database.
I've tried 'Celery' but failed to configure it with django. Then I find django-background-tasks which is quite simpler than celery to configure.
My question here is how to initialize repeating tasks?
It is described in documentation
but I'm unable to find any example which explains how to use repeat, repeat_until or other constants mentioned in documentation.
can anyone explain the following with examples please?
notify_user(user.id, repeat=<number of seconds>, repeat_until=<datetime or None>)
repeat is given in seconds. The following constants are provided:
Task.NEVER (default), Task.HOURLY, Task.DAILY, Task.WEEKLY,
Task.EVERY_2_WEEKS, Task.EVERY_4_WEEKS.

You have to call the particular function (notify_user()) when you really need to execute it.
Suppose you need to execute the task while a request comes to the server, then it would be like this,
#background(schedule=60)
def get_csv(creds):
#read csv from drop box with credentials, "creds"
#then update the DB
def myview(request):
# do something with my view
get_csv(creds, repeat=100)
return SomeHttpResponse
Excecution Procedure
1. Request comes to the url hence it would dispatch to the corresponding view, here myview()
2. Excetes the line get_csv(creds, repeat=100) and then creates a async task in DB (it wont excetute the function now)
3. Returning the HTTP response to the user.
After 60 seconds from the time which the task creation, get_csv(creds) will excecutes repeatedly in every 100 seconds

For example, suppose you have the function from the documentation
#background(schedule=60)
def notify_user(user_id):
# lookup user by id and send them a message
user = User.objects.get(pk=user_id)
user.email_user('Here is a notification', 'You have been notified')
Suppose you want to repeat this task daily until New Years day of 2019 you would do the following
import datetime
new_years_2019 = datetime.datetime(2019, 01, 01)
notify_user(some_id, repeat=task.DAILY, repeat_until=new_years_2019)

I need to scrape logs from cloud watch logs and load it to s3 and from s3 to data warehouse

I have several lambda functions. I need to scrape my logs generated from all of my lambda functions and load to our internal data warehouse. I thought of these solutions.
Have a lambda function subscribed to my lambda function's cloudwatch log groups and polish and log messages and push it to s3.
Pros: Works and simple to implement.
Cons: There is no way for me to
"replay". Say My exporter failed for some reason. I wouldn't be able
to replay this action.
Have a lambda function that runs every 10 min or so and creates export task and scrapes logs from cloudwatch and loads them to s3.
import boto3
client = boto3.client('logs')
response = client.create_export_task(
taskName='export_task',
logGroupName='/aws/lambda/<lambda_function_1>',
fromTime=from_time,
to=to_time,
destination='<application_logs>',
destinationPrefix='<lambda_function_1>'
)
response = client.create_export_task(
taskName='export_task',
logGroupName='/aws/lambda/<lambda_function_2>',
fromTime=from_time,
to=to_time,
destination='<application_logs>',
destinationPrefix='<lambda_function_2>'
)
Second create_export_task fails here
An error occurred (LimitExceededException) when calling the
CreateExportTask operation: Resource limit exceeded."
I cant create multiple export task. Is there a way to address this?

From AWS docs: One active (running or pending) export task at a time, per account. This limit cannot be changed.
U can use the below function to check if the status has been changed to 'COMPLETED'
response = client.create_export_task(
taskName='export_cw_to_s3',
logGroupName='/ecs/',
logStreamNamePrefix=org_id,
fromTime=int((yesterday-unix_start).total_seconds() * 1000),
to=int((today-unix_start).total_seconds() * 1000),
destination='test-bucket',
destinationPrefix=f'random-string/{today.year}/{today.month}/{today.day}/{org_id}')
taskId = (response['taskId'])
status = 'RUNNING'
while status in ['RUNNING','PENDING']:
response_desc = client.describe_export_tasks(
taskId=taskId
)
status = response_desc['exportTasks'][0]['status']['code']

Came across the same error message and the reason is you can only have one running/pending export task per account at a given time hence this task is failing. From AWS docs: One active (running or pending) export task at a time, per account. This limit cannot be changed.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html

Sometimes one createExport task stays in pending state for long preventing other lambda functions with the same task to run. You could see this task and cancel it allowing the other functions to run.

Find out whether celery task exists

Is it possible to find out whether a task with a certain task id exists? When I try to get the status, I will always get pending.
>>> AsyncResult('...').status
'PENDING'
I want to know whether a given task id is a real celery task id and not a random string. I want different results depending on whether there is a valid task for a certain id.
There may have been a valid task in the past with the same id but the results may have been deleted from the backend.

Celery does not write a state when the task is sent, this is partly an optimization (see the documentation).
If you really need it, it's simple to add:
from celery import current_app
# `after_task_publish` is available in celery 3.1+
# for older versions use the deprecated `task_sent` signal
from celery.signals import after_task_publish
# when using celery versions older than 4.0, use body instead of headers
#after_task_publish.connect
def update_sent_state(sender=None, headers=None, **kwargs):
# the task may not exist if sent using `send_task` which
# sends tasks by name, so fall back to the default result backend
# if that is the case.
task = current_app.tasks.get(sender)
backend = task.backend if task else current_app.backend
backend.store_result(headers['id'], None, "SENT")
Then you can test for the PENDING state to detect that a task has not (seemingly)
been sent:
>>> result.state != "PENDING"

AsyncResult.state returns PENDING in case of unknown task ids.
PENDING
Task is waiting for execution or unknown. Any task id that is not
known is implied to be in the pending state.
http://docs.celeryproject.org/en/latest/userguide/tasks.html#pending
You can provide custom task ids if you need to distinguish unknown ids from existing ones:
>>> from tasks import add
>>> from celery.utils import uuid
>>> r = add.apply_async(args=[1, 2], task_id="celery-task-id-"+uuid())
>>> id = r.task_id
>>> id
'celery-task-id-b774c3f9-5280-4ebe-a770-14a6977090cd'
>>> if not "blubb".startswith("celery-task-id-"): print "Unknown task id"
...
Unknown task id
>>> if not id.startswith("celery-task-id-"): print "Unknown task id"
...

Right now I'm using following scheme:
Get task id.
Set to memcache key like 'task_%s' % task.id message 'Started'.
Pass task id to client.
Now from client I can monitor task status(set from task messages to memcache).
From task on ready - set to memcache key message 'Ready'.
From client on task ready - start special task that will delete key from memcache and do necessary cleaning actions.

You need to call .get() on the AsyncTask object you create to actually fetch the result from the backend.
See the Celery FAQ.
To further clarify on my answer.
Any string is technically a valid ID, there is no way to validate the task ID. The only way to find out if a task exists is to ask the backend if it knows about it and to do that you must use .get().
This introduces the problem that .get() blocks when the backend doesn't have any information about the task ID you supplied, this is by design to allow you to start a task and then wait for its completion.
In the case of the original question I'm going to assume that the OP wants to get the state of a previously completed task. To do that you can pass a very small timeout and catch timeout errors:
from celery.exceptions import TimeoutError
try:
# fetch the result from the backend
# your backend must be fast enough to return
# results within 100ms (0.1 seconds)
result = AsyncResult('blubb').get(timeout=0.1)
except TimeoutError:
result = None
if result:
print "Result exists; state=%s" % (result.state,)
else:
print "Result does not exist"
It should go without saying that this only work if your backend is storing results, if it's not there's no way to know if a task ID is valid or not because nothing is keeping a record of them.
Even more clarification.
What you want to do cannot be accomplished using the AMQP backend because it does not store results, it forwards them.
My suggestion would be to switch to a database backend so that the results are in a database that you can query outside of the existing celery modules. If no tasks exist in the result database you can assume the ID is invalid.

So I have this idea:
import project.celery_tasks as tasks
def task_exist(task_id):
found = False
# tasks is my imported task module from celery
# it is located under /project/project, where the settings.py file is located
i = tasks.app.control.inspect()
s = i.scheduled()
for e in s:
if task_id in s[e]:
found = True
break
a = i.active()
if not found:
for e in a:
if task_id in a[e]:
found = True
break
r = i.reserved()
if not found:
for e in r:
if task_id in r[e]:
found = True
break
# if checking the status returns pending, yet we found it in any queues... it means it exists...
# if it returns pending, yet we didn't find it on any of the queues... it doesn't exist
return found
According to https://docs.celeryproject.org/en/stable/userguide/monitoring.html the different types of queue inspections are:
active,
scheduled,
reserved,
revoked,
registered,
stats,
query_task,
so pick and choose as you please.
And there might be a better way to go about checking the queues for their tasks, but this should work for me, for now.

Try
AsyncResult('blubb').state
that may work.
It should return something different.

Please correct me if i'm wrong.
if built_in_status_check(task_id) == 'pending'
if registry_exists(task_id) == true
print 'Pending'
else
print 'Task does not exist'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.