Job deletion and recreation in Azure Batch raises BatchErrorException

Job deletion and recreation in Azure Batch raises BatchErrorException - python

I'm writing a task manager for Azure Batch in Python.
When I run the manager, and add a Job to the specified Azure Batch account, I do:
check if the specified job id already exists
if yes, delete the job
create the job
Unfortunately I fail between step 2 and 3. This is because, even if I issue the deletion command for the specified job and check that there is no job with the same id in the Azure Batch Account, I get a BatchErrorException like the following when I try to create the job again:
Exception encountered:
The specified job has been marked for deletion and is being garbage collected.
The code I use to delete the job is the following:
def deleteJob(self, jobId):
print("Delete job [{}]".format(jobId))
self.__batchClient.job.delete(jobId)
# Wait until the job is deleted
# 10 minutes timeout for the operation to succeed
timeout = datetime.timedelta(minutes=10)
timeout_expiration = datetime.datetime.now() + timeout
while True:
try:
# As long as we can retrieve data related to the job, it means it is still deleting
self.__batchClient.job.get(jobId)
except batchmodels.BatchErrorException:
print("Job {jobId} deleted correctly.".format(
jobId = jobId
))
break
time.sleep(2)
if datetime.datetime.now() > timeout_expiration:
raise RuntimeError("ERROR: couldn't delete job [{jobId}] within timeout period of {timeout}.".format(
jobId = jobId
, timeout = timeout
))
I tried to check the Azure SDK, but couldn't find a method that would tell me exactly when a job was completely deleted.

Querying for existence of the job is the only way to determine if a job has been deleted from the system.
Alternatively, you can issue a delete job and then create a job with a different id, if you do not strictly need to reuse the same job id again. This will allow the job to delete asynchronously from your critical path.

According to the exception log information you provide, I think it occurred because the delete job could consume a certain amount of time and you could't create the same id of the job during this time.
I suggest that you could add the check in step 3 to create the job, ensuring that the job with the same id has not been found in the account before you create it .
You could refer to snippet of the code as below to create job since you did not provide your code of creating job:
import azure.batch.batch_service_client as batch
import azure.batch.batch_auth as batchauth
import azure.batch.models as batchmodels
credentials = batchauth.SharedKeyCredentials(ACCOUNT_NAME,
ACCOUNT_KEY)
batch_client = batch.BatchServiceClient(
credentials,
base_url=ACCOUNT_URL)
def createJob(jobId):
while (batch_client.job.get(jobId)):
print 'job still exists,can not be created'
else:
# Create Job
job = batchmodels.JobAddParameter(
jobId,
batchmodels.PoolInformation(pool_id='mypool')
)
batch_client.job.add(job)
print 'create success'
Hope it helps you.

Related

generate dynamic task using hooks without running them in backend

I have a simple dag - that takes argument from mysql db - (like sql, subject)
Then I have a function creating report out and send to particular email.
Here is code snippet.
def s_report(k,**kwargs):
body_sql = list2[k][4]
request1 = "({})".format(body_sql)
dwh_hook = SnowflakeHook(snowflake_conn_id="snowflake_conn")
df1 = dwh_hook.get_pandas_df(request1)
df2 = df1.to_html()
body_Text = list2[k][3]
html_content = f"""HI Team, Please find report<br><br>
{df2} <br> </br>
<b>Thank you!</b><br>
"""
return EmailOperator(task_id="send_email_snowflake{}".format(k), to=list2[k][1],
subject=f"{list2[k][2]}", html_content=html_content, dag=dag)
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
The s_report is getting generated dynamically, But the real problem is hook is continously submitting query in backend, While dag is stopped still its submitting query in backend.
I can use pythonoperator, but its not generating dynamic task.

A couple of things:
By looking at your code, in particular the lines:
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
we can determine that if your first task succeeds, namely, mysql_list, then the tasks downstream to it, namely, the s_report calls should begin executing. You have precisely len(list) of them. Within each s_report call there is exactly one dwh_hook.get_pandas_df(request) call, so I believe your DAG should be making len(list) calls of this type provided mysql_list task succeeds.
As for the mismatch you see in your Snowflake logs, I can't advise you here. I'd need more details. Keep in mind that the call get_pandas_df might have a retry mechanism (i.e. if cannot reach snowflake, retry) which might explain why your Snowflake logs show a bunch of requests.
If your DAG finishes successfully, (i.e. end_operator tasks finishes successfully), you are correct. There should be no requests in your Snowflake logs that came post-DAG end.
If you want more insight as to how your DAG interacts with your Snowflake resource, I'd suggest having a single s_report task like so:
mysql_list >> [ s_report(0)] >> end_operator
and see the behaviour in the logs.

BigQuery execution time

After I send a query to be executed in BigQuery, how can I find the execution time and other statistics about this job?
The Python API for BigQuery has a suggestive field timeline, but barely mentioned in the documentation. However, this method returns an empty iterable.
So far, I have been running this Python code, with the BigQuery Python API.
from google.cloud import bigquery
bq_client = bigquery.Client()
job = bq_client.query(some_sql_query, location="US")
ts = list(job.timeline)
if len(ts) > 0:
for t in ts:
print(t)
else:
print('No timeline results ???') # <-- no timeline results.

When your job is "DONE", the BigQuery API returns a JobStatistics object (described here) that is accessible in Python through the attributes of your job object.
The available attrs are accessible here.
For accessing the time taken by your job you mainly have the job.created, job.started and job.ended attributes.
To come back to your code snippet, you can try something like this:
from google.cloud import bigquery
bq_client = bigquery.client()
job = bq_client.query(some_sql_query, location="US")
while job.running():
print("Running..")
sleep(0.1)
print("The job duration was {}".format(job.ended - job.started))

In case you need to have job logs available after some time and ready for analysis I'd suggest creating a log sink for your big query jobs. It would create table in big query, where job logs are dumped. After that you can easily run analysis to determine how long they took and how much they cost.
First create a sink in Operations/Logging:
Make sure to set sink as BigQuery Dataset (pick one) and add query_job_completed as logs to include in the sink. After a while you will see new table in that data set similar to: {project}.{dataset}.cloudaudit_googleapis_com_data_access_{date}
Go you BigQuery and create a view. This view for example will show how much data each job consumed (and how much it cost):
SELECT
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes as totalBilledBytes
,protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.endTime as endTime
,protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query
,protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.datasetId targetDataSet
,protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.tableId targetTable
,protopayload_auditlog.authenticationInfo.principalEmail
,(protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes / 1000000000000.0) * 5.0 as billed
FROM `{project}.{dataset}.cloudaudit_googleapis_com_data_access_*`
where protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes > 0

I need to scrape logs from cloud watch logs and load it to s3 and from s3 to data warehouse

I have several lambda functions. I need to scrape my logs generated from all of my lambda functions and load to our internal data warehouse. I thought of these solutions.
Have a lambda function subscribed to my lambda function's cloudwatch log groups and polish and log messages and push it to s3.
Pros: Works and simple to implement.
Cons: There is no way for me to
"replay". Say My exporter failed for some reason. I wouldn't be able
to replay this action.
Have a lambda function that runs every 10 min or so and creates export task and scrapes logs from cloudwatch and loads them to s3.
import boto3
client = boto3.client('logs')
response = client.create_export_task(
taskName='export_task',
logGroupName='/aws/lambda/<lambda_function_1>',
fromTime=from_time,
to=to_time,
destination='<application_logs>',
destinationPrefix='<lambda_function_1>'
)
response = client.create_export_task(
taskName='export_task',
logGroupName='/aws/lambda/<lambda_function_2>',
fromTime=from_time,
to=to_time,
destination='<application_logs>',
destinationPrefix='<lambda_function_2>'
)
Second create_export_task fails here
An error occurred (LimitExceededException) when calling the
CreateExportTask operation: Resource limit exceeded."
I cant create multiple export task. Is there a way to address this?

From AWS docs: One active (running or pending) export task at a time, per account. This limit cannot be changed.
U can use the below function to check if the status has been changed to 'COMPLETED'
response = client.create_export_task(
taskName='export_cw_to_s3',
logGroupName='/ecs/',
logStreamNamePrefix=org_id,
fromTime=int((yesterday-unix_start).total_seconds() * 1000),
to=int((today-unix_start).total_seconds() * 1000),
destination='test-bucket',
destinationPrefix=f'random-string/{today.year}/{today.month}/{today.day}/{org_id}')
taskId = (response['taskId'])
status = 'RUNNING'
while status in ['RUNNING','PENDING']:
response_desc = client.describe_export_tasks(
taskId=taskId
)
status = response_desc['exportTasks'][0]['status']['code']

Came across the same error message and the reason is you can only have one running/pending export task per account at a given time hence this task is failing. From AWS docs: One active (running or pending) export task at a time, per account. This limit cannot be changed.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html

Sometimes one createExport task stays in pending state for long preventing other lambda functions with the same task to run. You could see this task and cancel it allowing the other functions to run.

How to replace a task on Google App Engine Task Queue?

When adding a certain task to the task queue, I would like to make sure there is only one such task. If this task already exists, I would like to delete it and add the new task instead (postponing its execution is ok also). This is my code:
queue = taskqueue.Queue()
queue.delete_tasks_by_name('task_name')
task = taskqueue.Task(
name = 'task_name',
url = '/task/url',
method = 'GET',
countdown = 3600)
queue.add(task)
When running the code it raises a TombstonedTaskError, which make sense according to the docs. Is there a way to replace or postpone execution of an existing task?

Use tags instead of names. Give the tag a unique name then do a lease_task_by_tag to see if it exists.
add(taskqueue.Task(payload='parse1', method='PULL', tag='parse'))
lease_tasks_by_tag(lease_seconds, max_tasks, tag=None, deadline=10)

Find out whether celery task exists

Is it possible to find out whether a task with a certain task id exists? When I try to get the status, I will always get pending.
>>> AsyncResult('...').status
'PENDING'
I want to know whether a given task id is a real celery task id and not a random string. I want different results depending on whether there is a valid task for a certain id.
There may have been a valid task in the past with the same id but the results may have been deleted from the backend.

Celery does not write a state when the task is sent, this is partly an optimization (see the documentation).
If you really need it, it's simple to add:
from celery import current_app
# `after_task_publish` is available in celery 3.1+
# for older versions use the deprecated `task_sent` signal
from celery.signals import after_task_publish
# when using celery versions older than 4.0, use body instead of headers
#after_task_publish.connect
def update_sent_state(sender=None, headers=None, **kwargs):
# the task may not exist if sent using `send_task` which
# sends tasks by name, so fall back to the default result backend
# if that is the case.
task = current_app.tasks.get(sender)
backend = task.backend if task else current_app.backend
backend.store_result(headers['id'], None, "SENT")
Then you can test for the PENDING state to detect that a task has not (seemingly)
been sent:
>>> result.state != "PENDING"

AsyncResult.state returns PENDING in case of unknown task ids.
PENDING
Task is waiting for execution or unknown. Any task id that is not
known is implied to be in the pending state.
http://docs.celeryproject.org/en/latest/userguide/tasks.html#pending
You can provide custom task ids if you need to distinguish unknown ids from existing ones:
>>> from tasks import add
>>> from celery.utils import uuid
>>> r = add.apply_async(args=[1, 2], task_id="celery-task-id-"+uuid())
>>> id = r.task_id
>>> id
'celery-task-id-b774c3f9-5280-4ebe-a770-14a6977090cd'
>>> if not "blubb".startswith("celery-task-id-"): print "Unknown task id"
...
Unknown task id
>>> if not id.startswith("celery-task-id-"): print "Unknown task id"
...

Right now I'm using following scheme:
Get task id.
Set to memcache key like 'task_%s' % task.id message 'Started'.
Pass task id to client.
Now from client I can monitor task status(set from task messages to memcache).
From task on ready - set to memcache key message 'Ready'.
From client on task ready - start special task that will delete key from memcache and do necessary cleaning actions.

You need to call .get() on the AsyncTask object you create to actually fetch the result from the backend.
See the Celery FAQ.
To further clarify on my answer.
Any string is technically a valid ID, there is no way to validate the task ID. The only way to find out if a task exists is to ask the backend if it knows about it and to do that you must use .get().
This introduces the problem that .get() blocks when the backend doesn't have any information about the task ID you supplied, this is by design to allow you to start a task and then wait for its completion.
In the case of the original question I'm going to assume that the OP wants to get the state of a previously completed task. To do that you can pass a very small timeout and catch timeout errors:
from celery.exceptions import TimeoutError
try:
# fetch the result from the backend
# your backend must be fast enough to return
# results within 100ms (0.1 seconds)
result = AsyncResult('blubb').get(timeout=0.1)
except TimeoutError:
result = None
if result:
print "Result exists; state=%s" % (result.state,)
else:
print "Result does not exist"
It should go without saying that this only work if your backend is storing results, if it's not there's no way to know if a task ID is valid or not because nothing is keeping a record of them.
Even more clarification.
What you want to do cannot be accomplished using the AMQP backend because it does not store results, it forwards them.
My suggestion would be to switch to a database backend so that the results are in a database that you can query outside of the existing celery modules. If no tasks exist in the result database you can assume the ID is invalid.

So I have this idea:
import project.celery_tasks as tasks
def task_exist(task_id):
found = False
# tasks is my imported task module from celery
# it is located under /project/project, where the settings.py file is located
i = tasks.app.control.inspect()
s = i.scheduled()
for e in s:
if task_id in s[e]:
found = True
break
a = i.active()
if not found:
for e in a:
if task_id in a[e]:
found = True
break
r = i.reserved()
if not found:
for e in r:
if task_id in r[e]:
found = True
break
# if checking the status returns pending, yet we found it in any queues... it means it exists...
# if it returns pending, yet we didn't find it on any of the queues... it doesn't exist
return found
According to https://docs.celeryproject.org/en/stable/userguide/monitoring.html the different types of queue inspections are:
active,
scheduled,
reserved,
revoked,
registered,
stats,
query_task,
so pick and choose as you please.
And there might be a better way to go about checking the queues for their tasks, but this should work for me, for now.

Try
AsyncResult('blubb').state
that may work.
It should return something different.

Please correct me if i'm wrong.
if built_in_status_check(task_id) == 'pending'
if registry_exists(task_id) == true
print 'Pending'
else
print 'Task does not exist'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.