AWS Batch Job Execution Results in Step Function - python

I'm newbie to AWS Step Functions and AWS Batch. I'm trying to integrate AWS Batch Job with Step Function. AWS Batch Job executes simple python scripts which output string value (High level simplified requirement) . I need to have the python script output available to the next state of the step function. How I should be able to accomplish this. AWS Batch Job output does not contain results of the python script. instead it contains all the container related information with input values.
Example : AWS Batch Job executes python script which output "Hello World". I need "Hello World" available to the next state of the step function to execute a lambda associated with it.

I was able to do it, below is my state machine, I took the sample project for running the batch job Manage a Batch Job (AWS Batch, Amazon SNS) and modified it for two lambdas for passing input/output.
{
"Comment": "An example of the Amazon States Language for notification on an AWS Batch job completion",
"StartAt": "Submit Batch Job",
"TimeoutSeconds": 3600,
"States": {
"Submit Batch Job": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobName": "BatchJobNotification",
"JobQueue": "arn:aws:batch:us-east-1:1234567890:job-queue/BatchJobQueue-737ed10e7ca3bfd",
"JobDefinition": "arn:aws:batch:us-east-1:1234567890:job-definition/BatchJobDefinition-89c42b1f452ac67:1"
},
"Next": "Notify Success",
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"Next": "Notify Failure"
}
]
},
"Notify Success": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:1234567890:function:readcloudwatchlogs",
"Parameters": {
"LogStreamName.$": "$.Container.LogStreamName"
},
"ResultPath": "$.lambdaOutput",
"Next": "ConsumeLogs"
},
"ConsumeLogs": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:1234567890:function:consumelogs",
"Parameters": {
"randomstring.$": "$.lambdaOutput.logs"
},
"End": true
},
"Notify Failure": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"Message": "Batch job submitted through Step Functions failed",
"TopicArn": "arn:aws:sns:us-east-1:1234567890:StepFunctionsSample-BatchJobManagement17968f39-e227-47ab-9a75-08a7dcc10c4c-SNSTopic-1GR29R8TUHQY8"
},
"End": true
}
}
}
The key to read logs was in the Submit Batch Job output which contains LogStreamName, that I passed to my lambda named function:readcloudwatchlogs and read the logs and then eventually passed the read logs to the next function named function:consumelogs. You can see in the attached screenshot consumelogs function printing the logs.
{
"Attempts": [
{
"Container": {
"ContainerInstanceArn": "arn:aws:ecs:us-east-1:1234567890:container-instance/BatchComputeEnvironment-4a1593ce223b3cf_Batch_7557555f-5606-31a9-86b9-83321eb3e413/6d11fdbfc9eb4f40b0d6b85c396bb243",
"ExitCode": 0,
"LogStreamName": "BatchJobDefinition-89c42b1f452ac67/default/2ad955bf59a8418893f53182f0d87b4b",
"NetworkInterfaces": [],
"TaskArn": "arn:aws:ecs:us-east-1:1234567890:task/BatchComputeEnvironment-4a1593ce223b3cf_Batch_7557555f-5606-31a9-86b9-83321eb3e413/2ad955bf59a8418893f53182f0d87b4b"
},
"StartedAt": 1611329367577,
"StatusReason": "Essential container in task exited",
"StoppedAt": 1611329367748
}
],
"Container": {
"Command": [
"echo",
"Hello world"
],
"ContainerInstanceArn": "arn:aws:ecs:us-east-1:1234567890:container-instance/BatchComputeEnvironment-4a1593ce223b3cf_Batch_7557555f-5606-31a9-86b9-83321eb3e413/6d11fdbfc9eb4f40b0d6b85c396bb243",
"Environment": [
{
"Name": "MANAGED_BY_AWS",
"Value": "STARTED_BY_STEP_FUNCTIONS"
}
],
"ExitCode": 0,
"Image": "137112412989.dkr.ecr.us-east-1.amazonaws.com/amazonlinux:latest",
"LogStreamName": "BatchJobDefinition-89c42b1f452ac67/default/2ad955bf59a8418893f53182f0d87b4b",
"TaskArn": "arn:aws:ecs:us-east-1:1234567890:task/BatchComputeEnvironment-4a1593ce223b3cf_Batch_7557555f-5606-31a9-86b9-83321eb3e413/2ad955bf59a8418893f53182f0d87b4b",
..
},
..
"Tags": {
"resourceArn": "arn:aws:batch:us-east-1:1234567890:job/d36ba07a-54f9-4acf-a4b8-3e5413ea5ffc"
}
}
Read Logs Lambda code:
import boto3
client = boto3.client('logs')
def lambda_handler(event, context):
print(event)
response = client.get_log_events(
logGroupName='/aws/batch/job',
logStreamName=event.get('LogStreamName')
)
log = {'logs': response['events'][0]['message']}
return log
Consume Logs Lambda Code
import json
print('Loading function')
def lambda_handler(event, context):
print(event)

You could pass your step function execution ID ($$.Execution.ID) to the batch process and then your batch process could write its response to DynamoDB using the execution ID and a primary key (or other filed). You would then need a subsequent step to read directly from DynamoDB and capture the process response.
I have been on the hunt for a way to do this without the subsequent step, but thus far no dice.

While you can't do waitForTaskToken with submitJob, you can still use the callback pattern by passing the task token in the Parameters and referencing it in the command override with Ref::TaskToken:
...
"Submit Batch Job": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"TaskToken.$": "$$.Task.Token"
},
"ContainerOverrides": {
"command": ["python3",
"my_script.py",
"Ref::TaskToken"]
}
...
Then when your script is done doing its processing, you just call StepFunctions.SendTaskSuccess or StepFunctions.SendTaskFailure:
import boto3
client = boto3.client('stepfunctions')
def main()
args = sys.argv[1:]
client.send_task_success(taskToken=args[0], output='Hello World')
This will tell StepFunctions your job is complete and the output should be 'Hello World'. This pattern can also be useful if your Batch job completes the work required to resume the state machine, but needs to do some cleanup work afterward. You can send_task_success with the results and the state machine can resume while the Batch job does the cleanup work.

Thanks #samtoddler for your answer.
We used it for a while.
However, recently my friend #liorzimmerman found a better solution.
Using stepfunctions send-task-success
When calling the job from the state machine you need to send the task-token:
"States": {
"XXX_state": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobDefinition": "arn:aws:batch:us-east-1:XXX:job-definition/task_XXX:4",
"JobQueue": "arn:aws:batch:us-east-1:XXX:job-queue/XXX-queue",
"JobName": "XXX",
"Parameters": {
"TASK_TOKEN.$": "$$.Task.Token",
}
},
"ResultPath": "$.payload",
"End": true
}
Next, inside the docker run by the job, the results are sent by:
aws stepfunctions send-task-success --task-token $TASK_TOKEN --task-output $OUTPUT_JSON

Related

AWS Python StepFunctions stepfunctions.steps.Parallel [Generate Definition]

is there an example of the Python AWS Data Science SDK for stepfunctions stepfunctions.steps.Parallel class implementation?
Parallel execution requires branches, but i cant seem to find the methods or documentation about their description.
Generating a synchronous list of steps works fine, but i cant find how to define the parallel step, anyone knows?
Are there any other libraries that can do this? boto3 as far as i looked doesnt have the functionality and CDK is not suitable, as this will be a service.
I'd like to be able to generate something like this by using just code:
{
"Comment": "Parallel Example.",
"StartAt": "LookupCustomerInfo",
"States": {
"LookupCustomerInfo": {
"Type": "Parallel",
"End": true,
"Branches": [
{
"StartAt": "LookupAddress",
"States": {
"LookupAddress": {
"Type": "Task",
"Resource":
"arn:aws:lambda:us-east-1:123456789012:function:AddressFinder",
"End": true
}
}
},
{
"StartAt": "LookupPhone",
"States": {
"LookupPhone": {
"Type": "Task",
"Resource":
"arn:aws:lambda:us-east-1:123456789012:function:PhoneFinder",
"End": true
}
}
}
]
}
}
}

Boto3 Backup Waiters

I have a script that automates restore jobs from AWS Backups.
I am taking guidance from this documentation of boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/backup.html
I am using the function start_restore_job() to start a job and then describe_restore_job() to query the CreatedResourceArn
After a restore job is launched, I need to wait for the restore to be completed so that i can query the CreatedResourceArn. The issue here is that AWS Backup doesn't have any waiters defined in its documentation. Does someone know how to do this?
Also, going through the docs, I see the function get_waiter():
Why is this function available when there is no waiters defined for AWS Backup ?
Looks like a waiter doesn't exist for this, but you can create your own customer waiters like this:
import boto3
from botocore.waiter import WaiterModel
from botocore.waiter import create_waiter_with_client
client = boto3.client('backup')
waiter_name = "BackupCompleted"
waiter_config = {
"version": 2,
"waiters": {
"BackupCompleted": {
"operation": "DescribeRestoreJob",
"delay": 60, # Number of seconds to delay
"maxAttempts": 5, # Max attempts before failure
"acceptors": [
{
"matcher": "path",
"expected": "COMPLETED",
"argument": "Status",
"state": "success"
},
{
"matcher": "path",
"expected": "ABORTED",
"argument": "Status",
"state": "failure"
},
{
"matcher": "path",
"expected": "FAILED",
"argument": "Status",
"state": "failure"
}
]
}
}
}
waiter_model = WaiterModel(waiter_config)
backup_waiter = create_waiter_with_client(waiter_name, waiter_model, client)
backup_waiter.wait(RestoreJobId='MyRestoreJobId')

Airflow/Amazon EMR: The VPC/subnet configuration was invalid: Subnet is required : The specified instance type m5.xlarge can only be used in a VPC

I want to create an emr cluster triggered via Airflow on Amazon EMR. The emr cluster shows up in the UI of Amazon EMR but with an error saying:
"The VPC/subnet configuration was invalid: Subnet is required : The specified instance type m5.xlarge can only be used in a VPC"
Below is the code snippet and the config details in json format for this task that are used in the Airflow script.
My question is how can I incorporate the information (id codes) about VPC and subnet in the json (if this is even possible)? there are no explicit examples out there.
Hint: a network and an EC2 subnet is already created
JOB_FLOW_OVERRIDES = {
"Name": "sentiment_analysis",
"ReleaseLabel": "emr-5.33.0",
"Applications": [{"Name": "Hadoop"}, {"Name": "Spark"}], # We want our EMR cluster to have HDFS and Spark
"Configurations": [
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {"PYSPARK_PYTHON": "/usr/bin/python3"}, # by default EMR uses py2, change it to py3
}
],
}
],
"Instances": {
"InstanceGroups": [
{
"Name": "Master node",
"Market": "SPOT",
"InstanceRole": "MASTER",
"InstanceType": "m5.xlarge",
"InstanceCount": 1,
},
{
"Name": "Core - 2",
"Market": "SPOT", # Spot instances are a "use as available" instances
"InstanceRole": "CORE",
"InstanceType": "m5.xlarge",
"InstanceCount": 2,
},
],
"KeepJobFlowAliveWhenNoSteps": True,
"TerminationProtected": False, # this lets us programmatically terminate the cluster
},
"JobFlowRole": "EMR_EC2_DefaultRole",
"ServiceRole": "EMR_DefaultRole",
}
create_emr_cluster = EmrCreateJobFlowOperator(
task_id="create_emr_cluster",
job_flow_overrides=JOB_FLOW_OVERRIDES,
aws_conn_id="aws_default",
emr_conn_id="emr_default",
dag=dag,
)
EmrCreateJobFlowOperator calls create_job_flow from emr.py which matches the same api from boto3 emr client.
Therefore you can put an item "Ec2SubnetId" with your subnet id as value within the "Instances" dictionary.
This works for me on Apache Airflow 2.0.2

Unable to attach Data disk to Azure VM

I created a shared disk following Create Disk and when i try to attach it to a VM Update VM i'm getting createOption cannot be changed. Below is the full error,
Disk attachment failed, request response is - {
"error": {
"code": "PropertyChangeNotAllowed",
"message": "Changing property 'dataDisk.createOption' is not allowed.",
"target": "dataDisk.createOption"
}
Request Body for Data Disk creation (please note this is a shared disk),
{
"location": LOCATION,
"sku": {
"name": "Premium_LRS"
},
"properties": {
"creationData": {
"createOption": "empty"
},
"osType": "linux",
"diskSizeGB": SIZE,
"maxShares": 5,
"networkAccessPolicy": "AllowAll"
}
}
Request body for VM Patch request,
{
"properties": {
"storageProfile": {
"dataDisks": [
{
"caching" : "ReadOnly",
"createOption": "Attach",
"lun": 0,
"managedDisk" : {
"id": disk_id, //-> this disk_id is id of the created disk above
"storageAccountType": "Premium_LRS"
}
}
]
}
}
}
can someone please point out where im doing wrong. I haven't found much documentation about shared disk attachment, through API.
As I see, there is no problem with your request body that updates the VM. I tried it right now and it works fine. I use the same request body as yours. So you need to check the disk again such as if the lun 0 is already in use.

Azure Error: data protection system cannot create a new key because auto-generation of keys is disabled

I am trying to run an azure function on my local machine using Visual Studio Code.
My main.py looks like this:
import logging
import azure.functions as func
def main(event: func.EventHubEvent):
logging.info('Python EventHub trigger processed an event: %s', event.get_body().decode('utf-8'))
My host.json file looks like this:
{
"version": "2.0",
"extensionBundle": {
"id": "Microsoft.Azure.Functions.ExtensionBundle",
"version": "[1.*, 2.0.0)"
}
}
My function.json looks something like this:
{
"scriptFile": "main.py",
"bindings": [
{
"type": "eventHubTrigger",
"name": "event",
"direction": "in",
"eventHubName": "myhubName",
"connection": "myHubConnection",
"cardinality": "many",
"consumerGroup": "$Default"
}
]
}
The problem is when I run this, it throws me the following error:
A host error has occurred at
Microsoft.AspNetCore.DataProtection: An error occurred while trying to encrypt the provided data. Refer to the inner exception for more information. Microsoft.AspNetCore.DataProtection: The key ring does not contain a valid default protection key. The data protection system cannot create a new key because auto-generation of keys is disabled.
Value cannot be null.
Parameter name: provider
I am not sure what I am I missing ? Any help is appreciated
The problem was with the Azure Storage account:
Make sure the local.settings.json has the correct credentials for the storage account
{
"IsEncrypted": false,
"Values": {
"AzureWebJobsStorage": "MyStorageKey",
"FUNCTIONS_WORKER_RUNTIME": "python",
}
}

Categories