I am not able to pass parameter in aws batch job? - python

I have a Dockerfile with the following lines:
FROM python
COPY sysargs.py /
CMD python sysargs.py --date Ref::date
And my python file sysargs.py looks like this:
import sys
print('The command line arguments are:')
a = sys.argv[1]
print(a)
I just want to pass parameter date and print the date but after passing date value I am getting output as "Ref::date".
Can someone help me what I have done wrong?
I am trying to replicate as mentioned in how to retrieve aws batch parameter value in python?.

In the context of your Dockerfile Ref::date is just a string, hence that is what is printed as output in the python script.
If you want date to be a value passed in from an external source, then you can use ARG. Take 10 minutes to read through this guide.

My experience with AWS Batch "parameters"
I have been working on a project for about 4 months now. One of my tasks was to connect several AWS services together to process in the Cloud the last uploaded file that an application had placed in a S3 bucket.
What I needed
The way this works is the following. Through a website, a user uploads a file that is sent to a back-end server, and then to a S3 bucket. This event triggers a AWS Lambda function, which inside creates and runs an instance of a AWS Batch Job, that has already been defined previously (based on a Docker image) and would retrieve from the S3 bucket the file to process it and the save in a database some results. By the way, all the code I am using is done with Python.
Everything worked as charm until I found it really hard to get as a parameter the filename of the file in the S3 bucket that generated the event, inside the python script that was being executed inside the Docker container, run by the AWS Batch Job.
What I did
After a lot of research and development, I came up with a solution for my problem. The issue was based on the fact that the word "parameter", for AWS Batch Jobs, is not what a user may expect. In return, we need to use containerOverrides, the way I show below: defining an "environment" variable value inside the running container by providing a pair of name and value of that variable.
# At some point we had defined aws_batch like this:
#
#aws_batch = boto3.client(
# service_name="batch",
# region_name='<OurRegion>',
# aws_access_key_id='<AWS_ID>',
# aws_secret_access_key='<AWS_KEY>',
#)
aws_batch.submit_job(
jobName='TheJobNameYouWant',
jobQueue='NameOfThePreviouslyDefinedQueue',
jobDefinition='NameOfThePreviouslyDefinedJobDefinition',
# parameters={ #THIS DOES NOT WORK
# 'FILENAME': FILENAME #THIS DOES NOT WORK
# }, #THIS DOES NOT WORK
containerOverrides={
'environment': [
{
'name': 'filename',
'value': 'name_of_the_file.png'
},
],
},
)
This way, from my Python script, inside the Docker container, I could access the environment variable value using the well-known os.getenv('<ENV_VAR_NAME>') function.
You can also check on your AWS console, under the Batch menu, both Job configuration and Container details tabs, to make sure everything makes sense. The container that the Job is running will never see the Job parameters. In the opposite way, it will know the environment variables.
Final notes
I do not know if there is a better way to solve this. So far, I share with all the community something that does work.
I have tested it myself, and the main idea came from reading the links that I list below:
AWSBatchJobs parameters (Use just as context info)
submit_job function AWS Docs (Ideal to learn about what kind of actions we are allowed to do or configure when creating a job)
I honestly hope this helps you and wish you a happy coding!

Related

How Python AWS Lambda interacts specifically with the uploaded file?

I´m trying to do the following:
when I upload a file in my s3 storage, the lambda picks this json file and converts it into a csv file.
How can I specify in the lambda code which file must pick?
example of my code in local:
import pandas as pd
df = pd.read_json('movies.json')
df.to_csv('csv-movies.csv')
in this example, I provide the name of the file...but..how can I manage that on a Lambda?
I think I don´t understand how Lambda works...could you give me an example?
Lambda spins up execution environments to handle your requests. When it initialises these environments, it'll pull the code you uploaded, and execute it when invoked.
Execution environments have a concept of ephemeral (temporary) storage with a default size of 512mb.
Lambda doesn't have access to your files in S3 by default. You'd first need to download your file from S3 using something like the AWS SDK for Python. You can store it in the /tmp directory to make use of the ephemeral storage I mentioned earlier.
Once you've downloaded the file using the SDK, you can interact with it as you would if you were running this locally, like in your example.
On the flip side, you'd also need to use the SDK to upload the CSV back to S3 if you want to keep it beyond the lifecycle of that execution environment.
Something else you might want to explore in future is reading that file into memory and doing away with storing it in ephemeral storage altogether.
In order to achieve this you will need to use S3 as the event source for your Lambda, there's a useful tutorial for this provided by AWS themselves and has some sample python code to assist you, you can view it here.
To break it down slightly further and answer how you get the name of the file. The lambda handler will look similar to the following:
def lambda_handler(event, context)
What is important here is the event object. When your event source is the S3 bucket you will be given the name of the bucket and the s3 key in the object which is effectively the path to the file in the S3 bucket. With this information you can do some logic to decide if you want to download the file from that path. If you do, you can use the S3 get_object( ) api call as shown in the tutorial.
Once this file is downloaded it can be used like any other file you would have on your local machine, so you can then proceed to process the json to a CSV. Once it is converted you will presumably want to put it back in S3 and for this you can use the S3 put_object( ) call for this and reuse the information in the event object in order to specify the path.

Is it possible to pass an environment variable to VM Instance start up script using Python?

I am using Python in a cloud function to switch on a VM instance when the function is triggered.
request = service.instances().start(project=project, zone=zone, instance=instance)
response = request.execute()
The VM instance will in turn run a start up script when it starts up.
However, is it possible to pass an environment variable to the start up script in the above Python command. If so, how would I do it?
I am planning on having a conditional in my instance start up script which does something like:
if env variable is 'x':
run python script x.py
else:
run python script y.py
etc...
Thanks
I'm afraid you won't be able to directly set an environment variable, but you could you a Metadata parameter for your VM. Here's how you could do it:
Modify the metadata set of your VM before turning it on. Here's how.
Start the VM.
In your startup script, query the metadata server to retrieve the value you configured in step 1. How to retrieve metadata values?
Treat the retrieved value like if it was environment variable.
I hope this helps you achieve your goal :)
EDIT: Guest attributes suggested in the comment by
guillaume blaquiere would work the same way, they are just a part of the instance metadata.

I want to trigger a python script using a cloud function whenever a specified file is created on the google cloud storage

One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/
The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.

GEE Python API: Export image to Google Drive fails

Using GEE Python API in an application running with App Engine (on localhost), I am trying to export an image to a file in Google Drive. The task seems to start and complete successfully but no file is created in Google Drive.
I have tried to execute the equivalent javascript code in GEE code editor and this works, the file is created in Google Drive.
In python, I have tried various ways to start the task, but it always gives me the same result: the task completes but no file is created.
My python code is as follows:
landsat = ee.Image('LANDSAT/LC08/C01/T1_TOA/LC08_123032_20140515').select(['B4', 'B3', 'B2'])
geometry = ee.Geometry.Rectangle([116.2621, 39.8412, 116.4849, 40.01236])
task_config = {
'description': 'TEST_todrive_desc',
'scale': 30,
'region': geometry,
'folder':'GEEtest'
}
task = ee.batch.Export.image.toDrive(landsat, 'TEST_todrive', task_config)
ee.batch.data.startProcessing(task.id, task.config)
# Note: I also tried task.start() instead of this last line but the problem is the same, task completed, no file created.
# Printing the task list successively
for i in range(10):
tasks = ee.batch.Task.list()
print(tasks)
time.sleep(5)
In the printed task list, the status of the task goes from READY to RUNNING and then COMPLETED. But after completion no file is created in Google Drive in my folder "GEEtest" (nor anywhere else).
What am I doing wrong?
I think that the file is been generated and stored on the google drive of the 'Service Account' used for Python API not on your private account that is normally used when using the web code editor.
You can't pass a dictionary of arguments directly in python. You need to pass it using the kwargs convention (do a web search for more info). Basically, you just need to preface the task_config argument with double asteriks like this:
task = ee.batch.Export.image.toDrive(landsat, 'TEST_todrive', **task_config)
Then proceed as you have (I assume your use of task.config rather than task_config in the following line is a typo). Also note that you can query the task directly (using e.g. task.status()) and it may give more information about when / why the task failed. This isn't well documented as far as I can tell but you can read about it in the API code.

Communicating with azure container using server-less function

I have created a python serverless function in azure that gets executed when a new file is uploaded to azure blob (BlobTrigger). The function extracts certain properties of the file and saves it in the DB. As the next step, I want this function copy and process the same file inside a container instance running in ACS. The result of processing should be returned back to the same azure function.
This is a hypothetical architecture that I am currently brainstorming on. I wanted to know if this is feasible. Can you provide me some pointers on how I can achieve this.
I dont see any ContainerTrigger kind of functionality that can allow me to trigger the container and process my next steps.
I have tried utilizing the code examples mentioned here but they have are not really performing the tasks that I need: https://github.com/Azure-Samples/aci-docs-sample-python/blob/master/src/aci_docs_sample.py
Based on the comments above you can consider.
Azure Container Instance
Deploy your container in ACI (Azure Container Instance) and expose HTTP end point from container , just like any web url. Trigger Azure Function using blob storage trigger and then pass your blob file URL to the exposed http end point to your container. Process the file there and return the response back to azure function just like normal http request/response.
You can completely bypass azure function and can trigger your ACI (container instance) using logic apps , process the file and directly save in database.
When you are using Azure function make sure this is short lived process since Azure function will exit after certain time (default 5 mins). For long processing you may have to consider azure durable functions.
Following url can help you understand better.
https://github.com/Azure-Samples/aci-event-driven-worker-queue

Categories