I'm building an application that can run user-submitted python code. I'm considering the following approaches:
Spinning up a new AWS lambda function for each user's request to run the submitted code in it. Delete the lambda function afterwards. I'm aware of AWS lambda's time limit - so this would be used to run only small functions.
Spinning up a new EC2 machine to run a user's code. One instance per user. Keep the instance running while the user is still interacting with my application. Kill the instance after the user is done.
Same as the 2nd approach but also spin up a docker container inside the EC2 instance to add an additional layer of isolation (is this necessary?)
Are there any security vulnerabilities I need to be aware of? Will the user be able to do anything if they gain access to environment variables in their own lambda function/ec2 machine? Are there any better solutions?
Any code which you run on AWS Lambda will have the capabilities of the associated function. Be very careful what you supply.
Even logging and metrics access can be manipulated to incur additional costs.
Related
I've found that my Azure Function is unavailable for approx. 3-5 mins when publishing changes. Any triggers are not recognized nor captured in App Insights.
This is not a big deal at the moment, but if this Function ever requires increased uptime, how can I ensure it will be available during code changes?
As mentioned in the comment, you just need to use the Azure Functions deployment slots, when you need to publish the changes, swap your function to the staging slot, after publishing, swap it back to the production slot.
I have created a python serverless function in azure that gets executed when a new file is uploaded to azure blob (BlobTrigger). The function extracts certain properties of the file and saves it in the DB. As the next step, I want this function copy and process the same file inside a container instance running in ACS. The result of processing should be returned back to the same azure function.
This is a hypothetical architecture that I am currently brainstorming on. I wanted to know if this is feasible. Can you provide me some pointers on how I can achieve this.
I dont see any ContainerTrigger kind of functionality that can allow me to trigger the container and process my next steps.
I have tried utilizing the code examples mentioned here but they have are not really performing the tasks that I need: https://github.com/Azure-Samples/aci-docs-sample-python/blob/master/src/aci_docs_sample.py
Based on the comments above you can consider.
Azure Container Instance
Deploy your container in ACI (Azure Container Instance) and expose HTTP end point from container , just like any web url. Trigger Azure Function using blob storage trigger and then pass your blob file URL to the exposed http end point to your container. Process the file there and return the response back to azure function just like normal http request/response.
You can completely bypass azure function and can trigger your ACI (container instance) using logic apps , process the file and directly save in database.
When you are using Azure function make sure this is short lived process since Azure function will exit after certain time (default 5 mins). For long processing you may have to consider azure durable functions.
Following url can help you understand better.
https://github.com/Azure-Samples/aci-event-driven-worker-queue
I am writing a program in Python that will need to be having an uptime of 30 days straight. It is connecting to an MQTT-client, and listens for messages for a number of topics.
I have using an EC2 server instance running Linux AMI and I wonder how I could set this up to run constantly for this duration of time?
I was looking for cronjobs and rebooting every X days, but preferably the system should have no down time if possible.
However, I am unsure how to set this up and make sure the script restarts if the server/program was ever to fail.
The client will connect to an OpenVPN VPC through amazon, and then run the script and keep it running. Would this be possible to setup?
The version I am running is:
Amazon Linux AMI 2018.03.0.20180811 x86_64 HVM GP2
NAME="Amazon Linux AMI"
VERSION="2018.03"
ID_LIKE="rhel fedora"
VERSION_ID="2018.03"
You can accomplish this by using Auto Scaling to automatically maintain the required number of EC2 instances. If an instance becomes unresponsive or fails health checks, auto scaling will launch a new one. See: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-maintain-instance-levels.html
You'll want to make an AMI of your system to use to launch new instances, or maybe put your configuration into a user data script.
If your use case is simply to receive messages over MQTT I would recommend that you take a look at the AWS IoT Core service as a solution rather than running an EC2 instance. This will solve your downtime issues because it's a managed service with a high degree of resiliency built-in.
You can choose the route the messages to a variety of targets, including storing them in S3 for batch processing or using AWS Lambda to process them as they arrive without having to run EC2 instances. With Lambda, you get 1 million invokes per month for free so if your volume is less than this, your compute costs will be zero too.
I am running a Spark step on AWS EMR, this step is added to EMR through Boto3, I will like to return to the user a percentage of completion of the task, is there anyway to do this?
I was thinking to calculate this percentage with the number of completed stages of Spark, I know this won't be too precise, as the stage 4 may take double time than stage 5 but I am fine with that.
Is it possible to access this information with boto3?
I checked the method list_steps (here are the docs) but in the response I am getting only if its running without other information.
DISCLAIMER: I know nothing about AWS EMR and Boto3
I will like to return to the user a percentage of completion of the task, is there anyway to do this?
Any way? Perhaps. Just register a SparkListener and intercept events as they come. That's how web UI works under the covers (which is the definitive source of truth for Spark applications).
Use spark.extraListeners property to register a SparkListener and do whatever you want with the events.
Quoting the official documentation's Application Properties:
spark.extraListeners A comma-separated list of classes that implement SparkListener; when initializing SparkContext, instances of these classes will be created and registered with Spark's listener bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor will be called; otherwise, a zero-argument constructor will be called.
You could also consider REST API interface:
In addition to viewing the metrics in the UI, they are also available as JSON. This gives developers an easy way to create new visualizations and monitoring tools for Spark. The JSON is available for both running applications, and in the history server. The endpoints are mounted at /api/v1. Eg., for the history server, they would typically be accessible at http://:18080/api/v1, and for a running application, at http://localhost:4040/api/v1.
This is not supported at the moment and I don't think it will be anytime soon.
You'll just have to follow application logs the old fashioned way. So maybe consider formatting your logs in a way you know what has actually finished.
I wrote a Python script that will pull data from a 3rd party API and push it into a SQL table I set up in AWS RDS. I want to automate this script so that it runs every night (e.g., the script will only take about a minute to run). I need to find a good place and way to set up this script so that it runs each night.
I could set up an EC2 instance, and a cron job on that instance, and run it from there, but it seems expensive to keep an EC2 instance alive all day for only 1 minute of run-time per night. Would AWS data pipeline work for this purpose? Are there other better alternatives?
(I've seen similar topics discussed when googling around but haven't seen recent answers.)
Thanks
Based on your case, I think you can try to use shellCommandActivity in data pipeline. It will launch a ec2 instance and execute the command you give to data pipeline on your schedule. After finishing the task, pipeline will terminate ec2 instance.
Here is doc:
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-ec2resource.html
Alternatively, you could use a 3rd-party service like Crono. Crono is a simple REST API to manage time-based jobs programmatically.