I am fairly new using AWS and I need to run a batch process (daily ) and store the data in a MySQL database. It would take approximately 30 minutes for extraction and transformation. As a side note, I need to run pandas.
I was reading that lambda functions are limited to 5 minutes. http://docs.aws.amazon.com/lambda/latest/dg/limits.html
I was thinking of using an EC2 micro instance with Ubuntu or an Elastic Beanstalk instance. And Amazon RDS for a MySQL DB.
Am I on the right path? Where is the best place to run my python code in AWS?
If you need to run these operations once or twice a day, you may want to look into the new AWS Batch service, which will let you run batch jobs without having to worry about DevOps.
If you have enough jobs to keep up the computer busy for most of the day, I believe the best solution is to run a Docker based solution, which will allow you to more easily manage your image and be able to test on your local host ( and more easily move to another cloud if you ever have to). AWS ECS makes this as easy as Elastic beanstalk.
I have my front end running on Elastic beanstalk and my back end workers running on ECS. In my case, my python workers are running on an infinite loop checking for SQS messages so the server can communicate with them via SQS messages. But I also have CloudWatch rules ( as cron jobs ) that wake up and call Lambda functions which then post SQS messages for the workers to handle. I can then have three worker containers running on the same t2.small ECS instance. If one of the workers ever fails, ECS will recreate one.
To summarize, use python on Docker on AWS ECS.
I'm using about 2-3 Ubuntu EC2 instances just to run Python scripts (via cronjob) for different purposes and using RDS for PostgresDB, all of them work well so far. So I think you should give EC2 and RDS a try. Good luck!
I would create an EC2 instance, install Python and MySQL, and host everything on that instance. If you need higher availability you could use an ASG to maintain at least 1 instance running. If one AZ goes down, or the system fails, ASG will launch another instance in a different AZ. Use CloudWatch for EC2 instance monitoring.
If you do not need 24 hour availability for the database, you could even schedule your instance to start and stop when it is not needed reducing costs.
Related
I have a number of very simple Python scripts that are constantly hitting API endpoints 24/7. The amount of data being pulled is very minimal and they all query the APIs every few seconds. My question is, is it okay to run multiple simple scripts in AWS Lightsail using tmux on a single core Lightsail instance or is it better practice to create a new instance for each Python script?
I don't find any limits mentioned in LightSail for your use case. As long as the end-points are owned by you or you don't get blocked for hitting them continuously, all seems good.
https://aws.amazon.com/lightsail/faq/
You can also set some alarms on Lightsail instance usage to know if you've hit any limits.
I have a script that downloads larger amounts of data from an API. The script takes around two hours to run. I would like to run the script on GCP and schedule it to run once a week on Sundays, so that we have the newest data in our SQL database (also on GCP) by the next day.
I am aware of cronjobs, but would not like to run an entire server just for this single script. I have taken a look at cloud functions and cloud scheduler, but because the script takes so long to execute I cannot run it on cloud functions as the maximum execution time is 9 minutes (from here). Is there any other way how I could schedule the python script to run?
Thank you in advance!
For running a script more than 1h, you need to use a Compute Engine. (Cloud Run can live only 1h).
However, you can use Cloud Scheduler. Here how to do
Create a cloud scheduler with the frequency that you want
On this scheduler, use the Compute Engine Start API
In the advanced part, select a service account (create one or reuse one) which have the right to start a VM instance
Select OAuth token as authentication mode (not OIDC)
Create a compute engine (that you will start with the Cloud Scheduler)
Add a startup script that trigger your long job
At the end on the script, add a line to shutdown the VM (with Gcloud for example)
Note: the startup script is run as ROOT user. Take care of the default home directory and the permission of the created files.
I have a Python script that pulls some data from an Azure Data Lake cluster, performs some simple compute, then stores it into a SQL Server DB on Azure. The whole shebang runs in about 20 seconds. It needs sqlalchemy, pandas, and some Azure data libraries. I need to run this script daily. We also have a Service Fabric cluster available to use.
What are my best options? I thought of containerizing it with Docker and making it into an http triggered API, but then how do I trigger it 1x per day? I'm not good with Azure or microservices design so this is where I need the help.
You can use Web Jobs in App Service. It has two types of Azure Web Jobs for you to choose: Continuous and Trigger. As I see you need the type Trigger
You could refer to the document here for more details.In addition, here shows how to run tasks in WebJobs.
Also, you can use Azure function timer-based on python which was made generally available in recent months.
I am currently doing some experimentation with AWS Elastic Container Service in the context of creating a data processing pipeline and I had a few questions regarding the specifics of how best to set up the docker container/ecs task definitions.
The general goal of the project is to create a system that allows users to add data files to an S3 bucket to trigger an ECS task using S3 events and Lambda, then return the outputs to another S3 bucket.
So far I've been able to figure out the S3 triggers and the basics of Lambda, but I am a bit more confused on how to properly set up the docker container and task definition so that it automatically processes the data using a set of python scripts. I believe that creating a docker container that runs a shell script that copies the necessary files and calls the python code makes sense, but I was confused on how to run the docker container with a bind mounted volume from an ECS task, and also whether or not this process makes sense. Currently, when I am testing the system on a single EC2, I am running my docker container using:
docker run -b $ (pwd)/data:/home/ec2-user/docker_test/data docker_test
I'm still relatively new to the AWS tools, so please let me know if I can clarify any of my points/questions and thank you in advance!
I have a Python script that can really eat some CPU and memory. Thus, I figure the most efficient way to run the script is to containerize it in a Docker Container. The script is not meant to run forever. Rather, it gets dependency information from environment variables, does it's behavior and then terminates. Once the script is over, by default Docker will remove the container from memory.
This is good. I am only paying for computing resource while the script is being run.
My problem is this: I have a number of different types of scripts I can run. What I want to do is create a manager that, given the name of a script type to run, gets the identified container to run in Google Container Engine in such as way that the invocation is configured to use a predefined CPU, disk and memory allocation envirnoment that is intended to run the script as fast as possible.
Then, once the script finishes, I want the container removed from the environment so that I am no longer paying for the resource. In other words I want to be able to do in an automated manner in Container Engine what I can do manually from my local machine at the command line.
I am trying to learn how to get Container Engine to support my need in an automated manner. It seems to me that using Kubernetes might be a bit of an overkill in that I do not really want to guarantee constant availability. Rather, I just want the container to run and die. If for some reason the script fails or terminated before success, the archtecture is designed to detect the unsuccesful attempt.
You could use a Kubernetes Controller to create a job object that 'runs to completion'.
A job object such as this can be used to run a single pod.
Once the job (in this case your script) has completed, the pod is terminated and will therefore no longer use any resources. The pod wouldn't be deleted (unless the job is deleted) but will remain in a terminated state. If required and configured correctly, no more pods will be created.
The job object can also be configured to start a new pod should the job fail for any reason should you require this functionality.
For more detailed information on this please see this page.
Also just to add, to keep your billing to a minimum, you could reduce the number of nodes in the cluster down to zero when you are not running the job, and then increase it to the required number when the jobs need to be executed. This could be done programmatically by making use of API calls if required. This should ensure your billing is kept as low as possible as you will only be billed for the nodes when they are running.