I was using AWS and am new to GCP. One feature I used heavily was AWS Batch, which automatically creates a VM when the job is submitted and deletes the VM when the job is done. Is there a GCP counterpart? Based on my research, the closest is GCP Dataflow. The GCP Dataflow documentation led me to Apache Beam. But when I walk through the examples here (link), it feels totally different from AWS Batch.
Any suggestions on submitting jobs for batch processing in GCP? My requirement is to simply retrieve data from Google Cloud Storage, analyze the data using a Python script, and then put the result back to Google Cloud Storage. The process can take overnight and I don't want the VM to be idle when the job is finished but I'm sleeping.
You can do this using AI Platform Jobs which is now able to run arbitrary docker images:
gcloud ai-platform jobs submit training $JOB_NAME \
--scale-tier BASIC \
--region $REGION \
--master-image-uri gcr.io/$PROJECT_ID/some-image
You can define the master instance type and even additional worker instances if desired. They should consider creating a sibling product without the AI buzzword so people can find this functionality easier.
I recommend checking out dsub. It's an open-source tool initially developed by the Google Genomics teams for doing batch processing on Google Cloud.
UPDATE: I have now used this service and I think it's awesome.
As of July 13, 2022, GCP now has it's own new fully managed Batch processing service (GCP Batch), which seems very akin to AWS Batch.
See the GCP Blog post announcing it at: https://cloud.google.com/blog/products/compute/new-batch-service-processes-batch-jobs-on-google-cloud (with links to docs as well)
Officially, according to the "Map AWS services to Google Cloud Platform products" page, there is no direct equivalent but you can put a few things together that might get you to get close.
I wasn't sure if you were or had the option to run your python code in Docker. Then the Kubernetes controls might do the trick. From the GCP docs:
Note: Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads. However, while a node pool can scale to a zero size, the overall cluster size does not scale down to zero nodes (as at least one node is always required to run system Pods).
So, if you are running other managed instances anyway you can scale up or down to and from 0 but you have the Kubernetes node is still active and running the pods.
I'm guessing you are already using something like "Creating API Requests and Handling Responses" to get an ID you can verify that the process is started, instance created, and the payload is processing. You can use that same process to submit that the process completes as well. That takes care of the instance creation and launch of the python script.
You could use Cloud Pub/Sub. That can help you keep track of the state of that: can you modify your python to notify the completion of the task? When you create the task and launch the instance, you can also report that the python job is complete and then kick off an instance tear down process.
Another thing you can do to drop costs is to use Preemptible VM Instances so that the instances run at 1/2 cost and will run a maximum of 1 day anyway.
Hope that helps.
The Product that best suits your use-case in GCP is Cloud Task. We are using it for a similar use-case where we are retrieving files from another HTTP server and after some processing storing them in Google Cloud Storage.
This GCP documentation describes in full detail the steps to create tasks and using them.
You schedule your task programmatically in Cloud Tasks and you have to create task handlers(worker services) in the App Engine. Some limitation For worker services running in App Engine
the standard environment:
Automatic scaling: task processing must finish in 10 minutes.
Manual and basic scaling: requests can run up to 24 hours.
the flex environment: all types have a 60 minutes timeout.
I think the Cron job can help you in this regard and you can implement it with the help of App engine, Pub/sub and Compute engine. Reliable Task Scheduling on Google Compute Engine In distributed systems, such as a network of Google Compute Engine instances, it is challenging to reliably schedule tasks because any individual instance may become unavailable due to autoscaling or network partitioning.
Google App Engine provides a Cron service. Using this service for scheduling and Google Cloud Pub/Sub for distributed messaging, you can build an application to reliably schedule tasks across a fleet of Compute Engine instances.
For a detailed look you can check it here: https://cloud.google.com/solutions/reliable-task-scheduling-compute-engine
Related
I'm trying to run a python script in Google Cloud which will download 50GB of data once a day to a storage bucket. That download might take longer than the timeout limit on the Google Cloud Functions which is set to 9 minutes.
The request to invoke the python function is triggered by HTTP.
Is there a way around this problem ? I don't need to run a HTTP Restful service as this is called once a day from an external source. (Can't be scheduled) .
The whole premise is do download the big chuck of data directly to the cloud.
Thanks for any suggestions.
9 minutes is a hard limit for Cloud Functions that can't be exceeded. If you can't split up your work into smaller units, one for each function invocation, consider using a different product. Cloud Run limits to 15 minutes, and Compute Engine has no limit that would apply to you.
Google Cloud Scheduler may work well for that.
Here is a nice google blog post that shows example of how to set up a python script.
p.s. you would probably want to connect it to the App Engine for the actual execution.
I have a question about quotas monitoring (I mean quotas in https://console.cloud.google.com -> IAM & admin -> Quotas).
I need configure alerting on cases, when the capacity of quota for any Service is less than 20%, for example. Has anybody done something like that? Maybe Google Cloud has some standard tools for that? If not, is it possible to do with python + gcloud module?
If you are interested in Compute Engine quotas, there is a Google Cloud standard tool to list them using this API call. Or find here the CLI command used to list them, in yaml format:
gcloud compute project-info describe --project myproject
You can use a cron job to perform a regular scheduled task, calling the API and verifying that the usage/limit<0.8 condition is met.
I have a couple of Python apps in CloudFoundry. Now I would like to schedule their execution. For example a specific app has to be executed on the second day of each month.
I coudldn't find anything on the internet. Is that even possible?
Cloud Foundry will deploy your application inside a container. You could use libraries to execute your code on a specific schedule but either way you're paying to have that instance run the whole time.
What you're trying to do is a perfect candidate for "serverless computing" (also known as "event-driven" or "function as a service" computing.
These deployment technologies execute functions on response to a trigger e.g. a REST api call, a certain timestamp, a new database insert etc...
You could execute your python cloud foundry apps using the Openwhisk serverless compute platform.
IBM offer a hosted version of this running on their cloud platform, Bluemix.
I don't know what your code looks like so I'll use this sample hello world function:
import sys
def main(dict):
if 'message' in dict:
name = dict['message']
else:
name = 'stranger'
greeting = 'Hello ' + name + '!'
print(greeting)
return {'greeting':greeting}
You can upload your actions (functions) to OpenWhisk using either the online editor or the CLI.
Once you've uploaded your actions you can automate them on a specific schedule by using the Alarm Package. To do this in the online editor click "automate this process" and pick the alarm package.
To do this via the CLI we need to first create a trigger:
$ wsk trigger create regular_hello_world --feed /whisk.system/alarms/alarm -p cron '0 0 9 * * *'
ok: created trigger feed regular_hello_world
This will trigger every day at 9am. We then need to link this trigger to our action by creating a rule:
$ wsk rule create regular_hello_rule regular_hello_world hello_world
ok: created rule regular_hello_rule
For more info see the docs on creating python actions.
The CloudFoundry platform itself does not have a scheduler (at least not at this time) and the containers where you application runs do not have cron installed (unlikely to ever happen).
If you want to schedule code to periodically run, you have a few options.
You can deploy an application that includes a scheduler. The scheduler can run your code directly in that container or it can trigger the code to run elsewhere (ex: it sends an HTTP request to another application and that request triggers the code to run). If you trigger the code to run elsewhere, you can make the scheduler app run pretty lean (maybe with 64m of memory or less) to reduce costs.
You can look for a third party scheduler service. The availability of and cost of services like this will vary depending on your CF provider, but there are service offerings to handle scheduling. These typically function like the previous example where an HTTP request is sent to your app at a specific time and that triggers your scheduled code. Many service providers offer free tiers, which give you a small number of triggers per month at no cost.
If you have a server outside of CF with cron installed, you can use cron there to schedule the tasks and trigger the code to run on CF. You can do this like the previous examples by sending HTTP requests to your app, however, this option also gives you the possibility to make use of CloudFoundry's task feature.
CloudFoundry has the concept of a task, which is a one-time execution of some code. With it, you can execute the cf run-task command to trigger the task to run. Ex: cf run-task <app-name> "python my-task.py". More on that in the docs, here. The nice part about using tasks is that your provider will only bill you while the task is running.
To see if your provider has tasks available, run cf feature-flags and look to see if task_creation is set to enabled.
Hope that helps!
I'm trying to define an architecture where multiple Python scripts need to be run in parallel and on demand. Imagine the following setup:
script requestors (web API) -> Service Bus queue -> script execution -> result posted back to script requestor
To this end, the script requestor places a script request message on the queue, together with an API endpoint where the result should be posted back to. The script request message also contains the input for the script to be run.
The Service Bus queue decouples producers and consumers. A generic set of "workers" simply look for new messages on the queue, take the input message and call a Python script with said input. Then they post back the result to the API endpoint. But what strategies could I use to "run the Python scripts"?
One possible strategy could be to use Webjobs. Webjobs can execute Python scripts and run on a schedule. Let's say that you run a Webjob every 5 minutes, the Python script can pool the queue, do some processing and post the results back to you API.
Per my exprience, I think there are two strategies below you could use.
Developing a Python script for Azure HDInsight. Azure HDInsight as a platform based on Hadoop that has the power of parallel compute, you can try to refer to the doc https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-streaming-python/ to know it.
Develoging a Python script based on the parallel compute framework like dispy or jug running Azure VMs.
Hope it helps. Best Regards.
I wrote a Python script that will pull data from a 3rd party API and push it into a SQL table I set up in AWS RDS. I want to automate this script so that it runs every night (e.g., the script will only take about a minute to run). I need to find a good place and way to set up this script so that it runs each night.
I could set up an EC2 instance, and a cron job on that instance, and run it from there, but it seems expensive to keep an EC2 instance alive all day for only 1 minute of run-time per night. Would AWS data pipeline work for this purpose? Are there other better alternatives?
(I've seen similar topics discussed when googling around but haven't seen recent answers.)
Thanks
Based on your case, I think you can try to use shellCommandActivity in data pipeline. It will launch a ec2 instance and execute the command you give to data pipeline on your schedule. After finishing the task, pipeline will terminate ec2 instance.
Here is doc:
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-ec2resource.html
Alternatively, you could use a 3rd-party service like Crono. Crono is a simple REST API to manage time-based jobs programmatically.