Long running python process with Google Cloud Functions - python

I'm trying to run a python script in Google Cloud which will download 50GB of data once a day to a storage bucket. That download might take longer than the timeout limit on the Google Cloud Functions which is set to 9 minutes.
The request to invoke the python function is triggered by HTTP.
Is there a way around this problem ? I don't need to run a HTTP Restful service as this is called once a day from an external source. (Can't be scheduled) .
The whole premise is do download the big chuck of data directly to the cloud.
Thanks for any suggestions.

9 minutes is a hard limit for Cloud Functions that can't be exceeded. If you can't split up your work into smaller units, one for each function invocation, consider using a different product. Cloud Run limits to 15 minutes, and Compute Engine has no limit that would apply to you.

Google Cloud Scheduler may work well for that.
Here is a nice google blog post that shows example of how to set up a python script.
p.s. you would probably want to connect it to the App Engine for the actual execution.

Related

Processing the data from salesforce to GCP BigQuery by using python script it's throwing the Time out error

When cloud-function-1 is triggered, SalesForce data will be stored in GCP_Bucket-1 and Cloud-function-2 is triggered, The data should be stored in GCP BigQuery Sql database. Here the Python script is working fine. But the problem is, It is processing only few records and throwing the time out error. Can anyone please suggest me solution for this one.
As stated in Cloud Run documentation: https://cloud.google.com/functions/docs/concepts/exec#timeout
By default, a function times out after 1 minute, but you can extend this period up to 9 minutes.
So try to increase the timeout. In the link above you have the details how you can do it.
If 9 minutes are not enough, you can use Cloud Run that has a timeout of 1h. But keep in mind that Cloud Run needs a little more configuration than Cloud Function.

Deploying a Python Automation script in the cloud

I have a working Python automation program combine_excel.py. It accesses a server and extracts excel files and combining them with an automation workflow. Currently, I need to execute this automation manually.
I like to host this program in a cloud/server and activate the script at preset timings and at regular intervals. I like to know if there is any service out there that will allow me to do that. Can I do this on Google Cloud or AWS?
The program will generate an output that I could have it saved into to my Google Drive.
An easy/cost-effective way to achieve this could be to use AWS Lambda functions. Lambda functions can be set to trigger at certain time intervals using CRON syntax.
You might need to make some minor adjustments to match some format requirements Lambda has, maybe workout a way to include dependencies if you have any, but everything should be pretty straightforward as there is a lot of information available on the web.
The same can be achieved using Google Cloud Functions.
You could also try Serverless Framework which would take care of the deployment for you, you only need to set it up once.
Another options is to try Zeit it's quite simple to use, and it has a free tier (as the others).
Some useful links:
https://serverless.com/blog/serverless-python-packaging/
https://docs.aws.amazon.com/lambda/latest/dg/welcome.html
https://cloud.google.com/functions/docs/quickstart-python
https://zeit.co/docs/runtimes#official-runtimes/python

AWS Batch analog in GCP?

I was using AWS and am new to GCP. One feature I used heavily was AWS Batch, which automatically creates a VM when the job is submitted and deletes the VM when the job is done. Is there a GCP counterpart? Based on my research, the closest is GCP Dataflow. The GCP Dataflow documentation led me to Apache Beam. But when I walk through the examples here (link), it feels totally different from AWS Batch.
Any suggestions on submitting jobs for batch processing in GCP? My requirement is to simply retrieve data from Google Cloud Storage, analyze the data using a Python script, and then put the result back to Google Cloud Storage. The process can take overnight and I don't want the VM to be idle when the job is finished but I'm sleeping.
You can do this using AI Platform Jobs which is now able to run arbitrary docker images:
gcloud ai-platform jobs submit training $JOB_NAME \
--scale-tier BASIC \
--region $REGION \
--master-image-uri gcr.io/$PROJECT_ID/some-image
You can define the master instance type and even additional worker instances if desired. They should consider creating a sibling product without the AI buzzword so people can find this functionality easier.
I recommend checking out dsub. It's an open-source tool initially developed by the Google Genomics teams for doing batch processing on Google Cloud.
UPDATE: I have now used this service and I think it's awesome.
As of July 13, 2022, GCP now has it's own new fully managed Batch processing service (GCP Batch), which seems very akin to AWS Batch.
See the GCP Blog post announcing it at: https://cloud.google.com/blog/products/compute/new-batch-service-processes-batch-jobs-on-google-cloud (with links to docs as well)
Officially, according to the "Map AWS services to Google Cloud Platform products" page, there is no direct equivalent but you can put a few things together that might get you to get close.
I wasn't sure if you were or had the option to run your python code in Docker. Then the Kubernetes controls might do the trick. From the GCP docs:
Note: Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads. However, while a node pool can scale to a zero size, the overall cluster size does not scale down to zero nodes (as at least one node is always required to run system Pods).
So, if you are running other managed instances anyway you can scale up or down to and from 0 but you have the Kubernetes node is still active and running the pods.
I'm guessing you are already using something like "Creating API Requests and Handling Responses" to get an ID you can verify that the process is started, instance created, and the payload is processing. You can use that same process to submit that the process completes as well. That takes care of the instance creation and launch of the python script.
You could use Cloud Pub/Sub. That can help you keep track of the state of that: can you modify your python to notify the completion of the task? When you create the task and launch the instance, you can also report that the python job is complete and then kick off an instance tear down process.
Another thing you can do to drop costs is to use Preemptible VM Instances so that the instances run at 1/2 cost and will run a maximum of 1 day anyway.
Hope that helps.
The Product that best suits your use-case in GCP is Cloud Task. We are using it for a similar use-case where we are retrieving files from another HTTP server and after some processing storing them in Google Cloud Storage.
This GCP documentation describes in full detail the steps to create tasks and using them.
You schedule your task programmatically in Cloud Tasks and you have to create task handlers(worker services) in the App Engine. Some limitation For worker services running in App Engine
the standard environment:
Automatic scaling: task processing must finish in 10 minutes.
Manual and basic scaling: requests can run up to 24 hours.
the flex environment: all types have a 60 minutes timeout.
I think the Cron job can help you in this regard and you can implement it with the help of App engine, Pub/sub and Compute engine. Reliable Task Scheduling on Google Compute Engine In distributed systems, such as a network of Google Compute Engine instances, it is challenging to reliably schedule tasks because any individual instance may become unavailable due to autoscaling or network partitioning.
Google App Engine provides a Cron service. Using this service for scheduling and Google Cloud Pub/Sub for distributed messaging, you can build an application to reliably schedule tasks across a fleet of Compute Engine instances.
For a detailed look you can check it here: https://cloud.google.com/solutions/reliable-task-scheduling-compute-engine

automating 3rd party api pull an push into AWS RDS SQL using Python

I wrote a Python script that will pull data from a 3rd party API and push it into a SQL table I set up in AWS RDS. I want to automate this script so that it runs every night (e.g., the script will only take about a minute to run). I need to find a good place and way to set up this script so that it runs each night.
I could set up an EC2 instance, and a cron job on that instance, and run it from there, but it seems expensive to keep an EC2 instance alive all day for only 1 minute of run-time per night. Would AWS data pipeline work for this purpose? Are there other better alternatives?
(I've seen similar topics discussed when googling around but haven't seen recent answers.)
Thanks
Based on your case, I think you can try to use shellCommandActivity in data pipeline. It will launch a ec2 instance and execute the command you give to data pipeline on your schedule. After finishing the task, pipeline will terminate ec2 instance.
Here is doc:
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-shellcommandactivity.html
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-ec2resource.html
Alternatively, you could use a 3rd-party service like Crono. Crono is a simple REST API to manage time-based jobs programmatically.

Google AppEngine and Threaded Workers

I am currently trying to develop something using Google AppEngine, I am using Python as my runtime and require some advise on setting up the following.
I am running a webserver that provides JSON data to clients, The data comes from an external service in which I have to pull the data from.
What I need to be able to do is run a background system that will check the memcache to see if there are any required ID's, if there is an ID I need to fetch some data for that ID from the external source and place the data in the memecache.
If there are multiple id's, > 30 I need to be able to pull all 30 request as quickly and efficiently as possible.
I am new to Python Development and AppEngine so any advise you guys could give would be great.
Thanks.
You can use "backends" or "task queues" to run processes in the background. Tasks have a 10-minute run time limit, and backends have no run time limit. There's also a cronjob mechanism which can trigger requests at regular intervals.
You can fetch the data from external servers with the "URLFetch" service.
Note that using memcache as the communication mechanism between front-end and back-end is unreliable -- the contents of memcache may be partially or fully erased at any time (and it does happen from time to time).
Also note that you can't query memcache of you don't know the exact keys ahead of time. It's probably better to use the task queue to queue up requests instead of using memcache, or using the datastore as a storage mechanism.

Categories