I have a few hundred files(100,000) in a Google Storage Bucket. The file sizes are about 2-10MB. I need to apply a simple python function(just data transformation) on each of these files. I need to read from one bucket - transform (python function) in parallel - and store in another bucket. I am thinking of a simple Hadoop or Spark cluster to do this. I previously used concurrent threads on a single instance to do this, but I need a more robust approach. What is the best way to accomplish this?
You can use the recently-announced Google Cloud Dataproc (in beta as of 5 Oct 2015), which provides a managed Hadoop or Spark cluster for you. It is integrated with Google Cloud Storage so you can read and write data from your bucket.
You can submit jobs via gcloud, the console, or via SSH to a machine in your cluster.
Related
I'm trying to run a python script in Google Cloud which will download 50GB of data once a day to a storage bucket. That download might take longer than the timeout limit on the Google Cloud Functions which is set to 9 minutes.
The request to invoke the python function is triggered by HTTP.
Is there a way around this problem ? I don't need to run a HTTP Restful service as this is called once a day from an external source. (Can't be scheduled) .
The whole premise is do download the big chuck of data directly to the cloud.
Thanks for any suggestions.
9 minutes is a hard limit for Cloud Functions that can't be exceeded. If you can't split up your work into smaller units, one for each function invocation, consider using a different product. Cloud Run limits to 15 minutes, and Compute Engine has no limit that would apply to you.
Google Cloud Scheduler may work well for that.
Here is a nice google blog post that shows example of how to set up a python script.
p.s. you would probably want to connect it to the App Engine for the actual execution.
I have a large collection of data stored in google storage bucket with the following structure:
gs://project_garden/plant_logs/2019/01/01/humidity/plant001/hour.gz. What I want is to make a Kubernetes Job which downloads all of it, parses it and upload the parsed files to BigQuery in parallel. So far I've managed to do it locally without any parallelism by writing a python code which takes a date interval as input and loops over each of the plants executing gsutil -m cp -r for download, gunzip for extraction and pandas for transforming. I want to do the same thing but in parallel for each plant using Kubernetes. Is it possible to parallelise the process by defining a job that passes down different plant id's for each pod and downloads the files for each of them?
A direct upload from Kubernetes to BigQuery is not possible, you can only upload data into BigQuery [1] with the following methods:
From Cloud Storage
From other Google services, such as Google Ad Manager and Google Ads
From a readable data source (such as your local machine)
By inserting individual records using streaming inserts
Using DML statements to perform bulk inserts
Using a BigQuery I/O transform in a Cloud Dataflow pipeline to write data to BigQuery
As mentioned in the previous comment the easiest solution would be to upload the data using DataFlow, you can find a template to upload text from Google Cloud Storage (GCS) to BigQuery in link [2]
If you have to use Google Cloud Engine (GKE) you will need to perform the following steps:
Read the data from GCS with GKE. You can find an example of how to mount a bucket in your containers in the next link [3]
Parse the data with your code as mentioned in your question
Upload data from GCS to BigQuery, more info in link [4]
[1] https://cloud.google.com/bigquery/docs/loading-data
[2] https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#gcstexttobigquerystream
[3] https://github.com/maciekrb/gcs-fuse-sample
[4] https://cloud.google.com/bigquery/docs/loading-data-cloud-storage
I have about 100 million json files (10 TB), each with a particular field containing a bunch of text, for which I would like to perform a simple substring search and return the filenames of all the relevant json files. They're all currently stored on Google Cloud Storage. Normally for a smaller number of files I might just spin up a VM with many CPUs and run multiprocessing via Python, but alas this is a bit too much.
I want to avoid spending too much time setting up infrastructure like a Hadoop server, or loading all of that into some MongoDB database. My question is: what would be a quick and dirty way to perform this task? My original thoughts were to set up something on Kubernetes with some parallel processing running Python scripts, but I'm open to suggestions and don't really have a clue how to go about this.
Easier would be to just load the GCS data into Big Query and just run your query from there.
Send your data to AWS S3 and use Amazon Athena.
The Kubernetes option would be set up a cluster in GKE and install Presto in it with a lot of workers, use a hive metastore with GCS and query from there. (Presto doesn't have direct GCS connector yet, afaik) -- This option seems more elaborate.
Hope it helps!
I was using AWS and am new to GCP. One feature I used heavily was AWS Batch, which automatically creates a VM when the job is submitted and deletes the VM when the job is done. Is there a GCP counterpart? Based on my research, the closest is GCP Dataflow. The GCP Dataflow documentation led me to Apache Beam. But when I walk through the examples here (link), it feels totally different from AWS Batch.
Any suggestions on submitting jobs for batch processing in GCP? My requirement is to simply retrieve data from Google Cloud Storage, analyze the data using a Python script, and then put the result back to Google Cloud Storage. The process can take overnight and I don't want the VM to be idle when the job is finished but I'm sleeping.
You can do this using AI Platform Jobs which is now able to run arbitrary docker images:
gcloud ai-platform jobs submit training $JOB_NAME \
--scale-tier BASIC \
--region $REGION \
--master-image-uri gcr.io/$PROJECT_ID/some-image
You can define the master instance type and even additional worker instances if desired. They should consider creating a sibling product without the AI buzzword so people can find this functionality easier.
I recommend checking out dsub. It's an open-source tool initially developed by the Google Genomics teams for doing batch processing on Google Cloud.
UPDATE: I have now used this service and I think it's awesome.
As of July 13, 2022, GCP now has it's own new fully managed Batch processing service (GCP Batch), which seems very akin to AWS Batch.
See the GCP Blog post announcing it at: https://cloud.google.com/blog/products/compute/new-batch-service-processes-batch-jobs-on-google-cloud (with links to docs as well)
Officially, according to the "Map AWS services to Google Cloud Platform products" page, there is no direct equivalent but you can put a few things together that might get you to get close.
I wasn't sure if you were or had the option to run your python code in Docker. Then the Kubernetes controls might do the trick. From the GCP docs:
Note: Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads. However, while a node pool can scale to a zero size, the overall cluster size does not scale down to zero nodes (as at least one node is always required to run system Pods).
So, if you are running other managed instances anyway you can scale up or down to and from 0 but you have the Kubernetes node is still active and running the pods.
I'm guessing you are already using something like "Creating API Requests and Handling Responses" to get an ID you can verify that the process is started, instance created, and the payload is processing. You can use that same process to submit that the process completes as well. That takes care of the instance creation and launch of the python script.
You could use Cloud Pub/Sub. That can help you keep track of the state of that: can you modify your python to notify the completion of the task? When you create the task and launch the instance, you can also report that the python job is complete and then kick off an instance tear down process.
Another thing you can do to drop costs is to use Preemptible VM Instances so that the instances run at 1/2 cost and will run a maximum of 1 day anyway.
Hope that helps.
The Product that best suits your use-case in GCP is Cloud Task. We are using it for a similar use-case where we are retrieving files from another HTTP server and after some processing storing them in Google Cloud Storage.
This GCP documentation describes in full detail the steps to create tasks and using them.
You schedule your task programmatically in Cloud Tasks and you have to create task handlers(worker services) in the App Engine. Some limitation For worker services running in App Engine
the standard environment:
Automatic scaling: task processing must finish in 10 minutes.
Manual and basic scaling: requests can run up to 24 hours.
the flex environment: all types have a 60 minutes timeout.
I think the Cron job can help you in this regard and you can implement it with the help of App engine, Pub/sub and Compute engine. Reliable Task Scheduling on Google Compute Engine In distributed systems, such as a network of Google Compute Engine instances, it is challenging to reliably schedule tasks because any individual instance may become unavailable due to autoscaling or network partitioning.
Google App Engine provides a Cron service. Using this service for scheduling and Google Cloud Pub/Sub for distributed messaging, you can build an application to reliably schedule tasks across a fleet of Compute Engine instances.
For a detailed look you can check it here: https://cloud.google.com/solutions/reliable-task-scheduling-compute-engine
I want to load data into Spark (on Databricks) from Google BigQuery. I notice that Databricks offers alot of support for Amazon S3 but not for Google.
What is the best way to load data into Spark (on Databricks) from Google BigQuery? Would the BigQuery connector allow me to do this or is this only valid for files hosted on Google Cloud storage?
The BigQuery Connector is a client side library that uses the public BigQuery API: it runs BigQuery export jobs to Google Cloud Storage, and takes advantage of file creation ordering to start Hadoop processing early to increase overall throughput.
This code should work wherever you happen to locate your Hadoop cluster.
That said, if you are running over large data, then you might find network bandwidth throughput to be a problem (how good is your network connection to Google?), and since you are reading data out of Google's network, then GCS network egress costs will apply.
Databricks now has documented how to use Google BigQuery via Spark here
Set spark config in cluster settings:
credentials <base64-keys>
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email <client_email>
spark.hadoop.fs.gs.project.id <project_id>
spark.hadoop.fs.gs.auth.service.account.private.key <private_key>
spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>
In pyspark use:
df = spark.read.format("bigquery") \
.option("table", table) \
.option("project", <project-id>) \
.option("parentProject", <parent-project-id>) \
.load()