Google Cloud Dataflow Dependencies - python

I want to use dataflow to process in parallel a bunch of video clips I have stored in google storage. My processing algorithm has non-python dependencies and is expected to change over development iterations.
My preference would be to use a dockerized container with the logic to process the clips, but it appears that custom containers are not supported (in 2017):
use docker for google cloud data flow dependencies
Although they may be supported now - since it was being worked on:
Posthoc connect FFMPEG to opencv-python binary for Google Cloud Dataflow job
According to this issue a custom docker image may be pulled, but I couldn't find any documentation on how to do it with dataflow.
https://issues.apache.org/jira/browse/BEAM-6706?focusedCommentId=16773376&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16773376
Another option might be to use setup.py to install any dependencies as described in this dated example:
https://cloud.google.com/blog/products/gcp/how-to-do-distributed-processing-of-landsat-data-in-python
However, when running the example I get an error that there is no module named osgeo.gdal.
For pure python dependencies I have also tried to pass the --requirements_file argument, however I still get an error: Pip install failed for package: -r
I could find documentation for adding dependencies to apache_beam, but not to dataflow, and it appears the apache_beam instructions do not work, based on my tests of --requirements_file and --setup_file

This was answered in the comments, rewriting here for clarity:
In Apache Beam you can modify the setup.py file while will be run once per container on start-up. This file allows you to perform arbitrary commands before the the SDK Harness start to receive commands from the Runner Harness.
A complete example can be found in the Apache Beam repo.

As of 2020, you can use Dataflow Flex Templates, which allow you to specify a custom Docker container in which to execute your pipeline.

Related

Apache Beam Dataflow pipeline build and deploy with Bazel

Looking at the official Python Beam documentation page it seems like the only way to do deployments is to have a setup.py that defines dependencies that exist within your repo or externally.
But this doesn't quite work with the Bazel way of managing Python dependencies (i.e. I have no setup.py file) or separate requirements.txt file for each pipeline in my repo.
How does one package and deploy jobs to runners using Bazel?
You can have a script that runs pip freeze > ... to generate the requirements files for your pipelines, then use them to deploy your pipelines.
If multiple pipelines share the same set of Python dependencies and the super set of all their dependencies is not that big, you can build a custom container to submit with all the jobs.

GCP Dockerfile using Artifact Registry

I have a question.
What's the best approach to building a Docker image using the pip artifact from the Artifact Registry?
I have a Cloud Build build that runs a Docker build, the only Dockerfile is pip install -r requirements.txt, one of the dependencies of which is the library located in the Artifact Registry.
When executing a stage with the image gcr.io / cloud-builders / docker, I get the error that my Artifact Registry is not accessible, which is quite logical. I have access only from the image performing the given step, not from the image that is being built in this step.
Any ideas?
Edit:
For now I will use Secret Manager to pass JSON key to my Dockerfile, but hope for better solution.
When you use Cloud Build, you can forward the metadata server access through the Docker build process. It's documented, but absolutely not clear (personally, the first time I made a mail to Cloud Build PM to ask him, and he send me the documentation link.)
Now, your docker build can access the metadata server and be authenticated with the Cloud Build runtime service account. It should make your process easiest.

How to upload packages to an instance in a Processing step in Sagemaker?

I have to do large scale feature engineering on some data. My current approach is to spin up an instance using SKLearnProcessor and then scale the job by choosing a larger instance size or increasing the number of instances. I require using some packages that are not installed on Sagemaker instances by default and so I want to install the packages using .whl files.
Another hurdle is that the Sagemaker role does not have internet access.
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
sess = sagemaker.Session()
sess.default_bucket()
region = boto3.session.Session().region_name
role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
role=role,
sagemaker_session = sess,
instance_type="ml.t3.medium",
instance_count=1)
sklearn_processor.run(code='script.py')
Attempted resolutions:
Upload the packages to a CodeCommit repository and clone the repo into the SKLearnProcessor instances. Failed with error fatal: could not read Username for 'https://git-codecommit.eu-west-1.amazonaws.com': No such device or address. I tried cloning the repo into a sagemaker notebook instance and it works, so its not a problem with my script.
Use a bash script to copy the packages from s3 using the CLI. The bash script I used is based off this post. But the packages never get copied, and an error message is not thrown.
Also looked into using the package s3fs but it didn't seem suitable to copy the wheel files.
Alternatives
My client is hesitant to spin up containers from custom docker images. Any alternatives?
2. Use a bash script to copy the packages from s3 using the CLI. The bash script I used is based off this post. But the packages never get copied, and an error message is not thrown.
This approach seems sound.
You may be better off overriding the command field on the SKLearnProcessor to /bin/bash, run a bash script like install_and_run_my_python_code.sh that installs the wheel containing your python dependencies, then runs your main python entry point script.
Additionally, instead of using AWS S3 calls to download your code in a script, you could use a ProcessingInput to download your code rather than doing this with AWS CLI calls in a bash script, which is what the SKLearnProcessor does to download your entry point script.py code across all the instances.

How do I connect to an external Oracle database using the Python cx_Oracle package on Google App Engine Flex?

My Python App Engine Flex application needs to connect to an external Oracle database. Currently I'm using the cx_Oracle Python package which requires me to install the Oracle Instant Client.
I have successfully run this locally (on macOS) by following the Instant Client installation steps. The steps required me to do the following:
Make a directory called /opt/oracle
Create a symlink from /opt/oracle/instantclient_12_2/libclntsh.dylib.12.1 to ~/lib/
However, I am confused about how to do the same thing in App Engine Flex (instructions). Specifically, here's what I'm confused about:
The instructions say I should run sudo yum install libaio to install the libaio package. How do I do this on GAE Flex? Or is this package already available?
I think I can add the Instant Client files to GAE (a whopping ~100MB!), then set the LD_LIBRARY_PATH environment variable in app.yaml to export LD_LIBRARY_PATH=/opt/oracle/instantclient_12_2:$LD_LIBRARY_PATH. Will this work?
Is this even feasible without using custom Docker containers on App Engine Flex?
Overall I'm not sure if I'm on the right track. Would love to hear from someone who has managed this before :)
If any of your dependencies is not available in the base GAE flex images provided by Google and cannot be installed via pip (because it's not a python package or it's not available in PyPI or whatever other reason) then you can't use the requirements.txt file to get it installed in your GAE flex app.
The proper way to satisfy such dependencies would be to build your own custom runtime. From About Custom Runtimes:
Custom runtimes allow you to define new runtime environments, which
might include additional components like language interpreters or
application servers.
Yes, that means providing a custom Docker file. In your particular case you'd be installing the Instant Client and libaio inside this Dockerfile. See also Building Custom Runtimes.
Answering your first question, I think that the instructions in the oracle website just show that you have to install said library for your application to work.
In the case of App engine flex, they way to ensure that the libraries are present in the deployment is with the requirements.txt textfile. There is a documentation page which does explain how to do so.
On the other hand, I will assume that "Instant Client Files" are not libraries, but necessary data for your App to run. You should use Google Cloud Storage to serve them, or any other alternative of Storage within Google Cloud.
I believe that, if this is all what you need for your App to work, pushing your own custom container should not be necessary.

What does an Apache Beam Dataflow job do locally?

I'm having some issues with an Apache Beam Python SDK defined Dataflow. If I step through my code it reaches the the pipeline.run() step which I assume means the execution graph was successfully defined. However, the job never registers on the Dataflow monitoring tool, which from this makes me think it never reaches the pipeline validation step.
I'd like to know more about what happens between these two steps to help in debugging the issue. I see output indicating packages in my requirements.txt and apache-beam are getting pip installed and it seems like something is getting pickled before being sent to Google's servers. Why is that? If I already have apache-beam downloaded, why download it again? What exactly is getting pickled?
I'm not looking for a solution to my problem here, just trying to understand the process better.
During Graph Construction, Dataflow checks for errors and any illegal operations in the pipeline. Once the checks are successful, execution graph is translated into JSON and transmitted to Dataflow service. In Dataflow service the JSON graph is validated and it becomes a job.
However, if the pipeline is executed locally, the graph is not translated to JSON or transmitted to Dataflow service. So, the graph would not show up as a job in the monitoring tool, it will run on the local machine [1]. You can follow the documentation to configure the local machine [2].
[1] https://cloud.google.com/dataflow/service/dataflow-service-desc#pipeline-lifecycle-from-pipeline-code-to-dataflow-job
[2] https://cloud.google.com/dataflow/pipelines/specifying-exec-params#configuring-pipelineoptions-for-local-execution
The packages from requirements.txt are downloaded using pip download and they are staged on to the staging location. This staging location will be used as cache by the Dataflow and is used to look up packages when pip install -r requirements.txt is called on Dataflow workers to reduce calls to the pypi.

Categories