Does Apache Beam need internet to run GCP Dataflow jobs

Does Apache Beam need internet to run GCP Dataflow jobs - python

I am trying to deploy an Dataflow Job on a GCP VM that will have access to GCP resources but will not have internet access. When I try to run the job I get a connection timeout error, which would make sense if I were trying to connect to the internet. The code breaks because an http connection is being attempted on behalf of apache-beam.
Python Set up:
Before cutting off the VM, I installed all necessary packages using pip and a requirements.txt. This seemed to work because other parts of the code work fine.
The following is the error message I receive when I run the code.
Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None))
after connection broken by 'ConnectTimeoutError(
<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at foo>,
'Connection to pypi.org timed out. (connect timeout=15)')': /simple/apache-beam/
Could not find a version that satisfies the requirement apache-beam==2.9.0 (from versions: )
No matching distribution found for apache-beam==2.9.0
I if you are running a python job does it need the to connect to pypi? Is there a hack around this?

TL;DR : Copy the Apache Beam SDK Archive into an accessible path and provide the path as a SetupOption sdk_location variable in your Dataflow pipeline.
I was also struggling for a long time with this setup. Finally I found a solution which does not need internet access while execution.
There are probably multiple ways to do that, but the following two are rather simple.
As a precondition you'll need to create the apache-beam-sdk source archive as following:
Clone Apache Beam GitHub
Switch to required tag eg. v2.28.0
cd to beam/sdks/python
Create tar.gz source archive of your required beam_sdk version like following:
python setup.py sdist
Now you should have the source archive apache-beam-2.28.0.tar.gz in the path beam/sdks/python/dist/
Option 1 - Use Flex templates and copy Apache_Beam_SDK in Dockerfile
Documentation : Google Dataflow Documentation
Create a Dockerfile --> you have to include this COPY utils/apache-beam-2.28.0.tar.gz /tmp, because this is going to be the path you can set in your SetupOptions.
FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
# Due to a change in the Apache Beam base image in version 2.24, you must to install
# libffi-dev manually as a dependency. For more information:
# https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4891
# update used packages
RUN apt-get update && apt-get install -y \
libffi-dev \
&& rm -rf /var/lib/apt/lists/*
COPY setup.py .
COPY main.py .
COPY path_to_beam_archive/apache-beam-2.28.0.tar.gz /tmp
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
RUN python -m pip install --user --upgrade pip setuptools wheel
Set sdk_location to path you've copied the apache_beam_sdk.tar.gz to:
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
Build the Docker image with Cloud Build
gcloud builds submit --tag $TEMPLATE_IMAGE .
Create a Flex template
gcloud dataflow flex-template build "gs://define-path-to-your-templates/your-flex-template-name.json" \
--image=gcr.io/your-project-id/image-name:tag \
--sdk-language=PYTHON \
--metadata-file=metadata.json
Run generated flex-template in your subnetwork (if required)
gcloud dataflow flex-template run "your-dataflow-job-name" \
--template-file-gcs-location="gs://define-path-to-your-templates/your-flex-template-name.json" \
--parameters staging_location="gs://your-bucket-path/staging/" \
--parameters temp_location="gs://your-bucket-path/temp/" \
--service-account-email="your-restricted-sa-dataflow#your-project-id.iam.gserviceaccount.com" \
--region="yourRegion" \
--max-workers=6 \
--subnetwork="https://www.googleapis.com/compute/v1/projects/your-project-id/regions/your-region/subnetworks/your-subnetwork" \
--disable-public-ips
Option 2 - Copy sdk_location from GCS
According Beam documentation you should be able to even provide directly a GCS / gs:// path for the Option sdk_location, but it didn't work for me. But the following should work:
Upload previously generated archive to a bucket which you're able to access from your Dataflow Job you'd like to execute. Probably to something like gs://yourbucketname/beam_sdks/apache-beam-2.28.0.tar.gz
Copy the apache-beam-sdk in your source code to eg. /tmp/apache-beam-2.28.0.tar.gz
# see: https://cloud.google.com/storage/docs/samples/storage-download-file
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket("gs://your-bucket-name")
# Construct a client side representation of a blob.
# Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
# any content from Google Cloud Storage. As we don't need additional data,
# using `Bucket.blob` is preferred here.
blob = bucket.blob("gs://your-bucket-name/path/apache-beam-2.28.0.tar.gz")
blob.download_to_filename("/tmp/apache-beam-2.28.0.tar.gz")
Now you can set the sdk_location to the path you've downloaded the sdk archive.
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
Now your Pipeline should be able to run without internet breakout.

If you run a DataflowPythonOperator in a private Cloud Composer, the job needs to access the internet to download a set of packages from the image projects/dataflow-service-producer-prod. But within the private cluster, VMs and GKEs don't have access to the internet.
To solve this problem, you need to create a Cloud NAT and a router: https://cloud.google.com/nat/docs/gke-example#step_6_create_a_nat_configuration_using
This will allow your instances to send packets to the internet and receive inbound traffic.

When we use google cloud composer with private ip enabled, we don't have access to internet.
To solve this:
Create GKE cluster and create a new node pool name "default-pool"(use same name).
In network tag: add "private".
In security: Check allow access to all cloud api.

Related

Could not install Apache Beam SDK from a wheel: could not find a Beam SDK wheel among staged files, proceeding to install SDK from source tarball

i work on a google cloud environment where i don't have internet access. I'm trying to launch a dataflow job passing it the sdk like this:
python wordcount.py --no_use_public_ip --sdk_location "<basepath>/dist/package-import-0.0.2.tar.gz"
I generated package-import-0.0.2.tar.gz with this setup.py
import setuptools
setuptools.setup(
name='package-import',
version='0.0.2',
install_requires=[
'apache-beam==2.43.0',
'cachetools==4.2.4',
'certifi==2022.12.7',
'charset-normalizer==2.1.1',
'cloudpickle==2.2.0',
'crcmod==1.7',
'dill==0.3.1.1',
'docopt==0.6.2',
'fastavro==1.7.0',
'fasteners==0.18',
'google-api-core==2.11.0',
'google-apitools==0.5.31',
'google-auth==2.15.0',
'google-auth-httplib2==0.1.0',
'google-cloud-bigquery==3.4.1',
'google-cloud-bigquery-storage==2.13.2',
'google-cloud-bigtable==1.7.3',
'google-cloud-core==2.3.2',
'google-cloud-datastore==1.15.5',
'google-cloud-dlp==3.10.0',
'google-cloud-language==1.3.2',
'google-cloud-pubsub==2.13.11',
'google-cloud-pubsublite==1.6.0',
'google-cloud-recommendations-ai==0.7.1',
'google-cloud-spanner==3.26.0',
'google-cloud-videointelligence==1.16.3',
'google-cloud-vision==1.0.2',
'google-crc32c==1.5.0',
'google-resumable-media==2.4.0',
'googleapis-common-protos==1.57.1',
'grpc-google-iam-v1==0.12.4',
'grpcio==1.51.1',
'grpcio-status==1.51.1',
'hdfs==2.7.0',
'httplib2==0.20.4',
'idna==3.4',
'numpy==1.22.4',
'oauth2client==4.1.3',
'objsize==0.5.2',
'orjson==3.8.3',
'overrides==6.5.0',
'packaging==22.0',
'proto-plus==1.22.1',
'protobuf==3.20.3',
'pyarrow==9.0.0',
'pyasn1==0.4.8',
'pyasn1-modules==0.2.8',
'pydot==1.4.2',
'pymongo==3.13.0',
'pyparsing==3.0.9',
'python-dateutil==2.8.2',
'pytz==2022.7',
'regex==2022.10.31',
'requests==2.28.1',
'rsa==4.9',
'six==1.16.0',
'sqlparse==0.4.3',
'typing-extensions==4.4.0',
'urllib3==1.26.13',
'zstandard==0.19.0'
],
packages=setuptools.find_packages(),
)
but in dataflow log worker i have this error: Could not install Apache Beam SDK from a wheel: could not find a Beam SDK wheel among staged files, proceeding to install SDK from source tarball.
And then he tries to download it but since he doesn't have internet he can't
my biggest problem is that the google cloud environment doesn't access the internet so dataflow can't download what it needs. Do you know of a way to pass it an sdk_location?

If you don't have access to internet from your environement, I thought on a solution based on a Docker image.
- Workers
Dataflow Python can use a Docker image in the execution phase while creating the workers.
In this image Docker you can install all the needed packages in the container and publish it to Container Registry, example :
FROM apache/beam_python3.8_sdk:2.44.0
# Pre-built python dependencies
RUN pip install lxml
# Pre-built other dependencies
RUN apt-get update \
&& apt-get dist-upgrade \
&& apt-get install -y --no-install-recommends ffmpeg
# Set the entrypoint to the Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]
In the Dataflow job, you have to specify 2 program arguments to use the image :
experiments
sdk_container_image
python -m apache_beam.examples.wordcount \
--input=INPUT_FILE \
--output=OUTPUT_FILE \
--project=PROJECT_ID \
--region=REGION \
--temp_location=TEMP_LOCATION \
--runner=DataflowRunner \
--disk_size_gb=DISK_SIZE_GB \
--experiments=use_runner_v2 \
--sdk_container_image=$IMAGE_URI
- Runner from your google environment
Your Google environment executing the job needs also to have the packages installed in order to be able to instantiate the job.
You need to find a way to install the packages in the machines and your environment. If you can use the same Docker image used for Dataflow workers and execution phase, it would be perfect.

I solved using an internal proxy that allowed me to access the internet. In the command added this --no_use_public_ip and i set no_proxy="metadata.google.internal,www.googleapis.com,dataflow.googleapis.com,bigquery.googleapis.com" thanks

No FileSystem for scheme "gs" Google Storage Connector in plain PySpark installation

I have already looked at several similar questions - here, here and some other blog posts and Stack overflow questions.
I have the below PySpark script and looking to read data from a GCS bucket
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName("GCSFilesRead")\
.getOrCreate()
bucket_name="my-gcs-bucket"
path=f"gs://{bucket_name}/path/to/file.csv"
df=spark.read.csv(path, header=True)
print(df.head())
which fails with the error -
py4j.protocol.Py4JJavaError: An error occurred while calling o29.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
My environment setup Dockerfile is something like below:
FROM openjdk:11.0.11-jre-slim-buster
# install a whole bunch of apt-get dev essential libraries (unixodbc-dev, libgdbm-dev...)
# some other setup for other services
# copy my repository, requirements file
# install Python-3.9 and activate a venv
RUN pip install pyspark==3.3.1
There is no env variable like HADOOP_HOME, SPARK_HOME, PYSPARK_PYTHON etc. Just a plain installation of PySpark.
I have tried to run -
spark = SparkSession.builder\
.appName("GCSFilesRead")\
.config("spark.jars.package", "/path/to/jar/gcs-connector-hadoop3-2.2.10.jar") \
.getOrCreate()
or
spark = SparkSession.builder\
.appName("GCSFilesRead")\
.config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")\
.config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")\
.getOrCreate()
and some other solutions, but I am still getting the same error
My question is -
in such a setup, what all do I need to do to get this script running? I have seen answers on updating pom files, core-site.xml file etc. but looks like simple pyspark installation does not come with those files
how can I make jar installs/setup be a default spark setting in pyspark only installation? I hope to simply run this script - python path/to/file.py without passing any arguments with spark-submit, setting it in the sparksession.config etc.
I know if we have a regular spark installation, we can add the default jars to spark-defaults.conf file, but looks like plain PySpark installation does not come with those file either
Thank you in advance!

The error message No FileSystem for scheme: gs indicates that Spark does not understand the path to your bucket (gs://) and couldn't find the GCS connector so you will have to mount the bucket first.I suggest reviewing the document to make sure that your settings were applied correctly ,Cloud Storage connector
You can also the following:
Authenticate your user >
from google.colab import auth auth.authenticate_user()
Then install gcsfuse with the following snippet>
echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list !curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - !apt -qq update !apt -qq install gcsfuse
Then mount the bucket as following>
mkdir mybucket !gcsfuse mybucket mybucket
You can store your data then to the following path:
df.write.csv('/content/my_bucket/df')
I would also recommend you to have a look at this thread example of a detailed workflow.
You can also try the below once:
To access Google Cloud Storage you have to include Cloud Storage connector:
spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py
or
pyspark --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar

How to load aws profile when logged via saml2aws inside a docker container?

example contrived for the question. I login locally via saml2aws to generate access keys for my aws account and issue commands via aws-cli. once i login via SAML, i use a profile that i have set up to connect to resources in aws. I am putting those commands in app.py file and adding a docker file with python and boto3 installed. from within the container, i need to access my_profile set up for aws in my local machine , how to access this or do i need to create/copy the aws credentials file to a container, if so how?
dockerfile
FROM python:alpine
RUN apk add --no-cache --virtual-dependencies python3 \
...
&& python3 -m ensurepip \
...
&& pip install boto3 ...
app.py
import boto3
my_session = boto3.session.Session(profile_name="my_profile")
aws_client = session.client('s3', region_name='us-east-1', config=...)

Installation DVC on MinIO storage

Does anybody install DVC on MinIO storage?
I have read docs but not all clear for me.
Which command should I use for setup MinIO storage with this entrance parameters:
storage url: https://minio.mysite.com/minio/bucket-name/
login: my_login
password: my_password

Install
I usually use it as a Python package, in this case you need to install:
pip install "dvc[s3]"
Setup remote
By default DVC supports AWS S3 storage and it works fine.
It also supports "S3-compatible storage", and MinIO in particular. In this case you have a bucket - a directory on a MinIO server where actual data is stored (it is similar to an AWS bucket). DVC uses AWS CLI to authenticate with AWS and in case of MinIO you need to pass credentials to dvc (not to the minio package).
The commands to setup MinIO as your DVC remote:
# setup default remote (change "bucket-name" to your minio backet name)
dvc remote add -d minio s3://bucket-name -f
# add information about storage url (where "https://minio.mysite.com" is your MinIO url)
dvc remote modify minio endpointurl https://minio.mysite.com
# add MinIO credentials (e.g. from env. variables)
dvc remote modify minio access_key_id my_login
dvc remote modify minio secret_access_key my_password
If you move from old remote, use the following commands to move your data:
Before setup (download old remote cache to the local machine - note it may take a long time):
dvc pull -r <old_remote_name> --all-commits --all-tags --all-branches
After setup (upload all local cache data to a new remote):
dvc push -r <new_remote_name> --all-commits --all-tags --all-branches

How to store artifacts on a server running MLflow

I define the following docker image:
FROM python:3.6
RUN pip install --upgrade pip
RUN pip install --upgrade mlflow
ENTRYPOINT mlflow server --host 0.0.0.0 --file-store /mnt/mlruns/
and build an image called mlflow-server. Next, I start this server from a local machine:
docker run --rm -it -p 5000:5000 -v ${PWD}/mlruns/:/mnt/mlruns mlflow-server
Next, I define the following function:
def foo(x, with_af=False):
mlflow.start_run()
mlflow.log_param("x", x)
print(x)
if with_af:
with open(str(x), 'wb') as fout:
fout.write(os.urandom(1024))
mlflow.log_artifact(str(x))
mlflow.log_artifact('./foo.data')
mlflow.end_run()
From the same directory I run foo(10) and the parameter is logged correctly. However, foo(10, True) yields the following error: PermissionError: [Errno 13] Permission denied: '/mnt'. Seems like log_artifact tries to save the file on the local file system directly.
Any idea what am I doing wrong?

Good question. Just to make sure, sounds like you're already configuring MLflow to talk to your tracking server when running your script, e.g. via MLFLOW_TRACKING_URI=http://localhost:5000 python my-script.py.
Artifact Storage in MLflow
Artifacts differ subtly from other run data (metrics, params, tags) in that the client, rather than the server, is responsible for persisting them. The current flow (as of MLflow 0.6.0) is:
User code calls mlflow.start_run
MLflow client makes an API request to the tracking server to create a run
Tracking server determines an appropriate root artifact URI for the run (currently: runs' artifact roots are subdirectories of their parent experiment's artifact root directories)
Tracking server persists run metadata (including its artifact root) & returns a Run object to the client
User code calls log_artifact
Client logs artifacts under the active run's artifact root
The issue
When you launch an MLflow server via mlflow server --host 0.0.0.0 --file-store /mnt/mlruns/, the server logs metrics and parameters under /mnt/mlruns in the docker container, and also returns artifact paths under /mnt/mlruns to the client. The client then attempts to log artifacts under /mnt/mlruns on the local filesystem, which fails with the PermissionError you encountered.
The fix
The best practice for artifact storage with a remote tracking server is to configure the server to use an artifact root accessible to both clients and the server (e.g. an S3 bucket or Azure Blob Storage URI). You can do this via mlflow server --default-artifact-root [artifact-root].
Note that the server uses this artifact root only when assigning artifact roots to newly-created experiments - runs created under existing experiments will use an artifact root directory under the existing experiment's artifact root. See the MLflow Tracking guide for more info on configuring your tracking server.

I had the same issue, try:
sudo chmod 755 -R /mnt/mlruns
docker run --rm -it -p 5000:5000 -v /mnt/mlruns:/mnt/mlruns mlflow-server
I had to create a folder with the exact path of the docker and change the permissions.
I did the same inside docker.
FROM python:3.6
RUN pip install --upgrade pip
RUN pip install --upgrade mlflow
RUN mkdir /mnt/mlruns/
RUN chmod 777 -R /mnt/mlruns/
ENTRYPOINT mlflow server --host 0.0.0.0 --file-store /mnt/mlruns/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.