i work on a google cloud environment where i don't have internet access. I'm trying to launch a dataflow job passing it the sdk like this:
python wordcount.py --no_use_public_ip --sdk_location "<basepath>/dist/package-import-0.0.2.tar.gz"
I generated package-import-0.0.2.tar.gz with this setup.py
import setuptools
setuptools.setup(
name='package-import',
version='0.0.2',
install_requires=[
'apache-beam==2.43.0',
'cachetools==4.2.4',
'certifi==2022.12.7',
'charset-normalizer==2.1.1',
'cloudpickle==2.2.0',
'crcmod==1.7',
'dill==0.3.1.1',
'docopt==0.6.2',
'fastavro==1.7.0',
'fasteners==0.18',
'google-api-core==2.11.0',
'google-apitools==0.5.31',
'google-auth==2.15.0',
'google-auth-httplib2==0.1.0',
'google-cloud-bigquery==3.4.1',
'google-cloud-bigquery-storage==2.13.2',
'google-cloud-bigtable==1.7.3',
'google-cloud-core==2.3.2',
'google-cloud-datastore==1.15.5',
'google-cloud-dlp==3.10.0',
'google-cloud-language==1.3.2',
'google-cloud-pubsub==2.13.11',
'google-cloud-pubsublite==1.6.0',
'google-cloud-recommendations-ai==0.7.1',
'google-cloud-spanner==3.26.0',
'google-cloud-videointelligence==1.16.3',
'google-cloud-vision==1.0.2',
'google-crc32c==1.5.0',
'google-resumable-media==2.4.0',
'googleapis-common-protos==1.57.1',
'grpc-google-iam-v1==0.12.4',
'grpcio==1.51.1',
'grpcio-status==1.51.1',
'hdfs==2.7.0',
'httplib2==0.20.4',
'idna==3.4',
'numpy==1.22.4',
'oauth2client==4.1.3',
'objsize==0.5.2',
'orjson==3.8.3',
'overrides==6.5.0',
'packaging==22.0',
'proto-plus==1.22.1',
'protobuf==3.20.3',
'pyarrow==9.0.0',
'pyasn1==0.4.8',
'pyasn1-modules==0.2.8',
'pydot==1.4.2',
'pymongo==3.13.0',
'pyparsing==3.0.9',
'python-dateutil==2.8.2',
'pytz==2022.7',
'regex==2022.10.31',
'requests==2.28.1',
'rsa==4.9',
'six==1.16.0',
'sqlparse==0.4.3',
'typing-extensions==4.4.0',
'urllib3==1.26.13',
'zstandard==0.19.0'
],
packages=setuptools.find_packages(),
)
but in dataflow log worker i have this error: Could not install Apache Beam SDK from a wheel: could not find a Beam SDK wheel among staged files, proceeding to install SDK from source tarball.
And then he tries to download it but since he doesn't have internet he can't
my biggest problem is that the google cloud environment doesn't access the internet so dataflow can't download what it needs. Do you know of a way to pass it an sdk_location?
If you don't have access to internet from your environement, I thought on a solution based on a Docker image.
- Workers
Dataflow Python can use a Docker image in the execution phase while creating the workers.
In this image Docker you can install all the needed packages in the container and publish it to Container Registry, example :
FROM apache/beam_python3.8_sdk:2.44.0
# Pre-built python dependencies
RUN pip install lxml
# Pre-built other dependencies
RUN apt-get update \
&& apt-get dist-upgrade \
&& apt-get install -y --no-install-recommends ffmpeg
# Set the entrypoint to the Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]
In the Dataflow job, you have to specify 2 program arguments to use the image :
experiments
sdk_container_image
python -m apache_beam.examples.wordcount \
--input=INPUT_FILE \
--output=OUTPUT_FILE \
--project=PROJECT_ID \
--region=REGION \
--temp_location=TEMP_LOCATION \
--runner=DataflowRunner \
--disk_size_gb=DISK_SIZE_GB \
--experiments=use_runner_v2 \
--sdk_container_image=$IMAGE_URI
- Runner from your google environment
Your Google environment executing the job needs also to have the packages installed in order to be able to instantiate the job.
You need to find a way to install the packages in the machines and your environment. If you can use the same Docker image used for Dataflow workers and execution phase, it would be perfect.
I solved using an internal proxy that allowed me to access the internet. In the command added this --no_use_public_ip and i set no_proxy="metadata.google.internal,www.googleapis.com,dataflow.googleapis.com,bigquery.googleapis.com" thanks
I have already looked at several similar questions - here, here and some other blog posts and Stack overflow questions.
I have the below PySpark script and looking to read data from a GCS bucket
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName("GCSFilesRead")\
.getOrCreate()
bucket_name="my-gcs-bucket"
path=f"gs://{bucket_name}/path/to/file.csv"
df=spark.read.csv(path, header=True)
print(df.head())
which fails with the error -
py4j.protocol.Py4JJavaError: An error occurred while calling o29.csv.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "gs"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
My environment setup Dockerfile is something like below:
FROM openjdk:11.0.11-jre-slim-buster
# install a whole bunch of apt-get dev essential libraries (unixodbc-dev, libgdbm-dev...)
# some other setup for other services
# copy my repository, requirements file
# install Python-3.9 and activate a venv
RUN pip install pyspark==3.3.1
There is no env variable like HADOOP_HOME, SPARK_HOME, PYSPARK_PYTHON etc. Just a plain installation of PySpark.
I have tried to run -
spark = SparkSession.builder\
.appName("GCSFilesRead")\
.config("spark.jars.package", "/path/to/jar/gcs-connector-hadoop3-2.2.10.jar") \
.getOrCreate()
or
spark = SparkSession.builder\
.appName("GCSFilesRead")\
.config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")\
.config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")\
.getOrCreate()
and some other solutions, but I am still getting the same error
My question is -
in such a setup, what all do I need to do to get this script running? I have seen answers on updating pom files, core-site.xml file etc. but looks like simple pyspark installation does not come with those files
how can I make jar installs/setup be a default spark setting in pyspark only installation? I hope to simply run this script - python path/to/file.py without passing any arguments with spark-submit, setting it in the sparksession.config etc.
I know if we have a regular spark installation, we can add the default jars to spark-defaults.conf file, but looks like plain PySpark installation does not come with those file either
Thank you in advance!
The error message No FileSystem for scheme: gs indicates that Spark does not understand the path to your bucket (gs://) and couldn't find the GCS connector so you will have to mount the bucket first.I suggest reviewing the document to make sure that your settings were applied correctly ,Cloud Storage connector
You can also the following:
Authenticate your user >
from google.colab import auth auth.authenticate_user()
Then install gcsfuse with the following snippet>
echo "deb http://packages.cloud.google.com/apt gcsfuse-bionic main" > /etc/apt/sources.list.d/gcsfuse.list !curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - !apt -qq update !apt -qq install gcsfuse
Then mount the bucket as following>
mkdir mybucket !gcsfuse mybucket mybucket
You can store your data then to the following path:
df.write.csv('/content/my_bucket/df')
I would also recommend you to have a look at this thread example of a detailed workflow.
You can also try the below once:
To access Google Cloud Storage you have to include Cloud Storage connector:
spark-submit --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar your-pyspark-script.py
or
pyspark --jars /path/to/gcs/gcs-connector-latest-hadoop2.jar
example contrived for the question. I login locally via saml2aws to generate access keys for my aws account and issue commands via aws-cli. once i login via SAML, i use a profile that i have set up to connect to resources in aws. I am putting those commands in app.py file and adding a docker file with python and boto3 installed. from within the container, i need to access my_profile set up for aws in my local machine , how to access this or do i need to create/copy the aws credentials file to a container, if so how?
dockerfile
FROM python:alpine
RUN apk add --no-cache --virtual-dependencies python3 \
...
&& python3 -m ensurepip \
...
&& pip install boto3 ...
app.py
import boto3
my_session = boto3.session.Session(profile_name="my_profile")
aws_client = session.client('s3', region_name='us-east-1', config=...)
Does anybody install DVC on MinIO storage?
I have read docs but not all clear for me.
Which command should I use for setup MinIO storage with this entrance parameters:
storage url: https://minio.mysite.com/minio/bucket-name/
login: my_login
password: my_password
Install
I usually use it as a Python package, in this case you need to install:
pip install "dvc[s3]"
Setup remote
By default DVC supports AWS S3 storage and it works fine.
It also supports "S3-compatible storage", and MinIO in particular. In this case you have a bucket - a directory on a MinIO server where actual data is stored (it is similar to an AWS bucket). DVC uses AWS CLI to authenticate with AWS and in case of MinIO you need to pass credentials to dvc (not to the minio package).
The commands to setup MinIO as your DVC remote:
# setup default remote (change "bucket-name" to your minio backet name)
dvc remote add -d minio s3://bucket-name -f
# add information about storage url (where "https://minio.mysite.com" is your MinIO url)
dvc remote modify minio endpointurl https://minio.mysite.com
# add MinIO credentials (e.g. from env. variables)
dvc remote modify minio access_key_id my_login
dvc remote modify minio secret_access_key my_password
If you move from old remote, use the following commands to move your data:
Before setup (download old remote cache to the local machine - note it may take a long time):
dvc pull -r <old_remote_name> --all-commits --all-tags --all-branches
After setup (upload all local cache data to a new remote):
dvc push -r <new_remote_name> --all-commits --all-tags --all-branches
I define the following docker image:
FROM python:3.6
RUN pip install --upgrade pip
RUN pip install --upgrade mlflow
ENTRYPOINT mlflow server --host 0.0.0.0 --file-store /mnt/mlruns/
and build an image called mlflow-server. Next, I start this server from a local machine:
docker run --rm -it -p 5000:5000 -v ${PWD}/mlruns/:/mnt/mlruns mlflow-server
Next, I define the following function:
def foo(x, with_af=False):
mlflow.start_run()
mlflow.log_param("x", x)
print(x)
if with_af:
with open(str(x), 'wb') as fout:
fout.write(os.urandom(1024))
mlflow.log_artifact(str(x))
mlflow.log_artifact('./foo.data')
mlflow.end_run()
From the same directory I run foo(10) and the parameter is logged correctly. However, foo(10, True) yields the following error: PermissionError: [Errno 13] Permission denied: '/mnt'. Seems like log_artifact tries to save the file on the local file system directly.
Any idea what am I doing wrong?
Good question. Just to make sure, sounds like you're already configuring MLflow to talk to your tracking server when running your script, e.g. via MLFLOW_TRACKING_URI=http://localhost:5000 python my-script.py.
Artifact Storage in MLflow
Artifacts differ subtly from other run data (metrics, params, tags) in that the client, rather than the server, is responsible for persisting them. The current flow (as of MLflow 0.6.0) is:
User code calls mlflow.start_run
MLflow client makes an API request to the tracking server to create a run
Tracking server determines an appropriate root artifact URI for the run (currently: runs' artifact roots are subdirectories of their parent experiment's artifact root directories)
Tracking server persists run metadata (including its artifact root) & returns a Run object to the client
User code calls log_artifact
Client logs artifacts under the active run's artifact root
The issue
When you launch an MLflow server via mlflow server --host 0.0.0.0 --file-store /mnt/mlruns/, the server logs metrics and parameters under /mnt/mlruns in the docker container, and also returns artifact paths under /mnt/mlruns to the client. The client then attempts to log artifacts under /mnt/mlruns on the local filesystem, which fails with the PermissionError you encountered.
The fix
The best practice for artifact storage with a remote tracking server is to configure the server to use an artifact root accessible to both clients and the server (e.g. an S3 bucket or Azure Blob Storage URI). You can do this via mlflow server --default-artifact-root [artifact-root].
Note that the server uses this artifact root only when assigning artifact roots to newly-created experiments - runs created under existing experiments will use an artifact root directory under the existing experiment's artifact root. See the MLflow Tracking guide for more info on configuring your tracking server.
I had the same issue, try:
sudo chmod 755 -R /mnt/mlruns
docker run --rm -it -p 5000:5000 -v /mnt/mlruns:/mnt/mlruns mlflow-server
I had to create a folder with the exact path of the docker and change the permissions.
I did the same inside docker.
FROM python:3.6
RUN pip install --upgrade pip
RUN pip install --upgrade mlflow
RUN mkdir /mnt/mlruns/
RUN chmod 777 -R /mnt/mlruns/
ENTRYPOINT mlflow server --host 0.0.0.0 --file-store /mnt/mlruns/