Installation DVC on MinIO storage

Installation DVC on MinIO storage - python

Does anybody install DVC on MinIO storage?
I have read docs but not all clear for me.
Which command should I use for setup MinIO storage with this entrance parameters:
storage url: https://minio.mysite.com/minio/bucket-name/
login: my_login
password: my_password

Install
I usually use it as a Python package, in this case you need to install:
pip install "dvc[s3]"
Setup remote
By default DVC supports AWS S3 storage and it works fine.
It also supports "S3-compatible storage", and MinIO in particular. In this case you have a bucket - a directory on a MinIO server where actual data is stored (it is similar to an AWS bucket). DVC uses AWS CLI to authenticate with AWS and in case of MinIO you need to pass credentials to dvc (not to the minio package).
The commands to setup MinIO as your DVC remote:
# setup default remote (change "bucket-name" to your minio backet name)
dvc remote add -d minio s3://bucket-name -f
# add information about storage url (where "https://minio.mysite.com" is your MinIO url)
dvc remote modify minio endpointurl https://minio.mysite.com
# add MinIO credentials (e.g. from env. variables)
dvc remote modify minio access_key_id my_login
dvc remote modify minio secret_access_key my_password
If you move from old remote, use the following commands to move your data:
Before setup (download old remote cache to the local machine - note it may take a long time):
dvc pull -r <old_remote_name> --all-commits --all-tags --all-branches
After setup (upload all local cache data to a new remote):
dvc push -r <new_remote_name> --all-commits --all-tags --all-branches

Related

How to load aws profile when logged via saml2aws inside a docker container?

example contrived for the question. I login locally via saml2aws to generate access keys for my aws account and issue commands via aws-cli. once i login via SAML, i use a profile that i have set up to connect to resources in aws. I am putting those commands in app.py file and adding a docker file with python and boto3 installed. from within the container, i need to access my_profile set up for aws in my local machine , how to access this or do i need to create/copy the aws credentials file to a container, if so how?
dockerfile
FROM python:alpine
RUN apk add --no-cache --virtual-dependencies python3 \
...
&& python3 -m ensurepip \
...
&& pip install boto3 ...
app.py
import boto3
my_session = boto3.session.Session(profile_name="my_profile")
aws_client = session.client('s3', region_name='us-east-1', config=...)

EndpointConnectionError when connecting to sqs via docker running from an EC2 instance

I have a python application that I want to connect to sqs and receive messages from. I'm trying to run this application on an EC2 via docker, but when I do I get botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://queue.amazonaws.com/.
I have tested it in four different scenarios. Running in my local machine from python directly, running in my local machine from docker, running in EC2 from python directly and in EC2 from docker. The first 3 scenarios I get no errors, so I think it's not related to AWS permission. Here is an example I'm trying to run and getting the error:
#!/bin/python3
import boto3
session = boto3.Session()
credentials = session.get_credentials()
credentials = credentials.get_frozen_credentials()
access_key = credentials.access_key
secret_key = credentials.secret_key
sqs_client = boto3.client('sqs')
print('access_key: %s', access_key)
print('secret_key: %s', secret_key)
while True:
try:
response = sqs_client.receive_message(
QueueUrl='https://queue.amazonaws.com/blablabla/my-queue',
WaitTimeSeconds=10,
MaxNumberOfMessages=10
)
print(response)
except Exception as error:
print(error)
Here is my Dockerfile:
FROM python:3.6-alpine3.6
COPY ./requirements.txt /my_app/requirements.txt
COPY ./build/tmp/id_rsa /root/.ssh/id_rsa
RUN chmod 400 /root/.ssh/id_rsa && \
ssh-keyscan -H bitbucket.org > /root/.ssh/known_hosts && \
pip3 install --no-cache-dir -r /my_app/requirements.txt && \
rm -rf /root/.ssh/
COPY ./src /my_app/src
WORKDIR /root/
EXPOSE 6092
VOLUME ["/root/"]
ENTRYPOINT ["python3", "/my_app/src/main.py"]
The access_key and secret_key are correct when running via docker from EC2, so it's probably not related to credentials. What am I missing?

It turned out that my application couldn't resolve any domain name. After digging a lot, I've discovered that the ec2 subnet was conflicting with the docker swarm subnet, both were using the 10.0.0.2 address. The workaround was to add
nameserver 8.8.8.8
nameserver 8.8.4.4
to the /etc/resolv.conf. Not the ideal solution, but it worked.

Does Apache Beam need internet to run GCP Dataflow jobs

I am trying to deploy an Dataflow Job on a GCP VM that will have access to GCP resources but will not have internet access. When I try to run the job I get a connection timeout error, which would make sense if I were trying to connect to the internet. The code breaks because an http connection is being attempted on behalf of apache-beam.
Python Set up:
Before cutting off the VM, I installed all necessary packages using pip and a requirements.txt. This seemed to work because other parts of the code work fine.
The following is the error message I receive when I run the code.
Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None))
after connection broken by 'ConnectTimeoutError(
<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at foo>,
'Connection to pypi.org timed out. (connect timeout=15)')': /simple/apache-beam/
Could not find a version that satisfies the requirement apache-beam==2.9.0 (from versions: )
No matching distribution found for apache-beam==2.9.0
I if you are running a python job does it need the to connect to pypi? Is there a hack around this?

TL;DR : Copy the Apache Beam SDK Archive into an accessible path and provide the path as a SetupOption sdk_location variable in your Dataflow pipeline.
I was also struggling for a long time with this setup. Finally I found a solution which does not need internet access while execution.
There are probably multiple ways to do that, but the following two are rather simple.
As a precondition you'll need to create the apache-beam-sdk source archive as following:
Clone Apache Beam GitHub
Switch to required tag eg. v2.28.0
cd to beam/sdks/python
Create tar.gz source archive of your required beam_sdk version like following:
python setup.py sdist
Now you should have the source archive apache-beam-2.28.0.tar.gz in the path beam/sdks/python/dist/
Option 1 - Use Flex templates and copy Apache_Beam_SDK in Dockerfile
Documentation : Google Dataflow Documentation
Create a Dockerfile --> you have to include this COPY utils/apache-beam-2.28.0.tar.gz /tmp, because this is going to be the path you can set in your SetupOptions.
FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
# Due to a change in the Apache Beam base image in version 2.24, you must to install
# libffi-dev manually as a dependency. For more information:
# https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4891
# update used packages
RUN apt-get update && apt-get install -y \
libffi-dev \
&& rm -rf /var/lib/apt/lists/*
COPY setup.py .
COPY main.py .
COPY path_to_beam_archive/apache-beam-2.28.0.tar.gz /tmp
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
RUN python -m pip install --user --upgrade pip setuptools wheel
Set sdk_location to path you've copied the apache_beam_sdk.tar.gz to:
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
Build the Docker image with Cloud Build
gcloud builds submit --tag $TEMPLATE_IMAGE .
Create a Flex template
gcloud dataflow flex-template build "gs://define-path-to-your-templates/your-flex-template-name.json" \
--image=gcr.io/your-project-id/image-name:tag \
--sdk-language=PYTHON \
--metadata-file=metadata.json
Run generated flex-template in your subnetwork (if required)
gcloud dataflow flex-template run "your-dataflow-job-name" \
--template-file-gcs-location="gs://define-path-to-your-templates/your-flex-template-name.json" \
--parameters staging_location="gs://your-bucket-path/staging/" \
--parameters temp_location="gs://your-bucket-path/temp/" \
--service-account-email="your-restricted-sa-dataflow#your-project-id.iam.gserviceaccount.com" \
--region="yourRegion" \
--max-workers=6 \
--subnetwork="https://www.googleapis.com/compute/v1/projects/your-project-id/regions/your-region/subnetworks/your-subnetwork" \
--disable-public-ips
Option 2 - Copy sdk_location from GCS
According Beam documentation you should be able to even provide directly a GCS / gs:// path for the Option sdk_location, but it didn't work for me. But the following should work:
Upload previously generated archive to a bucket which you're able to access from your Dataflow Job you'd like to execute. Probably to something like gs://yourbucketname/beam_sdks/apache-beam-2.28.0.tar.gz
Copy the apache-beam-sdk in your source code to eg. /tmp/apache-beam-2.28.0.tar.gz
# see: https://cloud.google.com/storage/docs/samples/storage-download-file
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket("gs://your-bucket-name")
# Construct a client side representation of a blob.
# Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
# any content from Google Cloud Storage. As we don't need additional data,
# using `Bucket.blob` is preferred here.
blob = bucket.blob("gs://your-bucket-name/path/apache-beam-2.28.0.tar.gz")
blob.download_to_filename("/tmp/apache-beam-2.28.0.tar.gz")
Now you can set the sdk_location to the path you've downloaded the sdk archive.
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
Now your Pipeline should be able to run without internet breakout.

If you run a DataflowPythonOperator in a private Cloud Composer, the job needs to access the internet to download a set of packages from the image projects/dataflow-service-producer-prod. But within the private cluster, VMs and GKEs don't have access to the internet.
To solve this problem, you need to create a Cloud NAT and a router: https://cloud.google.com/nat/docs/gke-example#step_6_create_a_nat_configuration_using
This will allow your instances to send packets to the internet and receive inbound traffic.

When we use google cloud composer with private ip enabled, we don't have access to internet.
To solve this:
Create GKE cluster and create a new node pool name "default-pool"(use same name).
In network tag: add "private".
In security: Check allow access to all cloud api.

How to store artifacts on a server running MLflow

I define the following docker image:
FROM python:3.6
RUN pip install --upgrade pip
RUN pip install --upgrade mlflow
ENTRYPOINT mlflow server --host 0.0.0.0 --file-store /mnt/mlruns/
and build an image called mlflow-server. Next, I start this server from a local machine:
docker run --rm -it -p 5000:5000 -v ${PWD}/mlruns/:/mnt/mlruns mlflow-server
Next, I define the following function:
def foo(x, with_af=False):
mlflow.start_run()
mlflow.log_param("x", x)
print(x)
if with_af:
with open(str(x), 'wb') as fout:
fout.write(os.urandom(1024))
mlflow.log_artifact(str(x))
mlflow.log_artifact('./foo.data')
mlflow.end_run()
From the same directory I run foo(10) and the parameter is logged correctly. However, foo(10, True) yields the following error: PermissionError: [Errno 13] Permission denied: '/mnt'. Seems like log_artifact tries to save the file on the local file system directly.
Any idea what am I doing wrong?

Good question. Just to make sure, sounds like you're already configuring MLflow to talk to your tracking server when running your script, e.g. via MLFLOW_TRACKING_URI=http://localhost:5000 python my-script.py.
Artifact Storage in MLflow
Artifacts differ subtly from other run data (metrics, params, tags) in that the client, rather than the server, is responsible for persisting them. The current flow (as of MLflow 0.6.0) is:
User code calls mlflow.start_run
MLflow client makes an API request to the tracking server to create a run
Tracking server determines an appropriate root artifact URI for the run (currently: runs' artifact roots are subdirectories of their parent experiment's artifact root directories)
Tracking server persists run metadata (including its artifact root) & returns a Run object to the client
User code calls log_artifact
Client logs artifacts under the active run's artifact root
The issue
When you launch an MLflow server via mlflow server --host 0.0.0.0 --file-store /mnt/mlruns/, the server logs metrics and parameters under /mnt/mlruns in the docker container, and also returns artifact paths under /mnt/mlruns to the client. The client then attempts to log artifacts under /mnt/mlruns on the local filesystem, which fails with the PermissionError you encountered.
The fix
The best practice for artifact storage with a remote tracking server is to configure the server to use an artifact root accessible to both clients and the server (e.g. an S3 bucket or Azure Blob Storage URI). You can do this via mlflow server --default-artifact-root [artifact-root].
Note that the server uses this artifact root only when assigning artifact roots to newly-created experiments - runs created under existing experiments will use an artifact root directory under the existing experiment's artifact root. See the MLflow Tracking guide for more info on configuring your tracking server.

I had the same issue, try:
sudo chmod 755 -R /mnt/mlruns
docker run --rm -it -p 5000:5000 -v /mnt/mlruns:/mnt/mlruns mlflow-server
I had to create a folder with the exact path of the docker and change the permissions.
I did the same inside docker.
FROM python:3.6
RUN pip install --upgrade pip
RUN pip install --upgrade mlflow
RUN mkdir /mnt/mlruns/
RUN chmod 777 -R /mnt/mlruns/
ENTRYPOINT mlflow server --host 0.0.0.0 --file-store /mnt/mlruns/

How can i copy community image on azure using python api or CLI of azure

I did not find any documentation for how to copy vhd on azure using python api, can anyone help me... also i have tried to create instance from image on the vmdepot community but when i run the following command i got this error:
$ azure vm create instanceahmed -o vmdepot-14776-1-1 -l "West US" ahmed P#ssw0rd --ssh 22 --verbose
......
verbose: Creating VM
verbose: Deleting image
info: VM image deleted: vmdepot-14776-1-1-c5febcb3
verbose: Uri : http://portalvhdsf4048vkh9c007.blob.core.windows.net/vm-images/community- 23970-525c8c75-8901-4870-a937-7277414a6eaa-1.vhd
info: Blob deleted: http://portalvhdsf4048vkh9c007.blob.core.windows.net/vm- images/community-23970-525c8c75-8901-4870-a937-7277414a6eaa-1.vhd
info: vm create command OK

I tried to do what "-o" do with Azure with these steps:
Get the URL of your image on the community:
http://vmdepoteastus.blob.core.windows.net/linux-community-store/community-23970-525c8c75-8901-4870-a937-7277414a6eaa-1.vhd
Create a new affinity group if you have one skip this point:
$ azure account affinity-group create mystoragegroup --location "West US"
Create a new storage account if you have an account skip this point:
$ azure storage account create mystorageazure --affinity-group mystoragegroup
Get the Primary secret key of new storage account using this command:
$ azure account storage keys list mystorageazure
output:
data: Primary: Your SECRET STORAGE KEY
Create new account storage inside this storage account:
$ azure storage container create --permission Blob -a mystorageazure -k Your SECRET STORAGE KEY mycontainerazure
Upload your image to the your account storage container:
$ azure vm disk upload --verbose http://vmdepoteastus.blob.core.windows.net/linux-community-store/community-23970-525c8c75-8901-4870-a937-7277414a6eaa-1.vhd http://mystorageazure.blob.core.windows.net/mycontainerazure/elastichpcvm.vhd Your SECRET STORAGE KEY
Create your local image:
$ azure vm image create mystorageimage --location "West US" --blob-url http://mystorageazure.blob.core.windows.net/mycontainerazure/elastichpcvm.vhd --os linux
Create your virtual machine using your local image:
$ azure vm create mystoragemachine mystorageimage ahmed P#ssw0rd --location "West US" --ssh 22

This seems to me like the cli issue. Can you please tell us
what version of the azure cli are you using. To get the version execute: azure -v
Can you provide us the version of publishsettings.xml file you are currently using? You can get the publishsettings xml file by executing: azure account download

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.