Sagemaker Training Job Not Uploading/Saving Training Model to S3 Output Path

Sagemaker Training Job Not Uploading/Saving Training Model to S3 Output Path - python

Ok I've been dealing with this issue in Sagemaker for almost a week and I'm ready to pull my hair out. I've got a custom training script paired with a data processing script in a BYO algorithm Docker deployment type scenario. It's a Pytorch model built with Python 3.x, and the BYO Docker file was originally built for Python 2, but I can't see an issue with the problem that I am having.....which is that after a successful training run Sagemaker doesn't save the model to the target S3 bucket.
I've searched far and wide and can't seem to find an applicable answer anywhere. This is all done inside a Notebook instance. Note: I am using this as a contractor and don't have full permissions to the rest of AWS, including downloading the Docker image.
Dockerfile:
FROM ubuntu:18.04
MAINTAINER Amazon AI <sage-learner#amazon.com>
RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
python-pip \
python3-pip3
nginx \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
RUN wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py && \
pip3 install future numpy torch scipy scikit-learn pandas flask gevent gunicorn && \
rm -rf /root/.cache
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"
COPY decision_trees /opt/program
WORKDIR /opt/program
Docker Image Build:
%%sh
algorithm_name="name-this-algo"
cd container
chmod +x decision_trees/train
chmod +x decision_trees/serve
account=$(aws sts get-caller-identity --query Account --output text)
region=$(aws configure get region)
region=${region:-us-east-2}
fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi
# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)
# Build the docker image locally with the image name and then push it to ECR
# with the full name.
docker build -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}
docker push ${fullname}
Env setup and session start:
common_prefix = "pytorch-lstm"
training_input_prefix = common_prefix + "/training-input-data"
batch_inference_input_prefix = common_prefix + "/batch-inference-input-data"
import os
from sagemaker import get_execution_role
import sagemaker as sage
sess = sage.Session()
role = get_execution_role()
print(role)
Training Directory, Image, and Estimator Setup, then a fit call:
TRAINING_WORKDIR = "a/local/directory"
training_input = sess.upload_data(TRAINING_WORKDIR, key_prefix=training_input_prefix)
print ("Training Data Location " + training_input)
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/image-that-works:working'.format(account, region)
tree = sage.estimator.Estimator(image,
role, 1, 'ml.p2.xlarge',
output_path="s3://sagemaker-directory-that-definitely/exists",
sagemaker_session=sess)
tree.fit(training_input)
The above script is working, for sure. I have print statements in my script and they are printing the expected results to the console. This runs as it's supposed to, finishes up, and says that it's deploying model artifacts when IT DEFINITELY DOES NOT.
Model Deployment:
model = tree.create_model()
predictor = tree.deploy(1, 'ml.m4.xlarge')
This throws an error that the model can't be found. A call to aws sagemaker describe-training-job shows that the training was completed but I found that the time it took to upload the model was super fast, so obviously there's an error somewhere and it's not telling me. Thankfully it's not just uploading it to the aether.
{
"Status": "Uploading",
"StartTime": 1595982984.068,
"EndTime": 1595982989.994,
"StatusMessage": "Uploading generated training model"
},
Here's what I've tried so far:
I've tried uploading it to a different bucket. I figured my permissions were the problem so I pointed it to one that I new allowed me to upload as I had done it before to that bucket. No dice.
I tried backporting the script to Python 2.x, but that caused more problems than it probably would have solved, and I don't really see how that would be the problem anyways.
I made sure the Notebook's IAM role has sufficient permissions, and it does have a SagemakerFullAccess policy
What bothers me is that there's no error log I can see. If I could be directed to that I would be happy too, but if there's some hidden Sagemaker kungfu that I don't know about I would be forever grateful.
EDIT
The training job runs and prints to both the Jupyter cell and CloudWatch as expected. I've since lost the cell output in the notebook but below is the last few lines in CloudWatch. The first number is the epoch and the rest are various custom model metrics.

Can you verify from the training job logs that your training script is running? It doesn't look like your Docker image would respond to the command train, which is what SageMaker requires, and so I suspect that your model isn't actually getting trained/saved to /opt/ml/model.
AWS documentation about how SageMaker runs the Docker container: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-dockerfile.html
edit: summarizing from the comments below - the training script must also save the model to /opt/ml/model (the model isn't saved automatically).

Have you tried saving to a local file and moving it to S3? I would save it locally (to the root directory of the script) and upload it via boto3.
The sagemaker session object may not have a bucket attributes initialized. Doing it explicitly isn't much an extra step.
import boto3
s3 = boto3.client('s3')
with open("FILE_NAME", "rb") as f:
s3.upload_fileobj(f, "BUCKET_NAME", "DESTINATION_NAME(optional)")

Related

Serving Univeral Sentence Encoder Model using tensorflow serving and docker

I have universal sentence encoder model saved on my local drive. I am trying to serve the model on docker container using tensorflow serving.
Command:
sudo docker run -p 8502:8502 --name tf-serve -v /home/ubuntu/first/models:/models -t tensorflow/serving --model_base_path=/models/test
where /home/ubuntu/first/models is the folder path to my model files.
After running initially it is going into an infinite loop.
What is happening here?! Any help will be appreciated.

edit your command like this:
sudo docker run -p 8502:8502 --name tf-serve -v /home/ubuntu/first/models:/models/test -t tensorflow/serving --model_base_path=/models/test
your --model_base_path is /models/test so you need to copy the files into that folder in your -v

Does Apache Beam need internet to run GCP Dataflow jobs

I am trying to deploy an Dataflow Job on a GCP VM that will have access to GCP resources but will not have internet access. When I try to run the job I get a connection timeout error, which would make sense if I were trying to connect to the internet. The code breaks because an http connection is being attempted on behalf of apache-beam.
Python Set up:
Before cutting off the VM, I installed all necessary packages using pip and a requirements.txt. This seemed to work because other parts of the code work fine.
The following is the error message I receive when I run the code.
Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None))
after connection broken by 'ConnectTimeoutError(
<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at foo>,
'Connection to pypi.org timed out. (connect timeout=15)')': /simple/apache-beam/
Could not find a version that satisfies the requirement apache-beam==2.9.0 (from versions: )
No matching distribution found for apache-beam==2.9.0
I if you are running a python job does it need the to connect to pypi? Is there a hack around this?

TL;DR : Copy the Apache Beam SDK Archive into an accessible path and provide the path as a SetupOption sdk_location variable in your Dataflow pipeline.
I was also struggling for a long time with this setup. Finally I found a solution which does not need internet access while execution.
There are probably multiple ways to do that, but the following two are rather simple.
As a precondition you'll need to create the apache-beam-sdk source archive as following:
Clone Apache Beam GitHub
Switch to required tag eg. v2.28.0
cd to beam/sdks/python
Create tar.gz source archive of your required beam_sdk version like following:
python setup.py sdist
Now you should have the source archive apache-beam-2.28.0.tar.gz in the path beam/sdks/python/dist/
Option 1 - Use Flex templates and copy Apache_Beam_SDK in Dockerfile
Documentation : Google Dataflow Documentation
Create a Dockerfile --> you have to include this COPY utils/apache-beam-2.28.0.tar.gz /tmp, because this is going to be the path you can set in your SetupOptions.
FROM gcr.io/dataflow-templates-base/python3-template-launcher-base
ARG WORKDIR=/dataflow/template
RUN mkdir -p ${WORKDIR}
WORKDIR ${WORKDIR}
# Due to a change in the Apache Beam base image in version 2.24, you must to install
# libffi-dev manually as a dependency. For more information:
# https://github.com/GoogleCloudPlatform/python-docs-samples/issues/4891
# update used packages
RUN apt-get update && apt-get install -y \
libffi-dev \
&& rm -rf /var/lib/apt/lists/*
COPY setup.py .
COPY main.py .
COPY path_to_beam_archive/apache-beam-2.28.0.tar.gz /tmp
ENV FLEX_TEMPLATE_PYTHON_SETUP_FILE="${WORKDIR}/setup.py"
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
RUN python -m pip install --user --upgrade pip setuptools wheel
Set sdk_location to path you've copied the apache_beam_sdk.tar.gz to:
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
Build the Docker image with Cloud Build
gcloud builds submit --tag $TEMPLATE_IMAGE .
Create a Flex template
gcloud dataflow flex-template build "gs://define-path-to-your-templates/your-flex-template-name.json" \
--image=gcr.io/your-project-id/image-name:tag \
--sdk-language=PYTHON \
--metadata-file=metadata.json
Run generated flex-template in your subnetwork (if required)
gcloud dataflow flex-template run "your-dataflow-job-name" \
--template-file-gcs-location="gs://define-path-to-your-templates/your-flex-template-name.json" \
--parameters staging_location="gs://your-bucket-path/staging/" \
--parameters temp_location="gs://your-bucket-path/temp/" \
--service-account-email="your-restricted-sa-dataflow#your-project-id.iam.gserviceaccount.com" \
--region="yourRegion" \
--max-workers=6 \
--subnetwork="https://www.googleapis.com/compute/v1/projects/your-project-id/regions/your-region/subnetworks/your-subnetwork" \
--disable-public-ips
Option 2 - Copy sdk_location from GCS
According Beam documentation you should be able to even provide directly a GCS / gs:// path for the Option sdk_location, but it didn't work for me. But the following should work:
Upload previously generated archive to a bucket which you're able to access from your Dataflow Job you'd like to execute. Probably to something like gs://yourbucketname/beam_sdks/apache-beam-2.28.0.tar.gz
Copy the apache-beam-sdk in your source code to eg. /tmp/apache-beam-2.28.0.tar.gz
# see: https://cloud.google.com/storage/docs/samples/storage-download-file
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket("gs://your-bucket-name")
# Construct a client side representation of a blob.
# Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
# any content from Google Cloud Storage. As we don't need additional data,
# using `Bucket.blob` is preferred here.
blob = bucket.blob("gs://your-bucket-name/path/apache-beam-2.28.0.tar.gz")
blob.download_to_filename("/tmp/apache-beam-2.28.0.tar.gz")
Now you can set the sdk_location to the path you've downloaded the sdk archive.
options.view_as(SetupOptions).sdk_location = '/tmp/apache-beam-2.28.0.tar.gz'
Now your Pipeline should be able to run without internet breakout.

If you run a DataflowPythonOperator in a private Cloud Composer, the job needs to access the internet to download a set of packages from the image projects/dataflow-service-producer-prod. But within the private cluster, VMs and GKEs don't have access to the internet.
To solve this problem, you need to create a Cloud NAT and a router: https://cloud.google.com/nat/docs/gke-example#step_6_create_a_nat_configuration_using
This will allow your instances to send packets to the internet and receive inbound traffic.

When we use google cloud composer with private ip enabled, we don't have access to internet.
To solve this:
Create GKE cluster and create a new node pool name "default-pool"(use same name).
In network tag: add "private".
In security: Check allow access to all cloud api.

How to run a python script in a docker container in jenkins pipeline

I have a docker image in dockerhub that I want to add as an agent in my jenkins pipeline script. As a part of the image, I perform git clone to fetch a repository from github that has multiple python scripts which corresponds to multiple stages in my jenkins pipeline. I tried searching everywhere but I'm not able to find relevant information that talks about how to access the files inside a docker container in jenkins pipeline.
I'm running jenkins on a VM and it has docker installed. The pipeline performs a build on a docker container. Since there are many steps involved in every single stage of the pipeline, I tried to use python API as much as possible.
This is how my dockerfile looks like and the image builds successfully and I'm able to host it in dockerhub. When I run the container, I'm able to see "jenkins_pipeline_scripts" directory which contains all the necessary python scripts for the pipeline stages.
FROM ros:melodic-ros-core-stretch
RUN apt-get update && apt-get -y install python-pip
RUN git clone <private-repo with token>
This is how my current jenkins pipeline script looks like.
pipeline {
agent {
docker {
image '<image name>'
registryUrl 'https://registry.hub.docker.com'
registryCredentialsId 'docker-credentials'
args '--network host -u root:root'
}
}
stages {
stage('Test') {
steps {
sh 'python jenkins_pipeline_scripts/scripts/test.py'
}
}
}
}
This is the error I'm getting when I execute the job.
$ docker top 05587cd75db5c4282b86b2f1ded2c43a0f4eae161d6c7d7c03d065b0d45e1 -eo pid,comm
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Test)
[Pipeline] sh
+ python jenkins_pipeline_scripts/scripts/test.py
python: can't open file 'jenkins_pipeline_scripts/scripts/test.py': [Errno 2] No such file or directory

When Jenkins Pipeline to launch the agent container, it will change the container's WORKDIR via -w option and mount the Jenkins job's workspace folder via -v option.
As a result of both options, the Jenkins job's workspace folder will becomes your container's WORKDIR.
Following is my jenkins job console output:
docker run -t -d -u 29001:100
-w /bld/workspace/test/agent-poc
-v /bld/workspace/test/agent-poc:/bld/workspace/test/agent-poc:rw,z
-v /bld/workspace/test/agent-poc#tmp:/bld/workspace/test/agent-poc#tmp:rw,z
-e ******** -e ******** -e ******** -e ********
docker.hub.com/busybox cat
You clone the code when build the image and they are not inside the WORKDIR, thus reports no such file error.
Two approaches to fix your issue.
1) cd your code folder at firstly, you should know that path.
stage('Test') {
steps {
sh '''
cd <your code folder in container>
python jenkins_pipeline_scripts/scripts/test.py
'''
}
}
2) move git clone code repo from Dockerfile into pipeline stage
As I explained at begin, your job's workspace will become container's WORKDIR,
thus you can clone your code into jenkins job workspace via pipeline step, then you no need to cd <your code folder in container>.

Writing csv files to local host from a docker container

I am trying to set up a very basic data processing project where I use docker to create an ubuntu environment on an EC2, install python, take an input csv, perform some simple data manipulation, then output the data to a new csv in the folder where the input was. I have been able to successfully run my python code locally, as well as on the ec2, but when I run it with the docker container, the data appears to be processed (my script prints out the data), but the results not saved at the end of the run. Is there a command I am missing from my dockerfile that is causing the results not to be saved? Alternatively, is there a way I can save the output directly to an S3 bucket?
EDIT: The the path to the input files is "/home/ec2-user/docker_test/data" and the path to the code is "/home/ec2-user/docker_test/code". After the data is processed, I want the result to be written as a new file in the "/home/ec2-user/docker_test/data" directory on the host.
Dockerfile:
FROM ubuntu:latest
RUN apt-get update \
&& apt-get install -y --no-install-recommends software-properties-common \
&& add-apt-repository -y ppa:deadsnakes/ppa \
&& apt-get update \
&& apt-get install -q -y --no-install-recommends python3.6 python3.6-dev python3-pip python3-setuptools \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
VOLUME /home/ec2-user/docker_test/data
VOLUME /home/ec2-user/docker_test/code
WORKDIR /home/ec2-user/docker_test/
COPY requirements.txt ./
RUN cat requirements.txt | xargs -n 1 -L 1 python3.6 -m pip install --no-cache-dir
COPY . .
ENV LC_ALL C.UTF-8
ENV LANG=C.UTF-8
CMD python3.6 main.py
Python Script:
import pandas as pd
import os
from code import processing
path = os.getcwd()
def main():
df = pd.read_csv(path + '/data/table.csv')
print('input df: \n{}'.format(df))
df_out = processing.processing(df)
df_out.to_csv(path + '/data/updated_table.csv', index = False)
print('\noutput df: \n{}'.format(df_out))
if __name__ == '__main__':
main()
EDIT: I have been running the dockerfile with "docker run docker_test"

Ok, gotcha, with the edit about expectations of the CSV being output to the host, we do have a problem with how this is set up.
You've got two VOLUMEs declared in your Dockerfile, which is fine. These are named volumes, which are great for persisting data between containers going up and down on a single host, but you aren't able to easily just go in like it's a normal file system from your host.
If you want the file to show up on your host, you can create a bind mounted volume at runtime, which maps a path in your host filesystem to a path in the Docker container's filesystem.
docker run -v $(pwd):/home/ec2-user/docker_test/data docker_test will do this. $(pwd) is an expression that evaluates to your current working directory if you're on a *nix system, where you're running the command. Take care with that and adjust as needed (like if you're using Windows as your host).
With a volume set up this way, when the CSV is created in the container file system at the location you intend, it will be accessible on your host in the location relative to however you've mapped it.
Read up on volumes. They're vital to using Docker, not hard to grasp at first glance, but there a some gotchas in the details.
Regarding uploading to S3, I would recommend using the boto3 library and doing it in your Python script. You could also use something like s3cmd if you find that simpler.

You could use S3FS Fuse to mount the S3 bucket as a drive in your docker container. This basically creates a folder on your filesystem that is actually the S3 bucket. Anything that you save/modify in that folder will be reflected in the S3 bucket.
If you delete the docker container or unmount the drive you still have your S3 bucket intact, so you don't need to worry too much about erasing files in the S3 bucket through normal docker use.

How to store artifacts on a server running MLflow

I define the following docker image:
FROM python:3.6
RUN pip install --upgrade pip
RUN pip install --upgrade mlflow
ENTRYPOINT mlflow server --host 0.0.0.0 --file-store /mnt/mlruns/
and build an image called mlflow-server. Next, I start this server from a local machine:
docker run --rm -it -p 5000:5000 -v ${PWD}/mlruns/:/mnt/mlruns mlflow-server
Next, I define the following function:
def foo(x, with_af=False):
mlflow.start_run()
mlflow.log_param("x", x)
print(x)
if with_af:
with open(str(x), 'wb') as fout:
fout.write(os.urandom(1024))
mlflow.log_artifact(str(x))
mlflow.log_artifact('./foo.data')
mlflow.end_run()
From the same directory I run foo(10) and the parameter is logged correctly. However, foo(10, True) yields the following error: PermissionError: [Errno 13] Permission denied: '/mnt'. Seems like log_artifact tries to save the file on the local file system directly.
Any idea what am I doing wrong?

Good question. Just to make sure, sounds like you're already configuring MLflow to talk to your tracking server when running your script, e.g. via MLFLOW_TRACKING_URI=http://localhost:5000 python my-script.py.
Artifact Storage in MLflow
Artifacts differ subtly from other run data (metrics, params, tags) in that the client, rather than the server, is responsible for persisting them. The current flow (as of MLflow 0.6.0) is:
User code calls mlflow.start_run
MLflow client makes an API request to the tracking server to create a run
Tracking server determines an appropriate root artifact URI for the run (currently: runs' artifact roots are subdirectories of their parent experiment's artifact root directories)
Tracking server persists run metadata (including its artifact root) & returns a Run object to the client
User code calls log_artifact
Client logs artifacts under the active run's artifact root
The issue
When you launch an MLflow server via mlflow server --host 0.0.0.0 --file-store /mnt/mlruns/, the server logs metrics and parameters under /mnt/mlruns in the docker container, and also returns artifact paths under /mnt/mlruns to the client. The client then attempts to log artifacts under /mnt/mlruns on the local filesystem, which fails with the PermissionError you encountered.
The fix
The best practice for artifact storage with a remote tracking server is to configure the server to use an artifact root accessible to both clients and the server (e.g. an S3 bucket or Azure Blob Storage URI). You can do this via mlflow server --default-artifact-root [artifact-root].
Note that the server uses this artifact root only when assigning artifact roots to newly-created experiments - runs created under existing experiments will use an artifact root directory under the existing experiment's artifact root. See the MLflow Tracking guide for more info on configuring your tracking server.

I had the same issue, try:
sudo chmod 755 -R /mnt/mlruns
docker run --rm -it -p 5000:5000 -v /mnt/mlruns:/mnt/mlruns mlflow-server
I had to create a folder with the exact path of the docker and change the permissions.
I did the same inside docker.
FROM python:3.6
RUN pip install --upgrade pip
RUN pip install --upgrade mlflow
RUN mkdir /mnt/mlruns/
RUN chmod 777 -R /mnt/mlruns/
ENTRYPOINT mlflow server --host 0.0.0.0 --file-store /mnt/mlruns/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.