Does MLflow allow to log artifacts from remote locations like S3? - python

My setting
I have developed an environment for ML experiments that looks like the following: training happens in the AWS cloud with SageMaker Training Jobs. The trained model is stored in the /opt/ml/model directory, which is reserved by SageMaker to pack models as a .tar.gz in SageMaker's own S3 bucket. Several evaluation metrics are computed during training and testing, and recorded to an MLflow infrastructure consisting of an S3-based artifact store (see Scenario 4). Note that this is a different S3 bucket than SageMaker's.
A very useful feature from MLflow is that any model artifacts can be logged to a training run, so data scientists have access to both metrics and more complex outputs through the UI. These outputs include (but are not limited to) the trained model itself.
A limitation is that, as I understand it, the MLflow API for logging artifacts only accepts as input a local path to the artifact itself, and will always upload it to its artifact store. This is suboptimal when the artifacts are stored somewhere outside MLflow, as you have to store them twice. A transformer model may weigh more than 1GB.
My questions
Is there a way to pass an S3 path to MLflow and make it count as an artifact, without having to download it locally first?
Is there a way to avoid pushing a copy of an artifact to the artifact store? If my artifacts already reside in another remote location, it would be ideal to just have a link to such location in MLflow and not a copy in MLflow storage.

You can use a Tracking Server with S3 as a backend

Related

Deploying a new model to a sagemaker endpoint without updating the config?

I want to deploy a new model to an existing AWS SageMaker endpoint. The model is trained by a different pipeline and stored as a mode.tar.gz in S3. The sagemaker endpoint config is pointing to this as the model data URL. Sagemaker however doesn't reload the model and I don't know how to convince it to do so.
I want to deploy a new model to an AWS SageMaker endpoint. The model is trained by a different pipeline and stored as a mode.tar.gz in S3. I provisioned the Sagemaker Endpoint using AWS CDK. Now, within the training pipeline, I want to allow the data scientists to optionally upload their newly trained model to the endpoint for testing. I dont want to create a new model or an endpoint config. Also, I dont want to change the infrastructure (AWS CDK) code.
The model is uploaded to the S3 location that the sagemaker endpoint config is using as the
model_data_url. Hence it should use the new model. But it doesn't load it. I know that Sagemaker caches models inside the container, but idk how to force a new load.
This documentation suggests to store the model tarball with another name in the same S3 folder, and alter the code to invoke the model. This is not possible for my application. And I dont want Sagemaker to default to an old model, once the TargetModel parameter is not present.
Here is what I am currently doing after uploading the model to S3. Even though the endpoint transitions into Updating state, it does not force a model reload:
def update_sm_endpoint(endpoint_name: str) -> Dict[str, Any]:
"""Forces the sagemaker endpoint to reload model from s3"""
sm = boto3.client("sagemaker")
return sm.update_endpoint_weights_and_capacities(
EndpointName=endpoint_name,
DesiredWeightsAndCapacities=[
{"VariantName": "main", "DesiredWeight": 1},
],
)
Any ideas?
If you want to modify the model called in a SageMaker endpoint, you have to create a new model object and and new endpoint configuration. Then call update_endpoint This will not change the name of the endpoint.
comments on your question and SageMaker doc:
the documentation you mention ("This documentation suggests to store the model tarball with another name in the same S3 folder, and alter the code to invoke the model") is for SageMaker Multi-Model Endpoint, a service to store multiple models in the same endpoint in parallel. This is not what you need. You need a single-model SageMaker endpoint, and that you update with a
also, the API you mention sm.update_endpoint_weights_and_capacities is not needed for what you want (unless you want a progressive rollout from the traffic from model 1 to model 2).

What is the proper way of transferring large files(ML/AI models) from local server to aws?

I have my own local server and I train ML/AI models with python. After training I store the model files that will help predicting results. I want to transfer that files to AWS and store them where my web application will be. The models will be trained and updated everyday. I don't want to train the models on AWS since it's expensive.
First of all is this approach applicable? If so, what is the best way of transferring that files?
Is there any way to send that kind of files with APIs?
If you already use python I would suggest starting with the official AWS Python SDK called boto3.
You even have some examples in the docs:
Uploading files
File transfer configuration
Notice the latter, which enables the transfer to be multipart, especially good with files that are big.
From my experience, I can suggest you another simple tool called rclone.
Very simple to use, and from my experience sometimes even faster than using boto3.

How to use gitlab cache to store model weights for an ML pipeline?

I am using gitlab to host an python-Machine Learning pipeline. The pipeline includes trained weights of some model which I do not want to store in git. The weights are stored in some remote data-storage that the pipeline automatically pulls when running its job.
This works, but I have a problem when trying to run some end-end automatic CI tests on with this setup. I do not want to download the model weights from the remote every time my CI is triggered (since that can get expensive). In fact, I want to completely block out my internet connection within all CI-tests for security reasons (for example by configuring socket in my conftest.py).
If I do this, obviously I am not able to access the location where my model weights are stored. I know I can mock the result of the model for testing, but I actually want to test that the weights of the model is sensible or not. So mocking is out of the question.
I posted a similar question before and one of the solutions that I got was to take advantage of gitlab's caching mechanism to store the model weights.
However, I am not able to figure out how to do that exactly. From what I understand of caching, if I enable it, gitlab will download the necessary files from the internet once and reuse them in later pipelines. However, the solution that I am looking for would look something like this -
Upload a file to gitlab manually.
This file is accessible to all my CI jobs, however, this is not tracked by git.
When the file becomes outdated (because I created a new model), I manually upload the updated file.
With the cache workflow, from what I understand, if I want to update the file, I will have to enable the internet in the testing suite, have the pipeline automatically download the new set of weights, and then disable the internet again once the new cache is set up. This feels hacky and unsafe (unsafe, because I never want to enable internet during testing).
Is there a good solution for this problem?
One possible solution, but may not flexible enough, is keeping model file in GitLab CI Variables and put into the correct path in the step. GitLab CI supports binary file as a variable as well.

Sagemaker not outputting Tensorboard logs to S3 during training

I'm training a model with Tensorflow using Amazon Sagemaker, and I'd like to be able to monitor training progress while the job is running. During training however, no Tensorboard files are output to S3, only once the training job is completed are the files uploaded to S3. After training has completed, I can download the files and see that Tensorboard has been logging values correctly throughout training, despite only being updated in S3 once after training completes.
I'd like to know why Sagemaker isn't uploading the Tensorboard information to S3 throughout the training process?
Here is the code from my notebook on Sagemaker that kicks off the training job
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig, TensorBoardOutputConfig
import time
bucket = 'my-bucket'
output_prefix = 'training-jobs'
model_name = 'my-model'
dataset_name = 'my-dataset'
dataset_path = f's3://{bucket}/datasets/{dataset_name}'
output_path = f's3://{bucket}/{output_prefix}'
job_name = f'{model_name}-{dataset_name}-training-{time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())}'
s3_checkpoint_path = f"{output_path}/{job_name}/checkpoints" # Checkpoints are updated live as expected
s3_tensorboard_path = f"{output_path}/{job_name}/tensorboard" # Tensorboard data isn't appearing here until the training job has completed
tensorboard_output_config = TensorBoardOutputConfig(
s3_output_path=s3_tensorboard_path,
container_local_output_path= '/opt/ml/output/tensorboard' # I have confirmed this is the unaltered path being provided to tf.summary.create_file_writer()
)
role = sagemaker.get_execution_role()
estimator = TensorFlow(entry_point='main.py', source_dir='./', role=role, max_run=60*60*24*5,
output_path=output_path,
checkpoint_s3_uri=s3_checkpoint_path,
tensorboard_output_config=tensorboard_output_config,
instance_count=1, instance_type='ml.g4dn.xlarge',
framework_version='2.3.1', py_version='py37', script_mode=True)
dpe_estimator.fit({'train': dataset_path}, wait=True, job_name=job_name)
There is a issue on tensorflow github related to the s3 client in version 2.3.1 which is the one you are using. Check in the cloudwatch logs if you have an error like
OP_REQUIRES failed at whole_file_read_ops.cc:116 : Failed precondition: AWS Credentials have not been set properly. Unable to access the specified S3 location
Then the provided solution is to add GetObejectVersion permission to the bucket. Alternatively, to confirm that is a tensorflow issue, you can try a different version.
First some speculation without any facts: Sagemaker could work as some other systems that sync files between local drive and s3. They might check that the file hasn't been accessed recently before syncing it so that they don't copy it while someone is writing to it. The log files are written constantly until shutdown so that might result in them not being copied ever.
I have used Sagemaker Docker containers with same problem. I've tried two ways circumvent this problem and they seemed to work.
First one is to periodically create a new log file. So e.g. every 30 minutes call again tf.summary.create_file_writer(...) to switch to a new log file. Old file is synced to s3 when it's not used anymore.
Second one is to directly write logs to s3. tf.summary.create_file_writer('s3://bucket/dir/'). This is more instant way of getting the info into s3.

How to deploy multiple TensorFlow models using AWS?

I've trained 10 different TensorFlow models for style transfer, basically, each model is responsible to apply filters to a image based on a style image. So, every model is functioning independently and I want to integrate this into an application. Is there any way to deploy these models using AWS?
I've tried deploying these models using AWS SageMaker and then using the endpoint with AWS Lambda and then finally creating an API using API Gateway. But the catch here is that we can only deploy a single model on SageMaker, but in my case I want to deploy 10 different models.
I expect to provide a link to each model in my application, so the selected filter will trigger the model on AWS and will apply the filter.
What I did for something similar is that I created my own docker container with an api code capable of loading and predicting with multiple models. The api, when it starts it copies a model.tar.gz from an S3 bucket, and inside that tar.gz are the weights for all my models, my code then scans the content and loads all the models. If your models are too big (RAM consumption) you might need to handle this differently, as it's said here, that it loads the model only when you call predict. I load all the models at the beginning to have faster predicts. That is not actually a big change in code.
Another approach that I'm trying right now is to have the API Gateway call multiple Sagemaker endpoints, although I did not find good documentation for that.
There are couple options, and the final choice depends on your priorities in terms of cost, latency, reliability, simplicity.
Different SageMaker endpoints per model - one benefit of that is that it leads to better robustness, because models are isolated from one another. If one model gets called a lot, it won't put the whole fleet down. They each live their own life, and can also be hosted on separate type of machines, to achieve better economics. Note that to achieve high-availability it is even recommended to double hardware backend (2+ servers per SageMaker endpoint), so that endpoints are multi-zone, as SageMaker does its best to host endpoint backend on different availability zones if an endpoint has two or more instances.
One SageMaker TFServing multi-model endpoint - If all your models are TensorFlow models and if their artifacts are compatible with TFServing, you may be able to host all of them in a single SageMaker TFServing endpoint. See this section of the docs: Deploying more than one model to your endpoint
One SageMaker Multi-Model Endpoint, a feature that was released end of 2019 and that enables hosting of multiple models in the same container.
Serverless deployment in AWS Lambda - this can be cost-effective: models generate charges only when called. This is limited to pairs of {DL model ; DL framework} that fit within Lambda memory and storage limits and that do not require GPU. It's been documented couple times in the past, notably with Tensorflow and MXNet

Categories