We currently have a system running on AWS Sagemaker whereby several units have their own trained machine learning model artifact (using an SKLearn training script with the Sagemaker SKLearn estimator).
Through the use of Sagemaker's multi-model endpoints, we are able to host all of these units on a single instance.
The problem we have is that we need to scale this system up such that we can train individual models for hundreds of thousand of units and then host the resulting model artifacts on a multi-model endpoint. But, Sagemaker has a limit to the number of models you can train in parallel (our limit is 30).
Aside from training our models in batches, does anyone have any ideas how to go about implementing a system in AWS Sagemaker whereby for hundreds of thousands of units, we can have a separate trained model artifact for each unit?
Is there a way to output multiple model artifacts for 1 sagemaker training job with the use of an SKLearn estimator?
Furthermore, how does Sagemaker make use of multiple CPUs when a training script is submitted? Does this have to be specified in the training script/estimator object or is this handled automatically?
Here are some ideas:
1. does anyone have any ideas how to go about implementing a system in AWS Sagemaker whereby for hundreds of thousands of units, we can have a separate trained model artifact for each unit? Is there a way to output multiple model artifacts for 1 sagemaker training job with the use of an SKLearn estimator?
I don't know if the 30-training job concurrency is a hard limit, if it is a blocker you should try and open a support ticket to ask if it is and try and get it raised. Otherwise as you can point out, you can try and train multiple models in one job, and produce multiple artifacts that you can either (a) send to S3 manually, or (b) save to opt/ml/model so that they all get sent to the model.tar.gz artifact in S3. Note that if this artifact gets too big this could get impractical though
2. how does Sagemaker make use of multiple CPUs when a training script is submitted? Does this have to be specified in the training script/estimator object or is this handled automatically?
This depends on the type of training container you are using. SageMaker built-in containers are developed by Amazon teams and designed to efficiently use available resources. If you use your own code such as custom python in the Sklearn container, you are responsible for making sure that your code is efficiently written and uses available hardware. Hence framework choice is quite important :) for example, some sklearn models support explicitly using multiple CPUs (eg the n_jobs parameter in the random forest), but I don't think that Sklearn natively supports GPU, multi-GPU or multi-node training.
Related
I already have a built predictive model written in Python, however currently it is executed by hand and functions on a single data file. I am hoping to generalize the model so that it can read in different datasets from my backend, each time effectively producing a different model since we are using different data for training as well. How would I be able to add the model onto my backend then?
Store the model as a pickle and read it from your backend when you need analog to your training data.
But you might want to checkout MLFlow for an integrated model handling solution. It is possible to run it on prem. With MLFlow you can easily implement a proper ML lifecycle. You can store your training stats and keep the history of your trained models.
From what I can see in the docs, H2O supports calibration for GBM, DRF, XGBoost models only and has to be specified prior to the training phase.
I find it confusing. If calibration is a post-processing step and is model agnostic, shouldn't it be possible to calibrate any model trained using H2O, even after the training process is finished?
Currently, I'm dealing with a model that I've trained using AutoML. Even though it is a GBM model, I'm not able to easily calibrate it by providing a calibrate_model parameter as it is not supported by AutoML. I don't see any option to calibrate it after it's trained either.
Does anyone know an easy way to calibrate already-trained H2O models? Is it necessary to "manually" calibrate them using algorithms such as Platt scaling or is there a way to do it without using any extra libraries?
Thanks
I find it confusing. If calibration is a post-processing step
The reason why it is part of model training right now is to have it in MOJO (our deployment artifact).
and is model agnostic, shouldn't it be possible to calibrate any model trained using H2O, even after the training process is finished?
Calibrating a model ex-post makes a lot of sense, all the code is already in - it “just” needs to be exposed to users. We created a ticket for this here.
I'm new to AWS and I am considering to use amazon sagemaker to train my deep learning model because I'm having memory issues due to the large dataset and neural network that I have to train. I'm confused whether to store my data in my notebook instance or in S3? If I store it in my s3 would I be able to access it to train on my notebook instance? I'm confused on the concepts. Can anyone explain the use of S3 in machine learning in AWS?
Yes you can use S3 as storage for your training datasets.
Refer diagram in this link describing how everything works together: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html
You may also want to checkout following blogs that details about File mode and Pipe mode, two mechanisms for transferring training data:
https://aws.amazon.com/blogs/machine-learning/accelerate-model-training-using-faster-pipe-mode-on-amazon-sagemaker/
In File mode, the training data is downloaded first to an encrypted EBS volume attached to the training instance prior to commencing the training. However, in Pipe mode the input data is streamed directly to the training algorithm while it is running.
https://aws.amazon.com/blogs/machine-learning/using-pipe-input-mode-for-amazon-sagemaker-algorithms/
With Pipe input mode, your data is fed on-the-fly into the algorithm container without involving any disk I/O. This approach shortens the lengthy download process and dramatically reduces startup time. It also offers generally better read throughput than File input mode. This is because your data is fetched from Amazon S3 by a highly optimized multi-threaded background process. It also allows you to train on datasets that are much larger than the 16 TB Amazon Elastic Block Store (EBS) volume size limit.
The blog also contains python code snippets using Pipe input mode for reference.
Many of TensorFlow's example applications create Experiments and run one of the Experiment's methods by calling tf.contrib.data.learn_runner.run. It looks like an Experiment is essentially a wrapper for an Estimator.
The code needed to create and run an Experiment looks more complex than the code needed to create, train, and evaluate an Estimator. I'm sure there's an advantage to using Experiments, but I can't figure out what it is. Could someone fill me in?
tf.contrib.learn.Experiment is a high-level API for distributed training. Here's from its doc:
Experiment is a class containing all information needed to train a
model.
After an experiment is created (by passing an Estimator and inputs for
training and evaluation), an Experiment instance knows how to invoke
training and eval loops in a sensible fashion for distributed
training.
Just like tf.estimator.Estimator (and the derived classes) is a high-level API that hides matrix multiplications, saving checkpoints and so on, tf.contrib.learn.Experiment tries to hide the boilerplate you'd need to do for distributed computation, namely tf.train.ClusterSpec, tf.train.Server, jobs, tasks, etc.
You can train and evaluate the tf.estimator.Estimator on a single machine without an Experiment. See the examples in this tutorial.
Does anyone know what is the difference between using Google Cloud Machine Learning compare to a Virtual Machine instance in the Google Cloud Engine ?
I am using Keras with Python 3 and feel like GML is more restricting (using python 2.7, older version of TensorFlow, must follow the given structure...). I guess they are benefits of using GML over a VM in GCE but I would like to know what they are.
Google Cloud ML is a fully managed service whereas Google Compute Engine is not (the latter is IaaS).
Assuming that you just want to know some differences for the case when you have your own model, here you have some:
The most noticeable feature of Google CloudML is the deployment
itself. You don't have to take care of things like setting up your
cluster (that is, scaling), launching it, installing the packages and
deploy your model for training. This is all done automatically, and
you would have to do it yourself in Compute Engine although you would be unrestricted in what you can install.
Although all that deployment you can automatise more or less, there
is no magic to it. In fact, you can see in the logs of CloudML for a
training job that it is quite rudimentary in the sense that a cluster
of instances is launched and thereafter TF is installed and your
model is run with the options you set. This is due to TensorFlow
being a framework decoupled from Google systems.
However, there is a substancial difference of CloudMl vs Compute
Engine when it comes to prediction. And that is what you pay for
mostly I would say with CloudML. You can have deployed model in
CloudML for online and batch prediction out of the box pretty much.
In Compute Engine, you would have to take care of all the quirks of
TensorFlow Serving which are not that trivial (compared to
training your model).
Another advantage of CloudML is hyper-parameter tuning. It is no more
than just a somewhat smart brute-forcing tool to find out the best
combination of parameters for your given model, and you could
possibly automatise this in Compute Engine, but you would have to do
that part of figuring out the optimisation algorithms to find the
combination of parameters and values that would improve the objective
function (usually maximise your accuracy or reduce your loss).
Finally, pricing is slightly different in either service. Until recently, pricing of CloudML was in pair with other competitors (you would pay for computing time in both training and prediction but also per prediction which you could compare with the computing time in Compute Engine). However, now you will only pay for that computing time (and it is even cheaper than before) which probably renders the idea of managing and scaling your own cluster (with TensorFlow) in Compute Engine useless in most scenarios.