When do you specify the Target variable in a SageMaker Training job?

When do you specify the Target variable in a SageMaker Training job? - python

I'm trying to create a Machine Learning algorithm following this tutorial : Get Started with Amazon SageMaker
Unless I missed something in the tutorial, I didn't find any steps where we specify the target variable. Can someone explain where / when we specify our target variable when creating an ML model using SageMaker built-in algorithms?
Thanks a lot!

It depends on the scientific paradigm you're using in SageMaker :)
SageMaker Built-in algorithms all have their input specification,
described in their respective documentation. For example, for
SageMaker Linear Learner and SageMaker XGBoost the target is assumed
to be the first column.
With custom code, such as Bring-Your-Own-Docker or SageMaker Framework containers (for Sklearn, TF, PyTorch, MXNet) since you are the one writing the code you can write any sort of logic, and the target can be any column of your dataset.

After a little bit of research, I found the answer. If you are using a CSV file pulled from an S3 bucket, the target variable is assumed to be on the first column.
If you need more details, you can check out this part of AWS documentation:
Common Data Format For Training

Related

creating a knn index model on opensearch using python

I'm trying to create an index on opensearch for knn_vectors using faiss's ivf and pq indexing methods. I'm following the instructions here and here. I need to create a model on a subset of data and use that model to perform the indexing on the complete set of documents. To perform the training I should use the train API.
The problem is my entire code is in python and I'm using opensearch-py and/or elasticsearch-py for accessing the opensearch cluster. These libraries don't seem to provide any wrappers for the train API. Is there a way that I can run train through python without using these libraries, or am I missing something on the library documentation that would allow me to use them for training.

How to use scikit learn model from inside sagemaker 'model.tar.gz' file?

New to Sagemaker..
Trained a "linear-learner" classification model using the Sagemaker API, and it saved a "model.tar.gz" file in my s3 path. From what I understand SM just used an image of a scikit logreg model.
Finally, I'd like to gain access to the model object itself, so I unpacked the "model.tar.gz" file only to find another file called "model_algo-1" with no extension.
Can anyone tell me how I can find the "real" modeling object without using the inference/Endpoint delpoy API provided by Sagemaker? There are some things I want to look at manually.
Thanks,
Craig

Linear-Learner is a built in algorithm written using MX-net and the binary is also MXNET compatible. You can't use this model outside of SageMaker as there is no open source implementation for this.

What is correct input for mxnet's linear learner in AWS SageMaker?

I am trying to create a simple linear learner in AWS SageMaker with MXNet. I have never worked with SageMaker or MXNet previously. Fitting the model gives runtime error as follows and shuts the instance:
UnexpectedStatusException: Error for Training job
linear-learner-2020-02-11-06-13-22-712: Failed. Reason: ClientError:
Unable to read data channel 'train'. Requested content-type is
'application/x-recordio-protobuf'. Please verify the data matches the
requested content-type. (caused by MXNetError)
I think that the data should be converted to protobuf format before passing as training data. Could someone please explain to me what is the correct format for MXNet models? What is the best way to convert a simple data frame into protobuf?

This end-to-end demo shows usage of Linear Learner from input data pre-processed in pandas dataframes and then converted to protobuf using the SDK. But note that:
There is no need to use protobuf, you can also pass csv data with the target variable on the first column of the files, as indicated here.
There is no need to know MXNet in order to use the SageMaker Linear Learner, just use the SDK of your choice, bring data to S3, and orchestrate training and inference :)

Can I extend a Tensorflow Estimator to also return the explanation values?

I have a model built by following roughly the tutorial provided for the tf.estimator.BoostedTreesClassifier in the docs. I then exported it by using the tf.Estimator.export_saved_model method as described in the SavedModels from Estimators section of the SavedModel docs. This loads in to TensorFlow Serving and answers gRPC and REST requests.
I'd now like to include the explanation factors along with any predictions. Or, less ideally, as a second signature available on the exported model. tf.estimator._BoostedTreesBase.experimental_predict_with_explanations already implements an appropriate algorithm, as described in Local Interpretability section of the docs.
I thought it would be possible to 'extend' the existing estimator in a way that would let me expose this method as another served signature. I've thought of several approaches, but only tried the first two so far:
I've Tried
Change which signatures export_saved_model exports
This didn't go very far. The exposed signatures are a little dynamic, but seem to be limited to the train, predict or eval options defined by tensorflow_core.python.saved_model.model_utils.mode_keys.KerasModeKeys.
Just use an eval_savedmodel?
I briefly thought Eval might be what I was looking for, and followed some of the getting started guide for TensorFlow Model Analysis. The further I go on this path the more it seems like the main difference with an Eval model is how the data is loaded, and that isn't what I want to change.
Subclass the estimator
There are extra caveats with exporting subclassed models. And on top of that an Estimator isn't a Model. It's a model with extra metadata around inputs, outputs and configuration, so I am not clear if a subclassed estimator would even be exportable in the same way a Keras Model is.
I abandoned this subclassing approach without writing much code.
Pull the BoostedTrees Model out of the Estimator
I am not savvy enough to arrange a BoostedTrees model myself, using the low-level primitives. The code in the Estimator that sets it up looks fairly complex. It would be nice to leverage that work, but it seems that the Estimator deals in model_fns, they change depending on the train/predict/eval mode, and it isn't clear what the relationship to a Keras Model is.
I wrote a little code for this, but also gave up on it quickly.
What Next?
Given the above dead ends, which angle should I be persuing further?
Both the low-level export API, and the low-level model building API look like they could get me closer to a solution. The gap between setting up an Estimator, and re-creating one using either API seems fairly wide.
Is it possible I could continue using the existing Estimator, but use the low-level export API to create something with an "interpret" signature that calls through to experimental_predict_with_explanations? Or even "predict and interpret" in a single step? Which tutorial will put me on that path?

Sagemaker to use processed pickled ndarray instead of csv files from S3

I understand that you can pass a CSV file from S3 into a Sagemaker XGBoost container using the following code
train_channel = sagemaker.session.s3_input(train_data, content_type='text/csv')
valid_channel = sagemaker.session.s3_input(validation_data, content_type='text/csv')
data_channels = {'train': train_channel, 'validation': valid_channel}
xgb_model.fit(inputs=data_channels, logs=True)
But I have an ndArray stored in S3 bucket. These are processed, label encoded, feature engineered arrays. I would want to pass this into the container instead of the csv. I do understand I can always convert my ndarray into csv files before saving it in S3. Just checking if there is an array option.

There are multiple options for algorithms in SageMaker:
Built-in algorithms, like the SageMaker XGBoost you mention
Custom, user-created algorithm code, which can be:
Written for a pre-built docker image, available for Sklearn, TensorFlow, Pytorch, MXNet
Written in your own container
When you use built-ins (option 1), your choice of data format options is limited to what the built-ins support, which is only csv and libsvm in the case of the built-in XGBoost. If you want to use custom data formats and pre-processing logic before XGBoost, it is absolutely possible if you use your own script leveraging the open-source XGBoost. You can get inspiration from the Random Forest demo to see how to create custom models in pre-built containers

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.