Sagemaker to use processed pickled ndarray instead of csv files from S3

Sagemaker to use processed pickled ndarray instead of csv files from S3 - python

I understand that you can pass a CSV file from S3 into a Sagemaker XGBoost container using the following code
train_channel = sagemaker.session.s3_input(train_data, content_type='text/csv')
valid_channel = sagemaker.session.s3_input(validation_data, content_type='text/csv')
data_channels = {'train': train_channel, 'validation': valid_channel}
xgb_model.fit(inputs=data_channels, logs=True)
But I have an ndArray stored in S3 bucket. These are processed, label encoded, feature engineered arrays. I would want to pass this into the container instead of the csv. I do understand I can always convert my ndarray into csv files before saving it in S3. Just checking if there is an array option.

There are multiple options for algorithms in SageMaker:
Built-in algorithms, like the SageMaker XGBoost you mention
Custom, user-created algorithm code, which can be:
Written for a pre-built docker image, available for Sklearn, TensorFlow, Pytorch, MXNet
Written in your own container
When you use built-ins (option 1), your choice of data format options is limited to what the built-ins support, which is only csv and libsvm in the case of the built-in XGBoost. If you want to use custom data formats and pre-processing logic before XGBoost, it is absolutely possible if you use your own script leveraging the open-source XGBoost. You can get inspiration from the Random Forest demo to see how to create custom models in pre-built containers

Related

How to use scikit learn model from inside sagemaker 'model.tar.gz' file?

New to Sagemaker..
Trained a "linear-learner" classification model using the Sagemaker API, and it saved a "model.tar.gz" file in my s3 path. From what I understand SM just used an image of a scikit logreg model.
Finally, I'd like to gain access to the model object itself, so I unpacked the "model.tar.gz" file only to find another file called "model_algo-1" with no extension.
Can anyone tell me how I can find the "real" modeling object without using the inference/Endpoint delpoy API provided by Sagemaker? There are some things I want to look at manually.
Thanks,
Craig

Linear-Learner is a built in algorithm written using MX-net and the binary is also MXNET compatible. You can't use this model outside of SageMaker as there is no open source implementation for this.

`flow_from_dataframe()` custom pre-processing

I'm trying to use the keras ImageDataGenerator.flow_from_dataframe() method to generate image data on the fly as the dataset I'm working on is too large to load into memory in one go.
The source image data files are DICOM files, not supported by the flow_from_dataframe() method.
Is it possible to (easily) extend flow_from_dataframe() to handle DICOM (or other unsupported) images/input?
Perhaps a custom pre-processing function could be run on each unsupported file, returning a normalised (windowed/photometric corrected) numpy array, then allowing the ImageDataGenerator instance to proceed.
I could edit the source on my own installation but a general solution that can be used on vanilla keras is preferred, to ensure portability to other platforms (especially Kaggle)!!

A solution to this can be found on the Keras github issue tracker/feature requests: https://github.com/keras-team/keras/issues/13665
A custom data generator can be created based on keras.utils.Sequence as a superclass.

Is there a format like ONNX for persisting data pre-processing / feature-engineering pipelines without serializing?

I'm looking for a way to save my sklearn Pipeline for data pre-processing so that I can re-load it to make predictions.
So far I've only seen options like pickle or joblib, that will serialize arbitrary python objects, but the resulting file
is opaque if I wanted to store the pipeline in version control,
will serialize any python object and therefore might not be safe to unserialize, and
may run into issues with different Python version or library versions
It seems like ONNX is a great way to save models in a safe & interoperable way -- Is there any alternative for data pre-processing pipelines?

When do you specify the Target variable in a SageMaker Training job?

I'm trying to create a Machine Learning algorithm following this tutorial : Get Started with Amazon SageMaker
Unless I missed something in the tutorial, I didn't find any steps where we specify the target variable. Can someone explain where / when we specify our target variable when creating an ML model using SageMaker built-in algorithms?
Thanks a lot!

It depends on the scientific paradigm you're using in SageMaker :)
SageMaker Built-in algorithms all have their input specification,
described in their respective documentation. For example, for
SageMaker Linear Learner and SageMaker XGBoost the target is assumed
to be the first column.
With custom code, such as Bring-Your-Own-Docker or SageMaker Framework containers (for Sklearn, TF, PyTorch, MXNet) since you are the one writing the code you can write any sort of logic, and the target can be any column of your dataset.

After a little bit of research, I found the answer. If you are using a CSV file pulled from an S3 bucket, the target variable is assumed to be on the first column.
If you need more details, you can check out this part of AWS documentation:
Common Data Format For Training

Ideal way to read data in bucket stored batches of data for Keras ML training in Google Cloud Platform?

This is my first attempt at training a model in the cloud and I am struggling with all the little intracacies.
I have training data stored in a bucket within the google cloud platform,
along the lines of gs://test/train
The dataset is about 100k.
Currently the data is distributed in separate folders in accordance with its labels.
I don't know the ideal way to get access to the data.
Usually in Keras I use,ImageDataGenerator with flow_from_directory and it automatically creates a generator which I can feed into my model.
Is there such as function for Python for the Google Cloud Platform?
If not what is the ideal way to access the data via generator so I can feed it to
Keras model.fit_generator
Thank you.

ImageDataGenerator.flow_from_directory() currently does not allow you to stream data directly from a GCS bucket. I think you have a couple of options:
1/ Copy your data from GCS to a disk local to the VM your are using to run your script. I suppose you are doing this either via ML Engine or on a Compute Engine instance. Either way, you can copy the data at the beginning of your training script using gsutil or the python cloud storage API. There is a downside here: this will cost you some time at the beginning of your script, especially when the dataset is large.
2/ When using tf.keras, you can train your model on a tf.data Dataset. The nice thing here is that TensorFlow's io utilities allow you to read directly from GCS buckets. If you would convert your data to TFRecords, you could instantiate a Dataset object without downloading the data to local disk first:
# Construct a TFRecordDataset
ds_train tf.data.TFRecordDataset('gs://') # path to TFRecords on GCS
ds_train = ds_train.shuffle(1000).batch(32)
# Fit a tf.keras model
model.fit(ds_train)
See this question for more information on the TFRecord option. This works also fine with a Dataset object that is instantiated directly from images on GCS with Dataset.from_tensor_slices, so that you don't have to store your data in the TFRecords format first:
def load_and_preprocess_image(path):
"""Read an image GCS path and process it into an image tensor
Args:
path (tensor): string tensor, pointer to GCS or local image path
Returns:
tensor: processed image tensor
"""
image = tf.read_file(path)
image = tf.image.decode_jpeg(image, channels=3)
return image
image_paths = ['gs://my-bucket/img1.png',
'gs://my-bucket/img2/png'...]
path_ds = tf.data.Dataset.from_tensor_slices(image_paths)
image_ds = path_ds.map(load_and_preprocess_image)
label_ds = tf.data.Dataset.from_tensor_slices(labels) # can be a list of labels
model.fit(tf.data.Dataset.zip((images_ds, labels_ds)))
See the tutorials on the TF website for more examples.
3/ Finally, it should also be possible to write your own python generator or to adapt the source code of the ImageDataGenerator such that the images are read in with a TensorFlow io function. Again, these work fine with gs:// paths:
import tensorflow as tf
tf.enable_eager_execution()
path = 'gs://path/to/my/image.png'
tf.image.decode_png(tf.io.read_file(path)) # this works
See also this related question. This will probably work slower than the options listed above.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sagemaker to use processed pickled ndarray instead of csv files from S3 - python

Related

How to use scikit learn model from inside sagemaker 'model.tar.gz' file?

`flow_from_dataframe()` custom pre-processing

Is there a format like ONNX for persisting data pre-processing / feature-engineering pipelines without serializing?

When do you specify the Target variable in a SageMaker Training job?

Ideal way to read data in bucket stored batches of data for Keras ML training in Google Cloud Platform?

Categories

Resources