creating a knn index model on opensearch using python

creating a knn index model on opensearch using python - python

I'm trying to create an index on opensearch for knn_vectors using faiss's ivf and pq indexing methods. I'm following the instructions here and here. I need to create a model on a subset of data and use that model to perform the indexing on the complete set of documents. To perform the training I should use the train API.
The problem is my entire code is in python and I'm using opensearch-py and/or elasticsearch-py for accessing the opensearch cluster. These libraries don't seem to provide any wrappers for the train API. Is there a way that I can run train through python without using these libraries, or am I missing something on the library documentation that would allow me to use them for training.

Related

Connecting untrained python predictive model to backend

I already have a built predictive model written in Python, however currently it is executed by hand and functions on a single data file. I am hoping to generalize the model so that it can read in different datasets from my backend, each time effectively producing a different model since we are using different data for training as well. How would I be able to add the model onto my backend then?

Store the model as a pickle and read it from your backend when you need analog to your training data.
But you might want to checkout MLFlow for an integrated model handling solution. It is possible to run it on prem. With MLFlow you can easily implement a proper ML lifecycle. You can store your training stats and keep the history of your trained models.

`flow_from_dataframe()` custom pre-processing

I'm trying to use the keras ImageDataGenerator.flow_from_dataframe() method to generate image data on the fly as the dataset I'm working on is too large to load into memory in one go.
The source image data files are DICOM files, not supported by the flow_from_dataframe() method.
Is it possible to (easily) extend flow_from_dataframe() to handle DICOM (or other unsupported) images/input?
Perhaps a custom pre-processing function could be run on each unsupported file, returning a normalised (windowed/photometric corrected) numpy array, then allowing the ImageDataGenerator instance to proceed.
I could edit the source on my own installation but a general solution that can be used on vanilla keras is preferred, to ensure portability to other platforms (especially Kaggle)!!

A solution to this can be found on the Keras github issue tracker/feature requests: https://github.com/keras-team/keras/issues/13665
A custom data generator can be created based on keras.utils.Sequence as a superclass.

Updating an LGBM model with new data

I want to use additional data to 'update' an already trained Light Gradient Boosting Model (LGBM). Is there a way to do that?
I am looking for an approach that uses the Skealrn API and thus can be used in a pipeline.

An LGBM model in python can be fitted both with the original model API and with the Sklearn API.
I couldn't find any examples of using the Sklearn API for continuous learning.
Regardless of that, you can fit a model either way and it is compatible with the .train() function from the original API.
It can be saved with save_model() or with joblib.dump().
This does not affect its compatibility with Python Pileline() - it is perfectly compatible.

When do you specify the Target variable in a SageMaker Training job?

I'm trying to create a Machine Learning algorithm following this tutorial : Get Started with Amazon SageMaker
Unless I missed something in the tutorial, I didn't find any steps where we specify the target variable. Can someone explain where / when we specify our target variable when creating an ML model using SageMaker built-in algorithms?
Thanks a lot!

It depends on the scientific paradigm you're using in SageMaker :)
SageMaker Built-in algorithms all have their input specification,
described in their respective documentation. For example, for
SageMaker Linear Learner and SageMaker XGBoost the target is assumed
to be the first column.
With custom code, such as Bring-Your-Own-Docker or SageMaker Framework containers (for Sklearn, TF, PyTorch, MXNet) since you are the one writing the code you can write any sort of logic, and the target can be any column of your dataset.

After a little bit of research, I found the answer. If you are using a CSV file pulled from an S3 bucket, the target variable is assumed to be on the first column.
If you need more details, you can check out this part of AWS documentation:
Common Data Format For Training

Project organization with Tensorflow.keras. Should one subclass tf.keras.Model?

I'm using Tensorflow 1.14 and the tf.keras API to build a number (>10) of differnet neural networks. (I'm also interested in the answers to this question using Tensorflow 2). I'm wondering how I should organize my project.
I convert the keras models into estimators using tf.keras.estimator.model_to_estimator and Tensorboard for visualization. I'm also sometimes using model.summary(). Each of my models has a number (>20) of hyperparameters and takes as input one of three types of input data. I sometimes use hyperparameter optimization, such that I often manually delete models and use tf.keras.backend.clear_session() before trying the next set of hyperparameters.
Currently I'm using functions that take hyperparameters as arguments and return the respective compiled keras model to be turned into an estimator. I use three different "Main_Datatype.py" scripts to train models for the three different input data types. All data is loaded from .tfrecord files and there is an input function for each data type, which is used by all estimators taking that type of data as input. I switch between models (i.e. functions returning a model) in the Main scripts. I also have some building blocks that are part of more than one model, for which I use helper functions returning them, piecing together the final result using the Keras functional API.
The slight incompatibilities of the different models are begining to confuse me and I've decided to organise the project using classes. I'm planing to make a class for each model that keeps track of hyperparameters and correct naming of each model and its model directory. However, I'm wondering if there are established or recomended ways to do this in Tensorflow.
Question: Should I be subclassing tf.keras.Model instead of using functions to build models or python classes that encapsulate them? Would subclassing keras.Model break (or require much work to enable) any of the functionality that I use with keras estimators and tensorboard? I've seen many issues people have with using custom Model classes and am somewhat reluctant to put in the work only to find that it doesn't work for me. Do you have other suggestions how to better organize my project?
Thank you very much in advance.

Subclass only if you absolutely need to. I personally prefer following the following order of implementation. If the complexity of the model you are designing, can not be achieved using the first two options, then of course subclassing is the only option left.
tf.keras Sequential API
tf.keras Functional API
Subclass tf.keras.Model

Seems like a reasonable thing to do: https://www.tensorflow.org/guide/keras/custom_layers_and_models https://www.tensorflow.org/api_docs/python/tf/keras/Model guide

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

creating a knn index model on opensearch using python - python

Related

Connecting untrained python predictive model to backend

`flow_from_dataframe()` custom pre-processing

Updating an LGBM model with new data

When do you specify the Target variable in a SageMaker Training job?

Project organization with Tensorflow.keras. Should one subclass tf.keras.Model?

Categories

Resources