read pmml using sklearn2pmml? - python

I was using sklearn2pmml to export my model to .pmml file.
How to read the PMML file back to python PMMLpipeline ?
I was checking the repo but could not find the solution in the documentation.

If you want to save a fitted Scikit-Learn pipeline so that it could be loaded back into Scikit-Learn/Python, then you need to use "native" Python serialization libraries and dataformats such as Pickle.
The conversion from Scikit-Learn to PMML should be regarded one-way operation. There is no easy way of getting PMML back into Scikit-Learn.
You can always save the fitted Scikit-Learn pipeline in multiple data formats. Save one copy in Pickle data format so that it could be loaded back into Scikit-Learn/Python, and save another copy in PMML data format so that it would be loaded in non-Python environments.

Related

How to convert LIGHTGBM to PMML?

I would like to know if there exist a way to convert a LightGBM model to a PMML. Starting from the lightgbm.basic.Booster object I would like to know how to convert it to a PMML or MOJO/POJO object. If is not possible, i would like to know if it is possible to save the LGBM model as Pickle and than convert it to a PMML (or MOJO/POJO) object.
For now, there are at least two ways to create PMML from lightGBM, such as sklearn2pmml and Nyoka, but both cannot create PMML from a learned Booster.
To create PMML, we need to use a Scikit-learn API such as LGBMClassifier and Pipeline. Both packages can create PMML in almost the same way. The detailed usage is described in here for sklearn2pmml and here for Nyoka, and both are pretty simple.
Nyoka just uses the normal Scikit-learn API for training, but sklearn2pmml requires Java to be installed and PMMLPipeline to be used during training, so if you are using python and sklearn, Nyoka may be a better choice.
It would be nice if there was a way to create PMML directly from a trained Booster or a way to convert Booster to LGBMClassifier and then create PMML, but there are no other packages to create PMML from Booster directly and according to this, there is no official way to convert Booster to LGBMClassifier.

Is there a format like ONNX for persisting data pre-processing / feature-engineering pipelines without serializing?

I'm looking for a way to save my sklearn Pipeline for data pre-processing so that I can re-load it to make predictions.
So far I've only seen options like pickle or joblib, that will serialize arbitrary python objects, but the resulting file
is opaque if I wanted to store the pipeline in version control,
will serialize any python object and therefore might not be safe to unserialize, and
may run into issues with different Python version or library versions
It seems like ONNX is a great way to save models in a safe & interoperable way -- Is there any alternative for data pre-processing pipelines?

Sagemaker to use processed pickled ndarray instead of csv files from S3

I understand that you can pass a CSV file from S3 into a Sagemaker XGBoost container using the following code
train_channel = sagemaker.session.s3_input(train_data, content_type='text/csv')
valid_channel = sagemaker.session.s3_input(validation_data, content_type='text/csv')
data_channels = {'train': train_channel, 'validation': valid_channel}
xgb_model.fit(inputs=data_channels, logs=True)
But I have an ndArray stored in S3 bucket. These are processed, label encoded, feature engineered arrays. I would want to pass this into the container instead of the csv. I do understand I can always convert my ndarray into csv files before saving it in S3. Just checking if there is an array option.
There are multiple options for algorithms in SageMaker:
Built-in algorithms, like the SageMaker XGBoost you mention
Custom, user-created algorithm code, which can be:
Written for a pre-built docker image, available for Sklearn, TensorFlow, Pytorch, MXNet
Written in your own container
When you use built-ins (option 1), your choice of data format options is limited to what the built-ins support, which is only csv and libsvm in the case of the built-in XGBoost. If you want to use custom data formats and pre-processing logic before XGBoost, it is absolutely possible if you use your own script leveraging the open-source XGBoost. You can get inspiration from the Random Forest demo to see how to create custom models in pre-built containers

Saving data in ".t7" format with python

Is there any possible way to save data in ".t7" format with python?
".t7" is the serialization format, which can be opened by lua torch. However, when I save my data in pickle with python with ".t7" at the end of the naming, it does not work.
I have been searching through internet, but I could not find out any working answer.
there is currently no such converter but there is a workaround. What you can do is convert pyTorch model to Caffe model and from Caffe to Torch (lua) model. Here is the table of converters by framework.

Importing PMML models into Python (Scikit-learn)

There seem to be a few options for exporting PMML models out of scikit-learn, such as sklearn2pmml, but a lot less information going in the other direction. My case is an XGboost model previously built in R, and saved to PMML using r2pmml, that I would like to use in Python. Scikit normally uses pickle to save/load models, but is it also possible to import models into scikit-learn using PMML?
You can't connect different specialized representations (such as R and Scikit-Learn native data structures) over a generalized representation (such as PMML). You may have better luck trying to translate R data structures to Scikit-Learn data structures directly.
XGBoost is really an exception to the above rule, because its R and Scikit-Learn implementations are just thin wrappers around the native XGBoost library. Inside a trained R XGBoost object there's a blob raw, which is the model in its native XGBoost representation. Save it to a file, and load in Python using the xgb.Booster.load_model(fname) method.
If you know that you need to the deploy XGBoost model in Scikit-Learn, then why do you train it in R?

Categories