How to convert LIGHTGBM to PMML? - python

I would like to know if there exist a way to convert a LightGBM model to a PMML. Starting from the lightgbm.basic.Booster object I would like to know how to convert it to a PMML or MOJO/POJO object. If is not possible, i would like to know if it is possible to save the LGBM model as Pickle and than convert it to a PMML (or MOJO/POJO) object.

For now, there are at least two ways to create PMML from lightGBM, such as sklearn2pmml and Nyoka, but both cannot create PMML from a learned Booster.
To create PMML, we need to use a Scikit-learn API such as LGBMClassifier and Pipeline. Both packages can create PMML in almost the same way. The detailed usage is described in here for sklearn2pmml and here for Nyoka, and both are pretty simple.
Nyoka just uses the normal Scikit-learn API for training, but sklearn2pmml requires Java to be installed and PMMLPipeline to be used during training, so if you are using python and sklearn, Nyoka may be a better choice.
It would be nice if there was a way to create PMML directly from a trained Booster or a way to convert Booster to LGBMClassifier and then create PMML, but there are no other packages to create PMML from Booster directly and according to this, there is no official way to convert Booster to LGBMClassifier.

Related

How to use scikit learn model from inside sagemaker 'model.tar.gz' file?

New to Sagemaker..
Trained a "linear-learner" classification model using the Sagemaker API, and it saved a "model.tar.gz" file in my s3 path. From what I understand SM just used an image of a scikit logreg model.
Finally, I'd like to gain access to the model object itself, so I unpacked the "model.tar.gz" file only to find another file called "model_algo-1" with no extension.
Can anyone tell me how I can find the "real" modeling object without using the inference/Endpoint delpoy API provided by Sagemaker? There are some things I want to look at manually.
Thanks,
Craig
Linear-Learner is a built in algorithm written using MX-net and the binary is also MXNET compatible. You can't use this model outside of SageMaker as there is no open source implementation for this.

Is there a format like ONNX for persisting data pre-processing / feature-engineering pipelines without serializing?

I'm looking for a way to save my sklearn Pipeline for data pre-processing so that I can re-load it to make predictions.
So far I've only seen options like pickle or joblib, that will serialize arbitrary python objects, but the resulting file
is opaque if I wanted to store the pipeline in version control,
will serialize any python object and therefore might not be safe to unserialize, and
may run into issues with different Python version or library versions
It seems like ONNX is a great way to save models in a safe & interoperable way -- Is there any alternative for data pre-processing pipelines?

read pmml using sklearn2pmml?

I was using sklearn2pmml to export my model to .pmml file.
How to read the PMML file back to python PMMLpipeline ?
I was checking the repo but could not find the solution in the documentation.
If you want to save a fitted Scikit-Learn pipeline so that it could be loaded back into Scikit-Learn/Python, then you need to use "native" Python serialization libraries and dataformats such as Pickle.
The conversion from Scikit-Learn to PMML should be regarded one-way operation. There is no easy way of getting PMML back into Scikit-Learn.
You can always save the fitted Scikit-Learn pipeline in multiple data formats. Save one copy in Pickle data format so that it could be loaded back into Scikit-Learn/Python, and save another copy in PMML data format so that it would be loaded in non-Python environments.

How can I use pmml model in PySpark script?

I have xgboost model, which was trained on pure Python and converted to pmml format. Now I need to use this model in PySpark script, but I out of ideas, how can I realize it. Are there methods that allow import pmml model in Python and use it for predict? Thanks for any suggestions.
BR,
Vladimir
Spark does not support importing from PMML directly. While I have not encountered a pyspark PMML importer there is one for java (https://github.com/jpmml/jpmml-evaluator-spark). What you can do is wrap the java (or scala) so you can access it from python (e.g. see http://aseigneurin.github.io/2016/09/01/spark-calling-scala-code-from-pyspark.html).
You could use PyPMML-Spark to import PMML in PySpark script, for example:
from pypmml_spark import ScoreModel
model = ScoreModel.fromFile('the/pmml/file/path')
score_df = model.transform(df)

Importing PMML models into Python (Scikit-learn)

There seem to be a few options for exporting PMML models out of scikit-learn, such as sklearn2pmml, but a lot less information going in the other direction. My case is an XGboost model previously built in R, and saved to PMML using r2pmml, that I would like to use in Python. Scikit normally uses pickle to save/load models, but is it also possible to import models into scikit-learn using PMML?
You can't connect different specialized representations (such as R and Scikit-Learn native data structures) over a generalized representation (such as PMML). You may have better luck trying to translate R data structures to Scikit-Learn data structures directly.
XGBoost is really an exception to the above rule, because its R and Scikit-Learn implementations are just thin wrappers around the native XGBoost library. Inside a trained R XGBoost object there's a blob raw, which is the model in its native XGBoost representation. Save it to a file, and load in Python using the xgb.Booster.load_model(fname) method.
If you know that you need to the deploy XGBoost model in Scikit-Learn, then why do you train it in R?

Categories