Importing PMML models into Python (Scikit-learn) - python

There seem to be a few options for exporting PMML models out of scikit-learn, such as sklearn2pmml, but a lot less information going in the other direction. My case is an XGboost model previously built in R, and saved to PMML using r2pmml, that I would like to use in Python. Scikit normally uses pickle to save/load models, but is it also possible to import models into scikit-learn using PMML?

You can't connect different specialized representations (such as R and Scikit-Learn native data structures) over a generalized representation (such as PMML). You may have better luck trying to translate R data structures to Scikit-Learn data structures directly.
XGBoost is really an exception to the above rule, because its R and Scikit-Learn implementations are just thin wrappers around the native XGBoost library. Inside a trained R XGBoost object there's a blob raw, which is the model in its native XGBoost representation. Save it to a file, and load in Python using the xgb.Booster.load_model(fname) method.
If you know that you need to the deploy XGBoost model in Scikit-Learn, then why do you train it in R?

Related

How to convert LIGHTGBM to PMML?

I would like to know if there exist a way to convert a LightGBM model to a PMML. Starting from the lightgbm.basic.Booster object I would like to know how to convert it to a PMML or MOJO/POJO object. If is not possible, i would like to know if it is possible to save the LGBM model as Pickle and than convert it to a PMML (or MOJO/POJO) object.
For now, there are at least two ways to create PMML from lightGBM, such as sklearn2pmml and Nyoka, but both cannot create PMML from a learned Booster.
To create PMML, we need to use a Scikit-learn API such as LGBMClassifier and Pipeline. Both packages can create PMML in almost the same way. The detailed usage is described in here for sklearn2pmml and here for Nyoka, and both are pretty simple.
Nyoka just uses the normal Scikit-learn API for training, but sklearn2pmml requires Java to be installed and PMMLPipeline to be used during training, so if you are using python and sklearn, Nyoka may be a better choice.
It would be nice if there was a way to create PMML directly from a trained Booster or a way to convert Booster to LGBMClassifier and then create PMML, but there are no other packages to create PMML from Booster directly and according to this, there is no official way to convert Booster to LGBMClassifier.

Is there a format like ONNX for persisting data pre-processing / feature-engineering pipelines without serializing?

I'm looking for a way to save my sklearn Pipeline for data pre-processing so that I can re-load it to make predictions.
So far I've only seen options like pickle or joblib, that will serialize arbitrary python objects, but the resulting file
is opaque if I wanted to store the pipeline in version control,
will serialize any python object and therefore might not be safe to unserialize, and
may run into issues with different Python version or library versions
It seems like ONNX is a great way to save models in a safe & interoperable way -- Is there any alternative for data pre-processing pipelines?

read pmml using sklearn2pmml?

I was using sklearn2pmml to export my model to .pmml file.
How to read the PMML file back to python PMMLpipeline ?
I was checking the repo but could not find the solution in the documentation.
If you want to save a fitted Scikit-Learn pipeline so that it could be loaded back into Scikit-Learn/Python, then you need to use "native" Python serialization libraries and dataformats such as Pickle.
The conversion from Scikit-Learn to PMML should be regarded one-way operation. There is no easy way of getting PMML back into Scikit-Learn.
You can always save the fitted Scikit-Learn pipeline in multiple data formats. Save one copy in Pickle data format so that it could be loaded back into Scikit-Learn/Python, and save another copy in PMML data format so that it would be loaded in non-Python environments.

How to get the exact structure from python sklearn machine learning algorithms?

I am new to machine learning and have mostly been using python's sklearn to create classification models on various data sets (iris data set etc). I have also been watching videos/tutorials which explain the underlying principles involved.
I want to know is there any way to see "under the hood" on the python sklearn algorithms. For example, I have created a decision tree classifier using sklearn and would like to see the exact structure of the tree created. Is this possible or will I need to write/develop my own algorithms from scratch?

Random Forest interpretation in scikit-learn

I am using scikit-learn's Random Forest Regressor to fit a random forest regressor on a dataset. Is it possible to interpret the output in a format where I can then implement the model fit without using scikit-learn or even Python?
The solution would need to be implemented in a microcontroller or maybe even an FPGA. I am doing analysis and learning in Python but want to implement on a uC or FPGA.
You can check out graphviz, which uses 'dot language' for storing models (which is quite human-readable if you'd want to build some custom interpreter, shouldn't be hard). There is an export_graphviz function in scikit-learn. You can load and process the model in C++ through boost library read_graphviz method or some of other custom interpreters available.
You could trying extracting rules from the tree ensemble model and implement the rules in hardware.
You can use TE2Rules (Tree Ensembles to Rules) to extract human understandable rules to explain a scikit tree ensemble (like GradientBoostingClassifier). It provides levers to control interpretability, fidelity and run time budget to extract useful explanations. Rules extracted by TE2Rules are guaranteed to closely approximate the tree ensemble, by considering the joint interactions of multiple trees in the ensemble.
References:
TE2Rules: You can find the code: https://github.com/linkedin/TE2Rules and documentation: https://te2rules.readthedocs.io/en/latest/ here.
Disclosure: I'm one of the core developers of TE2Rules.
It's unclear what you mean by this part:
Now, that I have the results, is it possible to interpret this in some format where I can then implement the fit without using sklearn or even python?
Implement the fitting process for a given dataset? tree topology? choice of parameters?
As to 'implement... without using sklearn or python', did you mean 'port the bytecode or binary' or 'clean-code a totally new implementation'?
Assuming you meant the latter, I'd suggest GPU rather than FPGA or uC.

Categories