How can I convert a LightGBM model to ONXX?

How can I convert a LightGBM model to ONXX? - python

I'm trying to save my model so it can be used in a ASP.NET program, and I think that ONNX is a good way to do so. The problem is that even after checking the docs and googling it all day, I still get the same error raise ValueError('Initial types are required. See usage of ' ValueError: Initial types are required. See usage of convert(...) in skl2onnx.convert for details. I have no idea what's going on and any help is greatly appreciated!
My Code
import onnxmltools
from skl2onnx import convert
import lightgbm as lgb
import pandas as pd
parameters = {
'boosting': 'gbdt',
'feature_fraction': 0.5,
'bagging_fraction': 0.5,
'bagging_freq': 20,
'num_boost_round': 10000,
'verbose': -1 #maybe?
}
model_lgbm = lgb.train(parameters, train_data, valid_sets = test_data, early_stopping_rounds = 200);
onnx_model = convert.convert_sklearn(model_lgbm, ???);

I think this doc will help you.
You have to use :
onnxmltools.convert_lightgbm
and not
convert.convert_sklearn

As the other answer mentions: You have to use : onnxmltools.convert_lightgbm and not convert.convert_sklearn
The error would also be caused since you have not defined the initial_types. The inital_types are in the Docs described as follows:
Example of initial_types: Assume that the specified scikit-learn model takes a heterogeneous list as its input. If the first 5 elements are floats and the last 10 elements are integers, we need to specify initial types as below. The [None] in [None, 5] indicates the batch size here is unknown.
from skl2onnx.common.data_types import FloatTensorType, Int64TensorType
initial_type = [('float_input', FloatTensorType([None, 5])),
('int64_input', Int64TensorType([None, 10]))]

Related

Getting ValueError when expanding my GMMHMM from 2 to three states

I am trying to expand my GMMHMM model from two to three states but get the error below
"ValueError: startprob_ must sum to 1 (got nan)"
. It looks like it states that my initial state distribution does not sum up to one but it does (see Pi). Furthermore
I get the following warning, might have something to do with it:
"UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1."
Furthermore, If I look into it I can see that my state transition matrix returns nan values.
import numpy as np
from hmmlearn.hmm import GMMHMM
import pandas as pd
Pi = np.array([0.24, 0.37, 0.39])
A = np.array([[0.74, 0.20, 0.06],
[0.20, 0.53, 0.27 ],
[0.05, 0.40, 0.54]])
model = GMMHMM(n_components=3, n_mix=1, startprob_prior=Pi, transmat_prior=A,
min_covar=0.001, tol=0.0001, n_iter=10000)
Obs = df[['gdp','un','inf','inx','itr']].to_numpy()
print(Obs)
model.fit(Obs)
print(model.transmat_)
seq = model.decode(Obs)
print(seq)
I am not a really experienced Python programmer, so might be an easy solve but unfortunately I do not see how. Any help would be highly appreciated!

Summarise the posterior of a single parameter from an array with arviz

I am estimating a model using the pyMC3 library in python. In my "real" model, there are four parameter arrays, two of which have over 170,000 parameters in them. Summarising this array of parameters is too computationally intensive on my computer. I have been trying to figure out if the summary function in arviz will allow me to only summarise one (or a small number) of parameters in the array. Below is a reprex where the same problem is present, though the model is a lot simpler. In the linear regression model below, the parameter array b has three parameters in it b[0], b[1], b[2]. I would like to know how to get the summary for just b[0] and b[1] or alternatively for just a single parameter, e.g., b[0].
import pandas as pd
import pymc3 as pm
import arviz as az
d = pd.read_csv("https://quantoid.net/files/mtcars.csv")
mpg = d['mpg'].values
hp = d['hp'].values
weight = d['wt'].values
with pm.Model() as model:
b = pm.Normal("b", mu=0, sigma=10, shape=3)
sig = pm.HalfCauchy("sig", beta=2)
mu = pm.Deterministic('mu', b[0] + b[1]*hp + b[2]*weight)
like = pm.Normal('like', mu=mu, sigma=sig, observed=mpg)
fit = pm.fit(10000, method='advi')
samp = fit.sample(1500)
with model:
smry = az.summary(samp, var_names = ["b"])
It looked like the coords argument to the summary() function would do it, but after googling around and finding a few examples, like the one here with plot_posterior() instead of summary(), I was unable to get something to work. In particular, I tried the following in the hopes that it would return the summary for b[0] and b[1].
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": range(1)})
or this to return the summary of b[0]:
with model:
smry = az.summary(samp, var_names = ["b"], coords={"b_dim_0": [0]})
I suspect I am missing something simple (I'm an R user who dabbles occasionally with Python). Any help is greatly appreciated.
(BTW, I am using Python 3.8.0, pyMC3 3.9.3, arviz 0.10.0)

To use coords for this, you need to update to the development (which will still show 0.11.2 but has the code from github or any >0.11.2 release) version of ArviZ. Until 0.11.2, the coords argument in summary was not used to subset the data (like it did in all plotting functions) but instead it was only taken into account if the input was not already InferenceData in which case it was passed to the converter.
With older versions, you need to use xarray to subset the data before passing it to summary. Therefore you need to explicitly convert the trace to inferencedata beforehand. In the example above it would look like:
with model:
...
samp = fit.sample(1500)
idata = az.from_pymc3(samp)
az.summary(idata.posterior[["b"]].sel({"b_dim_0": [0]}))
Moreover, you may also want to indicate summary to compute only a subset of the stats/diagnostics as shown in the docstring examples.

Order of priors in sklearn LinearDiscriminantAnalysis

I'm fitting a Linear Discriminant Analysis model using the stock market data (Smarket.csv) from here. I'm trying to predict Direction with columns Lag1 and Lag2. Direction has two values: Up or Down.
Here is my reproducible code and the result:
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
url="https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/master/Notebooks/Data/Smarket.csv"
Smarket=pd.read_csv(url, usecols=range(1,10), index_col=0, parse_dates=True)
X_train = Smarket[:'2004'][['Lag1', 'Lag2']]
y_train = Smarket[:'2004']['Direction']
LDA = LinearDiscriminantAnalysis()
model = LDA.fit(X_train, y_train)
print(model.priors_)
[0.49198397 0.50801603]
How do I know which prior value corresponds to which class (Up or Down)? I looked at the documentation but there seems to be nothing.
Can someone explain it to me or point me to a resource that explains this?

Although I cannot find an explicit reference in the documentation (I'm sure there is a general one, somewhere), in such cases the classes are ordered alphabetically, ie. in your case it is ['Down', 'Up'].
You can easily verify that this is consistent with your results here; since the priors_ attribute is just passed through the priors argument, which, according to the documentation, is just the class proportions as inferred from the training data (when priors=None, like here):
y_train.value_counts(normalize=True)
gives:
Up 0.508016
Down 0.491984
Name: Direction, dtype: float64
and
model.priors_[0] == (y_train.value_counts(normalize=True)['Down']
# True
model.priors_[1] == (y_train.value_counts(normalize=True)['Up']
# True

RandomizedSearchCV takes longer time with fewer elements to search

I am having a strange issue, i am using a RandomizedSearchCV to optimize my parameters.
para_RS = {"max_depth": randint(1,70),
"max_features": ["log2", "sqrt"],
"min_samples_leaf": randint(5, 50),
"criterion": ["entropy","gini"],
"class_weight":['balanced'],
"max_leaf_nodes":randint(2,20)
}
dt = DecisionTreeClassifier()
if i include all these parameters, the output comes in 2-3 mins, however if i remove all the parameters and keeping only the below parameter, its taking forever to run and i have to kill the notebook
para_RS = {
"max_depth": randint(1,70)
}
and also if i remove fewer its take long time to run(5-10 mins).
below is the code:
if (randomsearch == True):
tick = time.time()
print("Random_Search_begin")
rs= RandomizedSearchCV(estimator=dt, cv=5, param_distributions=para_RS,
n_jobs=4,n_iter =30, scoring="roc_auc",return_train_score=True)
rs.fit(trainx_outer,trainy_outer)
# other code irrelevant to the issue...
print("Random_Search_end")

This is due to random nature of the following:
"max_depth": randint(1,70)
"max_leaf_nodes":randint(2,20)
randint(1, 70) will return an integer between 1, 70. So during different runs, a different value of max_depth is generated.
So it may happen that during a certain run, the value generated is very high. The speed of DecisionTreeClassifier is impacted by the value of max_depth and it max_leaf_nodes. If these are very large, the time will be very large.
Also, I am not sure how you are able to run this code. Because RandomizedSearchCV takes a parameter grid of dictionary of iterables. But your code will generate a single int for "max_depth", "max_leaf_nodes" instead of an array or iterable. So it should throw an error. Which version of sklearn are you using? Or is the code you shown here different than actual?

you can close this, as it seems the problem went away when i started using the random seed in both the classifier and RandomSearchCV. Thanks for all the help.

Feature Importance with XGBClassifier

Hopefully I'm reading this wrong but in the XGBoost library documentation, there is note of extracting the feature importance attributes using feature_importances_ much like sklearn's random forest.
However, for some reason, I keep getting this error: AttributeError: 'XGBClassifier' object has no attribute 'feature_importances_'
My code snippet is below:
from sklearn import datasets
import xgboost as xg
iris = datasets.load_iris()
X = iris.data
Y = iris.target
Y = iris.target[ Y < 2] # arbitrarily removing class 2 so it can be 0 and 1
X = X[range(1,len(Y)+1)] # cutting the dataframe to match the rows in Y
xgb = xg.XGBClassifier()
fit = xgb.fit(X, Y)
fit.feature_importances_
It seems that you can compute feature importance using the Booster object by calling the get_fscore attribute. The only reason I'm using XGBClassifier over Booster is because it is able to be wrapped in a sklearn pipeline. Any thoughts on feature extractions? Is anyone else experiencing this?

As the comments indicate, I suspect your issue is a versioning one. However if you do not want to/can't update, then the following function should work for you.
def get_xgb_imp(xgb, feat_names):
from numpy import array
imp_vals = xgb.booster().get_fscore()
imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
total = array(imp_dict.values()).sum()
return {k:v/total for k,v in imp_dict.items()}
>>> import numpy as np
>>> from xgboost import XGBClassifier
>>>
>>> feat_names = ['var1','var2','var3','var4','var5']
>>> np.random.seed(1)
>>> X = np.random.rand(100,5)
>>> y = np.random.rand(100).round()
>>> xgb = XGBClassifier(n_estimators=10)
>>> xgb = xgb.fit(X,y)
>>>
>>> get_xgb_imp(xgb,feat_names)
{'var5': 0.0, 'var4': 0.20408163265306123, 'var1': 0.34693877551020408, 'var3': 0.22448979591836735, 'var2': 0.22448979591836735}

For xgboost, if you use xgb.fit(),then you can use the following method to get feature importance.
import pandas as pd
xgb_model=xgb.fit(x,y)
xgb_fea_imp=pd.DataFrame(list(xgb_model.get_booster().get_fscore().items()),
columns=['feature','importance']).sort_values('importance', ascending=False)
print('',xgb_fea_imp)
xgb_fea_imp.to_csv('xgb_fea_imp.csv')
from xgboost import plot_importance
plot_importance(xgb_model, )

I found out the answer. It appears that version 0.4a30 does not have feature_importance_ attribute. Therefore if you install the xgboost package using pip install xgboost you will be unable to conduct feature extraction from the XGBClassifier object, you can refer to #David's answer if you want a workaround.
However, what I did is build it from the source by cloning the repo and running . ./build.sh which will install version 0.4 where the feature_importance_ attribute works.
Hope this helps others!

Get Feature Importance as a sorted data frame
import pandas as pd
import numpy as np
def get_xgb_imp(xgb, feat_names):
imp_vals = xgb.booster().get_fscore()
feats_imp = pd.DataFrame(imp_vals,index=np.arange(2)).T
feats_imp.iloc[:,0]= feats_imp.index
feats_imp.columns=['feature','importance']
feats_imp.sort_values('importance',inplace=True,ascending=False)
feats_imp.reset_index(drop=True,inplace=True)
return feats_imp
feature_importance_df = get_xgb_imp(xgb, feat_names)

For those having the same problem as Luís Bianchin, "TypeError: 'str' object is not callable", I found a solution (that works for me at least) here.
In short, I found modifying David's code from
imp_vals = xgb.booster().get_fscore()
to
imp_vals = xgb.get_fscore()
worked for me.
For more detail I would recommend visiting the link above.
Big thanks to David and ianozsvald

You can also use the built-in plot_importance function:
from xgboost import XGBClassifier, plot_importance
fit = XGBClassifier().fit(X,Y)
plot_importance(fit)

The alternative to built-in feature importance can be:
permutation-based importance from scikit-learn (permutation_importance method
importance with Shapley values (shap package)
I really like shap package because it provides additional plots. Example:
Importance Plot
Summary Plot
Dependence Plot
You can read about alternative ways to compute feature importance in Xgboost in this blog post of mine.

An update of the accepted answer since it no longer works:
def get_xgb_imp(xgb_model, feat_names):
imp_vals = xgb_model.get_fscore()
imp_dict = {feat: float(imp_vals.get(feat, 0.)) for feat in feat_names}
total = sum(list(imp_dict.values()))
return {k: round(v/total, 5) for k,v in imp_dict.items()}

It seems like the api keeps on changing. For xgboost version 1.0.2, just changing from imp_vals = xgb.booster().get_fscore() to imp_vals = xgb.get_booster().get_fscore() in #David's answer does the trick. The updated code is -
from numpy import array
def get_xgb_imp(xgb, feat_names):
imp_vals = xgb.get_booster().get_fscore()
imp_dict = {feat_names[i]:float(imp_vals.get('f'+str(i),0.)) for i in range(len(feat_names))}
total = array(imp_dict.values()).sum()
return {k:v/total for k,v in imp_dict.items()}

I used the following code to get feature_importance. Also, I used DictVectorizer() in the pipeline for one_hot_encoding. If you use
v = DictVectorizer()
X_to_dict = X.to_dict("records")
X_transformed = v.fit_transform(X_to_dict)
feature_names = v.get_feature_names()
best_model.get_booster().feature_names = feature_names
xgb.plot_importance(best_model.get_booster())
You can obtain the f_score plot. But I wanted to plot the feature importance against the feature names. So I modified it further.
f, ax = plt.subplots(figsize=(10, 30)) plt.barh(feature_names, best_model.feature_importances_) plt.xticks(rotation = 90) plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I convert a LightGBM model to ONXX? - python

I think this doc will help you. You have to use : onnxmltools.convert_lightgbm and not convert.convert_sklearn

Related

Getting ValueError when expanding my GMMHMM from 2 to three states

Summarise the posterior of a single parameter from an array with arviz

Order of priors in sklearn LinearDiscriminantAnalysis

RandomizedSearchCV takes longer time with fewer elements to search

Feature Importance with XGBClassifier

Categories

Resources