Export machine learning model - python

I am creating a machine learning algorithm and want to export it.
Suppose i am using scikit learn library and Random Forest algorithm.
modelC=RandomForestClassifier(n_estimators=30)
m=modelC.fit(trainvec,yvec)
modelC.model
How can i export it or is there a any function for it ?

If you follow scikit documentation on model persistence
In [1]: from sklearn.ensemble import RandomForestClassifier
In [2]: from sklearn import datasets
In [3]: from sklearn.externals import joblib
In [4]: iris = datasets.load_iris()
In [5]: X, y = iris.data, iris.target
In [6]: m = RandomForestClassifier(2).fit(X, y)
In [7]: m
Out[7]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=2, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
In [8]: joblib.dump(m, "filename.cls")
In fact, you can use pickle.dump instead of joblib, but joblib does a very good job at compressing the numpy arrays inside classifiers.

Related

What is the LightGBM equivalent of .fit() to DecisionTreeRegressor/Classifier (from scikit-learn)?

I am trying to replicate the .fit() from lightgbm library in python, but there seems to be different methods for lightgbm Booster.
.update()
.refit()
.train()
I have tried all three to no avail.
tree = DecisionTreeRegressor(criterion='friedman_mse', max_depth=3,
max_features=self.max_features, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=5,
min_weight_fraction_leaf=0.0
, random_state=0)
tree.fit(X, gradient)
works
This, however, doesn't work.
tree = lgb.Booster(model_file='lgbm_model.txt')
train_data = lgb.Dataset(X, label=gradient, free_raw_data=False)
valid_data = lgb.Dataset(Xtest, label=gradient_t, free_raw_Data=False)
solution 1
tree.update(train_data) # gives me this error:
AttributeError: 'Booster' object has no attribute 'train_set'
solution 2
tree.refit(X, gradient, predict_disable_shape_check = True)
runs but doesn't seem to update the tree all that much
solution 3
tree = lgb.train(self.params,
train_data,
valid_sets=valid_data,
num_boost_round= 10,
keep_training_booster = True,
init_model = tree
)
doesn't run
The LightGBM package for Python has different APIs. If you are using the Training API then you should definitely use the train method:
Perform the training with given parameters
However, if you want to stick to scikit-learn conventions then you should simply use the scikit-learn API with a LGBMClassifier which offers a fit method:
import lightgbm as lgb
clf = lgb.LGBMClassifier()
clf.fit(X, y)

How to predict new data with trained sklearn model - Random Forest Regressor?

I am making a sklearn model (Random Forest Regressor), and have been successful in training it with my data, however, I am unsure of how to predict it.
My CSV contains 2 items per row, a year (expressed in years since 2003), and a number (what's being predicted), usually above 1,000. When I use model.predict([[20]]), I get a decimal for a number that is supposed to be in the thousands despite a very high r^2 value:
R-squared: 0.9804779528842772 Prediction: [0.67932727]
I have a feeling I'm not using this method correctly, but I couldn't really find anything online. A user from another question of mine said that the last item in a CSV row was supposed to be the output, so I assumed that is how it works. Please forgive me if something is unclear, just comment and I will try my best to clarify, I am a noob at this.
Code:
import pandas
import scipy
import numpy
import matplotlib
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
import matplotlib.pyplot as plt
from sklearn import set_config
from pandas import read_csv
names = ['YEAR', 'TOTAL']
url = 'energy/energyTotal.csv'
dataset = read_csv(url, names=names)
array = dataset.values
x = array[:, 0:1]
y = array[:, 1]
y=y.astype('int')
# rfr = RandomForestRegressor(max_depth=3)
# rfr.fit(x, y)
# print(rfr.predict([[0, 1, 0, 1]]))
x = scale(x)
y = scale(y)
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.10)
#Train model
set_config(print_changed_only=False)
rfr = RandomForestRegressor()
print(rfr)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
max_depth=None, max_features='auto', max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)
rfr.fit(xtrain, ytrain)
score = rfr.score(xtrain, ytrain)
print("R-squared:", score)
print(rfr.predict([[20]]))
The CSV:
18,28564
17,28411
16,27515
15,24586
14,26653
13,26836
12,26073
11,27055
10,26236
9,26020
8,26538
7,25800
6,26682
5,24997
4,25100
3,24651
2,12053
1,11500
Your data has been scaled, so your predictions are not in the original range of the TOTAL variable. You can try to train your model without scaling the data and results are still quite good.
I would recommend scaling only the training set, to avoid leaking information about the whole dataset into the test set. And you need to know the scaling to reverse your predictions into the original range.

Display all parameters of an estimator including defaults

I am using Watson Studio and using a markdown notebook. In the notebook, I write the code:
from sklearn.tree import DecisionTreeClassifier
Tree_Loan= DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree_Loan
and it displays
DecisionTreeClassifier(criterion='entropy', max_depth=4)
However, it should display something in the form of (this is from a different lab I've done using Skills Network Labs):
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
The best I can tell is that it is not importing the decision tree classifier. I have the same problem with svm from sklearn. Other functions in scikit-learn like train test split and k nearest neighbors work fine. A classmate says the rest of my code is correct and there is no reason for the error. What might be causing it?
It is importing the DecisionTreeClassifier, no problem there. But by default, sklearn prints only the parameters that were given to estimator with non-default values, from this function.
But if you want to see the "full" output, you can set the configuration of print_changed_only to False via sklearn._config.set_config like so:
>>> from sklearn.tree import DecisionTreeClassifier
>>> tree = DecisionTreeClassifier(criterion="entropy", max_depth=4)
>>> # only displays the changed parameters
>>> tree
DecisionTreeClassifier(criterion='entropy', max_depth=4)
>>> from sklearn._config import get_config, set_config
>>> # default setting
>>> get_config()["print_changed_only"]
True
>>> # now changing it
>>> set_config(print_changed_only=False)
# now we get the default values, too
>>> tree
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
max_depth=4, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')

TypeError: Expected sequence or array-like, got estimator RandomForestRegressor

I am building a Random Forest Regression model using scikit-learn model. When I am trying to calculate the RMSE error for this model, I am getting an error. Can anybody tell me how to solve the error?
The code snippet is shown below:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
forest_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, forest_reg)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
The error is shown below:
Expected sequence or array-like, got estimator RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
oob_score=False, random_state=None, verbose=0, warm_start=False)

Print decision tree and feature_importance when using BaggingClassifier

Obtaining the decision tree and the important features can be easy when using DecisionTreeClassifier in scikit learn. However I am not able to obtain none of them if I and bagging function, e.g., BaggingClassifier.
Since we need to fit the model using the BaggingClassifier, I can not return the results (print the trees (graphs), feature_importances_, ...) related to the DecisionTreeClassifier.
Hier is my script:
seed = 7
n_iterations = 199
DTC = DecisionTreeClassifier(random_state=seed,
max_depth=None,
min_impurity_split= 0.2,
min_samples_leaf=6,
max_features=None, #If None, then max_features=n_features.
max_leaf_nodes=20,
criterion='gini',
splitter='best',
)
#parametersDTC = {'max_depth':range(3,10), 'max_leaf_nodes':range(10, 30)}
parameters = {'max_features':range(1,200)}
dt = RandomizedSearchCV(BaggingClassifier(base_estimator=DTC,
#max_samples=1,
n_estimators=100,
#max_features=1,
bootstrap = False,
bootstrap_features = True, random_state=seed),
parameters, n_iter=n_iterations, n_jobs=14, cv=kfold,
error_score='raise', random_state=seed, refit=True) #min_samples_leaf=10
# Fit the model
fit_dt= dt.fit(X_train, Y_train)
print(dir(fit_dt))
tree_model = dt.best_estimator_
# Print the important features (NOT WORKING)
features = tree_model.feature_importances_
print(features)
rank = np.argsort(features)[::-1]
print(rank[:12])
print(sorted(list(zip(features))))
# Importing the image (NOT WORKING)
from sklearn.externals.six import StringIO
tree.export_graphviz(dt.best_estimator_, out_file='tree.dot') # necessary to plot the graph
dot_data = StringIO() # need to understand but it probably relates to read of strings
tree.export_graphviz(dt.best_estimator_, out_file=dot_data, filled=True, class_names= target_names, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
img = Image(graph.create_png())
print(dir(img)) # with dir we can check what are the possibilities in graph.create_png
with open("my_tree.png", "wb") as png:
png.write(img.data)
I obtain erros like: 'BaggingClassifier' object has no attribute 'tree_' and 'BaggingClassifier' object has no attribute 'feature_importances'. Does anyone know how can I obtain them? thanks.
Based on the documentation, BaggingClassifier object indeed doesn't have the attribute 'feature_importances'. You could still compute it yourself as described in the answer to this question: Feature importances - Bagging, scikit-learn
You can access the trees that were produced during the fitting of BaggingClassifier using the attribute estimators_, as in the following example:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import BaggingClassifier
iris = datasets.load_iris()
clf = BaggingClassifier(n_estimators=3)
clf.fit(iris.data, iris.target)
clf.estimators_
clf.estimators_ is a list of the 3 fitted decision trees:
[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1422640898, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1968165419, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=2103976874, splitter='best')]
So you can iterate over the list and access each one of the trees.

Categories