Feature importance or model summary in sktime - python

I'm going through the documentation of the sktime package. One thing I just cannot find is the feature importance (that we'd get with sklearn models) or model summary (like the one we can obtain from statsmodels). Is it something that is just not implemented yet?
It seems that this functionality is implemented for models like AutoETS or AutoARIMA.
from matplotlib import pyplot as plt
from sktime.datasets import load_airline
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.base import ForecastingHorizon
y = load_airline()
y_train,y_test = temporal_train_test_split(y)
fh = ForecastingHorizon(y_test.index, is_relative=False)
from sktime.forecasting.ets import AutoETS
model = AutoETS(trend='add',seasonal='mul',sp=12)
model.fit(y_train,fh=y_test.index)
model.summary()
I wonder if these summaries are accessible from instances like ForecastingPipeline.

Ok, I was able to solve it myself. I'm really glad the functionality is there!
The source code for ForecastingPipeline indicates that an instance of this class has an attribute steps_ - it holds the fitted instance of the model in a pipeline.
from sktime.forecasting.compose import ForecastingPipeline
model = ForecastingPipeline(steps=[
("forecaster", AutoETS(sp=1))])
model.fit(y_train)
model.steps_[-1][1].summary() # model.steps[-1][1].summary() would throw an error
The output of model.steps_ is [('forecaster', AutoETS())] (as mentioned before AutoETS() is in this case already fitted).

Related

Customizing Spacy's Text Categorizer

I am trying to to train a spacy model with a small dataset in Spacy 2.2. It is overfitting, I want to customize the architecture of the TextCategorizer. I referred to this post on GitHub :
https://github.com/explosion/spaCy/issues/3320
However, I am unable
from spacy.pipeline import TextCategorizer
from thinc.api import layerize
from spacy.language import Language
class StupidTextCategorizer(TextCategorizer):
name = 'stupid_textcat'
#classmethod
def Model(cls, nr_class, **cfg):
return create_dummy_model(nr_class, cfg.get('preferred_class', 0))
def create_dummy_model(nr_class, preferred_class):
"""Create a Thinc model that always predicts the same class."""
def dummy_model(docs, drop=0.):
scores = model.ops.allocate((len(docs), nr_class))
scores[:, preferred_class] = 1.0
return scores
model = layerize(dummy_model)
return model
However, when I'm trying to pass it to my training script, it throws this error which I can't seem to understand.
"[E002] Can't find factory for 'stupid_textcat'. This usually happens when spaCy calls `nlp.create_pipe` with a component name that's not built in - for example, when constructing the pipeline from a model's meta.json. If you're using a custom component, you can write to `Language.factories['stupid_textcat']` or remove it from the model meta and add it via `nlp.add_pipe` instead."
PS : Still learning Spacy but I can't find any helping documentation or tutorial for the above.

cast xgboost.Booster class to XGBRegressor or load XGBRegressor from xgboost.Booster

I get a model from Sagemaker of type:
<class 'xgboost.core.Booster'>
I can score this locally which is great but some google searches have shown that it may not be possible to do "standard" things like this taken from here:
plt.barh(boston.feature_names, xgb.feature_importances_)
Is it possible to tranform xgboost.core.Booster to XGBRegressor? Maybe one could use the save_raw method looking at this? Thanks!
So far I tried:
xgb_reg = xgb.XGBRegressor()
xgb_reg._Boster = model
xgb_reg.feature_importances_
but this reults in:
NotFittedError: need to call fit or load_model beforehand
Something along those lines appears to work fine:
local_model_path = "model.tar.gz"
with tarfile.open(local_model_path) as tar:
tar.extractall()
model = xgb.XGBRegressor()
model.load_model(model_file_name)
model can then be used as usual - model.tar.gz is an artifcat coming from sagemaker.

Why isn't `model.fit` defined in scikit-learn?

I am following step 3 of this example:
model.fit(dataset.data, dataset.target)
expected = dataset.target
predicted = model.predict(dataset.data)
I don't understand why scikit doesn't recognize model.fit.
Do I need assign that variable first?
Is there a missing import?
I'm working in jupyter, scikit-learn 0.17.1.
You need to first initiate an instance of whatever model you're using:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(dataset.data, dataset.target)
fit(x,y) is a method that can be used on an estimator.
In order to be able to use this method on model you would have to create model first and make sure its of an estimator class.
Documentation

Access the underlying (tree_) object of a single tree in a Random-Forest model (Python, scikit-learn)

I need to convert a Random Fores model to a rule-based model or (if-then) based model. I have now created my model and it is well tuned. The problem I face is that I cannot "access" the (base_estimator) or the underlying (tree_ object) which will make it possible create a function that can extract the rules from the trees in the forest. I would be very thankful if You can help me with this issue. To create the model I use:
estimator = RandomForestRegressor(oob_score=True, n_estimators=10,max_features='auto')
I tried to use the estimator.estimators_ attribute to access a single tree and then use for example estimator.estimators_[0].tree_to get the decision tree (DecisionTreeRegressor object) used to build the forest. Unfortunately, this method does not work.
If possible, I want something like:
estimator = RandomForestRegressor(oob_score=True, n_estimators=10,max_features='auto')
estimator.fit(tarning_data,traning_target)
tree1 = estimator.estimators_[0]
leftChild = tree1.tree_.children_left
rightChild = tree1.tree_.children_right
To access the underlying structure of a DecisionTreeRegressor object in a Random-forest model, you need to follow the steps described below:
estimator = RandomForestRegressor(oob_score=True,n_estimators=10,max_features='auto')
estimator.fit(tarning_data,traning_target)
tree1 = estimator.estimators_[0]
leftChilds = tree1.tree_.children_left # array of left children
rightChilds = tree1.tree_.children_right #array of right children
i.e. essentially what is already described in the question.

Python statsmodels OLS: how to save learned model to file

I am trying to learn an ordinary least squares model using Python's statsmodels library, as described here.
sm.OLS.fit() returns the learned model. Is there a way to save it to the file and reload it? My training data is huge and it takes around half a minute to learn the model. So I was wondering if any save/load capability exists in OLS model.
I tried the repr() method on the model object but it does not return any useful information.
The models and results instances all have a save and load method, so you don't need to use the pickle module directly.
Edit to add an example:
import statsmodels.api as sm
data = sm.datasets.longley.load_pandas()
data.exog['constant'] = 1
results = sm.OLS(data.endog, data.exog).fit()
results.save("longley_results.pickle")
# we should probably add a generic load to the main namespace
from statsmodels.regression.linear_model import OLSResults
new_results = OLSResults.load("longley_results.pickle")
# or more generally
from statsmodels.iolib.smpickle import load_pickle
new_results = load_pickle("longley_results.pickle")
Edit 2 We've now added a load method to main statsmodels API in master, so you can just do
new_results = sm.load('longley_results.pickle')
I've installed the statsmodels library and found that you can save the values using the pickle module in python.
Models and results are pickleable via save/load, optionally saving the model data.
[source]
As an example:
Given that you have the results saved in the variable results:
To save the file:
import pickle
with open('learned_model.pkl','w') as f:
pickle.dump(results,f)
To read the file:
import pickle
with open('learned_model.pkl','r') as f:
model_results = pickle.load(f)

Categories