How to clone an scikit-learn estimator including its data? - python

I am attempting to perform a partial fit of on an naive-bayes estimator but also retain a copy of the estimator prior to the partial fit. sklearn.base.clone only clones an estimators parameters, not it's data, so is not useful in this case. Performing a partial fit on the clone only uses the data added during the partial fit, since the clone is effectively empty.
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
fit_model = model.fit(np.array(X),np.array(y))
fit_model2 = model.partial_fit = (np.array(Z),np.array(w)),np.unique(y))
In the above example fit_model and fit_model2 will be the same since they both point to the same object. I would like to retain the original copy unaltered. My workaround is to pickle the original and load it into a new object to perform a partial fit on. Like this:
model = MultinomialNB()
fit_model = model.fit(np.array(X),np.array(y))
import pickle
with open('saved_model', 'wb') as f:
pickle.dump([model], f)
with open('saved_model', 'rb') as f:
[model2] = pickle.load(f)
fit_model2 = model2.partial_fit(np.array(Z),np.array(w)),np.unique(y))
Also I can completely refit with the new data each time, but since I need to perform this thousands of times I'm trying to find something more efficient.

model.fit() returns the model itself (the same object). So you don't have to assign it to a different variable as it's just aliasing.
You can use deepcopy to copy the object in a similar way to what loading a pickled object does.
So if you do something like:
from copy import deepcopy
model = MultinomialNB()
model.fit(np.array(X), np.array(y))
model2 = deepcopy(model)
model2.partial_fit(np.array(Z),np.array(w)), np.unique(y))
# ...
model2 will be a distinct object, with the copied parameters of model, including the "trained" parameters.

from copy import deepcopy
model = MultinomialNB()
model.fit(np.array(X), np.array(y))
model2 = deepcopy(model)
weight_vector_model = array(model.coef_[0])
weight_vector_model2 = array(model2.coef_[0])
model2.partial_fit(np.array(Z),np.array(w)), np.unique(y))
weight_vector_model = array(model.coef_[0])
weight_vector_model2 = array(model2.coef_[0])
model and model2 are now completely different objects. partial_fit() on model2 will have no impact on model. The two weight vectors are same after deepcopy but differ after partial_fit() on model2

I tried deepcopy but I got a leak of memory when deleting the variables. I found out at the documentation that it is recomended to use clone instead
sklearn.base.clone
from sklearn.base import clone
model2 = clone(model)

Related

How to combine already trained classifiers with StackingClassifier?

StackingClassifier in sklearn can stack several models. At the moment of the calling .fit method, the underlying models are trained.
A typical use case for StackingClassifier:
model1 = LogisticRegression()
model2 = RandomForest()
combination = StackingClassifier([model1, model2])
combination.fit(X_train, y_train)
However, what I need is the following:
model1 = LogisticRegression()
model1.fit(X_train_1, y_train_1)
model2 = RandomForest()
model2.fit(X_train_2, y_train_2)
combination = StackingClassifier([model1, model2], refit=False)
combination.fit(X_train_3, y_train_3)
where refit does not exist - it is what I would need.
I have already trained models model1, and model2 and do not want to re-fit them. I need just to fit the stacking model that combines these two. How do I elegantly combine them into one model that would produce an end-to-end .predict?
Of course, I can predict the first and the second model, create a data frame, and fit the third one. I would like to avoid that because then I cannot communicate the model as an end-to-end artifact.
You're close: it's cv="prefit", not refit=False. From the API docs:
cv : int, cross-validation generator, iterable, or “prefit”, default=None
[...]
"prefit" to assume the estimators are prefit. In this case, the estimators will not be refitted.

cast xgboost.Booster class to XGBRegressor or load XGBRegressor from xgboost.Booster

I get a model from Sagemaker of type:
<class 'xgboost.core.Booster'>
I can score this locally which is great but some google searches have shown that it may not be possible to do "standard" things like this taken from here:
plt.barh(boston.feature_names, xgb.feature_importances_)
Is it possible to tranform xgboost.core.Booster to XGBRegressor? Maybe one could use the save_raw method looking at this? Thanks!
So far I tried:
xgb_reg = xgb.XGBRegressor()
xgb_reg._Boster = model
xgb_reg.feature_importances_
but this reults in:
NotFittedError: need to call fit or load_model beforehand
Something along those lines appears to work fine:
local_model_path = "model.tar.gz"
with tarfile.open(local_model_path) as tar:
tar.extractall()
model = xgb.XGBRegressor()
model.load_model(model_file_name)
model can then be used as usual - model.tar.gz is an artifcat coming from sagemaker.

How to deep copy structure and data of a sklearn Pipeline structure into a new variable?

Suppose I have defined a sklearn Pipeline structure. I need to deep-copy its structure and data into another variable so that when refitting the original one, the new variable does not change. I tried to use clone from sklearn.base in a similar way to the following code:
temp_pipe = Pipeline([
('Scaler', StandardScaler()),
('LinearRegression', LinearRegression())]);
for i in iterations:
temp_pipe.fit(X,y);
....
if check_condition:
final = clone(temp_pipe);
but it seems to do a deep copy of the structure, not of the data as stated here:
Clone does a deep copy of the model in an estimator without actually
copying attached data
I know can do something like:
final = Pipeline([
('Scaler', StandardScaler()),
('LinearRegression', LinearRegression())]);
for i in iterations:
temp_pipe = clone(final)
temp_pipe.fit(X,y);
....
if check_condition:
final = temp_pipe;
but is there a way to deep-copy also the fitted data?
from copy import deepcopy
estimator_deep_copy = deepcopy(pipeline)
Note that the purpose of clone is to get an unfitted/clean estimator.

What does calling fit() multiple times on the same model do?

After I instantiate a scikit model (e.g. LinearRegression), if I call its fit() method multiple times (with different X and y data), what happens? Does it fit the model on the data like if I just re-instantiated the model (i.e. from scratch), or does it keep into accounts data already fitted from the previous call to fit()?
Trying with LinearRegression (also looking at its source code) it seems to me that every time I call fit(), it fits from scratch, ignoring the result of any previous call to the same method. I wonder if this true in general, and I can rely on this behavior for all models/pipelines of scikit learn.
If you will execute model.fit(X_train, y_train) for a second time - it'll overwrite all previously fitted coefficients, weights, intercept (bias), etc.
If you want to fit just a portion of your data set and then to improve your model by fitting a new data, then you can use estimators, supporting "Incremental learning" (those, that implement partial_fit() method)
You can use term fit() and train() word interchangeably in machine learning. Based on classification model you have instantiated, may be a clf = GBNaiveBayes() or clf = SVC(), your model uses specified machine learning technique.
And as soon as you call clf.fit(features_train, label_train) your model starts training using the features and labels that you have passed.
you can use clf.predict(features_test) to predict.
If you will again call clf.fit(features_train2, label_train2) it will start training again using passed data and will remove the previous results. Your model will reset the following inside model:
Weights
Fitted Coefficients
Bias
And other training related stuff...
You can use partial_fit() method as well if you want your previous calculated stuff to stay and additionally train using next data
Beware that the model is passed kind of "by reference". Here, model1 will be overwritten:
df1 = pd.DataFrame(np.random.rand(100).reshape(10,10))
df2 = df1.copy()
df2.iloc[0,0] = df2.iloc[0,0] -2 # change one value
pca = PCA()
model1 = pca.fit(df)
model2 = pca.fit(df2)
np.unique(model1.explained_variance_ == model2.explained_variance_)
Returns
array([ True])
To avoid this use
from copy import deepcopy
model1 = deepcopy(pca.fit(df))

Why isn't `model.fit` defined in scikit-learn?

I am following step 3 of this example:
model.fit(dataset.data, dataset.target)
expected = dataset.target
predicted = model.predict(dataset.data)
I don't understand why scikit doesn't recognize model.fit.
Do I need assign that variable first?
Is there a missing import?
I'm working in jupyter, scikit-learn 0.17.1.
You need to first initiate an instance of whatever model you're using:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(dataset.data, dataset.target)
fit(x,y) is a method that can be used on an estimator.
In order to be able to use this method on model you would have to create model first and make sure its of an estimator class.
Documentation

Categories