StackingClassifier in sklearn can stack several models. At the moment of the calling .fit method, the underlying models are trained.
A typical use case for StackingClassifier:
model1 = LogisticRegression()
model2 = RandomForest()
combination = StackingClassifier([model1, model2])
combination.fit(X_train, y_train)
However, what I need is the following:
model1 = LogisticRegression()
model1.fit(X_train_1, y_train_1)
model2 = RandomForest()
model2.fit(X_train_2, y_train_2)
combination = StackingClassifier([model1, model2], refit=False)
combination.fit(X_train_3, y_train_3)
where refit does not exist - it is what I would need.
I have already trained models model1, and model2 and do not want to re-fit them. I need just to fit the stacking model that combines these two. How do I elegantly combine them into one model that would produce an end-to-end .predict?
Of course, I can predict the first and the second model, create a data frame, and fit the third one. I would like to avoid that because then I cannot communicate the model as an end-to-end artifact.
You're close: it's cv="prefit", not refit=False. From the API docs:
cv : int, cross-validation generator, iterable, or “prefit”, default=None
[...]
"prefit" to assume the estimators are prefit. In this case, the estimators will not be refitted.
Related
I'm using the Keras functional API and attempting to stack and train two models with a non-linear step in between them.
Say I want to train a chain of two models, Model A and Model B, where the output of Model A is used as the input of Model B, as one model, Model C. My understanding of how to do this is:
input_A = Input(input_shape_A)
output_A = ModelA(inputA)
output_B = ModelB(outputA)
model_C = Model(inputA, outputB)
Source
The problem is that, in my case, I want to slice up the output of Model A before it goes into Model B, call another function on the slices which returns only one of them, and then use that slice as the input to Model B. So, Model B is only being trained on a fixed-size subset of Model A's output, which could be at arbitrary indices.
Something more like:
input_A = Input(input_shape_A)
output_A = ModelA(inputA)
input_B = custom_function(output_A)
output_B = ModelB(input_B)
model_C = Model(inputA, outputB)
I have not found any code examples so far that resemble this, and I am still trying to figure out if I can. The loss of Model B has to be integrated into Model A's training, but I need a function in between them. I was considering keeping them separate and trying to write out a custom loss function for Model A, but custom loss functions in Keras are seem to be very restrictive, and I haven't seen any examples for that approach so far, either.
Is this possible?
After I instantiate a scikit model (e.g. LinearRegression), if I call its fit() method multiple times (with different X and y data), what happens? Does it fit the model on the data like if I just re-instantiated the model (i.e. from scratch), or does it keep into accounts data already fitted from the previous call to fit()?
Trying with LinearRegression (also looking at its source code) it seems to me that every time I call fit(), it fits from scratch, ignoring the result of any previous call to the same method. I wonder if this true in general, and I can rely on this behavior for all models/pipelines of scikit learn.
If you will execute model.fit(X_train, y_train) for a second time - it'll overwrite all previously fitted coefficients, weights, intercept (bias), etc.
If you want to fit just a portion of your data set and then to improve your model by fitting a new data, then you can use estimators, supporting "Incremental learning" (those, that implement partial_fit() method)
You can use term fit() and train() word interchangeably in machine learning. Based on classification model you have instantiated, may be a clf = GBNaiveBayes() or clf = SVC(), your model uses specified machine learning technique.
And as soon as you call clf.fit(features_train, label_train) your model starts training using the features and labels that you have passed.
you can use clf.predict(features_test) to predict.
If you will again call clf.fit(features_train2, label_train2) it will start training again using passed data and will remove the previous results. Your model will reset the following inside model:
Weights
Fitted Coefficients
Bias
And other training related stuff...
You can use partial_fit() method as well if you want your previous calculated stuff to stay and additionally train using next data
Beware that the model is passed kind of "by reference". Here, model1 will be overwritten:
df1 = pd.DataFrame(np.random.rand(100).reshape(10,10))
df2 = df1.copy()
df2.iloc[0,0] = df2.iloc[0,0] -2 # change one value
pca = PCA()
model1 = pca.fit(df)
model2 = pca.fit(df2)
np.unique(model1.explained_variance_ == model2.explained_variance_)
Returns
array([ True])
To avoid this use
from copy import deepcopy
model1 = deepcopy(pca.fit(df))
This question already has an answer here:
Using sklearn cross_val_score and kfolds to fit and help predict model
(1 answer)
Closed last year.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
I created the following function in python:
def cross_validate(algorithms, data, labels, cv=4, n_jobs=-1):
print "Cross validation using: "
for alg, predictors in algorithms:
print alg
print
# Compute the accuracy score for all the cross validation folds.
scores = cross_val_score(alg, data, labels, cv=cv, n_jobs=n_jobs)
# Take the mean of the scores (because we have one for each fold)
print scores
print("Cross validation mean score = " + str(scores.mean()))
name = re.split('\(', str(alg))
filename = str('%0.5f' %scores.mean()) + "_" + name[0] + ".pkl"
# We might use this another time
joblib.dump(alg, filename, compress=1, cache_size=1e9)
filenameL.append(filename)
try:
move(filename, "pkl")
except:
os.remove(filename)
print
return
I thought that in order to do cross validation, sklearn had to fit your function.
However, when I try to use it later (f is the pkl file I saved above in joblib.dump(alg, filename, compress=1, cache_size=1e9)):
alg = joblib.load(f)
predictions = alg.predict_proba(train_data[predictors]).astype(float)
I get no error in the first line (so it looks like the load is working), but then it tells me NotFittedError: Estimator not fitted, callfitbefore exploiting the model. on the following line.
What am I doing wrong? Can't I reuse the model fitted to calculate the cross-validation? I looked at Keep the fitted parameters when using a cross_val_score in scikits learn but either I don't understand the answer, or it is not what I am looking for. What I want is to save the whole model with joblib so that I can the use it later without re-fitting.
It's not quite correct that cross-validation has to fit your model; rather a k-fold cross validation fits your model k times on partial data sets. If you want the model itself, you actually need to fit the model again on the whole dataset; this actually isn't part of the cross-validation process. So it actually wouldn't be redundant to call
alg.fit(data, labels)
to fit your model after your cross validation.
Another approcach would be rather than using the specialized function cross_val_score, you could think of this as a special case of a cross-validated grid search (with a single point in the parameter space). In this case GridSearchCV will by default refit the model over the entire dataset (it has a parameter refit=True), and also has predict and predict_proba methods in its API.
The real reason your model is not fitted is that the function cross_val_score first copies your model before fitting the copy : Source link
So your original model has not been fitted.
Cross_val_score does not keep the fitted model
Cross_val_predict does
There is no cross_val_predict_proba but you can do this
predict_proba for a cross-validated model
I am creating a GridSearchCV classifier as
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
('clf', LogisticRegression())
])
parameters= {}
gridSearchClassifier = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
# Fit/train the gridSearchClassifier on Training Set
gridSearchClassifier.fit(Xtrain, ytrain)
This works well, and I can predict. However, now I want to retrain the classifier. For this I want to do a fit_transform() on some feedback data.
gridSearchClassifier.fit_transform(Xnew, yNew)
But I get this error
AttributeError: 'GridSearchCV' object has no attribute 'fit_transform'
basically i am trying to fit_transform() on the classifier's internal TfidfVectorizer. I know that i can access the Pipeline's internal components using the named_steps attribute. Can i do something similar for the gridSearchClassifier?
Just call them step by step.
gridSearchClassifier.fit(Xnew, yNew)
transformed = gridSearchClassifier.transform(Xnew)
the fit_transform is nothing more but these two lines of code, simply not implemented as a single method for GridSearchCV.
update
From comments it seems that you are a bit lost of what GridSearchCV actually does. This is a meta-method to fit a model with multiple hyperparameters. Thus, once you call fit you get an estimator inside the best_estimator_ field of your object. In your case - it is a pipeline, and you can extract any part of it as usual, thus
gridSearchClassifier.fit(Xtrain, ytrain)
clf = gridSearchClassifier.best_estimator_
# do something with clf, its elements etc.
# for example print clf.named_steps['vect']
you should not use gridsearchcv as a classifier, this is only a method of fitting hyperparameters, once you find them you should work with best_estimator_ instead. However, remember that if you refit the TFIDF vectorizer, then your classifier will be useless; you cannot change data representation and expect old model to work well, you have to refit the whole classifier once your data change (unless this is carefully designed change, and you make sure old dimensions mean exactly the same - sklearn does not support such operations, you would have to implement this from scratch).
#lejot is correct that you should call fit() on the gridSearchClassifier.
Provided refit=True is set on the GridSearchCV, which is the default, you can access best_estimator_ on the fitted gridSearchClassifier.
You can access the already fitted steps:
tfidf = gridSearchClassifier.best_estimator_.named_steps['vect']
clf = gridSearchClassifier.best_estimator_.named_steps['clf']
You can then transform new text in new_X using:
X_vec = tfidf.transform(new_X)
You can make predictions using this X_vec with:
x_pred = clf.predict(X_vec)
You can also make predictions for the text going through the pipeline entire pipeline with.
X_pred = gridSearchClassifier.predict(new_X)
I am attempting to perform a partial fit of on an naive-bayes estimator but also retain a copy of the estimator prior to the partial fit. sklearn.base.clone only clones an estimators parameters, not it's data, so is not useful in this case. Performing a partial fit on the clone only uses the data added during the partial fit, since the clone is effectively empty.
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
fit_model = model.fit(np.array(X),np.array(y))
fit_model2 = model.partial_fit = (np.array(Z),np.array(w)),np.unique(y))
In the above example fit_model and fit_model2 will be the same since they both point to the same object. I would like to retain the original copy unaltered. My workaround is to pickle the original and load it into a new object to perform a partial fit on. Like this:
model = MultinomialNB()
fit_model = model.fit(np.array(X),np.array(y))
import pickle
with open('saved_model', 'wb') as f:
pickle.dump([model], f)
with open('saved_model', 'rb') as f:
[model2] = pickle.load(f)
fit_model2 = model2.partial_fit(np.array(Z),np.array(w)),np.unique(y))
Also I can completely refit with the new data each time, but since I need to perform this thousands of times I'm trying to find something more efficient.
model.fit() returns the model itself (the same object). So you don't have to assign it to a different variable as it's just aliasing.
You can use deepcopy to copy the object in a similar way to what loading a pickled object does.
So if you do something like:
from copy import deepcopy
model = MultinomialNB()
model.fit(np.array(X), np.array(y))
model2 = deepcopy(model)
model2.partial_fit(np.array(Z),np.array(w)), np.unique(y))
# ...
model2 will be a distinct object, with the copied parameters of model, including the "trained" parameters.
from copy import deepcopy
model = MultinomialNB()
model.fit(np.array(X), np.array(y))
model2 = deepcopy(model)
weight_vector_model = array(model.coef_[0])
weight_vector_model2 = array(model2.coef_[0])
model2.partial_fit(np.array(Z),np.array(w)), np.unique(y))
weight_vector_model = array(model.coef_[0])
weight_vector_model2 = array(model2.coef_[0])
model and model2 are now completely different objects. partial_fit() on model2 will have no impact on model. The two weight vectors are same after deepcopy but differ after partial_fit() on model2
I tried deepcopy but I got a leak of memory when deleting the variables. I found out at the documentation that it is recomended to use clone instead
sklearn.base.clone
from sklearn.base import clone
model2 = clone(model)