I have been working on a machine learning model and I'm currently using a Pipeline with GridSearchCV. My data is scaled with MinMaxScaler and I'm using an SVR with RBR kernel. My question is now that my model is complete, fitted, and has a decent evaluation score, do I need to also scale new data for predictions with MinMaxScaler or can I just make predictions with the data as is? I've read 3 books on scikit learn but they all focus on feature engineering and fitting. They don't really cover any additional steps in the prediction step other than use the predict method.
This is the code:
pipe = Pipeline([('scaler', MinMaxScaler()), ('clf', SVR())])
time_split = TimeSeriesSplit(n_splits=5)
param_grid = {'clf__kernel': ['rbf'],
'clf__C':[0.0001, 0.001],
'clf__gamma': [0.0001, 0.001]}
grid = GridSearchCV(pipe, param_grid, cv= time_split,
scoring='neg_mean_squared_error', n_jobs = -1)
grid.fit(X_train, y_train)
Sure, if you get new (in the sense of unprocessed) data you need to do the same preparation steps as you did when training the model. For example if you use MinMaxScaler with default proporties the model is used to have data with zero mean and standard variance in each feature, if you don't preprocess data the model can't produce accurate results.
Keep in mind to use exactly the same MinMaxScaler object you used for the training data. So in case you save your model to a file, save also your preprocessing objects.
I wanted to follow up my question with a solution thanks to pythonic833's answer. I think the proper procedure to scale new data for prediction if you used a pipeline is to do the whole scaling process from beginning to end with the original training data that was used on the pipeline. Even though the pipeline did the scaling for you during the training process, it's necessary to scale the training data manually to be able to have the new data predict accurately and scaled correctly by having a MinMaxScaler object. Below is my code based on pythonic833 answer and some of the other comments such as saving the model with Pickle.
from sklearn.preprocessing import MinMaxScaler
pipe = Pipeline([('scaler', MinMaxScaler()), ('clf', SVR())])
time_split = TimeSeriesSplit(n_splits=5)
param_grid = {'clf__kernel': ['rbf'],
'clf__C':[0.0001, 0.001],
'clf__gamma': [0.0001, 0.001]}
grid = GridSearchCV(pipe, param_grid, cv= time_split,
scoring='neg_mean_squared_error', n_jobs = -1)
grid.fit(X_train, y_train)
# Pickle the data with a content manager
with open('Pickles/{}.pkl'.format(file_name), 'wb') as file:
pickle.dump(grid, file)
# Load Pickle with a content manager
with open('Pickles/{}.pkl'.format(file_name), 'rb') as file:
model = pickle.load(file)
scaler = MinMaxScaler()
scaler.fit(X_train) # Original training data for Pipeline
X_train_scaled = scaler.transform(X_train)
new_data_scaled = scaler.transform(new_data)
model.predict(new_data_scaled)
Related
I build an application that trains a Keras binary classifier model (0 or 1) every x time (hourly,daily) given the new data. The data preparation, training and testing works well, or at least as expected. It tests different features and scales it with MinMaxScaler (some values are negative).
On live data predictions with one single data point, the values are unrealistic (around 0.9987 to 1 most of the time, which is inaccurate). Since the result should be how close to "1" the prediction is, getting such high numbers constantly raises alerts.
Code for live prediction is as follows
current_df is a pandas dataframe that contains the 1 row with the data pulled live and the column headers, we select the "features" (since why load the features from the db and we implement dynamic feature selection when training the model, which could mean on every model there are different features)
Get the features as a list:
# Convert literal str to list
features = ast.literal_eval(features)
Then select only the features that I need in the dataframe:
# Select the features
selected_df = current_df[features]
Get the values as a list:
# Get the values of the df
current_list = selected_df.values.tolist()[0]
Then I reshape it:
# Reshape to allow scaling and predicting
current_list = np.reshape(current_list, (-1, 1))
If I call "transform" instead of "fit_transform" in the line above, I get the following error: This MinMaxScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Reshape again:
# Reshape to be able to scale
current_list = np.reshape(current_list, (1, -1))
Loads the model using Keras (model_location is a Path) and predict:
# Loads the model from the local folder
reconstructed_model = keras.models.load_model(model_location)
prediction = reconstructed_model.predict(current_list)
prediction = prediction.flat[0]
Updated
The data gets scaled using fit_transform and transform (MinMaxScaler although it can be Standard Scaler):
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)
And this is run when training the model (the "model" config is not shown):
# Compile the model
model.compile(optimizer=optimizer,
loss=loss,
metrics=['binary_accuracy'])
# build the model
model.fit(X_train, y_train, epochs=epochs, verbose=0)
# Evaluate using Keras built-in function
scores = model.evaluate(X_test, y_test, verbose=0)
testing_accuracy = scores[1]
# create model with sklearn KerasClassifier for evaluation
eval_model = KerasClassifier(model=model, epochs=epochs, batch_size=10, verbose=0)
# Evaluate model using RepeatedStratifiedKFold
accuracy = ML.evaluate_model_KFold(eval_model, X_test, y_test)
# Predict testing data
pred_test= model.predict(X_test, verbose=0)
pred_test = pred_test.flatten()
# extract the predicted class labels
y_predicted_test = np.where(pred_test > 0.5, 1, 0)
Regarding feature selection, the features are not always the same --I use both SelectKBest (10 or 15 features) or RFECV. And select the trained model with highest accuracy, meaning the features can be different.
Is there anything I'm doing wrong here? I'm thinking maybe the scaling should be done before the feature selection or there's some issue with the scaling (since maybe some values might be 0 when training and 100 when using it and the features are not necessarily the same when scaling).
The issues seems to stem from a StandardScaler / MinMaxScaler. The following example shows how to apply the former. However, if there are separate scripts handling learning/prediction, then the scaler will also need to be serialized and loaded at prediction time.
Set up a classification problem:
X, y = make_classification(n_samples=10_000)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
Fit a StandardScaler instance on the training set and use the same parameters to .transform the test set:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
# Train time: Serialize the scaler to a pickle file.
with open("scaler.pkl", "wb") as fh:
pickle.dump(scaler, fh)
# Test time: Load the scaler and apply to the test set.
with open("scaler.pkl", "rb") as fh:
new_scaler = pickle.load(fh)
X_test = new_scaler.transform(X_test)
Which means that the model should be fit on features with similar distributions:
model = keras.Sequential([
keras.Input(shape=X_train.shape[1]),
layers.Dense(100),
layers.Dropout(0.1),
layers.Dense(1, activation="relu")])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["binary_accuracy"])
model.fit(X_train, y_train, epochs=25)
y_pred = np.where(model.predict(X_test)[:, 0] > 0.5, 1, 0)
print(accuracy_score(y_test, y_pred))
# 0.8708
Alexander's answer is correct, I think there is just some confusion between testing and live prediction. What he said regarding the testing step is equally applicable to live prediction step. After you've called scaler.fit_transform on your training set, add the following code to save the scaler:
with open("scaler.pkl", "wb") as fh:
pickle.dump(scaler, fh)
Then, during live prediction step, you don't call fit_transform. Instead, you load the scaler saved during training and call transform:
with open("scaler.pkl", "rb") as fh:
new_scaler = pickle.load(fh)
# Load features, reshape them, etc
# Scaling step
current_list = new_scaler.transform(current_list)
# Features are scaled properly now, put the rest of your prediction code here
You always call fit_transform only once per model, during the training step on your training pool. After that (during testing or calculating predictions after model deployment) you never call it, only call transform. Treat scaler as part of the model. Naturally, you fit the model on the training set and then during testing and live prediction you use the same model, never refitting it. The same should be true for the scaler.
If you call scaler.fit_transform on live prediction features it creates a new scaler that has no prior knowledge of feature distribution on training set.
I previously saw a post with code like this:
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)
My understanding is that: when we apply scaler, we should use 3 out of the 4 folds to calculate mean and standard deviation, then we apply the mean and standard deviation to all 4 folds.
In the above code, how can I know that Sklearn is following the same strategy? On the other hand, if sklearn is not following the same strategy, which means sklearn would calculate the mean/std from all 4 folds. Would that mean I should not use the above codes?
I do like the above codes because it saves tons of time.
In the example you gave, I would add an additional step using sklearn.model_selection.train_test_split:
folds = 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1/folds), random_state=0, stratify=y)
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=(folds - 1))
scores = cross_val_score(pipeline, X_train, y_train, cv = cv)
I think best practice is to only use the training data set (i.e., X_train, y_train) when tuning the hyperparameters of your model, and the test data set (i.e., X_test, y_test) should be used as a final check, to make sure your model isn't biased towards the validation folds. At that point you would apply the same scaler that you fit on your training data set to your testing data set.
Yes, this is done properly; this is one of the reasons for using pipelines: all the preprocessing is fitted only on training folds.
Some references.
Section 6.1.1 of the User Guide:
Safety
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
The note at the end of section 3.1.1 of the User Guide:
Data transformation with held out data
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:
...code sample...
A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation:
...
Finally, you can look into the source for cross_val_score. It calls cross_validate, which clones and fits the estimator (in this case, the entire pipeline) on each training split. GitHub link.
I have here my code wherein it loops through each label or category then creates a model out of it. However, what I want is to create a general model that will be able to accept new predictions that are inputs from a user.
I'm aware that the code below saves the model that is fit for the last category in the loop. How can I fix this so that models for each category will be saved so that when I load those models, i would be able to predict a label for a new text?
vectorizer = TfidfVectorizer(strip_accents='unicode',
stop_words=stop_words, analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)
x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['question_body'], axis=1)
x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['question_body'], axis=1)
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {}'.format(category))
# train the SVC model using X_dtm & y
SVC_pipeline.fit(x_train, train[category])
# compute the testing accuracy of SVC
svc_prediction = SVC_pipeline.predict(x_test)
print("SVC Prediction:")
print(svc_prediction)
print('Test accuracy is {}'.format(f1_score(test[category], svc_prediction)))
print("\n")
#save the model to disk
filename = 'svc_model.sav'
pickle.dump(SVC_pipeline, open(filename, 'wb'))
There are multiple mistakes in your code.
You are fitting your TfidfVectorizer on both train and test:
vectorizer.fit(train_text)
vectorizer.fit(test_text)
This is wrong. Calling fit() is not incremental. It will not learn on both data if called two times. The most recent call to fit() will forget everything from past calls. You never fit (learn) something on test data.
What you need to do is this:
vectorizer.fit(train_text)
The pipeline does not work the way you think:
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See that you are passing LinearSVC inside the OneVsRestClassifier, so it will automatically use that without the need of Pipeline. Pipeline will not do anything here. Pipeline is of use when you sequentially want to pass your data through multiple models. Something like this:
pipe = Pipeline([
('pca', pca),
('logistic', LogisticRegression())
])
What the above pipe will do is pass the data to PCA which will transform it. Then that new data is passed to LogisticRegression and so on..
Correct usage of pipeline in your case can be:
SVC_pipeline = Pipeline([
('vectorizer', vectorizer)
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See more examples here:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#examples-using-sklearn-pipeline-pipeline
You need to describe more about your "categories". Show some examples of your data. You are not using y_train and y_test anywhere. Is the categories different from "question_body"?
I'm confused about using cross_val_predict in a test data set.
I created a simple Random Forest model and used cross_val_predict to make predictions:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict, KFold
lr = RandomForestClassifier(random_state=1, class_weight="balanced", n_estimators=25, max_depth=6)
kf = KFold(train_df.shape[0], random_state=1)
predictions = cross_val_predict(lr,train_df[features_columns], train_df["target"], cv=kf)
predictions = pd.Series(predictions)
I'm confused on the next step here. How do I use what is learnt above to make predictions on the test data set?
I don't think cross_val_score or cross_val_predict uses fit before predicting. It does it on the fly. If you look at the documentation (section 3.1.1.1), you'll see that they never mention fit anywhere.
As #DmitryPolonskiy commented, the model has to be trained (with the fit method) before it can be used to predict.
# Train the model (a.k.a. `fit` training data to it).
lr.fit(train_df[features_columns], train_df["target"])
# Use the model to make predictions based on testing data.
y_pred = lr.predict(test_df[feature_columns])
# Compare the predicted y values to actual y values.
accuracy = (y_pred == test_df["target"]).mean()
cross_val_predict is a method of cross validation, which lets you determine the accuracy of your model. Take a look at sklearn's cross-validation page.
I am not sure the question was answered. I had a similar thought. I want compare the results (Accuracy for example) with the method that does not apply CV. The CV valiadte accuracy is on the X_train and y_train. The other method fit the model using X_trian and y_train, tested on the X_test and y_test. So the comparison is not fair since they are on different datasets.
What you can do is using the estimator returned by the cross_validate
lr_fit = cross_validate(lr, train_df[features_columns], train_df["target"], cv=kf, return_estimator=Ture)
y_pred = lr_fit.predict(test_df[feature_columns])
accuracy = (y_pred == test_df["target"]).mean()
I am asking the question here, even though I hesitated to post it on CrossValidated (or DataScience) StackExchange. I have a dataset of 60 labeled objects (to be used for training) and 150 unlabeled objects (for test). The aim of the problem is to predict the labels of the 150 objects (this used to be given as a homework problem). For each object, I computed 258 features. Considering each object as a sample, I have X_train : (60,258), y_train : (60,) (labels of the objects used for training) and X_test : (150,258). Since the solution of the homework problem was given, I also have the true labels of the 150 objects, in y_test : (150,).
In order to predict the labels of the 150 objects, I choose to use a LogisticRegression (the Scikit-learn implementation). The classifier is trained on (X_train, y_train), after the data has been normalized, and used to make predictions for the 150 objects. Those predictions are compared to y_test to assess the performance of the model. For reproducibility, I copy the code I have used.
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, crosss_val_predict
# Fit classifier
LogReg = LogisticRegression(C=1, class_weight='balanced')
scaler = StandardScaler()
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
# Performance on training data
CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')
print(CV_score)
# Performance on test data
probas = LogReg.predict_proba(X_test)[:, 1]
AUC = metrics.roc_auc_score(y_test, probas)
print(AUC)
The matrices X_train,y_train,X_test and y_test are saved in a .mat file available at this link. My problem is the following :
Using this approach, I get a good performance on training data (CV_score = 0.8) but the performance on the test data is much worse : AUC = 0.54 for C=1 in LogReg and AUC = 0.40 for C=0.01. How can I get AUC<0.5 if a naive classifier should score AUC = 0.5 ? Is this due to the fact that I have a small number of samples for training ?
I have noticed that the performance on test data improves if I change the code for :
y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
AUC = metrics.roc_auc_score(y_test, y_pred)
print(AUC)
Indeed, AUC=0.87 for C=1 and 0.9 for C=0.01. Why is the AUC score so much better using cross validation predictions ? Is it because cross validation allows to make predictions on subsets of the test data which do not contain objects/samples which decrease the AUC ?
Looks like you are encountering an overfitting problem, i.e. the classifier trained using the training data is overfitting to the training data. It has poor generalization ability. That is why the performance on the testing dataset isn't good.
cross_val_predict is actually training the classifier using part of your testing data and then predict on the rest. So the performance is much better.
Overall, there seems to be quite some difference between your training and testing datasets. So the classifier with the highest training accuracy doesn't work well on your testing set.
Another point not directly related with your question: since the number of your training samples is much smaller than the feature dimensions, it may be helpful to perform dimension reduction before feeding to classifier.
It looks like your training and test process are inconsistent. Although from your code you intend to standardize your data, you fail to do so during testing. What I mean:
clf = make_pipeline(StandardScaler(), LogReg)
LogReg.fit(X_train, y_train)
Although you define a pipeline, you do not fit the pipeline (clf.fit) but only the Logistic Regression. This matters, because your cross-validated score is calculated with the pipeline (CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')) but during test instead of using the pipeline as expected to predict, you use only LogReg, hence the test data are not standardized.
The second point you raise is different. In y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
you get predictions by doing cross-validation on the test data, while ignoring the train data. Here, you do data standardization since you use clf and thus your score is high; this is evidence that the standardization step is important.
To summarize, standardizing the test data, I believe will improve your test score.
Firstly it makes no sense to have 258 features for 60 training items. Secondly CV=10 for 60 items means you split the data into 10 train/test sets. Each of these has 6 items only in the test set. So whatever results you obtain will be useless. You need more training data and less features.