Keras prediction incorrect with scaler and feature selection - python

I build an application that trains a Keras binary classifier model (0 or 1) every x time (hourly,daily) given the new data. The data preparation, training and testing works well, or at least as expected. It tests different features and scales it with MinMaxScaler (some values are negative).
On live data predictions with one single data point, the values are unrealistic (around 0.9987 to 1 most of the time, which is inaccurate). Since the result should be how close to "1" the prediction is, getting such high numbers constantly raises alerts.
Code for live prediction is as follows
current_df is a pandas dataframe that contains the 1 row with the data pulled live and the column headers, we select the "features" (since why load the features from the db and we implement dynamic feature selection when training the model, which could mean on every model there are different features)
Get the features as a list:
# Convert literal str to list
features = ast.literal_eval(features)
Then select only the features that I need in the dataframe:
# Select the features
selected_df = current_df[features]
Get the values as a list:
# Get the values of the df
current_list = selected_df.values.tolist()[0]
Then I reshape it:
# Reshape to allow scaling and predicting
current_list = np.reshape(current_list, (-1, 1))
If I call "transform" instead of "fit_transform" in the line above, I get the following error: This MinMaxScaler instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Reshape again:
# Reshape to be able to scale
current_list = np.reshape(current_list, (1, -1))
Loads the model using Keras (model_location is a Path) and predict:
# Loads the model from the local folder
reconstructed_model = keras.models.load_model(model_location)
prediction = reconstructed_model.predict(current_list)
prediction = prediction.flat[0]
Updated
The data gets scaled using fit_transform and transform (MinMaxScaler although it can be Standard Scaler):
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns, index=X_test.index)
And this is run when training the model (the "model" config is not shown):
# Compile the model
model.compile(optimizer=optimizer,
loss=loss,
metrics=['binary_accuracy'])
# build the model
model.fit(X_train, y_train, epochs=epochs, verbose=0)
# Evaluate using Keras built-in function
scores = model.evaluate(X_test, y_test, verbose=0)
testing_accuracy = scores[1]
# create model with sklearn KerasClassifier for evaluation
eval_model = KerasClassifier(model=model, epochs=epochs, batch_size=10, verbose=0)
# Evaluate model using RepeatedStratifiedKFold
accuracy = ML.evaluate_model_KFold(eval_model, X_test, y_test)
# Predict testing data
pred_test= model.predict(X_test, verbose=0)
pred_test = pred_test.flatten()
# extract the predicted class labels
y_predicted_test = np.where(pred_test > 0.5, 1, 0)
Regarding feature selection, the features are not always the same --I use both SelectKBest (10 or 15 features) or RFECV. And select the trained model with highest accuracy, meaning the features can be different.
Is there anything I'm doing wrong here? I'm thinking maybe the scaling should be done before the feature selection or there's some issue with the scaling (since maybe some values might be 0 when training and 100 when using it and the features are not necessarily the same when scaling).

The issues seems to stem from a StandardScaler / MinMaxScaler. The following example shows how to apply the former. However, if there are separate scripts handling learning/prediction, then the scaler will also need to be serialized and loaded at prediction time.
Set up a classification problem:
X, y = make_classification(n_samples=10_000)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
Fit a StandardScaler instance on the training set and use the same parameters to .transform the test set:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
# Train time: Serialize the scaler to a pickle file.
with open("scaler.pkl", "wb") as fh:
pickle.dump(scaler, fh)
# Test time: Load the scaler and apply to the test set.
with open("scaler.pkl", "rb") as fh:
new_scaler = pickle.load(fh)
X_test = new_scaler.transform(X_test)
Which means that the model should be fit on features with similar distributions:
model = keras.Sequential([
keras.Input(shape=X_train.shape[1]),
layers.Dense(100),
layers.Dropout(0.1),
layers.Dense(1, activation="relu")])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["binary_accuracy"])
model.fit(X_train, y_train, epochs=25)
y_pred = np.where(model.predict(X_test)[:, 0] > 0.5, 1, 0)
print(accuracy_score(y_test, y_pred))
# 0.8708

Alexander's answer is correct, I think there is just some confusion between testing and live prediction. What he said regarding the testing step is equally applicable to live prediction step. After you've called scaler.fit_transform on your training set, add the following code to save the scaler:
with open("scaler.pkl", "wb") as fh:
pickle.dump(scaler, fh)
Then, during live prediction step, you don't call fit_transform. Instead, you load the scaler saved during training and call transform:
with open("scaler.pkl", "rb") as fh:
new_scaler = pickle.load(fh)
# Load features, reshape them, etc
# Scaling step
current_list = new_scaler.transform(current_list)
# Features are scaled properly now, put the rest of your prediction code here
You always call fit_transform only once per model, during the training step on your training pool. After that (during testing or calculating predictions after model deployment) you never call it, only call transform. Treat scaler as part of the model. Naturally, you fit the model on the training set and then during testing and live prediction you use the same model, never refitting it. The same should be true for the scaler.
If you call scaler.fit_transform on live prediction features it creates a new scaler that has no prior knowledge of feature distribution on training set.

Related

Using Keras Multi Layer Perceptron with Cross Validation prediction [duplicate]

I'm implementing a Multilayer Perceptron in Keras and using scikit-learn to perform cross-validation. For this, I was inspired by the code found in the issue Cross Validation in Keras
from sklearn.cross_validation import StratifiedKFold
def load_data():
# load your data using this function
def create model():
# create your model using this function
def train_and_evaluate__model(model, data[train], labels[train], data[test], labels[test)):
# fit and evaluate here.
if __name__ == "__main__":
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
In my studies on neural networks, I learned that the knowledge representation of the neural network is in the synaptic weights and during the network tracing process, the weights that are updated to thereby reduce the network error rate and improve its performance. (In my case, I'm using Supervised Learning)
For better training and assessment of neural network performance, a common method of being used is cross-validation that returns partitions of the data set for training and evaluation of the model.
My doubt is...
In this code snippet:
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
We define, train and evaluate a new neural net for each of the generated partitions?
If my goal is to fine-tune the network for the entire dataset, why is it not correct to define a single neural network and train it with the generated partitions?
That is, why is this piece of code like this?
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
and not so?
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
Is my understanding of how the code works wrong? Or my theory?
If my goal is to fine-tune the network for the entire dataset
It is not clear what you mean by "fine-tune", or even what exactly is your purpose for performing cross-validation (CV); in general, CV serves one of the following purposes:
Model selection (choose the values of hyperparameters)
Model assessment
Since you don't define any search grid for hyperparameter selection in your code, it would seem that you are using CV in order to get the expected performance of your model (error, accuracy etc).
Anyway, for whatever reason you are using CV, the first snippet is the correct one; your second snippet
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
will train your model sequentially over the different partitions (i.e. train on partition #1, then continue training on partition #2 etc), which essentially is just training on your whole data set, and it is certainly not cross-validation...
That said, a final step after the CV which is often only implied (and frequently missed by beginners) is that, after you are satisfied with your chosen hyperparameters and/or model performance as given by your CV procedure, you go back and train again your model, this time with the entire available data.
You can use wrappers of the Scikit-Learn API with Keras models.
Given inputs x and y, here's an example of repeated 5-fold cross-validation:
from sklearn.model_selection import RepeatedKFold, cross_val_score
from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
def buildmodel():
model= Sequential([
Dense(10, activation="relu"),
Dense(5, activation="relu"),
Dense(1)
])
model.compile(optimizer='adam', loss='mse', metrics=['mse'])
return(model)
estimator= KerasRegressor(build_fn=buildmodel, epochs=100, batch_size=10, verbose=0)
kfold= RepeatedKFold(n_splits=5, n_repeats=100)
results= cross_val_score(estimator, x, y, cv=kfold, n_jobs=2) # 2 cpus
results.mean() # Mean MSE
I think many of your questions will be answered if you read about nested cross-validation. This is a good way to "fine tune" the hyper parameters of your model. There's a thread here:
https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection
The biggest issue to be aware of is "peeking" or circular logic. Essentially - you want to make sure that none of data used to assess model accuracy is seen during training.
One example where this might be problematic is if you are running something like PCA or ICA for feature extraction. If doing something like this, you must be sure to run PCA on your training set, and then apply the transformation matrix from the training set to the test set.
The main idea of testing your model performance is to perform the following steps:
Train a model on a training set.
Evaluate your model on a data not used during training process in order to simulate a new data arrival.
So basically - the data you should finally test your model should mimic the first data portion you'll get from your client/application to apply your model on.
So that's why cross-validation is so powerful - it makes every data point in your whole dataset to be used as a simulation of new data.
And now - to answer your question - every cross-validation should follow the following pattern:
for train, test in kFold.split(X, Y
model = training_procedure(train, ...)
score = evaluation_procedure(model, test, ...)
because after all, you'll first train your model and then use it on a new data. In your second approach - you cannot treat it as a mimicry of a training process because e.g. in second fold your model would have information kept from the first fold - which is not equivalent to your training procedure.
Of course - you could apply a training procedure which uses 10 folds of consecutive training in order to finetune network. But this is not cross-validation then - you'll need to evaluate this procedure using some kind of schema above.
The commented out functions make this a little less obvious, but the idea is to keep track of your model performance as you iterate through your folds and at the end provide either those lower level performance metrics or an averaged global performance. For example:
The train_evaluate function ideally would output some accuracy score for each split, which could be combined at the end.
def train_evaluate(model, x_train, y_train, x_test, y_test):
model.fit(x_train, y_train)
return model.score(x_test, y_test)
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
scores = np.zeros(10)
idx = 0
for train, test in kFold.split(X, Y):
model = create_model()
scores[idx] = train_evaluate(model, X[train], Y[train], X[test], Y[test])
idx += 1
print(scores)
print(scores.mean())
So yes you do want to create a new model for each fold as the purpose of this exercise is to determine how your model as it is designed performs on all segments of the data, not just one particular segment that may or may not allow the model to perform well.
This type of approach becomes particularly powerful when applied along with a grid search over hyperparameters. In this approach you train a model with varying hyperparameters using the cross validation splits and keep track of the performance on splits and overall. In the end you will be able to get a much better idea of which hyperparameters allow the model to perform best. For a much more in depth explanation see sklearn Model Selection and pay particular attention to the sections of Cross Validation and Grid Search.

Data Leakage with Huggingface Trainer and KFold?

I'm fine-tuning a Huggingface model for a downstream task, and I am using StratifiedKFold to evaluate performance on unseen data. The results I'm getting are very encouraging for my particular domain, and its making me think that I might be leaking data somehow. Suspiciously, f1 seems to be consistently increasing over each fold. I'm assuming that there is something hanging around between folds that is causing this increase in performance but I can't see what.
checkpoint = 'roberta-large-mnli'
# LM settiungs
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)
# set training arguments
batch_size=32
training_args = TrainingArguments(num_train_epochs=5,
weight_decay=0.1,
learning_rate=1e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
output_dir="content/drive/My Drive/Projects/test-trainer")
metric4 = load_metric("f1")
# function to tokenize each dataset
def tokenize_function(example):
return tokenizer(example["message"], example["hypothesis"], truncation=True)
# 5-fold cross validation loop
for train_index, test_index in cv.split(data, data['label']):
# split into train and test regions based on index positions
train_set, test_set = data.iloc[list(train_index)],data.iloc[list(test_index)]
# split training set into train and validation sub-regions
train_set, val_set = train_test_split(train_set,
stratify=train_set['label'],
test_size=0.10, random_state=42)
# convert datasets to Dataset object and gather in dictionary
train_dataset = Dataset.from_pandas(train_set, preserve_index=False)
val_dataset = Dataset.from_pandas(val_set, preserve_index=False)
test_dataset = Dataset.from_pandas(test_set, preserve_index=False)
combined_dataset = DatasetDict({'train':train_dataset,
'test': test_dataset,
'val':val_dataset})
# tokenize
tokenized_datasets = combined_dataset.map(tokenize_function, batched=True)
# instantiate trainer
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["val"],
data_collator=data_collator,
tokenizer=tokenizer)
# train
trainer.train()
# get predictions
predictions = trainer.predict(tokenized_datasets["test"])
preds = np.argmax(predictions.predictions, axis=-1)
print("F1 score ", metric4.compute(predictions=preds,
references=predictions.label_ids,
average='macro',pos_label=2))
Based on the above, what I think I'm doing is (1) splitting the data into seperate folds, (2) splitting the training set into train and val regions, (3) tokenizing, (4) training on the train/val sets, (5) testing on the test set, (6) starting the next with a new trainer instance. I'd like to think that that is correct, but I cannot see in the above why the F1 at fold-level would consistently get better over time.

Not able to get correct results even though mean absolute error is low

import pandas as pd
df=pd.read_csv('final sheet for project.csv')
features=['moisture','volatile matter','fixed carbon','calorific value','carbon %','oxygen%']
train_data=df[features]
target_data=df.pop('Activation energy')
X_train, X_test, y_train, y_test = train_test_split(train_data,target_data, test_size=0.09375, random_state=1)
standard_X_train=pd.DataFrame(StandardScaler().fit_transform(X_train))
standard_X_test=pd.DataFrame(StandardScaler().fit_transform(X_test))
y_train=y_train.values
y_train = y_train.reshape((-1, 1))
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(y_train)
normalized_y_train = scaler.transform(y_train)
y_test=y_test.values
y_test = y_test.reshape((-1, 1))
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(y_test)
normalized_y_test = scaler.transform(y_test)
model=keras.Sequential([layers.Dense(units=20,input_shape=[6,]),layers.Dense(units=1,activation='tanh')])
model.compile(
optimizer='adam',
loss='mae',
)
history = model.fit(standard_X_train,normalized_y_train, validation_data=(standard_X_test,normalized_y_test),epochs=200)
I wish to create a model to predict activation energy using some features . I am getting training loss: 0.0629 and val_loss: 0.4213.
But when I try to predict the activation energies of some other unseen data ,I get bizarre results. I am a beginner in ML.
Can someone please help what changes can be made in the code. ( I want to make a model with one hidden layer of 20 units that has activation function tanh.)
You should not use fit_transform for test data. You should use fit_transform for training data and apply just transform to test data, in order to use the same parameters for training data, on the test data.
So, the transformation part of your code should change like this:
scaler_x = StandardScaler()
standard_X_train = pd.DataFrame(scaler_x.fit_transform(X_train))
standard_X_test = pd.DataFrame(scaler_x.transform(X_test))
y_train=y_train.values
y_train = y_train.reshape((-1, 1))
y_test=y_test.values
y_test = y_test.reshape((-1, 1))
scaler_y = MinMaxScaler(feature_range=(0, 1))
normalized_y_train = scaler_y.fit_transform(y_train)
normalized_y_test = scaler_y.transform(y_test)
Furthermore, since you are scaling your data, you should do the same thing for any prediction. So, your prediction line should be something like:
preds = scaler_y.inverse_transform(
model.predict(scaler_x.transform(pred_input)) #if it is standard_X_test you don't need to transform again, since you already did it.
)
Additionally, since you are transforming your labels in range 0 and 1, you may need to change your last layer activation function to sigmoid instead of tanh, or even may better to use an activation function like relu in your first layer if you are still getting poor results after above modifications.
model=keras.Sequential([
layers.Dense(units=20,input_shape=[6,],activation='relu'),
layers.Dense(units=1,activation='sigmoid')
])

Using scaler in Sklearn PIpeline and Cross validation

I previously saw a post with code like this:
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=4)
scores = cross_val_score(pipeline, X, y, cv = cv)
My understanding is that: when we apply scaler, we should use 3 out of the 4 folds to calculate mean and standard deviation, then we apply the mean and standard deviation to all 4 folds.
In the above code, how can I know that Sklearn is following the same strategy? On the other hand, if sklearn is not following the same strategy, which means sklearn would calculate the mean/std from all 4 folds. Would that mean I should not use the above codes?
I do like the above codes because it saves tons of time.
In the example you gave, I would add an additional step using sklearn.model_selection.train_test_split:
folds = 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1/folds), random_state=0, stratify=y)
scalar = StandardScaler()
clf = svm.LinearSVC()
pipeline = Pipeline([('transformer', scalar), ('estimator', clf)])
cv = KFold(n_splits=(folds - 1))
scores = cross_val_score(pipeline, X_train, y_train, cv = cv)
I think best practice is to only use the training data set (i.e., X_train, y_train) when tuning the hyperparameters of your model, and the test data set (i.e., X_test, y_test) should be used as a final check, to make sure your model isn't biased towards the validation folds. At that point you would apply the same scaler that you fit on your training data set to your testing data set.
Yes, this is done properly; this is one of the reasons for using pipelines: all the preprocessing is fitted only on training folds.
Some references.
Section 6.1.1 of the User Guide:
Safety
Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.
The note at the end of section 3.1.1 of the User Guide:
Data transformation with held out data
Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction:
...code sample...
A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation:
...
Finally, you can look into the source for cross_val_score. It calls cross_validate, which clones and fits the estimator (in this case, the entire pipeline) on each training split. GitHub link.

DictVectorizer learns more features for the training set

I have the following code which works as expected:
clf = Pipeline([
('vectorizer', DictVectorizer(sparse=False)),
('classifier', DecisionTreeClassifier(criterion='entropy'))
])
clf.fit(X[:size], y[:size])
score = clf.score(X_test, y_test)
I wanted to do the same logic without using Pipeline:
v = DictVectorizer(sparse=False)
Xdv = v.fit_transform(X[:size])
Xdv_test = v.fit_transform(X_test)
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(Xdv[:size], y[:size])
clf.score(Xdv_test, y_test)
But I receive the following error:
ValueError: Number of features of the model must match the input. Model n_features is 8251 and input n_features is 14303
It seems that DictVectorizer learns more features for the test set than for the training set. I want to know how does Pipeline handle this issue and how can I accomplish the same.
Dont call fit_transform again.
Do this:
Xdv_test = v.transform(X_test)
When you do fit() or fit_transform(), the dict vectorizer will forget the features learnt during previous call (on training data) and re-fits again, hence different number of features.
Pipeline will automatically handle the test data appropriately when you do clf.score(X_test, y_test) on the pipeline.

Categories