plot and save history of kfold training - python

i am trying to train model with kfold cross validation , now i want to keep history for plotting and saving the history. how can i do that?
it seems that some questions post answers of this question but i want to save and plot all history once, not with parted files
num_folds = 10
kfold = KFold(n_splits=num_folds, shuffle=True)
# K-fold Cross Validation model evaluation
fold_no = 1
a = []
for train, test in kfold.split(X, label):
print("---"*20)
history = siamese.fit(
[tf.gather(X[:,0], train),tf.gather(X[:,1], train)],
tf.gather(label, train),
validation_data=([tf.gather(X[:,0], test),tf.gather(X[:,1], test)], tf.gather(label, test)),
batch_size=batch_size,
epochs=epochs,
)
a.append(history)

by adding a dictionary to my code i can use all histories at once
num_folds = 10
kfold = KFold(n_splits=num_folds, shuffle=True)
# K-fold Cross Validation model evaluation
fold_no = 1
histories = {'accuracy':[], 'loss':[], 'val_accuracy':[], 'val_loss':[]}
for train, test in kfold.split(X, label):
print("---"*20)
history = siamese.fit(
[tf.gather(X[:,0], train),tf.gather(X[:,1], train)],
tf.gather(label, train),
validation_data=([tf.gather(X[:,0], test),tf.gather(X[:,1], test)], tf.gather(label, test)),
batch_size=batch_size,
epochs=epochs,
)
histories['accuracy'].append(history.history['accuracy'])
histories['loss'].append(history.history['loss'])
histories['val_accuracy'].append(history.history['val_accuracy'])
histories['val_loss'].append(history.history['val_loss'])
with open('./trainHistoryDict', 'wb') as file_pi:
pickle.dump(histories, file_pi)

Related

Data Leakage with Huggingface Trainer and KFold?

I'm fine-tuning a Huggingface model for a downstream task, and I am using StratifiedKFold to evaluate performance on unseen data. The results I'm getting are very encouraging for my particular domain, and its making me think that I might be leaking data somehow. Suspiciously, f1 seems to be consistently increasing over each fold. I'm assuming that there is something hanging around between folds that is causing this increase in performance but I can't see what.
checkpoint = 'roberta-large-mnli'
# LM settiungs
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)
# set training arguments
batch_size=32
training_args = TrainingArguments(num_train_epochs=5,
weight_decay=0.1,
learning_rate=1e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
output_dir="content/drive/My Drive/Projects/test-trainer")
metric4 = load_metric("f1")
# function to tokenize each dataset
def tokenize_function(example):
return tokenizer(example["message"], example["hypothesis"], truncation=True)
# 5-fold cross validation loop
for train_index, test_index in cv.split(data, data['label']):
# split into train and test regions based on index positions
train_set, test_set = data.iloc[list(train_index)],data.iloc[list(test_index)]
# split training set into train and validation sub-regions
train_set, val_set = train_test_split(train_set,
stratify=train_set['label'],
test_size=0.10, random_state=42)
# convert datasets to Dataset object and gather in dictionary
train_dataset = Dataset.from_pandas(train_set, preserve_index=False)
val_dataset = Dataset.from_pandas(val_set, preserve_index=False)
test_dataset = Dataset.from_pandas(test_set, preserve_index=False)
combined_dataset = DatasetDict({'train':train_dataset,
'test': test_dataset,
'val':val_dataset})
# tokenize
tokenized_datasets = combined_dataset.map(tokenize_function, batched=True)
# instantiate trainer
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["val"],
data_collator=data_collator,
tokenizer=tokenizer)
# train
trainer.train()
# get predictions
predictions = trainer.predict(tokenized_datasets["test"])
preds = np.argmax(predictions.predictions, axis=-1)
print("F1 score ", metric4.compute(predictions=preds,
references=predictions.label_ids,
average='macro',pos_label=2))
Based on the above, what I think I'm doing is (1) splitting the data into seperate folds, (2) splitting the training set into train and val regions, (3) tokenizing, (4) training on the train/val sets, (5) testing on the test set, (6) starting the next with a new trainer instance. I'd like to think that that is correct, but I cannot see in the above why the F1 at fold-level would consistently get better over time.

Why no ability to generalize, but loss seems good? (ANN, keras, high dimensional data)

I am using tensorflow and keras for a binary classification problem.
I have only a training set of 81 samples (Testsize 21), but ~1900 features. I know its too less samples and too many features, but its a biological problem (gene-expression data), so i have to deal with it.
My model looks like this: (using different neurons per layer, different number of hidden layers, regularization and dropout to deal with the high dimensional data)
model = Sequential()
model.add(Input((input_shape,)))
for i in range(num_hidden):
model.add(Dense(n_neurons, activation="relu",kernel_regularizer=keras.regularizers.l1_l2(l1_reg, l2_reg)))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation="sigmoid"))
ann_optimizer= keras.optimizers.Adam()
model.compile(loss="binary_crossentropy",
optimizer=ann_optimizer, metrics=['accuracy'])
I am using a 10 fold nested cross validation and grid search in the inner fold like this:
# fit and evaluate the model
# configure the inner cross-validation procedure (5 fold, 80 inner training dataset, 20 inner test dataset)
cv_inner = ShuffleSplit(n_splits=5, test_size=0.2, random_state=1)
# define the mode
ann = KerasRegressor(build_fn=regressionModel_sequential, input_shape=X_train.shape[1],
batch_size=batch_size)
# use pipeline as prevent from Leaky Preprocessing (StandardScaler on 80% inner-training dataset))
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('ann', ann)])
# define the grid search of with inner cv to get good parameters
grid_search_result = GridSearchCV(
pipe, param_grid, n_jobs=-1, cv=cv_inner, refit=True, verbose=0)
#refit = True a final model with the entire inner-training dataset
# execute search
grid_search_result.fit(X_train, y_train, ann__verbose=0)
logger.info('>>>>> est=%.3f, params=%s' % (grid_search_result.best_score_, grid_search_result.best_params_))
# to get loss curve
ann_val = regressionModel_sequential(input_shape=X_train.shape[1],
n_neurons=grid_search_result.best_params_['ann__n_neurons'],
l1_reg=grid_search_result.best_params_['ann__l1_reg'],
l2_reg=grid_search_result.best_params_['ann__l2_reg'],
num_hidden=grid_search_result.best_params_['ann__num_hidden'],
dropout_rate=grid_search_result.best_params_['ann__dropout_rate'])
# Validation with outer 20 %
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
history = ann_val.fit(X_train, y_train, batch_size=batch_size, verbose=0,
validation_split=0.25, shuffle=True, epochs=grid_search_result.best_params_['ann__epochs'])
plot_history(history, directory, i)
# use best grid search reult for predicting on outer test dataset
y_predicted = ann_val.predict(X_test)
# print predicted
logger.info(y_predicted[:5])
logger.info(y_test[:5])
rmse = (np.sqrt(metrics.mean_squared_error(y_test, y_predicted)))
mae = (metrics.mean_squared_error(y_test, y_predicted))
r_squared = metrics.r2_score(y_test, y_predicted)
My loss seems good: loss
But accuracy is very bad. accuracy (example from one outer fold)
Does anyone have suggestions on what i could do to improve my results?
I also know that the biological question behind is very hard/maybe not possible to solve.

How to split dataset into K-fold without loading the whole dataset at once?

I can't load all of my dataset at once, so I used tf.keras.preprocessing.image_dataset_from_directory() in order to load batches of images during training. It works well if I want to split my dataset into 2 subsets (train and validation), however, I'd like to divide my dataset into K-folds in order to make cross validation. (5 folds would be nice)
How can I make K-folds without loading my whole dataset ?
Do I have to give up using tf.keras.preprocessing.image_dataset_from_directory() ?
Personally I recommend that you switch to tf.data.Dataset().
Not only is it more efficient but it gives you more flexibility in terms of what you can implement.
Say you have images(image_paths) and labels as an example.
In that way, you could create a pipeline like:
training_data = []
validation_data = []
kf = KFold(n_splits=5,shuffle=True,random_state=42)
for train_index, val_index in kf.split(images,labels):
X_train, X_val = images[train_index], images[val_index]
y_train, y_val = labels[train_index], labels[val_index]
training_data.append([X_train,y_train])
validation_data.append([X_val,y_val])
Then you could create something like:
for index, _ in enumerate(training_data):
x_train, y_train = training_data[index][0], training_data[index][1]
x_valid, y_valid = validation_data[index][0], validation_data[index][1]
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.map(mapping_function, num_parallel_calls=tf.data.experimental.AUTOTUNE)
train_dataset = train_dataset.batch(batch_size)
train_dataset = train_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
validation_dataset = tf.data.Dataset.from_tensor_slices((x_valid, y_valid))
validation_dataset = validation_dataset.map(mapping_function, num_parallel_calls=tf.data.experimental.AUTOTUNE)
validation_dataset = validation_dataset.batch(batch_size)
validation_dataset = validation_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
model.fit(train_dataset,
validation_data=validation_dataset,
epochs=epochs,
verbose=2)

Validation set with TensorFlow Dataset

From Train and evaluate with Keras:
The argument validation_split (generating a holdout set from the training data) is not supported when training from Dataset objects, since this features requires the ability to index the samples of the datasets, which is not possible in general with the Dataset API.
Is there a workaround? How can I still use a validation set with TF datasets?
No, you can't use use validation_split (as described clearly by documentation), but you can create validation_data instead and create Dataset "manually".
You can see an example in the same tensorflow tutorial:
# Prepare the training dataset
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)
# Prepare the validation dataset
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val))
val_dataset = val_dataset.batch(64)
model.fit(train_dataset, epochs=3, validation_data=val_dataset)
You could create those two datasets from numpy arrays ((x_train, y_train) and (x_val, y_val)) using simple slicing as shown there:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]
There are also other ways to create tf.data.Dataset objects, see tf.data.Dataset documentation and related tutorials/notebooks.

Issues with KeyError: 'val_acc'

This is rather a popular error, but I couldn't find a proper answer given my setup.
I found this tutorial code, but when running, I get this error:
val_acc = history.history['val_acc']
KeyError: 'val_acc'
The fit_generator() function unlike fit(), doesn't allow a validation split. So how to fix it?
Here is the code:
def plot_training(history):
print (history.history.keys())
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'r.')
plt.plot(epochs, val_acc, 'r')
plt.title('Training and validation accuracy')
# plt.figure()
# plt.plot(epochs, loss, 'r.')
# plt.plot(epochs, val_loss, 'r-')
# plt.title('Training and validation loss')
plt.show()
plt.savefig('acc_vs_epochs.png')
#....
finetune_model = build_finetune_model(base_model, dropout=dropout, fc_layers=FC_LAYERS, num_classes=len(class_list))
adam = Adam(lr=0.00001)
finetune_model.compile(adam, loss='categorical_crossentropy', metrics=['accuracy'])
filepath="./checkpoints/" + "ResNet50" + "_model_weights.h5"
checkpoint = ModelCheckpoint(filepath, monitor=["acc"], verbose=1, mode='max')
callbacks_list = [checkpoint]
history = finetune_model.fit_generator(train_generator, epochs=NUM_EPOCHS, workers=8,
steps_per_epoch=steps_per_epoch,
shuffle=True, callbacks=callbacks_list)
plot_training(history)
Hi writting my suggestions here because I'm not able to comment yet,
You are right the fuction fit_generator() dosen't have the validation split attribute.
Therefore you need to make your own validation dataset and feed it to the fit generator through validation_data=(val_X, val_y) as eg.:
history = finetune_model.fit_generator(train_generator, epochs=NUM_EPOCHS, workers=8, validation_data=(val_X, val_y),
steps_per_epoch=steps_per_epoch,
shuffle=True, callbacks=callbacks_list)
Hope this helps.
EDIT
To get a validation dataset from your data you can use the methode train_test_split() from sklearn. For example a split with 77% train and 33% validation data:
X_train, val_X, y_train, val_y= train_test_split(
X, y, test_size=0.33, random_state=42)
Look here for more information.
Alternatively you could write your own split methode :)
Edit 2
If you don't have the possibility to use train_test split and with the assumption you have a pandas dataframe called train_data with the features and labels together:
val_data=train_data.sample(frac=0.33,random_state=1)
This should creates a validation dataset with 33% of the data and a train dataset with 77% of the data.
Edit3
It turns out you are using ImageDataGenerator() to create your data. This is quite handy because you can set your validation percentage via validation_split= while you initialize the ImageDataGenerator() as seen in the documentation (here). This should look something like this:
train_datagen = ImageDataGenerator(preprocessing_function=preprocess_input,
validation_split=0.33)
After this you need two "generated" datasets. One to train and one to make your validation. This should look as following:
train_generator = train_datagen.flow_from_directory(TRAIN_DIR,
target_size=(HEIGHT, WIDTH),
batch_size=BATCH_SIZE,subset="training")
validation_generator = train_datagen.flow_from_directory(TRAIN_DIR,
target_size=(HEIGHT, WIDTH),
batch_size=BATCH_SIZE,subset="validation")
Finally you can use both sets in your fit_generator as following:
history = finetune_model.fit_generator(train_generator,epochs=NUM_EPOCHS, workers=8,
validation_data=validation_generator, validation_steps = validation_generator.samples,steps_per_epoch=steps_per_epoch,
shuffle=True, callbacks=callbacks_list)
Let me know if this solves your problem :)

Categories