I am training a CNN model using tf.keras passing training and validation generators as follows:
model.fit(
x=training_data_generator,
validation_data=validation_data_generator,
epochs=n_epochs,
use_multiprocessing=False,
max_queue_size=100,
workers=50
)
The generators are based on tf.keras.Sequence.
The problem is, my data set is huge. Training one epoch takes about a day (despite training on two Titan RTX GPUs) and validation after each epoch takes a few hours.
During training I can see the progress displayed, but during validation all I see is the last snapshot of the training progress bar:
130339/130340 [==============================] - 147432s 1s/step
until the validation finishes and finally I see my validation acuracy, loss etc.
Is there a way to display a progress bar for validation?
I'm thinking of doing something like this:
for epoch in range(n_epochs):
model.fit(
x=training_data_generator,
epochs=1,
use_multiprocessing=False,
max_queue_size=100,
workers=50
)
validation_results = model.evaluate(
x=validation_data_generator,
use_multiprocessing=False,
max_queue_size=100,
workers=50
)
print(validation_results)
Another option I was considering is to create a custom callback that validates the model on_epoch_end, but this seems very non-standard.
Is there a better approach to this?
You can set a steps_per_epoch on the fit method.
Based on the documentation:
Total number of steps (batches of samples) before declaring one epoch finished and starting the next epoch. When training with input tensors such as TensorFlow data tensors, the default None is equal to the number of samples in your dataset divided by the batch size, or 1 if that cannot be determined. If x is a tf.data dataset, and 'steps_per_epoch' is None, the epoch will run until the input dataset is exhausted. This argument is not supported with array inputs.
By this, you can limit the per epoch steps, so setting it with a lower value will immediately give you the validation loss & accuracy per epoch
By setting the steps_per_epoch to a lower size means you need to increase the epoch.
Every 1000 steps or epoch, it will show you the training and validation loss & accuracy after finishing 1000 steps rather than exhausting the entire dataset first then showing the results.
history = model.fit(x_train, y_train,
batch_size=2,
epochs=30,
steps_per_epoch=1000,
# We pass some validation for
# monitoring validation loss and metrics
# at the end of each epoch
validation_data=(x_val, y_val))
Related
I found a code for training model by creating a set of batches at a fixed rate from different types of data.
The code is as follows.
def train_model(self):
print('train the model')
i = 0
while (i<self.iterations) and (self.file_date == os.path.getmtime(sys.argv[0])):
#x,y,d,e,s = self.fgen.load_train(self.nbatch, scenario='doubletalk')
x,y,d,e,s = self.fgen.load_train(self.nbatch)
self.model.fit([x,y,d,e,s], None, batch_size=self.nbatch, epochs=1, verbose=0, callbacks=[self.logger])
i += 1
This model uses 'Adam optimizer'
However, my question is as follows.
I wonder if the optimizer state will be updated in this training case that 'epoch=1' is iterated for 'while'.
For example, my concern is that if its learning rate is constant for every training epoch
to learn PyTorch, I started with the Quickstart Tutorial. In the train() method, I noticed that they don't print the training accuracy during the training session. Only the training loss is printed.
Coming from Keras, this was very unusual for me, since the training accuracy automatically printed when you call fit().
So, I decided to modify the tutorial code like the following to print the training accuracy:
def train(dataloader, model, optimizer, loss_fn):
model.train()
size = len(dataloader.dataset)
num_batches = len(dataloader)
training_loss = 0.0
correct = 0.0
for batch, (imgs, labels) in enumerate(dataloader):
imgs = imgs.to(device=device)
labels = labels.to(device=device)
predictions = model(imgs)
loss = loss_fn(predictions, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# accumulate the training loss - each batch's loss will be added to trainin_loss
training_loss += loss.item()
# determines the number of correct predictions
correct += (predictions.argmax(1) == labels).type(torch.float).sum().item()
# end of for loop - all batches are processed
# after all batches are processed, determine the average training loss
training_loss = training_loss / num_batches
# this would be the training accuracy: number of correct predictions / number of samples in dataset
correct = correct / size
print(f"{datetime.datetime.now()} Training Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {training_loss:>8f} \n")
Is this ok ? As a beginner to PyTorch, I wanted to make sure that is correct before I start training my neural networks.
It all looks correct. Doing things like this should not influence training. loss.backward() computes your gradients and anything not connect to that can not change them. By the way, just run the training, you can't break anything :) (Yet. Just wait when you start building self driving cars.).
I thought, in Keras/TensorFlow fit() does not compute accuracy automatically, you still have to specify this metric for example when compiling the model or as a parameter to fit(), e.g.:
model.compile(optimizer='sgd',
loss='mse',
metrics=[tf.keras.metrics.Accuracy()])
I am trying to optimize a given neural network (ex Perceptron Multilayer, with 2 hidden layers), by finding the number of epoch and batch that give the highest accuracy.
for epoch from 10 to 200 (in steps of 10):
for batch from 40 to 200 (in steps of 20):
modele.fit (X_train, Y_train, epochs = epoch, batch_size = batch)
I save batch, epoch, Accuracy;
Afterwards I kept the smallest epoch with the smallest corresponding batch which has the highest recognition
ex best_params: epoch = 10, batch = 150 => Accuracy = 94%
My problem is that when I re-run my model with the best_params, it doesn't give me the same results (loss, accuracy), even sometimes very low accuracy (eg 10%).
i try to fix seed, but no best result
Regards
Djam75
df=pd.DataFrame(columns=['Nb_Batch','Nb_Epoch','Accuracy'])
i=0
lst_loss=[]
lst_accuracy=[]
lst_epoch=list(np.arange(10,200,10))
lst_batch=list(np.arange(100,400,20))
for epoch in lst_epoch:
print ('---------------- Epoch ' + str(epoch)+ '------------------')
for batch in lst_batch:
modelSimple.fit(X_train, Y_train, nb_epoch = epoch, batch_size = batch, verbose = 0)
score = modelSimple.evaluate(X_test, Y_test)
df.loc[i,"Nb_Batch"]=batch
df.loc[i,"Nb_Epoch"]=epoch
df.loc[i,"Accuracy"]=score[1]*100
i=i+1
This might be happening due to random parameter initialization. Because if you are building an end-to-end model without transfer learn the weights, every time you training architecture get random values for its parameters.
In this case, a good practice is to use batch normalization layers after some layers according to your architecture.
tensoflow-implementation
pytorch-implmentation
extra idea:
Do not use any 'for', 'while' loops in the model implementation.
you can follow templates in TensorFlow or PyTorch.
OR, if you build a complete model from scratch, vectorize operations by using NumPy like metrics operation library.
Thanks for the update.
I resolve my probelm by saving a model and load it after.
thaks for idea (batch normalization ) and extra idea : not user any for ;-)
regards
I think you might not be updating the weight matrix after completing the training for certain batch sizes and epochs.
Please include the code as well in order to see the problem
I am working on a seq2seq keras/tensorflow 2.0 model. Every time the user inputs something, my model prints the response perfectly fine. However on the last line of each response I get this:
You: WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 2 batches). You may need to use the repeat() function when building your dataset.
The "You:" is my last output, before the user is supposed to type something new in. The model works totally fine, but I guess no error is ever good, but I don't quite get this error. It says "interrupting training", however I am not training anything, this program loads an already trained model. I guess this is why the error is not stopping the program?
In case it helps, my model looks like this:
intent_model = keras.Sequential([
keras.layers.Dense(8, input_shape=[len(train_x[0])]), # input layer
keras.layers.Dense(8), # hidden layer
keras.layers.Dense(len(train_y[0]), activation="softmax"), # output layer
])
intent_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
intent_model.fit(train_x, train_y, epochs=epochs)
test_loss, test_acc = intent_model.evaluate(train_x, train_y)
print("Tested Acc:", test_acc)
intent_model.save("models/intent_model.h5")
To make sure that you have "at least steps_per_epoch * epochs batches", set the steps_per_epoch to
steps_per_epoch = len(X_train)//batch_size
validation_steps = len(X_test)//batch_size # if you have validation data
You can see the maximum number of batches that model.fit() can take by the progress bar when the training interrupts:
5230/10000 [==============>...............] - ETA: 2:05:22 - loss: 0.0570
Here, the maximum would be 5230 - 1
Importantly, keep in mind that by default, batch_size is 32 in model.fit().
If you're using a tf.data.Dataset, you can also add the repeat() method, but be careful: it will loop indefinitely (unless you specify a number).
I have also had a number of models crash with the same warnings while trying to train them. The training dataset if created using the tf.keras.preprocessing.image_dataset_from_directory() and split 80/20. I have created a variable to try and not run out of image. Using ResNet50 with my own images.....
TRAIN_STEPS_PER_EPOCH = np.ceil((image_count*0.8/BATCH_SIZE)-1)
# to ensure that there are enough images for training bahch
VAL_STEPS_PER_EPOCH = np.ceil((image_count*0.2/BATCH_SIZE)-1)
but it still does. BATCH_SIZE is set to 32 so i am taking 80% of the number of images and dividing by 32 then taking away 1 to have surplus.....or so i thought.
history = model.fit(
train_ds,
steps_per_epoch=TRAIN_STEPS_PER_EPOCH,
epochs=EPOCHS,
verbose = 1,
validation_data=val_ds,
validation_steps=VAL_STEPS_PER_EPOCH,
callbacks=tensorboard_callback)
Error after 3 hours processing a a single successful Epoch is:
Epoch 1/25 374/374 [==============================] - 8133s 22s/step -
loss: 7.0126 - accuracy: 0.0028 - val_loss: 6.8585 - val_accuracy:
0.0000e+00 Epoch 2/25 1/374 [..............................] - ETA: 0s - loss: 6.0445 - accuracy: 0.0000e+00WARNING:tensorflow:Your input
ran out of data; interrupting training. Make sure that your dataset or
generator can generate at least steps_per_epoch * epochs batches (in
this case, 9350.0 batches). You may need to use the repeat() function
when building your dataset.
this might help....
> > print(train_ds) <BatchDataset shapes: ((None, 224, 224, 3), (None,)), types: (tf.float32, tf.int32)>
>
> print(val_ds) BatchDataset shapes: ((None, 224, 224, 3), (None,)),types: (tf.float32, tf.int32)>
>
> print(TRAIN_STEPS_PER_EPOCH)
> 374.0
>
> print(VAL_STEPS_PER_EPOCH)
> 93.0
Solution which worked for me was to set drop_remainder=True while generating the dataset. This automatically handles any extra data that is left over.
For example:
dataset = tf.data.Dataset.from_tensor_slices((images, targets)) \
.batch(12, drop_remainder=True)
I had same problem and decreasing validation_steps from 50 to 10 solved the issue.
If you create a dataset with image_dataset_from_directory, remove steps_per_epoch and validation_steps parameters from model.fit.
The reason is steps has been initiated when batch_size passed into image_dataset_from_directory, and you can trying get the steps number with len.
I had the same problem in TF 2.1. It has something to do with the shape/ type of the input, namely the query. In my case, I solved the problem as follows: (My model takes 3 inputs)
x = [[test_X[0][0]], [test_X[1][0]], [test_X[2][0]]]
MODEL.predict(x)
Output:
WARNING:tensorflow:Your input ran out of data; interrupting training.
Make sure that your dataset or generator can generate at least
steps_per_epoch * epochs batches (in this case, 7 batches). You may
need to use the repeat() function when building your dataset.
array([[2.053718]], dtype=float32)
Solution:
x = [np.array([test_X[0][0]]), np.array([test_X[1][0]]), np.array([test_X[2][0]])]
MODEL.predict(x)
Output:
array([[2.053718]], dtype=float32)
I understand that it is completely fine. Firstly. it is a warning not an error. Secondly, the situation is similar to one data is trained during one epoch, next epoch trains next data and you have set epochs value too high e.g.500 (assuming your data size is not fixed but will approximately be <=500). Suppose data size is 480. Now, the remaining epoch don't have any data left to process, hence the warning. As a result, it returns to the recent state when the last data was trained.
I hope this helps. Do let me know if the concept is misunderstood. Thanks!
Try reducing the steps_per_epoch value below the value you have currently set. This helped me solve the problem
I am seeing this issue with TF-2.9 and a custom ImageDataGenerator.
The core issue appears to be that TF/Keras is not selecting the correct data adapter:
<python site-packages>/keras/engine/data_adapter.py
select_data_adapter() was selecting a GeneratorDataAdapter when it should be a KerasSequenceAdapter.
I updated the following file to work around the issue:
<python site-packages>/keras_preprocessing/image/iterator.py
try:
DataSequence = get_keras_submodule('utils').Sequence
except:
try:
# Work-around for TF-2.9
from keras.utils.data_utils import Sequence
DataSequence = Sequence
except:
DataSequence = object
A better solution is using :
data_amount = .5 # to chage the amount of data if the data is large and training for small epoch
steps_per_epoch = int(data_amount*(len(train_data)/EPOCHS)),
# if you have validation data, then:
validation_steps = int(data_amount*(len(val_data)/EPOCHS)),
here .5 is float value which determines how much of the data you want to fit it.
This approach is better than the ones with BATCH_SIZE, but this will always fit the whole dataset, so change the data_amount value to adjust it
I also got this while training a model in google colab, and the reason was that there is not enough memory/RAM to store the amount of data per batch (if you are using batch), so after I lower the batch_size it all ran perfectly
I'm trying to write some logic that selects the best epoch to run a neural network in Keras. My code saves the training loss and the test loss for a set number of epochs and then picks the best fitting epoch according to some logic. The code looks like this:
ini_epochs = 100
df_train_loss = DataFrame(data=history.history['loss'], columns=['Train_loss']);
df_test_loss = DataFrame(data=history.history['val_loss'], columns=['Test_loss']);
df_loss = concat([df_train_loss,df_test_loss], axis=1)
Min_loss = max(df_loss['Test_loss'])
for i in range(ini_epochs):
Test_loss = df_loss['Test_loss'][i];
Train_loss = df_loss['Train_loss'][i];
if Test_loss > Train_loss and Test_loss < Min_loss:
Min_loss = Test_loss;
The idea behind the logic is this; to get the best model, the epoch selected should select the model with the lowest loss value, but it must be above the training loss value to avoid overfitting.
In general, this epoch selection method works OK. However, if the test loss value is below the train loss from the start, then this method picks an epoch of zero (see below).
Now I could add another if statement assessing whether the difference between the test and train losses are positive or negative, and then write logic for each case, but what happens if the difference starts positive and then ends up negative. I get confused and haven't been able to write effective code.
So, my questions are:
1) Can you show me how you what code you would write to to account for the situation show in the graph (and for the case where the test and train loss curves cross). I'd say the strategy would be to take the value that with the minimum difference.
2) There is a good chance that I'm going about this the wrong way. I know Keras has a callbacks feature but I don't like the idea of using the save_best_only feature because it can save overfitted models. Any advice on a more efficient epoch selection method would be great.
Use EarlyStopping which is available in Keras. Early stopping is basically stopping the training once your loss starts to increase (or in other words validation accuracy starts to decrease). use ModelCheckpoint to save the model wherever you want.
from keras.callbacks import EarlyStopping, ModelCheckpoint
STAMP = 'simple_lstm_glove_vectors_%.2f_%.2f'%(rate_drop_lstm,rate_drop_dense)
early_stopping =EarlyStopping(monitor='val_loss', patience=5)
bst_model_path = STAMP + '.h5'
model_checkpoint = ModelCheckpoint(bst_model_path, save_best_only=True, save_weights_only=True)
hist = model.fit(data_train, labels_train, \
validation_data=(data_val, labels_val), \
epochs=50, batch_size=256, shuffle=True, \
callbacks=[early_stopping, model_checkpoint])
model.load_weights(bst_model_path)
refer to this link for more info
Here is a simple example illustrate how to use early stooping in Keras:
First necessarily import:
from keras.callbacks import EarlyStopping, ModelCheckpoint
Setup Early Stopping
# Set callback functions to early stop training and save the best model so far
callbacks = [EarlyStopping(monitor='val_loss', patience=2),
ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]
Train neural network
history = network.fit(train_features, # Features
train_target, # Target vector
epochs=20, # Number of epochs
callbacks=callbacks, # Early stopping
verbose=0, # Print description after each epoch
batch_size=100, # Number of observations per batch
validation_data=(test_features, test_target)) # Data for evaluation
See the full example here.
Please also check :Stop Keras Training when the network has fully converge; the best answer of Daniel.