Neural Network optimization using epoch and batch

Neural Network optimization using epoch and batch - python

I am trying to optimize a given neural network (ex Perceptron Multilayer, with 2 hidden layers), by finding the number of epoch and batch that give the highest accuracy.
for epoch from 10 to 200 (in steps of 10):
for batch from 40 to 200 (in steps of 20):
modele.fit (X_train, Y_train, epochs = epoch, batch_size = batch)
I save batch, epoch, Accuracy;
Afterwards I kept the smallest epoch with the smallest corresponding batch which has the highest recognition
ex best_params: epoch = 10, batch = 150 => Accuracy = 94%
My problem is that when I re-run my model with the best_params, it doesn't give me the same results (loss, accuracy), even sometimes very low accuracy (eg 10%).
i try to fix seed, but no best result
Regards
Djam75

df=pd.DataFrame(columns=['Nb_Batch','Nb_Epoch','Accuracy'])
i=0
lst_loss=[]
lst_accuracy=[]
lst_epoch=list(np.arange(10,200,10))
lst_batch=list(np.arange(100,400,20))
for epoch in lst_epoch:
print ('---------------- Epoch ' + str(epoch)+ '------------------')
for batch in lst_batch:
modelSimple.fit(X_train, Y_train, nb_epoch = epoch, batch_size = batch, verbose = 0)
score = modelSimple.evaluate(X_test, Y_test)
df.loc[i,"Nb_Batch"]=batch
df.loc[i,"Nb_Epoch"]=epoch
df.loc[i,"Accuracy"]=score[1]*100
i=i+1

This might be happening due to random parameter initialization. Because if you are building an end-to-end model without transfer learn the weights, every time you training architecture get random values for its parameters.
In this case, a good practice is to use batch normalization layers after some layers according to your architecture.
tensoflow-implementation
pytorch-implmentation
extra idea:
Do not use any 'for', 'while' loops in the model implementation.
you can follow templates in TensorFlow or PyTorch.
OR, if you build a complete model from scratch, vectorize operations by using NumPy like metrics operation library.

Thanks for the update.
I resolve my probelm by saving a model and load it after.
thaks for idea (batch normalization ) and extra idea : not user any for ;-)
regards

I think you might not be updating the weight matrix after completing the training for certain batch sizes and epochs.
Please include the code as well in order to see the problem

Related

TensorFlow Warning: Your input ran out of data [duplicate]

I am working on a seq2seq keras/tensorflow 2.0 model. Every time the user inputs something, my model prints the response perfectly fine. However on the last line of each response I get this:
You: WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 2 batches). You may need to use the repeat() function when building your dataset.
The "You:" is my last output, before the user is supposed to type something new in. The model works totally fine, but I guess no error is ever good, but I don't quite get this error. It says "interrupting training", however I am not training anything, this program loads an already trained model. I guess this is why the error is not stopping the program?
In case it helps, my model looks like this:
intent_model = keras.Sequential([
keras.layers.Dense(8, input_shape=[len(train_x[0])]), # input layer
keras.layers.Dense(8), # hidden layer
keras.layers.Dense(len(train_y[0]), activation="softmax"), # output layer
])
intent_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
intent_model.fit(train_x, train_y, epochs=epochs)
test_loss, test_acc = intent_model.evaluate(train_x, train_y)
print("Tested Acc:", test_acc)
intent_model.save("models/intent_model.h5")

To make sure that you have "at least steps_per_epoch * epochs batches", set the steps_per_epoch to
steps_per_epoch = len(X_train)//batch_size
validation_steps = len(X_test)//batch_size # if you have validation data
You can see the maximum number of batches that model.fit() can take by the progress bar when the training interrupts:
5230/10000 [==============>...............] - ETA: 2:05:22 - loss: 0.0570
Here, the maximum would be 5230 - 1
Importantly, keep in mind that by default, batch_size is 32 in model.fit().
If you're using a tf.data.Dataset, you can also add the repeat() method, but be careful: it will loop indefinitely (unless you specify a number).

I have also had a number of models crash with the same warnings while trying to train them. The training dataset if created using the tf.keras.preprocessing.image_dataset_from_directory() and split 80/20. I have created a variable to try and not run out of image. Using ResNet50 with my own images.....
TRAIN_STEPS_PER_EPOCH = np.ceil((image_count*0.8/BATCH_SIZE)-1)
# to ensure that there are enough images for training bahch
VAL_STEPS_PER_EPOCH = np.ceil((image_count*0.2/BATCH_SIZE)-1)
but it still does. BATCH_SIZE is set to 32 so i am taking 80% of the number of images and dividing by 32 then taking away 1 to have surplus.....or so i thought.
history = model.fit(
train_ds,
steps_per_epoch=TRAIN_STEPS_PER_EPOCH,
epochs=EPOCHS,
verbose = 1,
validation_data=val_ds,
validation_steps=VAL_STEPS_PER_EPOCH,
callbacks=tensorboard_callback)
Error after 3 hours processing a a single successful Epoch is:
Epoch 1/25 374/374 [==============================] - 8133s 22s/step -
loss: 7.0126 - accuracy: 0.0028 - val_loss: 6.8585 - val_accuracy:
0.0000e+00 Epoch 2/25 1/374 [..............................] - ETA: 0s - loss: 6.0445 - accuracy: 0.0000e+00WARNING:tensorflow:Your input
ran out of data; interrupting training. Make sure that your dataset or
generator can generate at least steps_per_epoch * epochs batches (in
this case, 9350.0 batches). You may need to use the repeat() function
when building your dataset.
this might help....
> > print(train_ds) <BatchDataset shapes: ((None, 224, 224, 3), (None,)), types: (tf.float32, tf.int32)>
>
> print(val_ds) BatchDataset shapes: ((None, 224, 224, 3), (None,)),types: (tf.float32, tf.int32)>
>
> print(TRAIN_STEPS_PER_EPOCH)
> 374.0
>
> print(VAL_STEPS_PER_EPOCH)
> 93.0

Solution which worked for me was to set drop_remainder=True while generating the dataset. This automatically handles any extra data that is left over.
For example:
dataset = tf.data.Dataset.from_tensor_slices((images, targets)) \
.batch(12, drop_remainder=True)

I had same problem and decreasing validation_steps from 50 to 10 solved the issue.

If you create a dataset with image_dataset_from_directory, remove steps_per_epoch and validation_steps parameters from model.fit.
The reason is steps has been initiated when batch_size passed into image_dataset_from_directory, and you can trying get the steps number with len.

I had the same problem in TF 2.1. It has something to do with the shape/ type of the input, namely the query. In my case, I solved the problem as follows: (My model takes 3 inputs)
x = [[test_X[0][0]], [test_X[1][0]], [test_X[2][0]]]
MODEL.predict(x)
Output:
WARNING:tensorflow:Your input ran out of data; interrupting training.
Make sure that your dataset or generator can generate at least
steps_per_epoch * epochs batches (in this case, 7 batches). You may
need to use the repeat() function when building your dataset.
array([[2.053718]], dtype=float32)
Solution:
x = [np.array([test_X[0][0]]), np.array([test_X[1][0]]), np.array([test_X[2][0]])]
MODEL.predict(x)
Output:
array([[2.053718]], dtype=float32)

I understand that it is completely fine. Firstly. it is a warning not an error. Secondly, the situation is similar to one data is trained during one epoch, next epoch trains next data and you have set epochs value too high e.g.500 (assuming your data size is not fixed but will approximately be <=500). Suppose data size is 480. Now, the remaining epoch don't have any data left to process, hence the warning. As a result, it returns to the recent state when the last data was trained.
I hope this helps. Do let me know if the concept is misunderstood. Thanks!

Try reducing the steps_per_epoch value below the value you have currently set. This helped me solve the problem

I am seeing this issue with TF-2.9 and a custom ImageDataGenerator.
The core issue appears to be that TF/Keras is not selecting the correct data adapter:
<python site-packages>/keras/engine/data_adapter.py
select_data_adapter() was selecting a GeneratorDataAdapter when it should be a KerasSequenceAdapter.
I updated the following file to work around the issue:
<python site-packages>/keras_preprocessing/image/iterator.py
try:
DataSequence = get_keras_submodule('utils').Sequence
except:
try:
# Work-around for TF-2.9
from keras.utils.data_utils import Sequence
DataSequence = Sequence
except:
DataSequence = object

A better solution is using :
data_amount = .5 # to chage the amount of data if the data is large and training for small epoch
steps_per_epoch = int(data_amount*(len(train_data)/EPOCHS)),
# if you have validation data, then:
validation_steps = int(data_amount*(len(val_data)/EPOCHS)),
here .5 is float value which determines how much of the data you want to fit it.
This approach is better than the ones with BATCH_SIZE, but this will always fit the whole dataset, so change the data_amount value to adjust it

I also got this while training a model in google colab, and the reason was that there is not enough memory/RAM to store the amount of data per batch (if you are using batch), so after I lower the batch_size it all ran perfectly

training by batches leads to more over-fitting

I'm training a sequence to sequence (seq2seq) model and I have different values to train on for the input_sequence_length.
For values 10 and 15, I get acceptable results but when I try to train with 20, I get memory errors so I switched the training to train by batches but the model over-fit and the validation loss explodes, and even with the accumulated gradient I get the same behavior, so I'm looking for hints and leads to more accurate ways to do the update.
Here is my training function (only with batch section) :
if batch_size is not None:
k=len(list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )))
for epoch in range(num_epochs):
optimizer.zero_grad()
epoch_loss=0
for i in list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )): # by using equidistant batch till the last one it becomes much faster than using the X.size()[0] directly
sequence = X_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, input_size).to(device)
labels = y_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, output_size).to(device)
# Forward pass
outputs = model(sequence)
loss = criterion(outputs, labels)
epoch_loss+=loss.item()
# Backward and optimize
loss.backward()
optimizer.step()
epoch_loss=epoch_loss/k
model.eval
validation_loss,_= evaluate(model,X_test_hard_tensor_1,y_test_hard_tensor_1)
model.train()
training_loss_log.append(epoch_loss)
print ('Epoch [{}/{}], Train MSELoss: {}, Validation : {} {}'.format(epoch+1, num_epochs,epoch_loss,validation_loss))
EDIT:
here are the parameters that I'm training with :
batch_size = 1024
num_epochs = 25000
learning_rate = 10e-04
optimizer=torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss(reduction='mean')

Batch size affects regularization. Training on a single example at a time is quite noisy, which makes it harder to overfit. Training on batches smoothes everything out, which makes it easier to overfit. Translating back to regularization:
Smaller batches add regularization.
Larger batches reduce regularization.
I am also curious about your learning rate. Every call to loss.backward() will accumulate the gradient. If you have set your learning rate to expect a single example at a time, and not reduced it to account for batch accumulation, then one of two things will happen.
The learning rate will be too high for the now-accumulated gradient, training will diverge, and both training and validation errors will explode.
The learning rate won't be too high, and nothing will diverge. The model will just train more quickly and effectively. If the model is too large for the data being fit, then training error will go to 0 but validation error will explode due to overfitting.
Update
Here is a bit more detail regarding the gradient accumulation.
Every call to loss.backward() will accumulate gradient, until you reset it with optimizer.zero_grad(). It will be acted on when you call optimizer.step(), based on whatever it has accumulated.
The way your code is written, you call loss.backward() for every pass through the inner loop, then you call optimizer.step() in the outer loop before resetting. So the gradient has been accumulated, that is summed, over all examples in the batch and not just one example at a time.
Under most assumptions, that will make the batch-accumulated gradient larger than the gradient for a single example. If the gradients are all aligned, for B batches, it will be larger by B times. If the gradients are i.i.d. then it will be more like sqrt(B) times larger.
If you do not account for this, then you have effectively increased your learning rate by that factor. Some of that will be mitigated by the smoothing effect of larger batches, which can then tolerate a higher learning rate. Larger batches reduce regularization, larger learning rates add it back. But that will not be a perfect match to compensate, so you will still want to adjust accordingly.
In general, whenever you change your batch size you will also want to re-tune your learning rate to compensate.
Leslie N. Smith has written some excellent papers on a methodical approach to hyperparameter tuning. A great place to start is A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. He recommends you start by reading the diagrams, which are very well done.

tensorflow:Your input ran out of data

I am working on a seq2seq keras/tensorflow 2.0 model. Every time the user inputs something, my model prints the response perfectly fine. However on the last line of each response I get this:
You: WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 2 batches). You may need to use the repeat() function when building your dataset.
The "You:" is my last output, before the user is supposed to type something new in. The model works totally fine, but I guess no error is ever good, but I don't quite get this error. It says "interrupting training", however I am not training anything, this program loads an already trained model. I guess this is why the error is not stopping the program?
In case it helps, my model looks like this:
intent_model = keras.Sequential([
keras.layers.Dense(8, input_shape=[len(train_x[0])]), # input layer
keras.layers.Dense(8), # hidden layer
keras.layers.Dense(len(train_y[0]), activation="softmax"), # output layer
])
intent_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
intent_model.fit(train_x, train_y, epochs=epochs)
test_loss, test_acc = intent_model.evaluate(train_x, train_y)
print("Tested Acc:", test_acc)
intent_model.save("models/intent_model.h5")

To make sure that you have "at least steps_per_epoch * epochs batches", set the steps_per_epoch to
steps_per_epoch = len(X_train)//batch_size
validation_steps = len(X_test)//batch_size # if you have validation data
You can see the maximum number of batches that model.fit() can take by the progress bar when the training interrupts:
5230/10000 [==============>...............] - ETA: 2:05:22 - loss: 0.0570
Here, the maximum would be 5230 - 1
Importantly, keep in mind that by default, batch_size is 32 in model.fit().
If you're using a tf.data.Dataset, you can also add the repeat() method, but be careful: it will loop indefinitely (unless you specify a number).

I have also had a number of models crash with the same warnings while trying to train them. The training dataset if created using the tf.keras.preprocessing.image_dataset_from_directory() and split 80/20. I have created a variable to try and not run out of image. Using ResNet50 with my own images.....
TRAIN_STEPS_PER_EPOCH = np.ceil((image_count*0.8/BATCH_SIZE)-1)
# to ensure that there are enough images for training bahch
VAL_STEPS_PER_EPOCH = np.ceil((image_count*0.2/BATCH_SIZE)-1)
but it still does. BATCH_SIZE is set to 32 so i am taking 80% of the number of images and dividing by 32 then taking away 1 to have surplus.....or so i thought.
history = model.fit(
train_ds,
steps_per_epoch=TRAIN_STEPS_PER_EPOCH,
epochs=EPOCHS,
verbose = 1,
validation_data=val_ds,
validation_steps=VAL_STEPS_PER_EPOCH,
callbacks=tensorboard_callback)
Error after 3 hours processing a a single successful Epoch is:
Epoch 1/25 374/374 [==============================] - 8133s 22s/step -
loss: 7.0126 - accuracy: 0.0028 - val_loss: 6.8585 - val_accuracy:
0.0000e+00 Epoch 2/25 1/374 [..............................] - ETA: 0s - loss: 6.0445 - accuracy: 0.0000e+00WARNING:tensorflow:Your input
ran out of data; interrupting training. Make sure that your dataset or
generator can generate at least steps_per_epoch * epochs batches (in
this case, 9350.0 batches). You may need to use the repeat() function
when building your dataset.
this might help....
> > print(train_ds) <BatchDataset shapes: ((None, 224, 224, 3), (None,)), types: (tf.float32, tf.int32)>
>
> print(val_ds) BatchDataset shapes: ((None, 224, 224, 3), (None,)),types: (tf.float32, tf.int32)>
>
> print(TRAIN_STEPS_PER_EPOCH)
> 374.0
>
> print(VAL_STEPS_PER_EPOCH)
> 93.0

Solution which worked for me was to set drop_remainder=True while generating the dataset. This automatically handles any extra data that is left over.
For example:
dataset = tf.data.Dataset.from_tensor_slices((images, targets)) \
.batch(12, drop_remainder=True)

I had same problem and decreasing validation_steps from 50 to 10 solved the issue.

If you create a dataset with image_dataset_from_directory, remove steps_per_epoch and validation_steps parameters from model.fit.
The reason is steps has been initiated when batch_size passed into image_dataset_from_directory, and you can trying get the steps number with len.

I had the same problem in TF 2.1. It has something to do with the shape/ type of the input, namely the query. In my case, I solved the problem as follows: (My model takes 3 inputs)
x = [[test_X[0][0]], [test_X[1][0]], [test_X[2][0]]]
MODEL.predict(x)
Output:
WARNING:tensorflow:Your input ran out of data; interrupting training.
Make sure that your dataset or generator can generate at least
steps_per_epoch * epochs batches (in this case, 7 batches). You may
need to use the repeat() function when building your dataset.
array([[2.053718]], dtype=float32)
Solution:
x = [np.array([test_X[0][0]]), np.array([test_X[1][0]]), np.array([test_X[2][0]])]
MODEL.predict(x)
Output:
array([[2.053718]], dtype=float32)

I understand that it is completely fine. Firstly. it is a warning not an error. Secondly, the situation is similar to one data is trained during one epoch, next epoch trains next data and you have set epochs value too high e.g.500 (assuming your data size is not fixed but will approximately be <=500). Suppose data size is 480. Now, the remaining epoch don't have any data left to process, hence the warning. As a result, it returns to the recent state when the last data was trained.
I hope this helps. Do let me know if the concept is misunderstood. Thanks!

Try reducing the steps_per_epoch value below the value you have currently set. This helped me solve the problem

I am seeing this issue with TF-2.9 and a custom ImageDataGenerator.
The core issue appears to be that TF/Keras is not selecting the correct data adapter:
<python site-packages>/keras/engine/data_adapter.py
select_data_adapter() was selecting a GeneratorDataAdapter when it should be a KerasSequenceAdapter.
I updated the following file to work around the issue:
<python site-packages>/keras_preprocessing/image/iterator.py
try:
DataSequence = get_keras_submodule('utils').Sequence
except:
try:
# Work-around for TF-2.9
from keras.utils.data_utils import Sequence
DataSequence = Sequence
except:
DataSequence = object

A better solution is using :
data_amount = .5 # to chage the amount of data if the data is large and training for small epoch
steps_per_epoch = int(data_amount*(len(train_data)/EPOCHS)),
# if you have validation data, then:
validation_steps = int(data_amount*(len(val_data)/EPOCHS)),
here .5 is float value which determines how much of the data you want to fit it.
This approach is better than the ones with BATCH_SIZE, but this will always fit the whole dataset, so change the data_amount value to adjust it

I also got this while training a model in google colab, and the reason was that there is not enough memory/RAM to store the amount of data per batch (if you are using batch), so after I lower the batch_size it all ran perfectly

PyTorch: is there a definitive training loop similar to Keras' fit()?

I'm coming over from Keras to PyTorch, and one of the surprising things I've found is that I'm supposed to implement my own training loop.
In Keras, there is a de facto fit() function that: (1) runs gradient descent and (2) collects a history of metrics for loss and accuracy over both the training set and validation set.
In PyTorch, it appears that the programmer needs to implement the training loop. Since I'm new to PyTorch, I don't know if my training loop implementation is correct. I just want to compare apples-to-apples loss and accuracy metrics with what I'm seeing in Keras.
I've already read through:
the official PyTorch 60-minute blitz, where they provide a sample training loop.
official PyTorch example code, where I've found the training loop placed in-line with other code.
the O'Reilly book Programming PyTorch for Deep Learning with its own training loop.
Stanford CS230 sample code.
various blog posts (e.g. here and here).
So I'm wondering: is there a definitive, universal training loop implementation that does the same thing and reports the same numbers as the Keras fit() function?
My points of frustration:
Pulling data out of the dataloader is not consistent between image data and NLP data.
Correctly computing loss and accuracy is not consistent in any sample code I've seen.
Some code examples use Variable, while others do not.
Unnecessarily detailed: moving data to/from the GPU; knowing when to call zero_grad().
For what it's worth, here is my current implementation. Are there any obvious bugs?
import time
def train(model, optimizer, loss_fn, train_dl, val_dl, epochs=20, device='cuda'):
'''
Runs training loop for classification problems. Returns Keras-style
per-epoch history of loss and accuracy over training and validation data.
Parameters
----------
model : nn.Module
Neural network model
optimizer : torch.optim.Optimizer
Search space optimizer (e.g. Adam)
loss_fn :
Loss function (e.g. nn.CrossEntropyLoss())
train_dl :
Iterable dataloader for training data.
val_dl :
Iterable dataloader for validation data.
epochs : int
Number of epochs to run
device : string
Specifies 'cuda' or 'cpu'
Returns
-------
Dictionary
Similar to Keras' fit(), the output dictionary contains per-epoch
history of training loss, training accuracy, validation loss, and
validation accuracy.
'''
print('train() called: model=%s, opt=%s(lr=%f), epochs=%d, device=%s\n' % \
(type(model).__name__, type(optimizer).__name__,
optimizer.param_groups[0]['lr'], epochs, device))
history = {} # Collects per-epoch loss and acc like Keras' fit().
history['loss'] = []
history['val_loss'] = []
history['acc'] = []
history['val_acc'] = []
start_time_sec = time.time()
for epoch in range(epochs):
# --- TRAIN AND EVALUATE ON TRAINING SET -----------------------------
model.train()
train_loss = 0.0
num_train_correct = 0
num_train_examples = 0
for batch in train_dl:
optimizer.zero_grad()
x = batch[0].to(device)
y = batch[1].to(device)
yhat = model(x)
loss = loss_fn(yhat, y)
loss.backward()
optimizer.step()
train_loss += loss.data.item() * x.size(0)
num_train_correct += (torch.max(yhat, 1)[1] == y).sum().item()
num_train_examples += x.shape[0]
train_acc = num_train_correct / num_train_examples
train_loss = train_loss / len(train_dl.dataset)
# --- EVALUATE ON VALIDATION SET -------------------------------------
model.eval()
val_loss = 0.0
num_val_correct = 0
num_val_examples = 0
for batch in val_dl:
x = batch[0].to(device)
y = batch[1].to(device)
yhat = model(x)
loss = loss_fn(yhat, y)
val_loss += loss.data.item() * x.size(0)
num_val_correct += (torch.max(yhat, 1)[1] == y).sum().item()
num_val_examples += y.shape[0]
val_acc = num_val_correct / num_val_examples
val_loss = val_loss / len(val_dl.dataset)
print('Epoch %3d/%3d, train loss: %5.2f, train acc: %5.2f, val loss: %5.2f, val acc: %5.2f' % \
(epoch+1, epochs, train_loss, train_acc, val_loss, val_acc))
history['loss'].append(train_loss)
history['val_loss'].append(val_loss)
history['acc'].append(train_acc)
history['val_acc'].append(val_acc)
# END OF TRAINING LOOP
end_time_sec = time.time()
total_time_sec = end_time_sec - start_time_sec
time_per_epoch_sec = total_time_sec / epochs
print()
print('Time total: %5.2f sec' % (total_time_sec))
print('Time per epoch: %5.2f sec' % (time_per_epoch_sec))
return history

Short answer: there is no equivalent training loop for PT and TF.keras and there shall never be one.
First of all, the training loop is syntactical sugar that is supposed to makes one's life easier. From my point of view, "making life easier" is a moto of TF.keras framework and this is the main reason it has it. Training loop can not be formalized as well defined practice, it might vary a lot depending on the task/dataset/procedure/metric/you_name_it and it would require a lot of effort to match all the options for 2 frameworks. Furthermore, creating a defining interface for training loop in Pytorch might be too restrictive for many actual users of the framework.
Matching the outputs of network would require matching behaviors of every operation within 2 frameworks, which would be impossible. First of all, the frameworks don't necessarily provide same sets of operations. Operations can be grouped into higher level abstracts differently. Also, some common functions like sigmoid or BatchNorm might look well mathematically defined on paper, but in reality have dozens of implementation specific details. Also, when improvements are introduced to the operations it is up to the community to integrate these updates into main framework distributions or plane ignore them. Needless to say, developers of 2 frameworks make these decisions independently and likely have different motivation behind them.
To sum it all up, matching high level details of 2 frameworks would require enormous effort and would probably be very disruptive for the existing users.

Indeed, the Pytorch Module class (source code) doesn't have a fit() method, so you have to implement your own according to your needs.
However there are some implementations which mimic the Keras training API, such as this one:
https://github.com/ncullen93/torchsample
or a simpler one:
https://github.com/henryre/pytorch-fitmodule

A close thing to Keras model.fit in Pytorch is Pytorch Extension called Torchbearer.
From the MNIST example notebook:
trial = Trial(model, optimizer, loss, metrics=['acc', 'loss'], callbacks=callbacks).to(device)
trial.with_generators(train_generator=traingen, val_generator=valgen, test_generator=testgen)
history = trial.run(epochs=5, verbose=1)
the similarity is there, although the usage requiers some reading.
Best of luck!

I wanted to point out that there is now a fit(...) equivalent. PyTorch Lightning is a wrapper around PyTorch that allows for a clean object-oriented approach to creating ML models in PyTorch. It provides a fit(...) loop using their Trainer class. I would checkout their official website for a more detailed answer.

Why does recognition rate drop after multiple online training epochs?

I am using tensorflow to do image recognition on the MNIST dataset. In each training epoch, I picked 10,000 random images and conducted online training with batch size of 1. The recognition rate increased for the first few epochs, however, after several epochs the recognition rate started to drop greatly. (In the first 20 epochs, the recognition rate goes up to ~94%. Afterwards, the recognition rate went from 90->50->40->30->20). What is the reason for this?
Also, with a batch size of 1, the performance is worse than when using a batch size of 100 (max recognition rate 94% vs. 96%). I looked through several references but there seems to be contradictory results on whether small or large batch sizes achieve better performance. What would be this case in this situation?
Edit: I also added a figure of the recognition rate of the training dataset and the test dataset.Recognition rate vs. epoch
I have attached a copy of the code below. Thanks for the help!
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)
#parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 10
batch_size = 1
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')
#model of neural network
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1]) , name='l1_w'),
'biases': tf.Variable(tf.random_normal([n_nodes_hl1]) , name='l1_b')}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2]) , name='l2_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl2]) , name='l2_b')}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3]) , name='l3_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl3]) , name='l3_b')}
output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes]) , name='lo_w'),
'biases' :tf.Variable(tf.random_normal([n_classes]) , name='lo_b')}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
#train neural network
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epoches = 100
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epoches):
epoch_loss=0
for batch in range (10000):
epoch_x, epoch_y=mnist.train.next_batch(batch_size)
_,c =sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
epoch_loss += c
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print(epoch_loss)
print('Accuracy_test:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
print('Accuracy_train:', accuracy.eval({x:mnist.train.images, y:mnist.train.labels}))
train_neural_network(x)

DROPPING ACCURACY
You're over-fitting. This is when the model learns false features that are specific to artifacts of the images in the training data, at the expense of important features. One of the main experimental results of any application is to determine the optimal number of training iterations.
For instance, perhaps 80% of the 7's in your training data happen to have a little extra slant to the right near the bottom of the stem, where 4's and 1's do not. After too much training, your model "decides" that the best way to tell a 7 from another digit is from that extra slant, despite any other features. As a result, some 1's and 4's now get classed as 7's.
BATCH SIZE
Again, the best batch size is one of the experimental results. Typically, a batch size of 1 is too small: this gives the first few input images too much influence on the early weights in kernel or perceptron training. This is a minor case of over-fitting: one item having undue influence on the model. However, it's significant enough to alter your best results by 2%.
You need to balance the batch size with the other hyper-parameters to find the model's "sweet spot", optimum performance followed by shortest training time. In my experience, it's been best to increase the batch size until my time per image degraded. The models I've used most (MNIST, CIFAR-10, AlexNet, GoogleNet, ResNet, VGG, etc.) had very little loss of accuracy once we reached a rather minimal batch size; from there, the training speed was usually a matter of choosing the batch size the best used available RAM.

There are a few possibilities, although you'll need to do some experimentation to find out which it is.
Overfitting
Prune did a good job of explaining this. I'll add that the simplest way to avoid overfitting is to just remove 10-15% of the training set and evaluate the recognition rate on this held out validation set after every few epochs. If you graph the change in recognition rate on both the training and validation sets, you'll eventually reach a point on the graph where the training error keeps going down but the validation error starts going up. Stop training at that point; that's where overfitting is starting in earnest. Note that it's important that there be no overlap between the training/validation/test sets.
This was more likely before you mentioned that the training error wasn't also decreasing, but it's possible that it's overfitting on a fairly homogeneous part of your training set at the expense of the outliers, or something like this. Try randomizing the order of your training set after each epoch; if it's fitting one section of the set at the expense of the others, this might help.
Addendum: The massive instantaneous drop in quality around epoch 20 makes this even less likely; that is not what overfitting looks like.
Numerical Instability
If you get a particularly incorrect input at a point on the activation function with a large gradient, it's possible to end up with a gigantic weight update that screws up everything it's learned thus far. It's common to put a hard limit on the gradient magnitude for this reason. But you're using AdamOptimizer, which has an epsilon parameter for avoiding instability. I haven't read the paper it references, so I don't know exactly how it works, but the fact that it's there makes instability less likely.
Saturated Neurons
Some activation functions have regions with very small gradients, so if you end up with weights such that the function is almost always in that region, you have a tiny gradient and thus can't learn effectively. Sigmoids and Tanh are particularly prone to this since they have flat regions on both sides of the function. ReLUs don't have a flat region on the high end, but do on the low end. Try replacing your activation functions with Softplus; those are similar to ReLU, but with a continuous nonzero gradient.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.