previously I thought that smaller batch_size would lead to faster training, but in practice in keras, I am receiving the opposite results which is that bigger batch_size make training faster.
I am implementing a sample code, and by increasing the amount of batch_size, the training become faster. that is the opposite of my previously common believe (that smaller batch_size results in faster training),
here's the sample code:
# fit model
import time
start = time.time()
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=1000,
batch_size= 500 , verbose=0)
end = time.time()
elapsed = end - start
print(elapsed)
I put 500, 250, 50 and 10 as batch_size respectively, i expect the lower batch_size would have faster training, but batch_size 500 results in 6.3 sec,
250 results in 6.7sec, 50 results in 28.0sec, and 10 results in 140.2secc !!!
This makes sense. I do not know what model you are using but Keras is highly optimized making use of vectorization for fast matrix operations. So if you split your data of 5000 samples into batch sizes of 500, with 1000 epochs, essentially there are (5000/500) x 1000 iterations through the model. That's 10 000
Now if you do that for a batch size of 10 there are (5000/10) x 1000 iterations. That's 500 000. Many more iterations through the model, both forward and backwards.
Hope that helps.
For hardware, GPUs are very good at parallelising the calculations, specifically, the matrix operations happens in forward-and-backward-propagation. The same thing happens in software side, tensorflow and other DL libraies optimise matrix operations.
Therefore, lager batch size makes the GPU and DL libraries able to "optimise more matrix computation", which leads to faster training time.
Related
I have a training set containing 272 images.
batch size = 8, steps per epoch = 1 > train the model for just 8 images and jumps to next epoch?
batch size = 8, steps per epoch = 34 (no shuffle) > train the model for all 272 images and jumps to the next epoch?
At the end of each steps per epoch does it update the weights of the model?
If so, by increasing the number of steps per epoch does it gives a better result?
Is there a convention in selecting batch size & steps per epoch?
If I provide the definition using the 272 images as the training dataset and 8 as batch size,
batch size - the number of images that will be feed together to the neural network.
epoch - an iteration over all the dataset images
steps - usually the batch size and number of epochs determine the steps. By default, here, steps = 272/8 = 34 per epoch. In total, if you want 10 epochs, you get 10 x 34 = 340 steps.
Now, if your dataset is very large, or if there are many possible ways to augment your images, which can again lead to a dataset of infinite or dynamic length, so how do you set the epoch in this case? You simply use steps per epoch to set a boundary. You pick an arbitrary value like say 100 and you assume your total dataset length to be 800. Now, it is another thing on how you do the augmentation. Normally, you can rotate, crop, or scale by random values each time.
Anyway, coming to the answers to your questions -
Yes
Yes
Yes if you are using Mini-batch gradient descent
Well, yes unless it overfits or your data is very small or ... there are a lot of other things to consider.
I am not aware of any. But for a ballpark figure, you can check on the training mechanism of high accuracy open source trained models in your domain.
(Note: I am not actively working in this field any more. So some things may have changed or I may be mistaken.)
The batch size defines the number of samples that propagates through the network before updating the model parameters.
Each batch of samples go through one full forward and backward propagation.
Example:
Total training samples (images) = 3000
batch_size = 32
epochs = 500
Then…
32 samples will be taken at a time to train the network.
To go through all 3000 samples it takes 3000/32 = 94 iterations 1 epoch.
This process continues 500 times (epochs).
You may be limited to small batch sizes based on your system hardware (RAM + GPU).
Smaller batches mean each step in gradient descent may be less accurate, so it may take longer for the algorithm to converge.
But, it has been observed that for larger batches there is a significant degradation in the quality of the model, as measured by its ability to generalize.
Batch size of 32 or 64 is a good starting point.
Summary:
Larger batch sizes result in faster progress in training, but don't always converge as fast.
Smaller batch sizes train slower but can converge faster
I'm training a sequence to sequence (seq2seq) model and I have different values to train on for the input_sequence_length.
For values 10 and 15, I get acceptable results but when I try to train with 20, I get memory errors so I switched the training to train by batches but the model over-fit and the validation loss explodes, and even with the accumulated gradient I get the same behavior, so I'm looking for hints and leads to more accurate ways to do the update.
Here is my training function (only with batch section) :
if batch_size is not None:
k=len(list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )))
for epoch in range(num_epochs):
optimizer.zero_grad()
epoch_loss=0
for i in list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )): # by using equidistant batch till the last one it becomes much faster than using the X.size()[0] directly
sequence = X_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, input_size).to(device)
labels = y_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, output_size).to(device)
# Forward pass
outputs = model(sequence)
loss = criterion(outputs, labels)
epoch_loss+=loss.item()
# Backward and optimize
loss.backward()
optimizer.step()
epoch_loss=epoch_loss/k
model.eval
validation_loss,_= evaluate(model,X_test_hard_tensor_1,y_test_hard_tensor_1)
model.train()
training_loss_log.append(epoch_loss)
print ('Epoch [{}/{}], Train MSELoss: {}, Validation : {} {}'.format(epoch+1, num_epochs,epoch_loss,validation_loss))
EDIT:
here are the parameters that I'm training with :
batch_size = 1024
num_epochs = 25000
learning_rate = 10e-04
optimizer=torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss(reduction='mean')
Batch size affects regularization. Training on a single example at a time is quite noisy, which makes it harder to overfit. Training on batches smoothes everything out, which makes it easier to overfit. Translating back to regularization:
Smaller batches add regularization.
Larger batches reduce regularization.
I am also curious about your learning rate. Every call to loss.backward() will accumulate the gradient. If you have set your learning rate to expect a single example at a time, and not reduced it to account for batch accumulation, then one of two things will happen.
The learning rate will be too high for the now-accumulated gradient, training will diverge, and both training and validation errors will explode.
The learning rate won't be too high, and nothing will diverge. The model will just train more quickly and effectively. If the model is too large for the data being fit, then training error will go to 0 but validation error will explode due to overfitting.
Update
Here is a bit more detail regarding the gradient accumulation.
Every call to loss.backward() will accumulate gradient, until you reset it with optimizer.zero_grad(). It will be acted on when you call optimizer.step(), based on whatever it has accumulated.
The way your code is written, you call loss.backward() for every pass through the inner loop, then you call optimizer.step() in the outer loop before resetting. So the gradient has been accumulated, that is summed, over all examples in the batch and not just one example at a time.
Under most assumptions, that will make the batch-accumulated gradient larger than the gradient for a single example. If the gradients are all aligned, for B batches, it will be larger by B times. If the gradients are i.i.d. then it will be more like sqrt(B) times larger.
If you do not account for this, then you have effectively increased your learning rate by that factor. Some of that will be mitigated by the smoothing effect of larger batches, which can then tolerate a higher learning rate. Larger batches reduce regularization, larger learning rates add it back. But that will not be a perfect match to compensate, so you will still want to adjust accordingly.
In general, whenever you change your batch size you will also want to re-tune your learning rate to compensate.
Leslie N. Smith has written some excellent papers on a methodical approach to hyperparameter tuning. A great place to start is A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. He recommends you start by reading the diagrams, which are very well done.
In Keras documentation - steps_per_epoch: Total number of steps (batches of samples) to yield from generator before declaring one epoch finished and starting the next epoch. It should typically be equal to the number of unique samples of your dataset divided by the batch size.
I have 3000 samples.
If i set steps_per_epoch=3000 it's work slowly. If i set steps_per_epoch=300 it's work faster and i thought that Batch works!
But then I compared how much video memory is allocated in the first and second cases. And did not notice a big difference. If I use a simple fit() function then the difference is large. So it's real speed up or i just process 300 examples, instead of 3000?
What for this parameter is necessary? And how can I speed up the training?
My generator code:
def samples_generator(self, path_source, path_mask):
while 1:
file_paths_x = self.get_files(path_source)
file_paths_y = self.get_files(path_mask)
for path_x, path_y in zip(file_paths_x, file_paths_y):
x = self.load_pixels(path_x, 3, cv2.INTER_CUBIC)
y = self.load_pixels(path_y, 0, cv2.INTER_NEAREST)
yield (x, y)
The steps_per_epoch parameter is the number of batches of samples it will take to complete one full epoch. This is dependent on your batch size. The batch size is set where you initialize your training data. For example, if you're doing this with ImageDataGenerator.flow() or ImageDataGenerator.flow_from_directory(), the batch size is specified with the batch_size parameter in each of these.
You said you have 3000 samples.
If your batch size was 100, then steps_per_epoch would be 30.
If your batch size was 10, then steps_per_epoch would be 300.
If your batch size was 1, then steps_per_epoch would be 3000.
This is because steps_per_epoch should be equivalent to the total number of samples divided by the batch size. The process of implementing this in Keras is available in the two videos below.
The reason why you have to set steps_per_epoch is that the generator is designed to run indefinitely (See the docs:
"The generator is expected to loop over its data indefinitely."
). You implemented this by setting while 1.
Since fit_generator() is supposed to run epochs=x times, the method must know when the next epoch begins within this indefinitely loop (and, hence, the data has to be drawn from the beginning again).
Image preparation for CNN training with Keras
Create and train a CNN in Keras
The problem is that I am only seeing a one second improvement during training when I switched from a CPU to a GPU. I believe this is because my training process is highly iterative instead of vectorized.
Any recommendations on vectorizing this Keras training?
e_num = 0
sample_count = len(trainX[0])
for e in range(epochs):
e_num += 1
print(e_num)
for i in range(len(trainX)):
model.fit(trainX[i], trainY[i], epochs=1, batch_size=32,
verbose=2, shuffle=False,
validation_split=fix_validation_split(0.05, sample_count,
batch_size))
model.reset_states()
The problem with Keras is that I'm having difficulty passing the fit function a 4-D dataset. Thus, I'm iteratively training over each 3-D dataset inside trainX. When I increased the batch_size from 1 to 32, training is much faster obviously. However, there is still only a 1 second time difference per epoch between the CPU and GPU. Is this because my training process is not properly vectorized? If so, what recommendations would you have whilst using Keras?
Thank you!
I am using tensorflow to do image recognition on the MNIST dataset. In each training epoch, I picked 10,000 random images and conducted online training with batch size of 1. The recognition rate increased for the first few epochs, however, after several epochs the recognition rate started to drop greatly. (In the first 20 epochs, the recognition rate goes up to ~94%. Afterwards, the recognition rate went from 90->50->40->30->20). What is the reason for this?
Also, with a batch size of 1, the performance is worse than when using a batch size of 100 (max recognition rate 94% vs. 96%). I looked through several references but there seems to be contradictory results on whether small or large batch sizes achieve better performance. What would be this case in this situation?
Edit: I also added a figure of the recognition rate of the training dataset and the test dataset.Recognition rate vs. epoch
I have attached a copy of the code below. Thanks for the help!
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)
#parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 10
batch_size = 1
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')
#model of neural network
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1]) , name='l1_w'),
'biases': tf.Variable(tf.random_normal([n_nodes_hl1]) , name='l1_b')}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2]) , name='l2_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl2]) , name='l2_b')}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3]) , name='l3_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl3]) , name='l3_b')}
output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes]) , name='lo_w'),
'biases' :tf.Variable(tf.random_normal([n_classes]) , name='lo_b')}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
#train neural network
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epoches = 100
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epoches):
epoch_loss=0
for batch in range (10000):
epoch_x, epoch_y=mnist.train.next_batch(batch_size)
_,c =sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
epoch_loss += c
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print(epoch_loss)
print('Accuracy_test:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
print('Accuracy_train:', accuracy.eval({x:mnist.train.images, y:mnist.train.labels}))
train_neural_network(x)
DROPPING ACCURACY
You're over-fitting. This is when the model learns false features that are specific to artifacts of the images in the training data, at the expense of important features. One of the main experimental results of any application is to determine the optimal number of training iterations.
For instance, perhaps 80% of the 7's in your training data happen to have a little extra slant to the right near the bottom of the stem, where 4's and 1's do not. After too much training, your model "decides" that the best way to tell a 7 from another digit is from that extra slant, despite any other features. As a result, some 1's and 4's now get classed as 7's.
BATCH SIZE
Again, the best batch size is one of the experimental results. Typically, a batch size of 1 is too small: this gives the first few input images too much influence on the early weights in kernel or perceptron training. This is a minor case of over-fitting: one item having undue influence on the model. However, it's significant enough to alter your best results by 2%.
You need to balance the batch size with the other hyper-parameters to find the model's "sweet spot", optimum performance followed by shortest training time. In my experience, it's been best to increase the batch size until my time per image degraded. The models I've used most (MNIST, CIFAR-10, AlexNet, GoogleNet, ResNet, VGG, etc.) had very little loss of accuracy once we reached a rather minimal batch size; from there, the training speed was usually a matter of choosing the batch size the best used available RAM.
There are a few possibilities, although you'll need to do some experimentation to find out which it is.
Overfitting
Prune did a good job of explaining this. I'll add that the simplest way to avoid overfitting is to just remove 10-15% of the training set and evaluate the recognition rate on this held out validation set after every few epochs. If you graph the change in recognition rate on both the training and validation sets, you'll eventually reach a point on the graph where the training error keeps going down but the validation error starts going up. Stop training at that point; that's where overfitting is starting in earnest. Note that it's important that there be no overlap between the training/validation/test sets.
This was more likely before you mentioned that the training error wasn't also decreasing, but it's possible that it's overfitting on a fairly homogeneous part of your training set at the expense of the outliers, or something like this. Try randomizing the order of your training set after each epoch; if it's fitting one section of the set at the expense of the others, this might help.
Addendum: The massive instantaneous drop in quality around epoch 20 makes this even less likely; that is not what overfitting looks like.
Numerical Instability
If you get a particularly incorrect input at a point on the activation function with a large gradient, it's possible to end up with a gigantic weight update that screws up everything it's learned thus far. It's common to put a hard limit on the gradient magnitude for this reason. But you're using AdamOptimizer, which has an epsilon parameter for avoiding instability. I haven't read the paper it references, so I don't know exactly how it works, but the fact that it's there makes instability less likely.
Saturated Neurons
Some activation functions have regions with very small gradients, so if you end up with weights such that the function is almost always in that region, you have a tiny gradient and thus can't learn effectively. Sigmoids and Tanh are particularly prone to this since they have flat regions on both sides of the function. ReLUs don't have a flat region on the high end, but do on the low end. Try replacing your activation functions with Softplus; those are similar to ReLU, but with a continuous nonzero gradient.