In Keras, when we are training a model for a fixed number of epochs using model.fit(), one of its parameters is shuffle (a boolean). The Keras documentation about it reads:
"Boolean (whether to shuffle the training data before each epoch)."
Essentially, I am training a Convolutional Neural Network and trying to get reproducible results. So, I followed the instructions and specified seeds as mentioned in this answer.
Although it worked partially (successfully got reproducible results on my local machine only), it was thought setting shuffle=False would help (by keeping the same data inputs), but keeping the reproducibility aside for a second, doing that dramatically reduced the performance of the model. Specifically, after each epoch, the metrics give same results (meaning not improving) even an increase in epochs gives same numbers (Accuracy = ~75 after 3 epochs and after 30 epochs). But setting shuffle=True shows gradual normal improvement in results.
Training data shape: (143256, 1, 150, 3)
Target data shape: (143256, 3)
Batch Size: 64
metrics = ['accuracy']
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adam(),
metrics=metrics)
....
model.fit(x_train, to_categorical(y_train), batch_size=batch_size,
epochs=epochs, verbose=verbose,
validation_data=(x_val, to_categorical(y_val)),
shuffle=False, callbacks=[metrics],
class_weight=class_weights)
Is this normal behavior of shuffling being set to false? Because even though the data is not permuted, the weights should be updated in each epoch and hence the metrics should improve overtime.
Assuming there is some issue with my implementation, should there be any significant difference in model performance when trying to train with both approaches (shuffling or without it)?
How can the results be reproducible with shuffle=True, which they apparently are, even if seeds are specified?
Any help will be really appreciated. Thanks!
Related
I'm using a tf.data dataset containing my training data consisting of (lets say) 100k images.
I'm also using a tf.data dataset containing my validation set.
Since an epoch of all 100k images takes quite long (in my case approximately one hour) before I get any feedback on performance on the validation set, I set the steps_per_epoch parameter in tf.keras.Model fit() to 10000.
Using a batch size of 1 this results into having 10 validation scores when reaching 100k of images.
In order to complete one epoch of 100k images of my entire training dataset, I set the epochs parameter to 10
However, I'm not sure if using steps_per_epoch and epochs this way has any other consequences. Is it correct to use these parameters in order to get more frequent feedback on performance?
And also a more specific question, does it use all 100k images or does it use the same first 10k images of my training set at every 'epoch'?
I already dug into the TensorFlow docs and read several different stack overflow questions, but I couldn't find anything conclusive to answer my own question. Hope you can help!
Tensorflow version I'm using is 2.2.0.
Is it correct to use these parameters in order to get more frequent
feedback on performance?
Yes, it is correct to use these parameters. Here is the code that i used to fit the model.
model.fit(
train_data,
steps_per_epoch = train_samples//batch_size,
epochs = epochs,
validation_data = test_data,
verbose = 1,
validation_steps = test_samples//batch_size)
does it use all 100k images or does it use the same first 10k images of my
training set at every 'epoch'?
It use all images in your training data.
For better understanding Epoch is the number times the learning algorithm will work through the entire training data set.
Where as steps_per_epoch is the total number of samples in your training data set divided by the batch size.
For example, if you have 100000 training samples and use a batch size of 100, one epoch will be equivalent to 1000 steps_per_epoch.
Note: We generally observe batch size to be the power of 2, this is because of the effective work of optimized matrix operation libraries.
I have a few questions about interpreting the performance of certain optimizers on MNIST using a Lenet5 network and what does the validation loss/accuracy vs training loss/accuracy graphs tell us exactly.
So everything is done in Keras using a standard LeNet5 network and it is ran for 15 epochs with a batch size of 128.
There are two graphs, train acc vs val acc and train loss vs val loss. I made 4 graphs because I ran it twice, once with validation_split = 0.1 and once with validation_data = (x_test, y_test) in model.fit parameters. Specifically the difference is shown here:
train = model.fit(x_train, y_train, epochs=15, batch_size=128, validation_data=(x_test,y_test), verbose=1)
train = model.fit(x_train, y_train, epochs=15, batch_size=128, validation_split=0.1, verbose=1)
These are the graphs I produced:
using validation_data=(x_test, y_test):
using validation_split=0.1:
So my two questions are:
1.) How do I interpret both the train acc vs val acc and train loss vs val acc graphs? Like what does it tell me exactly and why do different optimizers have different performances (i.e the graphs are different as well).
2.) Why do the graphs change when I use validation_split instead? Which one would be a better choice to use?
I will attempt to provide an answer
You can see that towards the end training accuracy is slightly higher than validation accuracy and training loss is slightly lower than validation loss. This hints at overfitting and if you train for more epochs the gap should widen.
Even if you use the same model with same optimizer you will notice slight difference between runs because weights are initialized randomly and randomness associated with GPU implementation. You can look here for how to address this issue.
Different optimizers will usually produce different graph because they update model parameters differently. For example, vanilla SGD will do update at constant rate for all parameters and at all training steps. But if you add momentum the rate will depend on previous updates and usually will result in faster convergence. Which means you can achieve same accuracy as vanilla SGD in lower number of iteration.
Graphs will change because training data will be changed if you split randomly. But for MNIST you should use standard test split provided with the dataset.
here is my code
for _ in range(5):
K.clear_session()
model = Sequential()
model.add(LSTM(256, input_shape=(None, 1)))
model.add(Dropout(0.2))
model.add(Dense(256))
model.add(Dropout(0.2))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='RmsProp', metrics=['accuracy'])
hist = model.fit(x_train, y_train, epochs=20, batch_size=64, verbose=0, validation_data=(x_val, y_val))
p = model.predict(x_test)
print(mean_squared_error(y_test, p))
plt.plot(y_test)
plt.plot(p)
plt.legend(['testY', 'p'], loc='upper right')
plt.show()
Total params : 330,241
samples : 2264
and below is the result
I haven't changed anything.
I only ran for loop.
As you can see in the picture, the result of the MSE is huge, even though I have just run the for loop.
I think the fundamental reason for this problem is that the optimizer can not find global maximum and find local maximum and converge. The reason is that after checking all the loss graphs, the loss is no longer reduced significantly. (After 20 times) So in order to solve this problem, I have to find the global minimum. How should I do this?
I tried adjusting the number of batch_size, epoch. Also, I tried hidden layer size, LSTM unit, kerner_initializer addition, optimizer change, etc. but could not get any meaningful result.
I wonder how can I solve this problem.
Your valuable opinions and thoughts will be very much appreciated.
if you want to see full source here is link https://gist.github.com/Lay4U/e1fc7d036356575f4d0799cdcebed90e
From your example, the problem simply comes from the fact that you have over 100 times more parameters than you have samples. If you reduce the size of your model, you will see less variance.
The wider question you are asking is actually very interesting that usually isn't covered in tutorials. Nearly all Machine Learning models are by nature stochastic, the output predictions will change slightly everytime you run it which means you will always have to ask the question: Which model do I deploy to production ?
Off the top of my head there are two things you can do:
Choose the first model trained on all the data (after cross-validation, ...)
Build an ensemble of models that all have the same hyper-parameters and implement a simple voting strategy
References:
https://machinelearningmastery.com/train-final-machine-learning-model/
https://machinelearningmastery.com/randomness-in-machine-learning/
If you want to always start from the same point you should set some seed. You can do it like this if you use Tensorflow backend in Keras:
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
If you want to learn why do you get different results in ML/DL models, I recommend this article.
Following this great post: Scaling Keras Model Training to Multiple GPUs I tried to upgrade my model to run in parallel on my multiple GPUs instance.
At first I ran the MNIST example as proposed here: MNIST in Keras with the additional syntax in the compile command as follows:
# Prepare the list of GPUs to be used in training
NUM_GPU = 8 # or the number of GPUs available on your machine
gpu_list = []
for i in range(NUM_GPU): gpu_list.append('gpu(%d)' % i)
# Compile your model by setting the context to the list of GPUs to be used in training.
model.compile(loss='categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'],
context=gpu_list)
then I trained the model:
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test))
So far so good. It ran for less than 1s per epoch and I was really excited and happy until I tried - data augmentation.
To that point, my training images were a numpy array at size (6000,1,28,28) and the labels were at size (10,60000) - one-hot encoded. For data augmentation I used the ImageDataGenerator function:
gen = image.ImageDataGenerator(rotation_range=8, width_shift_range=0.08, shear_range=0.3,
height_shift_range=0.08, zoom_range=0.08)
batches = gen.flow(x_train, y_train, batch_size=NUM_GPU*64)
test_batches = gen.flow(x_test, y_test, batch_size=NUM_GPU*64)
and then:
model.fit_generator(batches, batches.N, nb_epoch=1,
validation_data=test_batches, nb_val_samples=test_batches.N)
And unfortunately, from 1s per epoch I started getting ~11s per epoch... I suppose that the "impact" of the ImageDataGenerator is destructive and it probably running all the (reading->augmenting->writing to gpu) process really slow and inefficient.
Scaling keras to multiple GPUs is great, but data-augmentation is essential for my model to be robust enough.
I guess one solution could be: load all images from directory and write your own function that shuffles and augment those images. But I'm sure must be some easier way to optimize this process using keras API.
Thanks!
Ok I've found the solution.
You need to use mxnet's iterator. see here:
Image IO - Loading and pre-processing images
instead of Keras's data_generator
I am using Keras with theano backend and I want to train my Network on a gpu. That actually works pretty good. But when I want to train on a huge amount of data, I recognized, that there is a bottleneck in the model.fit() function (I am using the functional API).
Actually in the model.fit() function Keras starts to use the GPU for the training. But before it starts on the GPU, it needs much CPU-effort to prepare the training (I don't know exactly what fit() is doing before the actual training). The problem is, that this part only uses one thread, so that this part takes pretty long.
Is it possible to force Keras to use multiprocessing at this step?
Edit: Added additional data to my function call:
My function call looks like this:
optimizer = SGD(lr=0.00001)
early_stopping = EarlyStopping(monitor='val_loss', patience=30, verbose=1, mode='auto')
outname = join(outdir, save_base_name+".model")
checkpoint = ModelCheckpoint(outname, monitor='val_loss', verbose=1, save_best_only=True)
model.compile(loss='hinge', optimizer=optimizer, metrics=['accuracy'])
model.fit(
train_instances.x,
train_instances.y,
batch_size=60,
epochs=50,
verbose=1,
callbacks=[checkpoint, early_stopping],
validation_data=(valid_instances.x, valid_instances.y),
shuffle=True
)
The model I use (you can find the implementation here: https://github.com/pexmar/DSCNN_document) has 90 inputs (shared Layers) of dimension 100 x 300 (word2vec embedding layer: 100 words, each has 300 dimensions). I give 12500 training instances and 1000 validation instances to the network.