I'm attempting to get 100% reproducibility when resuming from a checkpoint for a reinforcement learning agent I'm training in PyTorch. What I currently find is that if I train the agent from scratch twice in a row, at 10000 timesteps the training plots (loss, return, etc.) are identical. However, if I save a checkpoint at 5000 timesteps, then resume training from this timestep and continue training out to 10000 timesteps, I find that performance is slightly off, as can be seen from the following plot (where blue is the trained from scratch to 10k steps and green is resumed from a 5k timestep checkpoint of blue and trained out to 10k steps):
I've stepped through my code and found that the parameters of my models and the RNG states are identical at the 5k step mark with both training from scratch, and after loading from the 5k checkpoint.
I set my seeding as follows:
def set_seed(seed, device):
os.environ["PYTHONHASHSEED"] = str(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if device.type == "cuda":
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
When I generate my environment I also set the following seeding:
env.seed(seed)
env.action_space.seed(seed)
env.observation_space.seed(seed)
When loading from a checkpoint, besides loading the state dicts for all my models and optimizers, I also set the RNG states after setting the seeds at the beginning of my code as follows:
if args.resume:
random.setstate(checkpoint["rng_states"]["random_rng_state"])
np.random.set_state(checkpoint["rng_states"]["numpy_rng_state"])
torch.set_rng_state(checkpoint["rng_states"]["torch_rng_state"])
if device.type == "cuda":
torch.cuda.set_rng_state(checkpoint["rng_states"]["torch_cuda_rng_state"])
torch.cuda.set_rng_state_all(
checkpoint["rng_states"]["torch_cuda_rng_state_all"]
)
The more complete script is here (I added only what I thought were the relevant sections here for brevity/make things less confusing): https://pastebin.com/1yqn3CLt
Would anyone have any ideas as to what I might be doing wrong such that I can't get exact reproducibility when I'm resuming from my checkpoint? Thanks in advance!
Related
I am using the TF2 research object detection API with the pre-trained EfficientDet D3 model from the TF2 model zoo. During training on my own dataset I notice that the total loss is jumping up and down - for example from 0.5 to 2.0 a few steps later, and then back to 0.75:
So all in all this training does not seem to be very stable. I thought the problem might be the learning rate, but as you can see in the charts above, I set the LR to decay during the training, it goes down to a really small value of 1e-15, so I don't see how this can be the problem (at least in the 2nd half of the training).
Also when I smooth the curves in Tensorboard, as in the 2nd image above, one can see the total loss going down, so the direction is correct, even though it's still on quite a high value. I would be interested why I can't achieve better results with my training set, but I guess that is another question. First I would be really interested why the total loss is going up and down so much the whole training. Any ideas?
PS: The pipeline.config file for my training can be found here.
In your config it states that your batch size is 2. This is tiny and will cause a very volatile loss.
Try increasing your batch size substantially; try a value of 256 or 512. If you are constrained by memory, try increasing it via gradient accumulation.
Gradient accumulation is the process of synthesising a larger batch by combining the backwards passes from smaller mini-batches. You would run multiple backwards passes before updating the model's parameters.
Typically, a training loop would like this (I'm using pytorch-like syntax for illustrative purposes):
for model_inputs, truths in iter_batches():
predictions = model(model_inputs)
loss = get_loss(predictions, truths)
loss.backward()
optimizer.step()
optimizer.zero_grad()
With gradient accumulation, you'll put several batches through and then update the model. This simulates a larger batch size without requiring the memory to actually put a large batch size through all at once:
accumulations = 10
for i, (model_inputs, truths) in enumerate(iter_batches()):
predictions = model(model_inputs)
loss = get_loss(predictions, truths)
loss.backward()
if (i - 1) % accumulations == 0:
optimizer.step()
optimizer.zero_grad()
Reading
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
How to accumulate gradients in tensorflow?
https://towardsdatascience.com/how-to-easily-use-gradient-accumulation-in-keras-models-fa02c0342b60
Understanding accumulated gradients in PyTorch
Here is a typical plot of train/test losses behaviour as epoch increases.
I'm not an expert but I have read several topics on similar problems. Well, let me explain what I'm doing.
First, I have used implementation given by https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py for resnet18 & resnet50, and by https://github.com/akamaster/pytorch_resnet_cifar10 for resnet32, resnet56. For all these nets I got the same kind of test-loss hieratic behaviour.
Second, my inputs are images 5x64x64, so I have adapted the first Convolutional Layer, and the output of the last Full-connected consist of 180 neurons. I have used either 64, 128, 256 batch sizes for the training, and 128 for the test: the same behaviour persists. I have also both used 300k or 100k images in input training (100k for the test): same behaviour persists too. The images are not of "standard" RGB photos: first, as you probably have already, , remarked there are 5 channels, second the pixel values can be negative (eg. spanning the range (-0.01, 500))
Third, I am aware of the model.train() statement for the training phase, as well as the model.eval() statement (coupled with the with torch.no_grad():) for the testing phase. It is clear that if I do not use model.eval() during the test phase, the test loss is gently decrasing as the traing loss. But, this is not allowed, isn't it?
I have tried several things after reading post concerning Batch Norm behaviour wo any success
I have used SGD, Adam (& SWATS)
I have tryied lr = 0.1 to lr= 1e-5
I have modified the BN momentum (default = 0.1) : 0.5 and 0.01; as well as the eps parameter.
Now, I have managed to get nice results (ie; good training & testing losses) with a classical CNN (ie. wo any Batch normalization, & short-cuts) but I would like to study Resnet behaviour against adversarial attack. So, I would like to get Resnet fit my images :slight_smile:
Any idea ?
Thanks
After making some tests, I have found something: I have used the standard resnet20 (h=1). Then, I have used as test set the same samples (100,000 images) as for the train set. BUT, for the test set 1) I do not use the shuffling, and 2) I do not make any Horizontal/vertical flip or Rot90deg, Rot180deg or Rot270deg. I observe the same kind of fluctuations for the test loss.
Moreover, when I switch OFF complety the transformations of the train set, and uses the same set for train & test, I got the same behviour:
And finaly, if I switch off the suffling and random transforms (flips & Rotations) of train set, and I use the same set for test, then I get:
Seems that the test loss is converging towards a value, but different from the train loss. Why ???
As we know, Caffe supports resuming training when the snapshot is given. An explanation of Caffe's training continuation scheme can be found here. However, I found the training loss and validation loss is inconsistent. I gives the following example to illustrate my point. Suppose, I am training a neural network with maximum iteration 1000, and every 100 training iteration it will keep a snapshot. This is done using the following command:
caffe train -solver solver.prototxt
where the batch size is selected to be 64, and in solver.prototxt we have:
test_iter: 4
max_iter: 1000
snapshot: 100
display: 100
test_interval: 100
We select test_iter=4 carefully so that it will perform testing on nearly all the validation dataset (there are 284 validation samples, a little larger than 4*64).
This will gives us a list of .caffemodel and .solverstate files. For example, we may have solver_iter_300.solverstate and solver_iter_300.caffemodel. When generating these two files, we can also see the training loss (13.7466) and validation loss (2.9385).
Now, if we use the snapshot solver_iter_300.solverstate to continue training:
caffe train -solver solver.prototxt -snapshot solver_iter_300.solverstate
We can see the training loss and validation loss are 12.6 and 2.99 respectively. They are different from before. Any ideas? Thanks.
I have a dataset of 3372149 rows, and I batch them every 3751 rows as the code shown below:
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": train_features_numpy},
y=train_labels_numpy,
batch_size = 3751,
num_epochs= 1,
shuffle=False)
# Train
nn.train(input_fn=train_input_fn)#, steps=10000)
If I set num_epochs = 1 as what I have in the code, it means that the training process would go through the dataset once right? And that leads to the total steps equals to 3372149/3751 = 899.
If I uncomment the "steps = 10000" part, and set "num_epochs=none", the training part would be forced to train all the way to step 10000.
I have two questions then:
Since I only have 899 sets of valid data but I set the step to 10000, what is Tensorflow training after step 899? Does it just go back to the top and repeat the training?
If I trained more then 899 steps, is it going to mess up the model that relates the features and labels? Or is it redundant since the training loop just go over and over the same data set?
I did ask about the loss not reduced during training in my other posts and I am now thinking if I have too few data sets to train on and thus all the excessive steps are useless.
Iterating over a dataset many times is quite common and normal. Each "step" of your model (that is each batch) takes one gradient update step. In intuitive terms it has taken one step towards the goal in the direction dictated by that mini batch. It does NOT learn everything it can about a particular sample by seeing it once, it just takes a step closer to the goal, and how big a step is dictated by the learning rate (and other more complex factors). If you cut your learning rate in half you'd need twice as many steps to get there. Notice how that had nothing to do with epochs, just "update steps" (aka batches).
The typical way of knowing when it's time to stop is to plot test data accuracy over time as you train your model. It is certainly possible that your model will begin to overfit at some point. If it does so test accuracy will start to get worse, this is an obvious optimal stopping point.
Also note that batches of data are not sequential, each batch is randomly selected by permuting the data. The next time through the dataset will end up with different batches of data, and thus each of these batches will produce a different gradient update. So even going through the dataset twice will not produce the same set of updates on each epoch.
I don't actually know the answer to question #1 because I don't use the estimator API much, but I'm 90% sure it simply permutes the samples and iterates through them again after each epoch. That's the most common approach.
During training my LSTM performs well (I use training, validation, and test dataset). And use my test dataset once at the end after training, and I get really good values. So I save the meta file and checkpoint.
Then, during inference, I load my checkpoint and meta file, initialize the weights (using sess.run(tf.initialize_variables())), but when I use a second test dataset (different from the dataset I used during training) my LSTM performance goes from 96% to 20%.
My second test dataset was recorded in similar conditions as my training, validation, and first test dataset, but it was recorded on a different day.
All my dataset was recorded using the same webcam, and with the same background in all images, so technically I should get similar performance in my first and second test set.
In shuffled my dataset during training.
I am using tensorflow 1.1.0
What could be the issue here?
Well, I was reloading my checkpoint during inference, and somehow tensorflow would complain if I did not call the initializer after starting my session like this:
init = tf.global_variables_initializer()
lstm_sess.run(init)
Somehow that seems to randomly initialize my weights rather than reloading the last used weight values.
So what I did instead is freezing my graph as soon as I finish training, and now during inference I reload my graph, so I get the same performance as the performance I got with my test dataset during training. It's kinda weird. Maybe I am not saving/reloading my checkpoint correctly?