PyTorch: VGG16 model not deterministic when ResNet18 model is

PyTorch: VGG16 model not deterministic when ResNet18 model is - python

I currently try to learn two models (VGG16 ad ResNet18) on two Datasets (MNIST and CIFAR10). The goal here is to later test the effect different changes (like another loss function, or a manipulated dataset) have on the accuracy of the model. To make my results comparable I tried to make the learning process deterministic. To achieve this I set a fixed see for all the random generators with the following code.
def update_seed(seed):
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
random.seed(seed)
torch.backends.cudnn.deterministic = True
os.environ['PYTHONHASHSEED'] = str(seed)
And for the ResNet18 model this works perfectly fine (The results are deterministic). But for the VGG16 model this does not work. And that is the point I don't understand, why is the above enough for ResNet18 to be deterministic, but not for VGG16?
So where is this extra randomness for VGG16 coming from and how can I disable it?
To get VGG16 deterministic I currently have to disable cuda and use the cpu only, but this makes the whole computing process very slow and is therefor not really an option.
The only difference between the two models is loading seen below and the learning rate when using CIFAR10.
def setup_vgg16(is_mnist_used):
vgg16_model = models.vgg16()
if is_mnist_used:
vgg16_model.features[0] = nn.Conv2d(1, 64, kernel_size=3, stride=1, padding=1)
vgg16_model.classifier[-1] = nn.Linear(4096, 10, bias=True)
return vgg16_model
def setup_resnet(is_mnist_used):
resnet_model = models.resnet18()
if is_mnist_used:
resnet_model.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)
resnet_model.fc = nn.Linear(512, 10, bias=True)
return resnet_model
What I have already tried (but with no success):
Adding bias=False to the VGG16 model, as it is the obvious difference between the two models
Testing the model before the learning (maybe the model is initiated with random values), but without learning the model is deterministic
Adding more stuff to the update_seed(seed) function
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.enabled = False (These two just decreases the
performance)
torch.use_deterministic_algorithms(True) -> This results in a cuda
error
Set num_worker=0 in the dataloader (this was suggested as a workaround for a similar problem in another thread)
This is the training function. Before this function the model is deterministic and after it is called for the first time, VGG16 is no longer deterministic.
def train_loop(dataloader, f_model, f_loss_fn, f_optimizer):
# setting the model into the train mode
f_model.train()
for batch, (x, y) in tqdm(enumerate(dataloader)):
# Moving the data to the same device as the model
x, y = x.to(device), y.to(device)
# Compute prediction and loss
pred = f_model(x)
loss = f_loss_fn(pred, y)
# Backpropagation
f_optimizer.zero_grad()
loss.backward()
f_optimizer.step()

I think that's because torchvision's VGG models use AdaptiveAvgPool2d, and AdaptiveAvgPool2d cannot be made non-deterministic and will throw a runtime error when used together with torch.use_deterministic_algorithms(True).

Related

kerastuner tune hyperparameter that selects a subset of input features

I have a toy problem, where I have some data (X,Y) where the labels Y are frequencies, and X are cosine functions with frequency Y: X=cos(Y*t+phi)+N, where t is a time vector, phi is some random phase shift and N is additive noise. I am developing a CNN in keras (tensorflow backend) to learn Y from X. However, I don't know how long my time window needs to be. So, I would like to use keras-tuner to help identify the best hyperparameters (winStart,winSpan) to determine which times to select t[winStart:winSpan].
It is unclear if/how I can slice my learning features X using tuned hyperparameters.
First, I defined my data as:
# given sine waves X, estimate frequencies Y
t=np.linspace(0,1,1000)
Y=np.divide(2*np.pi,(np.random.random((100))+1))
X=np.transpose(np.cos(np.expand_dims(t, axis=1)*np.expand_dims(Y,axis=0)+np.ones((t.size,1))*np.random.normal(loc=0,scale=np.pi,size=(1,Y.size)))+np.random.normal(loc=0,scale=.1,size=(t.size,Y.size)))
Y=np.expand_dims(Y,axis=1)
Following tutorials, I have written a function to construct my CNN model:
def build_model(inputSize):
model = Sequential()
model.add(Conv1D(10,
kernel_size=(15,),
padding='same',
activation='ReLU',
batch_input_shape=(None,inputSize,1)))
model.add(MaxPool1D(pool_size=(2,)))
model.add(Dropout(.2))
model.add(Conv1D(10,
kernel_size=(15,),
padding='same',
activation='ReLU',
batch_input_shape=(None,model.layers[-1].output_shape[1],1)))
model.add(MaxPool1D(pool_size=(2,)))
model.add(Dropout(.2))
model.add(Flatten())
# add a dense layer
model.add(Dense(10))
model.add(Dense(1))
model.compile(loss='mean_squared_error',
optimizer='adam')
return model
Additionally, I have written a hypermodel class:
class myHypermodel(HyperModel):
def __init__(self,inputSize):
self.inputSize=inputSize
def build_hp_model(self,hp):
inputSize=1000
self.winStart=hp.Int('winStart',min_value=0,max_value=inputSize-100,step=100)
self.winSpan=hp.Int('fMax',min_value=100,max_value=inputSize,step=100)
return build_model(self.winSpan)
def run_trial(self, trial, x,y,*args, **kwargs):
hp=trial.hyperparameters
# build the model with the current hyperparameters
model=self.build_hp_model(hp)
# Window the feature vectors
x=x[:,self.winStart:np.min([self.winStart+self.winSpan,self.inputSize])]
print('here')
return model.fit(x,y,*args,**kwargs)
Here, the build_hp_model() method is intended to link the hyperparameters to internal variables so that they can be used during when run_trial() method is called. My understanding is that run_trial() will be called by tuner.search() when performing hyperparameter optimization. I expect the run_trial() method to pick a new combination of winStart and winSpan hyperparameters, rebuild the model, remove all values of x except in the window defined by winStart and winSpan, and then run model.fit().
I call my hypermodel class and attempt to perform the hyperparameter search using:
tuner_model=myHypermodel(X.shape[1])
tuner = kt.Hyperband(tuner_model.build_hp_model,
overwrite=True,
objective='val_loss',
max_epochs=25,
factor=3,
hyperband_iterations=3)
tuner.search(x=np.expand_dims(X,axis=2),
y=np.expand_dims(Y,axis=2),
epochs=9,
validation_split=0.25)
When I run the script, I get the error:
ValueError: Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 100, 1), found shape=(None, 1000, 1)
So it seems like the build_model() function is being called for a hyperparameter winSpan=100, but then the model is being fit using the full feature vectors X instead of X[:,winStart:winStart+winSpan].
Any suggestions on how I can properly implement this tuning?

Issue fitting multiple keras Sequential models in the same python script

Please see gist here for the full code. In this question I've pasted what I regard the important part of the code at the bottom.
First I generate a dataset as shown in [2], which consists of 2 features (x and y) and 3 labels (grey, blue and orange). Then I make 4 keras sequential models with identical layer structure, optimizer, loss function, etc. Lastly, I call fit on each model and plot the resulting metrics shown in [3]. As you can see the models all perform differently and I'm wondering why? I've locked the random seed value so each time I run this script I get the exact same result.
In this example the models have identical structures, so I would expect the metric plots to be identical. Eventually I would like to vary the number of layers, layer size, loss function, etc. between the models to see its effects, but that does not seem feasible in this setup. Am I approaching this incorrectly?
An interesting thing to note is that by setting the batch_size to 32 this effect is not as prominent, but it is still present and reproducible (See [4]).
# ---- MAKE MODELS ---- #
NUMBER_OF_MODELS = 4
models = []
for i in range(NUMBER_OF_MODELS):
model = keras.models.Sequential(name=f'{i}')
model.add(keras.layers.Dense(8, activation='relu', input_shape=df_train['features'].values.shape[-1:]))
model.add(keras.layers.Dense(3, activation='softmax'))
model.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.CategoricalCrossentropy(),
metrics=[keras.metrics.CategoricalAccuracy()])
model.summary()
models.append(model)
# --------------------- #
# ---- TRAIN MODELS ---- #
histories = []
for model in models:
with tf.device('/cpu:0'):
history = model.fit(x=df_train['features'].values, y=df_train['labels'].values,
validation_data=(df_val['features'].values, df_val['labels'].values),
batch_size=512, epochs=100, verbose=0)
histories.append(history)
# ---------------------- #

you simply need to set the seed every time you define and fit the model
following your code I collapse all in these lines:
NUMBER_OF_MODELS = 4
models = []
histories = []
for i in range(NUMBER_OF_MODELS):
set_seed_TF2(33)
model = keras.models.Sequential(name=f'{i}')
model.add(keras.layers.Dense(8, activation='relu', input_shape=df_train['features'].values.shape[-1:]))
model.add(keras.layers.Dense(3, activation='softmax'))
model.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.CategoricalCrossentropy(),
metrics=[keras.metrics.CategoricalAccuracy()])
with tf.device('/cpu:0'):
history = model.fit(x=df_train['features'].values, y=df_train['labels'].values,
validation_data=(df_val['features'].values, df_val['labels'].values),
batch_size=512, epochs=100, verbose=0)
histories.append(history)
models.append(model)
the magic function is set_seed_TF2
def set_seed_TF2(seed):
tf.random.set_seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
np.random.seed(seed)
random.seed(seed)
that must be call every time you initialize the model and fit
with this in mind, you can produce every time the same metrics/predictions:
here the running notebook: https://colab.research.google.com/drive/1nHEDI6d3LsRPQUXGiTOYfNOfw954Pu-H?usp=sharing
this works for CPU only

The difference of loading model parameters between load_state_dict and nn.Parameter in pytorch

When I wanna assign part of pre-trained model parameters to another module defined in a new model of PyTorch, I got two different outputs using two different methods.
The Network is defined as follows:
class Net:
def __init__(self):
super(Net, self).__init__()
self.resnet = torch.hub.load('pytorch/vision', 'resnet18', pretrained=True)
self.resnet = nn.Sequential(*list(self.resnet.children())[:-1])
self.freeze_model(self.resnet)
self.classifier = nn.Sequential(
nn.Dropout(),
nn.Linear(512, 256),
nn.ReLU(),
nn.Linear(256, 3),
)
def forward(self, x):
out = self.resnet(x)
out = out.flatten(start_dim=1)
out = self.classifier(out)
return out
What I want is to assign pre-trained parameters to classifier in the net module. Two different ways were used for this task.
# First way
net.load_state_dict(torch.load('model_CNN_pretrained.ptl'))
# Second way
params = torch.load('model_CNN_pretrained.ptl')
net.classifier[1].weight = nn.Parameter(params['classifier.1.weight'], requires_grad =False)
net.classifier[1].bias = nn.Parameter(params['classifier.1.bias'], requires_grad =False)
net.classifier[3].weight = nn.Parameter(params['classifier.3.weight'], requires_grad =False)
net.classifier[3].bias = nn.Parameter(params['classifier.3.bias'], requires_grad =False)
The parameters were assigned correctly but got two different outputs from the same input data. The first method works correctly, but the second doesn't work well. Could some guys point what the difference of these two methods?

Finally, I find out where is the problem.
During the pre-trained process, buffer parameters in BatchNorm2d Layer of ResNet18 model were changed even if we set require_grad of parameters False. Buffer parameters were calculated by the input data after model.train() was processed, and unchanged after model.eval().
There is a link about how to freeze the BN layer.
How to freeze BN layers while training the rest of network (mean and var wont freeze)

Access output of intermediate layers in Tensor-flow 2.0 in eager mode

I have CNN that I have built using on Tensor-flow 2.0. I need to access outputs of the intermediate layers. I was going over other stackoverflow questions that were similar but all had solutions involving Keras sequential model.
I have tried using model.layers[index].output but I get
Layer conv2d has no inbound nodes.
I can post my code here (which is super long) but I am sure even without that someone can point to me how it can be done using just Tensorflow 2.0 in eager mode.

I stumbled onto this question while looking for an answer and it took me some time to figure out as I use the model subclassing API in TF 2.0 by default (as in here https://www.tensorflow.org/tutorials/quickstart/advanced).
If somebody is in a similar situation, all you need to do is assign the intermediate output you want, as an attribute of the class. Then keep the test_step without the #tf.function decorator and create its decorated copy, say val_step, for efficient internal computation of validation performance during training. As a short example, I have modified a few functions of the tutorial from the link accordingly. I'm assuming we need to access the output after flattening.
def call(self, x):
x = self.conv1(x)
x = self.flatten(x)
self.intermediate=x #assign it as an object attribute for accessing later
x = self.d1(x)
return self.d2(x)
#Remove #tf.function decorator from test_step for prediction
def test_step(images, labels):
predictions = model(images, training=False)
t_loss = loss_object(labels, predictions)
test_loss(t_loss)
test_accuracy(labels, predictions)
return
#Create a decorated val_step for object's internal use during training
#tf.function
def val_step(images, labels):
return test_step(images, labels)
Now when you run model.predict() after training, using the un-decorated test step, you can access the intermediate output using model.intermediate which would be an EagerTensor whose value is obtained simply by model.intermediate.numpy(). However, if you don't remove the #tf_function decorator from test_step, this would return a Tensor whose value is not so straightforward to obtain.

Thanks for answering my earlier question. I wrote this simple example to illustrate how what you're trying to do might be done in TensorFlow 2.x, using the MNIST dataset as the example problem.
The gist of the approach:
Build an auxiliary model (aux_model in the example below), which is so-called "functional model" with multiple outputs. The first output is the output of the original model and will be used for loss calculation and backprop, while the remaining output(s) are the intermediate-layer outputs that you want to access.
Use tf.GradientTape() to write a custom training loop and expose the detailed gradient values on each individual variable of the model. Then you can pick out the gradients that are of interest to you. This requires that you know the ordering of the model's variables. But that should be relatively easy for a sequential model.
import tensorflow as tf
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
# This is the original model.
model = tf.keras.Sequential([
tf.keras.layers.Flatten(input_shape=[28, 28, 1]),
tf.keras.layers.Dense(100, activation="relu"),
tf.keras.layers.Dense(10, activation="softmax")])
# Make an auxiliary model that exposes the output from the intermediate layer
# of interest, which is the first Dense layer in this case.
aux_model = tf.keras.Model(inputs=model.inputs,
outputs=model.outputs + [model.layers[1].output])
# Define a custom training loop using `tf.GradientTape()`, to make it easier
# to access gradients on specific variables (the kernel and bias of the first
# Dense layer in this case).
cce = tf.keras.losses.CategoricalCrossentropy()
optimizer = tf.optimizers.Adam()
with tf.GradientTape() as tape:
# Do a forward pass on the model, retrieving the intermediate layer's output.
y_pred, intermediate_output = aux_model(x_train)
print(intermediate_output) # Now you can access the intermediate layer's output.
# Compute loss, to enable backprop.
loss = cce(tf.one_hot(y_train, 10), y_pred)
# Do backprop. `gradients` here are for all variables of the model.
# But we know we want the gradients on the kernel and bias of the first
# Dense layer, which happens to be the first two variables of the model.
gradients = tape.gradient(loss, aux_model.variables)
# This is the gradient on the first Dense layer's kernel.
intermediate_layer_kerenl_gradients = gradients[0]
print(intermediate_layer_kerenl_gradients)
# This is the gradient on the first Dense layer's bias.
intermediate_layer_bias_gradients = gradients[1]
print(intermediate_layer_bias_gradients)
# Update the variables of the model.
optimizer.apply_gradients(zip(gradients, aux_model.variables))

The most straightforward solution would go like this:
mid_layer = model.get_layer("layer_name")
you can now treat the "mid_layer" as a model, and for instance:
mid_layer.predict(X)
Oh, also, to get the name of a hidden layer, you can use this:
model.summary()
this will give you some insights about the layer input/output as well.

Adding batch normalization decreases the performance

I'm using PyTorch to implement a classification network for skeleton-based action recognition. The model consists of three convolutional layers and two fully connected layers. This base model gave me an accuracy of around 70% in the NTU-RGB+D dataset. I wanted to learn more about batch normalization, so I added a batch normalization for all the layers except for the last one. To my surprise, the evaluation accuracy dropped to 60% rather than increasing But the training accuracy has increased from 80% to 90%. Can anyone say what am I doing wrong? or Adding batch normalization need not increase the accuracy?
The model with batch normalization
class BaseModelV0p2(nn.Module):
def __init__(self, num_person, num_joint, num_class, num_coords):
super().__init__()
self.name = 'BaseModelV0p2'
self.num_person = num_person
self.num_joint = num_joint
self.num_class = num_class
self.channels = num_coords
self.out_channel = [32, 64, 128]
self.loss = loss
self.metric = metric
self.bn_momentum = 0.01
self.bn_cv1 = nn.BatchNorm2d(self.out_channel[0], momentum=self.bn_momentum)
self.conv1 = nn.Sequential(nn.Conv2d(in_channels=self.channels, out_channels=self.out_channel[0],
kernel_size=3, stride=1, padding=1),
self.bn_cv1,
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.bn_cv2 = nn.BatchNorm2d(self.out_channel[1], momentum=self.bn_momentum)
self.conv2 = nn.Sequential(nn.Conv2d(in_channels=self.out_channel[0], out_channels=self.out_channel[1],
kernel_size=3, stride=1, padding=1),
self.bn_cv2,
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.bn_cv3 = nn.BatchNorm2d(self.out_channel[2], momentum=self.bn_momentum)
self.conv3 = nn.Sequential(nn.Conv2d(in_channels=self.out_channel[1], out_channels=self.out_channel[2],
kernel_size=3, stride=1, padding=1),
self.bn_cv3,
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.bn_fc1 = nn.BatchNorm1d(256 * 2, momentum=self.bn_momentum)
self.fc1 = nn.Sequential(nn.Linear(self.out_channel[2]*8*3, 256*2),
self.bn_fc1,
nn.ReLU(),
nn.Dropout2d(p=0.5)) # TO check
self.fc2 = nn.Sequential(nn.Linear(256*2, self.num_class))
def forward(self, input):
list_bn_layers = [self.bn_fc1, self.bn_cv3, self.bn_cv2, self.bn_cv1]
# set the momentum of the batch norm layers to given momentum value during trianing and 0 during evaluation
# ref: https://discuss.pytorch.org/t/model-eval-gives-incorrect-loss-for-model-with-batchnorm-layers/7561
# ref: https://github.com/pytorch/pytorch/issues/4741
for bn_layer in list_bn_layers:
if self.training:
bn_layer.momentum = self.bn_momentum
else:
bn_layer.momentum = 0
logits = []
for i in range(self.num_person):
out = self.conv1(input[:, :, :, :, i])
out = self.conv2(out)
out = self.conv3(out)
logits.append(out)
out = torch.max(logits[0], logits[1])
out = out.view(out.size(0), -1)
out = self.fc1(out)
out = self.fc2(out)
t = out
assert not ((t != t).any()) # find out nan in tensor
assert not (t.abs().sum() == 0) # find out 0 tensor
return out

My interpretation of the phenomenon you are observing,, is that instead of reducing the covariance shift, which is what the Batch Normalization is meant for, you are increasing it. In other words, instead of decrease the distribution differences between train and test, you are increasing it and that's what it is causing you to have a bigger difference in the accuracies between train and test. Batch Normalization does not assure better performance always, but for some problems it doesn't work well. I have several ideas that could lead to an improvement:
Increase the batch size if it is small, what would help the mean and std calculated in the Batch Norm layers to be more robust estimates of the population parameters.
Decrease the bn_momentum parameter a bit, to see if that also stabilizes the Batch Norm parameters.
I am not sure you should set bn_momentum to zero when test, I think you should just call model.train() when you want to train and model.eval() when you want to use your trained model to perform inference.
You could alternatively try Layer Normalization instead of Batch Normalization, cause it does not require accumulating any statistic and usually works well
Try regularizing a bit your model using dropout
Make sure you shuffle your training set in every epoch. Not shuffling the data set may lead to correlated batches that make the statistics in batch normalization cycle. That may impact your generalization
I hope any of these ideas work for you

The problem may be with your momentum. I see you are using 0.01.
Here is how I tried different betas to fit to points with momentum and with beta=0.01 I got bad results. Usually beta=0.1 is used.

It's almost happen because of two major reasons 1.non-stationary training'procedure and 2.train/test different distributions
If It's possible try other regularization technique's like Drop-out,I face to this problem and i found that my test and train distribution might be different so after i remove BN and use drop-out instead, got the reasonable result. read this for more
Use nn.BatchNorm2d(out_channels, track_running_stats=False) this disables the running statistics of the batches and uses the current batch’s mean and variance to do the normalization
In Training mode run some forward passes on data in with torch.no_grad() block. this stabilize the running_mean / running_std values
Use same batch_size in your dataset for both model.train() and model.eval()
Increase momentum of the BN. This means that the means and stds learned will be much more stable during the process of training
this is helpful whenever you use pre-trained model
for child in model.children():
for ii in range(len(child)):
if type(child[ii])==nn.BatchNorm2d:
child[ii].track_running_stats = False

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyTorch: VGG16 model not deterministic when ResNet18 model is - python

I think that's because torchvision's VGG models use AdaptiveAvgPool2d, and AdaptiveAvgPool2d cannot be made non-deterministic and will throw a runtime error when used together with torch.use_deterministic_algorithms(True).

Related

kerastuner tune hyperparameter that selects a subset of input features

Issue fitting multiple keras Sequential models in the same python script

The difference of loading model parameters between load_state_dict and nn.Parameter in pytorch

Access output of intermediate layers in Tensor-flow 2.0 in eager mode

Adding batch normalization decreases the performance

Categories

Resources